Batch Processing¶
Processing a single request at a time leaves most hardware resources idle. The model weights occupy memory regardless of how many sequences use them; matrix operations have fixed kernel-launch overhead that amortises over larger inputs. Batch processing groups multiple generation requests and processes them together, dramatically improving throughput.
1. Why Batching¶
Throughput vs Latency
- Latency: time from request submission to completion (single request).
- Throughput: total requests (or tokens) completed per unit time.
Batching trades slightly increased per-request latency for substantially higher throughput.
The key insight is that the transformer forward pass is dominated by matrix multiplications that scale efficiently with batch size:
The overhead of loading model weights from memory is paid once per batch rather than once per request. On modern hardware, this weight-loading cost often dominates single-request inference, so batching can yield 5--10x throughput improvement.
2. Batching Strategies¶
ZigLlama supports four batching strategies, selectable via the BatchingStrategy enum:
pub const BatchingStrategy = enum {
FixedSize, // Wait for exactly max_batch_size requests
DynamicTimeout, // Collect requests until timeout or batch full
Adaptive, // Adjust batch size based on queue depth and latency
Continuous, // Process immediately, never wait
};
| Strategy | Latency | Throughput | Complexity | Best For |
|---|---|---|---|---|
FixedSize | High (waits for full batch) | Highest | Simple | Offline processing |
DynamicTimeout | Bounded by timeout | High | Moderate | Production servers |
Adaptive | Variable | High | Complex | Variable load patterns |
Continuous | Lowest | Lower | Simple | Interactive applications |
2.1 Fixed-Size Batching¶
Accumulates exactly max_batch_size requests before processing. Simple to implement but introduces high latency when request arrival rate is low.
2.2 Dynamic Timeout Batching¶
The default strategy. Collects requests until either the batch is full or a timeout expires, whichever comes first:
Dynamic Timeout Batching
Parameters: max_batch_size, max_wait_time_ms
- Set timer to
max_wait_time_ms. - while batch size \( < \)
max_batch_sizeand timer not expired:- Wait on request queue condition variable.
- If new request arrives, add to current batch.
- Process batch (even if smaller than
max_batch_size).
sequenceDiagram
participant C1 as Client 1
participant C2 as Client 2
participant C3 as Client 3
participant Q as Request Queue
participant P as BatchProcessor
C1->>Q: submit(prompt_1)
Note over Q: Timer starts (100ms)
C2->>Q: submit(prompt_2)
C3->>Q: submit(prompt_3)
Note over Q: Timer expires
Q->>P: batch = [prompt_1, prompt_2, prompt_3]
P->>P: processBatch(batch)
P-->>C1: result_1
P-->>C2: result_2
P-->>C3: result_3 2.3 Adaptive Batching¶
Monitors queue depth and recent latency to dynamically adjust the batch size. When the queue is deep, it forms larger batches to drain the backlog; when latency is high, it reduces batch size to improve responsiveness.
2.4 Continuous Batching¶
Processes each request as it arrives, without waiting for additional requests. This provides the lowest latency but the lowest throughput, as it cannot amortise weight-loading costs.
3. BatchProcessor¶
The BatchProcessor is the top-level struct for batch inference:
pub const BatchProcessor = struct {
model: *LLaMAModel,
tokenizer: *SimpleTokenizer,
config: BatchConfig,
queue: RequestQueue,
stats: BatchStats,
allocator: Allocator,
workers: []Thread,
is_running: bool,
next_request_id: RequestId,
result_callback: ?*const fn (result: BatchResult, user_data: ?*anyopaque) void,
user_data: ?*anyopaque,
pub fn init(model: *LLaMAModel, tokenizer: *SimpleTokenizer,
config: BatchConfig, allocator: Allocator) !BatchProcessor { ... }
pub fn deinit(self: *BatchProcessor) void { ... }
pub fn start(self: *BatchProcessor) !void { ... }
pub fn stop(self: *BatchProcessor) void { ... }
pub fn submit(self: *BatchProcessor, prompt: []const u8,
config: GenerationConfig) !RequestId { ... }
pub fn getStats(self: BatchProcessor) BatchStats { ... }
};
3.1 BatchConfig¶
pub const BatchConfig = struct {
strategy: BatchingStrategy = .DynamicTimeout,
max_batch_size: u32 = 8,
min_batch_size: u32 = 1,
max_wait_time_ms: u32 = 100,
max_queue_size: u32 = 1000,
num_workers: u32 = 1,
enable_priority: bool = false,
memory_limit_bytes: ?usize = null,
};
Choosing max_batch_size
The optimal batch size depends on available memory and compute resources. A larger batch uses more memory (each sequence needs its own KV cache) but amortises fixed costs better. Start with 4--8 for CPU inference and 16--64 for GPU inference, then tune based on profiling data.
4. Sequence Padding¶
When sequences in a batch have different lengths, shorter sequences must be padded so that all sequences can be processed as a single batched matrix operation:
Attention Masking for Padded Sequences
Given a batch of \( B \) sequences with lengths \( \ell_1, \ldots, \ell_B \), pad all to length \( L = \max_i \ell_i \). Construct an attention mask \( \mathbf{M} \in \{0, -\infty\}^{B \times L} \):
The mask is added to the attention logits before softmax, ensuring padding positions receive zero attention weight.
block-beta
columns 6
block:header:6
h1["pos 0"] h2["pos 1"] h3["pos 2"] h4["pos 3"] h5["pos 4"] h6["pos 5"]
end
block:seq1["Sequence 1 (len=6)"]:6
s1a["The"] s1b["cat"] s1c["sat"] s1d["on"] s1e["the"] s1f["mat"]
end
block:seq2["Sequence 2 (len=3)"]:6
s2a["Hello"] s2b["world"] s2c["!"] s2d["[PAD]"] s2e["[PAD]"] s2f["[PAD]"]
end
block:seq3["Sequence 3 (len=4)"]:6
s3a["What"] s3b["is"] s3c["the"] s3d["answer"] s3e["[PAD]"] s3f["[PAD]"]
end
style s2d fill:#f5b7b1
style s2e fill:#f5b7b1
style s2f fill:#f5b7b1
style s3e fill:#f5b7b1
style s3f fill:#f5b7b1 Padding Overhead
Padding wastes compute on tokens that produce no useful output. If sequence lengths vary widely (e.g., 10 tokens vs 1000 tokens), the shorter sequences waste most of their compute budget. Mitigation strategies include sorting requests by length before batching and using continuous batching to avoid padding entirely.
5. Dynamic Batching¶
The RequestQueue implements thread-safe request queuing with timeout-based batch extraction:
const RequestQueue = struct {
requests: ArrayList(BatchRequest),
mutex: Mutex,
condition: Condition,
allocator: Allocator,
closed: bool,
pub fn push(self: *RequestQueue, request: BatchRequest) !void { ... }
pub fn popBatch(self: *RequestQueue, max_size: u32, timeout_ms: u32) []BatchRequest { ... }
pub fn close(self: *RequestQueue) void { ... }
};
The popBatch method blocks until either max_size requests are available or timeout_ms elapses:
flowchart TD
START["popBatch(max_size, timeout_ms)"]
LOCK["Acquire mutex"]
CHECK{"Queue empty<br/>AND not closed?"}
WAIT["Wait on condition<br/>(with remaining timeout)"]
TIMEOUT{"Timed out?"}
EXTRACT["Extract min(available, max_size)<br/>requests from queue"]
RETURN["Return batch"]
START --> LOCK --> CHECK
CHECK -->|Yes| WAIT
WAIT --> TIMEOUT
TIMEOUT -->|Yes| EXTRACT
TIMEOUT -->|No| CHECK
CHECK -->|No| EXTRACT
EXTRACT --> RETURN 6. Throughput Scaling¶
Throughput Analysis
Let \( C \) be the fixed cost per batch (weight loading, kernel launch) and \( c \) be the per-sequence cost within a batch. Total time for a batch of size \( B \):
Throughput (sequences per second):
As \( B \to \infty \): throughput \( \to 1/c \) (limited by per-sequence cost). As \( B \to 1 \): throughput \( \to 1/(C + c) \) (dominated by fixed cost).
6.1 Diminishing Returns¶
The throughput improvement follows a logarithmic curve. Doubling batch size from 1 to 2 yields the largest relative improvement; each subsequent doubling yields less:
| Batch Size | Relative Throughput | Improvement over Previous |
|---|---|---|
| 1 | 1.0x | -- |
| 2 | 1.8x | +80% |
| 4 | 3.0x | +67% |
| 8 | 4.5x | +50% |
| 16 | 5.5x | +22% |
| 32 | 6.2x | +13% |
Memory Constraint
In practice, batch size is limited by memory rather than compute. Each sequence in the batch requires its own KV cache, so the total memory scales as \( B \times M_{\text{cache}} \) where \( M_{\text{cache}} \) is the per-sequence cache size (see KV Cache).
7. BatchRequest and BatchResult¶
7.1 BatchRequest¶
pub const BatchRequest = struct {
id: RequestId,
prompt: []const u8,
config: GenerationConfig,
max_tokens: u32,
generated_tokens: ArrayList(TokenId),
generated_text: ArrayList(u8),
cumulative_log_prob: f32,
created_at: i64,
started_at: ?i64,
completed_at: ?i64,
completed: bool,
stop_reason: ?StopReason,
pub fn addToken(self: *BatchRequest, token_id: TokenId,
text: []const u8, log_prob: f32) !void { ... }
pub fn markCompleted(self: *BatchRequest, stop_reason: StopReason) void { ... }
pub fn getLatency(self: BatchRequest) ?i64 { ... }
pub fn getQueueTime(self: BatchRequest) ?i64 { ... }
};
Request Lifecycle
A request passes through four phases: queued (waiting in the queue), started (assigned to a batch), generating (tokens being produced), and completed (stop condition met or error). Timestamps at each transition enable precise latency analysis.
7.2 BatchResult¶
pub const BatchResult = struct {
request_id: RequestId,
tokens: []TokenId,
text: []u8,
log_probs: []f32,
total_log_prob: f32,
num_tokens: u32,
stop_reason: StopReason,
latency_ms: i64, // Processing time
queue_time_ms: i64, // Time spent waiting in queue
};
8. Batch Statistics¶
The BatchStats struct tracks aggregate performance metrics:
pub const BatchStats = struct {
total_requests: u64,
total_batches: u64,
avg_batch_size: f32,
avg_latency_ms: f32,
avg_queue_time_ms: f32,
requests_per_second: f32,
tokens_per_second: f32,
current_queue_size: u32,
peak_queue_size: u32,
timeout_count: u32,
error_count: u32,
};
8.1 Request Priority¶
When enable_priority is set in BatchConfig, requests can be assigned priority levels that affect queue ordering:
Higher-priority requests are dequeued before lower-priority ones, even if they arrived later. This is useful for serving real-time interactive requests alongside background batch jobs.
References¶
-
Yu, G. et al. "Orca: A Distributed Serving System for Transformer-Based Language Generation." OSDI, 2022. ↩
-
Kwon, W. et al. "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP, 2023. ↩
-
Gerganov, G. "llama.cpp -- Inference of LLaMA model in C/C++." GitHub, 2023. ↩
-
Agrawal, A. et al. "Sarathi: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills." arXiv:2308.16369, 2023. ↩