inference.batching¶
Module Path¶
Source file: src/inference/batching.zig
Public Types¶
BatchingStrategy¶
| Variant | Behavior |
|---|---|
FixedSize | Wait for exactly N requests before processing |
DynamicTimeout | Process when batch is full or a timeout expires |
Adaptive | Adjust batch size based on queue pressure and latency targets |
Continuous | Process requests as they arrive, inserting into running batches |
BatchRequest¶
pub const BatchRequest = struct {
id: u64,
prompt: []const u8,
config: GenerationConfig,
priority: u8,
callback: ?StreamCallback,
};
A single generation request submitted to the batch processor.
BatchResult¶
pub const BatchResult = struct {
id: u64,
result: GenerationResult,
latency_ms: u64,
queue_time_ms: u64,
};
Result for one request within a batch, including timing metadata.
BatchProcessor¶
pub const BatchProcessor = struct {
queue: std.ArrayList(BatchRequest),
workers: []std.Thread,
config: BatchConfig,
stats: BatchStats,
model: *LLaMAModel,
tokenizer: *SimpleTokenizer,
allocator: std.mem.Allocator,
};
Manages a request queue and a pool of worker threads that process batches of requests against a shared model.
BatchConfig¶
pub const BatchConfig = struct {
max_batch_size: usize = 32,
strategy: BatchingStrategy = .DynamicTimeout,
timeout_ms: u64 = 100,
num_workers: usize = 1,
max_queue_size: usize = 1024,
};
BatchStats¶
pub const BatchStats = struct {
total_requests: u64,
total_tokens: u64,
avg_latency_ms: f64,
avg_throughput_tps: f64,
};
Public Functions¶
BatchProcessor.init¶
pub fn init(
model: *LLaMAModel,
tokenizer: *SimpleTokenizer,
config: BatchConfig,
allocator: std.mem.Allocator,
) !BatchProcessor
Create the batch processor and start worker threads.
BatchProcessor.deinit¶
Drain the queue, join worker threads, and free resources.
BatchProcessor.submit¶
Submit a request and block until the result is ready. The request may be batched with others for higher throughput.
BatchProcessor.submitAsync¶
Submit a request without blocking. Returns a request ID that can be used to poll for the result later.
BatchProcessor.processQueue¶
Internal: called by worker threads to dequeue and process a batch of requests.
Error Types¶
error{QueueFull}-- the request queue has reachedmax_queue_size.error{ProcessorStopped}-- the batch processor has been shut down.- Inherits generation errors from
TextGenerator.
Usage Example¶
const batch = @import("zigllama").inference.batching;
var processor = try batch.BatchProcessor.init(
&model,
&tokenizer,
.{ .max_batch_size = 8, .strategy = .DynamicTimeout },
allocator,
);
defer processor.deinit();
const result = try processor.submit(.{
.id = 1,
.prompt = "Explain quantum computing",
.config = gen.GenerationConfig.balanced(),
.priority = 0,
.callback = null,
});
std.debug.print("Response: {s}\n", .{result.result.text});
std.debug.print("Latency: {} ms\n", .{result.latency_ms});
Related Modules¶
inference.generation-- Underlying generation engine used by each worker.inference.kv_cache--MultiSequenceKVCachemanages per-request caches.inference.streaming-- Per-request streaming via thecallbackfield.