inference.kv_cache¶
Module Path¶
Source file: src/inference/kv_cache.zig
Public Types¶
CacheStrategy¶
| Variant | Behavior |
|---|---|
Always | Cache K/V for every token at every layer |
LongSequenceOnly | Enable caching only when sequence length exceeds a threshold |
Adaptive | Dynamically enable/disable based on available memory |
Disabled | No caching; recompute K/V each step (saves memory) |
KVCacheEntry¶
Cached key and value tensors for a single position in a single layer.
LayerKVCache¶
pub const LayerKVCache = struct {
entries: std.ArrayList(KVCacheEntry),
max_seq_len: usize,
head_dim: usize,
num_heads: usize,
};
K/V cache for one transformer layer.
ModelKVCache¶
pub const ModelKVCache = struct {
layers: []LayerKVCache,
num_layers: usize,
strategy: CacheStrategy,
allocator: std.mem.Allocator,
};
Aggregate cache spanning all layers of the model.
MultiSequenceKVCache¶
pub const MultiSequenceKVCache = struct {
sequences: std.AutoHashMap(u64, ModelKVCache),
max_sequences: usize,
};
Manages independent caches for multiple concurrent sequences (e.g., different requests in a batch).
SlidingWindowKVCache¶
Fixed-size sliding window that evicts the oldest entries when the window is full. Used by Mistral-style attention.
Public Functions¶
ModelKVCache.init¶
pub fn init(
num_layers: usize,
num_heads: usize,
head_dim: usize,
max_seq_len: usize,
strategy: CacheStrategy,
allocator: std.mem.Allocator,
) !ModelKVCache
Allocate the full cache structure. Memory for individual entries is allocated lazily as tokens are generated.
ModelKVCache.deinit¶
Free all cached tensors and layer structures.
LayerKVCache.append¶
Add a new K/V pair at the next position.
LayerKVCache.get¶
Retrieve the cached K/V pair at the given position. Returns null if the position has not been cached.
ModelKVCache.clear¶
Discard all cached entries across all layers. The cache structure itself is retained and can be reused.
ModelKVCache.compact¶
Reclaim memory by removing evicted entries and shrinking internal arrays.
Error Types¶
error{CacheFull}-- the cache has reachedmax_seq_lenand the strategy does not allow eviction.error{OutOfMemory}
Usage Example¶
const kvc = @import("zigllama").inference.kv_cache;
var cache = try kvc.ModelKVCache.init(
32, // num_layers
32, // num_heads
128, // head_dim (4096 / 32)
2048, // max_seq_len
.Always,
allocator,
);
defer cache.deinit();
// During generation, append K/V after each attention computation
try cache.layers[layer_idx].append(key_tensor, value_tensor);
// Retrieve cached K/V for attention
if (cache.layers[layer_idx].get(position)) |entry| {
// Use entry.key and entry.value
}
// Reset between prompts
cache.clear();
Related Modules¶
transformers.attention-- Produces the K/V tensors that are cached.inference.generation-- Uses the cache during autoregressive generation.inference.batching--MultiSequenceKVCachesupports batched inference.