inference.kv_cache¶

Module Path¶

zigllama.inference.kv_cache

Source file: src/inference/kv_cache.zig

Public Types¶

`CacheStrategy`¶

pub const CacheStrategy = enum {
    Always,
    LongSequenceOnly,
    Adaptive,
    Disabled,
};

Variant	Behavior
`Always`	Cache K/V for every token at every layer
`LongSequenceOnly`	Enable caching only when sequence length exceeds a threshold
`Adaptive`	Dynamically enable/disable based on available memory
`Disabled`	No caching; recompute K/V each step (saves memory)

`KVCacheEntry`¶

pub const KVCacheEntry = struct {
    key: Tensor(f32),
    value: Tensor(f32),
    position: usize,
};

Cached key and value tensors for a single position in a single layer.

`LayerKVCache`¶

pub const LayerKVCache = struct {
    entries: std.ArrayList(KVCacheEntry),
    max_seq_len: usize,
    head_dim: usize,
    num_heads: usize,
};

K/V cache for one transformer layer.

`ModelKVCache`¶

pub const ModelKVCache = struct {
    layers: []LayerKVCache,
    num_layers: usize,
    strategy: CacheStrategy,
    allocator: std.mem.Allocator,
};

Aggregate cache spanning all layers of the model.

`MultiSequenceKVCache`¶

pub const MultiSequenceKVCache = struct {
    sequences: std.AutoHashMap(u64, ModelKVCache),
    max_sequences: usize,
};

Manages independent caches for multiple concurrent sequences (e.g., different requests in a batch).

`SlidingWindowKVCache`¶

pub const SlidingWindowKVCache = struct {
    cache: LayerKVCache,
    window_size: usize,
};

Fixed-size sliding window that evicts the oldest entries when the window is full. Used by Mistral-style attention.

Public Functions¶

`ModelKVCache.init`¶

pub fn init(
    num_layers: usize,
    num_heads: usize,
    head_dim: usize,
    max_seq_len: usize,
    strategy: CacheStrategy,
    allocator: std.mem.Allocator,
) !ModelKVCache

Allocate the full cache structure. Memory for individual entries is allocated lazily as tokens are generated.

`ModelKVCache.deinit`¶

pub fn deinit(self: *ModelKVCache) void

Free all cached tensors and layer structures.

`LayerKVCache.append`¶

pub fn append(self: *LayerKVCache, key: Tensor(f32), value: Tensor(f32)) !void

Add a new K/V pair at the next position.

`LayerKVCache.get`¶

pub fn get(self: LayerKVCache, position: usize) ?KVCacheEntry

Retrieve the cached K/V pair at the given position. Returns null if the position has not been cached.

`ModelKVCache.clear`¶

pub fn clear(self: *ModelKVCache) void

Discard all cached entries across all layers. The cache structure itself is retained and can be reused.

`ModelKVCache.compact`¶

pub fn compact(self: *ModelKVCache) !void

Reclaim memory by removing evicted entries and shrinking internal arrays.

Error Types¶

error{CacheFull} -- the cache has reached max_seq_len and the strategy does not allow eviction.
error{OutOfMemory}

Usage Example¶

const kvc = @import("zigllama").inference.kv_cache;

var cache = try kvc.ModelKVCache.init(
    32,     // num_layers
    32,     // num_heads
    128,    // head_dim (4096 / 32)
    2048,   // max_seq_len
    .Always,
    allocator,
);
defer cache.deinit();

// During generation, append K/V after each attention computation
try cache.layers[layer_idx].append(key_tensor, value_tensor);

// Retrieve cached K/V for attention
if (cache.layers[layer_idx].get(position)) |entry| {
    // Use entry.key and entry.value
}

// Reset between prompts
cache.clear();

transformers.attention -- Produces the K/V tensors that are cached.
inference.generation -- Uses the cache during autoregressive generation.
inference.batching -- MultiSequenceKVCache supports batched inference.

inference.kv_cache¶

Module Path¶

Public Types¶

CacheStrategy¶

KVCacheEntry¶

LayerKVCache¶

ModelKVCache¶

MultiSequenceKVCache¶

SlidingWindowKVCache¶

Public Functions¶

ModelKVCache.init¶

ModelKVCache.deinit¶

LayerKVCache.append¶

LayerKVCache.get¶

ModelKVCache.clear¶

ModelKVCache.compact¶