Memory Management in Zig¶
Large language models are memory-dominated workloads. A 7-billion-parameter model in f16 occupies roughly 14 GB of weight storage alone; add KV caches, activations, and scratch buffers and you can easily exceed 20 GB for a single inference context. Zig's approach to memory management -- explicit allocators, no hidden allocations, deterministic cleanup -- turns out to be an almost ideal fit for this domain.
1. The Allocator Pattern¶
1.1 std.mem.Allocator Interface¶
In Zig, there is no global malloc. Every function that needs heap memory receives an Allocator argument:
const std = @import("std");
const Allocator = std.mem.Allocator;
fn createBuffer(allocator: Allocator, size: usize) ![]u8 {
return allocator.alloc(u8, size);
}
The Allocator is a vtable-style interface with three function pointers:
| Function | Signature (simplified) | Purpose |
|---|---|---|
alloc | (len, alignment) -> ?[*]u8 | Allocate len bytes |
resize | (buf, new_len) -> bool | Try to resize in place |
free | (buf) -> void | Release memory |
Why Explicit Allocators?
Implicit global allocators (C's malloc, Rust's #[global_allocator]) make it difficult to:
- Use different strategies for different lifetimes (arena vs. page-locked vs. general purpose).
- Instrument allocations per subsystem (how much memory does the KV cache use?).
- Test allocation failure paths (inject
OutOfMemoryin unit tests).
Zig's design solves all three problems by making the allocator a first-class parameter.
1.2 Common Allocator Implementations¶
| Allocator | Use Case in ZigLlama |
|---|---|
std.heap.GeneralPurposeAllocator | Development and testing -- detects leaks and double-frees |
std.heap.page_allocator | Large, long-lived allocations (weight buffers) |
std.heap.ArenaAllocator | Per-inference scratch space -- free everything in one shot |
std.heap.FixedBufferAllocator | Stack-backed allocations with zero syscall overhead |
2. GeneralPurposeAllocator¶
The GPA is the recommended allocator during development. It wraps the system allocator with safety checks:
var gpa = std.heap.GeneralPurposeAllocator(.{
.safety = true, // Enable all safety checks
.thread_safe = true, // Guard against data races
.never_unmap = false, // Unmap freed pages (helps valgrind)
}){};
defer {
const status = gpa.deinit();
if (status == .leak) @panic("Memory leak detected!");
}
const allocator = gpa.allocator();
2.1 Safety Checks¶
GPA Safety Features
| Check | What It Catches |
|---|---|
| Leak detection | Any allocation not freed before gpa.deinit() |
| Double-free detection | Calling free on an already-freed pointer |
| Use-after-free canaries | Writing a known pattern over freed memory; reading it later traps |
| Buffer overflow guards | Red-zone pages before/after allocations (optional) |
These checks add runtime overhead and should be disabled in release builds by switching to std.heap.page_allocator or a custom arena.
3. The Init / Deinit Pattern¶
Zig does not have constructors, destructors, or RAII in the C++ sense. Instead, it uses a convention of init (factory function) paired with deinit (cleanup method), linked by defer:
var tensor = try Tensor(f32).init(allocator, &[_]usize{ 2, 3 });
defer tensor.deinit(); // runs when scope exits, no matter how
// ... use tensor ...
3.1 defer -- Deterministic Cleanup¶
defer schedules a statement to execute when the enclosing block exits -- whether through normal flow, return, or an error propagated by try. Multiple defer statements execute in reverse order (LIFO):
{
const a = try allocator.alloc(u8, 100);
defer allocator.free(a); // runs second
const b = try allocator.alloc(u8, 200);
defer allocator.free(b); // runs first
// ... use a and b ...
}
// Both a and b are freed here.
3.2 errdefer -- Partial Failure Cleanup¶
errdefer only executes when the scope exits via an error. This is essential for multi-step initialisation where earlier allocations must be rolled back:
pub fn init(allocator: Allocator, shape: Shape) TensorError!Self {
const data = allocator.alloc(T, size) catch return TensorError.OutOfMemory;
errdefer allocator.free(data); // only if later steps fail
const owned_shape = allocator.dupe(usize, shape) catch return TensorError.OutOfMemory;
errdefer allocator.free(owned_shape);
const strides = allocator.alloc(usize, shape.len) catch return TensorError.OutOfMemory;
errdefer allocator.free(strides);
// All allocations succeeded -- none of the errdefers will fire.
return Self{ .data = data, .shape = owned_shape, .strides = strides, ... };
}
flowchart TD
A["alloc data"] -->|ok| B["alloc shape"]
A -->|fail| ERR1["return OutOfMemory"]
B -->|ok| C["alloc strides"]
B -->|fail| ERR2["errdefer: free data\nreturn OutOfMemory"]
C -->|ok| D["return Self"]
C -->|fail| ERR3["errdefer: free data, shape\nreturn OutOfMemory"] Rule of Thumb
Every alloc inside an init function should have a matching errdefer that frees it. The corresponding free lives in deinit.
4. Allocation Patterns in Transformers¶
4.1 Lifetime Categories¶
Transformer inference involves three distinct memory lifetime categories:
gantt
title Memory Lifetimes During Inference
dateFormat X
axisFormat %s
section Weights
Model weights (mmap) :active, w1, 0, 100
section KV Cache
KV cache (growing) :active, kv, 10, 100
section Activations
Token 1 activations :a1, 10, 20
Token 2 activations :a2, 20, 30
Token 3 activations :a3, 30, 40
Token N activations :aN, 90, 100 | Category | Lifetime | Allocator Strategy |
|---|---|---|
| Weights | Entire process | mmap (zero-copy from disk) or page_allocator |
| KV cache | Per-conversation (grows with sequence length) | Pre-allocated slab, grows by doubling |
| Activations | Per-token (freed after each forward pass) | ArenaAllocator -- reset after each token |
4.2 Arena Allocator for Activations¶
An arena allocator bumps a pointer for each allocation and frees everything in one operation. This is perfect for activations that are computed and discarded every token:
var arena = std.heap.ArenaAllocator.init(std.heap.page_allocator);
for (0..num_tokens) |_| {
defer _ = arena.reset(.retain_capacity); // free all, keep pages
const act_alloc = arena.allocator();
var q = try Tensor(f32).init(act_alloc, &[_]usize{ seq_len, d_model });
// No individual deinit needed -- arena.reset() handles it.
var k = try Tensor(f32).init(act_alloc, &[_]usize{ seq_len, d_model });
var v = try Tensor(f32).init(act_alloc, &[_]usize{ seq_len, d_model });
// ... compute attention ...
}
arena.deinit(); // release pages back to OS
Performance Impact
Arena allocation reduces per-token allocation overhead from \( O(\text{num\_allocs} \cdot \log n) \) (GPA with free-list) to \( O(\text{num\_allocs}) \) (pointer bump) with an \( O(1) \) bulk free. For a 32-layer model this eliminates thousands of free() calls per token.
4.3 Pre-allocated KV Cache¶
The KV cache stores key and value tensors for all previously generated tokens. Its size grows linearly with sequence length:
where \( L \) is the number of layers, \( s \) is the current sequence length, and \( d_h \) is the head dimension. ZigLlama pre-allocates the cache to the maximum context length to avoid reallocation during generation.
5. Memory Safety Without a Garbage Collector¶
Zig achieves memory safety through a combination of language-level guarantees and convention:
5.1 No Hidden Allocations¶
Unlike C++ (std::vector::push_back may reallocate) or Rust (.clone()), Zig never allocates behind the programmer's back. If a function can allocate, it takes an Allocator parameter -- visible in the type signature.
5.2 Ownership Semantics¶
ZigLlama follows a simple ownership rule:
The struct that calls
allocis responsible for callingfree.
For Tensor(T), init allocates three arrays; deinit frees all three. No reference counting, no shared ownership, no cycles.
5.3 Comptime Bounds Checking¶
Zig's []T slices carry their length at runtime. Indexing out of bounds is a detectable illegal behavior that triggers a safety check in Debug and ReleaseSafe modes:
var buf: [4]u8 = .{ 1, 2, 3, 4 };
const slice: []u8 = &buf;
_ = slice[4]; // panic: index 4 out of bounds for slice of length 4
5.4 Sentinel-Free Strings¶
Zig strings are []const u8 slices with explicit length -- no null terminators that can be forgotten or overflowed.
6. Common Pitfalls and Solutions¶
Pitfall 1: Forgetting errdefer
Symptom: Memory leak when a later allocation in init fails.
Fix: Add errdefer allocator.free(ptr) immediately after every allocator.alloc() inside an init function.
Pitfall 2: Storing Allocator by Value
Symptom: Struct holds an Allocator that becomes invalid after the backing allocator (e.g., GPA) is destroyed.
Fix: Ensure the allocator's lifetime exceeds the struct's lifetime. Prefer passing the allocator to deinit rather than storing it, unless the struct truly needs it for internal resizing.
Pitfall 3: Mixing Allocators
Symptom: Freeing memory with a different allocator than the one that allocated it.
Fix: ZigLlama stores the allocator in the Tensor struct so deinit always uses the correct one.
Pitfall 4: Arena Escape
Symptom: Pointer into arena memory is used after arena.reset().
Fix: Never store arena-allocated pointers in long-lived data structures. Copy the data to a persistent allocator if it must survive the arena's lifetime.
Pitfall 5: Leaking on Error Return
Symptom: Caller forgets to deinit a struct returned by a function that can also return an error for other reasons later.
Fix: Document ownership transfer clearly. In ZigLlama, the convention is: if a function returns !Self, the caller owns the result and must call deinit.