Memory Profiling¶
Accurate memory prediction is essential for choosing quantisation levels, setting context lengths, and provisioning hardware. This page derives the formulas for each memory component and shows how to use Zig's built-in tools to detect leaks and measure peak usage.
Zig's GeneralPurposeAllocator and Leak Detection¶
Zig's GeneralPurposeAllocator (GPA) is a debug allocator that tracks every allocation and reports leaks when deinit is called:
var gpa = std.heap.GeneralPurposeAllocator(.{
.stack_trace_frames = 8, // capture 8 frames per allocation
}){};
defer {
const status = gpa.deinit();
if (status == .leak) {
std.log.err("LEAK DETECTED", .{});
}
}
const allocator = gpa.allocator();
When a leak is found, GPA prints the stack trace of the offending allocation:
error: memory leak at 0x7f3a2c001000 (4096 bytes)
src/models/llama.zig:142:25: LLaMAModel.init
src/server/http_server.zig:182:19: ZigLlamaServer.loadModel
examples/main.zig:23:5: main
Release builds
GPA is active only in debug and ReleaseSafe modes. In ReleaseFast the allocator degrades to a thin wrapper around malloc with no tracking overhead.
Workflow¶
- Run the test suite under GPA:
zig build test. - If GPA reports zero leaks, the allocator contract is satisfied.
- For long-running server processes, periodically log
gpa.total_requested_bytesto monitor growth.
Model Memory Budget¶
Parameter Count Formula¶
For a standard LLaMA-style transformer with \(L\) layers and embedding dimension \(d\):
This accounts for four projection matrices per layer (\(W^Q\), \(W^K\), \(W^V\), \(W^O\), each \(d \times d\)), the FFN gate and up projections (\(d \times \tfrac{8}{3}d\) each), and the down projection.
| Model | \(L\) | \(d\) | Params (formula) | Params (actual) |
|---|---|---|---|---|
| 7B | 32 | 4096 | 6.4 B | 6.7 B |
| 13B | 40 | 5120 | 12.6 B | 13.0 B |
| 30B | 60 | 6656 | 31.9 B | 32.5 B |
| 65B | 80 | 8192 | 64.4 B | 65.2 B |
Discrepancy
The formula slightly undercounts because it omits the embedding table (\(V \times d\), where \(V\) is the vocabulary size) and the final RMSNorm parameters (\(d\) per layer). These contribute <5 % of total parameters.
Weight Memory¶
Given the parameter count and the number of bits per weight (\(b\)):
| Format | \(b\) | 7B Size | 13B Size | 65B Size |
|---|---|---|---|---|
| FP32 | 32 | 28.0 GB | 52.0 GB | 260 GB |
| FP16 | 16 | 14.0 GB | 26.0 GB | 130 GB |
| Q8_0 | 8 | 7.0 GB | 13.0 GB | 65 GB |
| Q6_K | 6.5 | 5.5 GB | 10.6 GB | 53 GB |
| Q4_K_M | 4.5 | 3.9 GB | 7.3 GB | 37 GB |
| Q4_0 | 4 | 3.5 GB | 6.5 GB | 33 GB |
| IQ2_XS | 2.3 | 2.0 GB | 3.7 GB | 19 GB |
Activation Memory¶
During a forward pass, each transformer layer produces intermediate activations. For a single token at position \(n\) in a model with dimension \(d\):
Concretely, the dominant activations are:
| Activation | Shape | Size (FP32) |
|---|---|---|
| Attention scores | \(H \times n \times n\) | \(4Hn^2\) bytes |
| QKV projections | \(3 \times n \times d\) | \(12nd\) bytes |
| FFN intermediate | \(n \times \tfrac{8}{3}d\) | \(\tfrac{32}{3}nd\) bytes |
For a 7B model (\(d=4096\), \(H=32\)) at context length \(n=2048\):
- Attention scores: \(4 \times 32 \times 2048^2 \approx 537\;\text{MB}\) (per layer, without KV cache)
- QKV projections: \(12 \times 2048 \times 4096 \approx 96\;\text{MB}\)
- FFN intermediate: \(\tfrac{32}{3} \times 2048 \times 4096 \approx 89\;\text{MB}\)
Activation memory scales quadratically
The attention-score tensor grows as \(O(n^2)\). At \(n=8192\) the scores alone require 8.6 GB per layer. This is why KV caching (which reduces attention to \(O(n)\) per step) is essential for long contexts.
In practice, ZigLlama only materialises one layer's activations at a time, reusing the buffer across layers. Peak activation memory is therefore a single layer's worth, not \(L\) layers.
KV Cache Memory¶
The KV cache stores the key and value tensors for all layers and all past positions. For \(L\) layers, \(H\) heads, head dimension \(d_h\), and context length \(n\):
The factor of 2 accounts for keys and values; the factor of 4 is the byte width of f32.
| Model | \(L\) | \(H\) | \(d_h\) | Cache at \(n=2048\) | Cache at \(n=4096\) |
|---|---|---|---|---|---|
| 7B | 32 | 32 | 128 | 1.07 GB | 2.15 GB |
| 13B | 40 | 40 | 128 | 1.68 GB | 3.36 GB |
| 30B | 60 | 52 | 128 | 3.25 GB | 6.50 GB |
| 65B | 80 | 64 | 128 | 5.37 GB | 10.74 GB |
Reducing KV cache memory
- FP16 cache: Halves the cache by storing keys/values in half precision. Minimal quality impact.
- Grouped-query attention (GQA): Models like LLaMA 2 70B use fewer KV heads than query heads, reducing the cache by \(H_q / H_{kv}\).
- Sliding-window attention: Mistral limits attention to a fixed window, capping cache size regardless of context length.
Peak Memory Analysis¶
The total peak memory during inference is:
For a 7B Q4_K_M model at context length 2048:
| Component | Size | % of Peak |
|---|---|---|
| Weights (Q4_K_M) | 3.9 GB | 75 % |
| KV cache (FP32) | 1.07 GB | 21 % |
| Activations (1 layer) | 0.15 GB | 3 % |
| Overhead (allocator, HTTP) | 0.05 GB | 1 % |
| Total | 5.17 GB | 100 % |
pie title Peak Memory Breakdown (7B Q4_K_M)
"Weights" : 3.9
"KV Cache" : 1.07
"Activations" : 0.15
"Overhead" : 0.05 Optimisation Strategies¶
1. Quantise Weights¶
The most impactful lever. Moving from FP16 to Q4_K_M reduces weight memory by 3.6x.
2. Quantise the KV Cache¶
Storing keys and values in FP16 halves cache memory with negligible quality loss. INT8 KV cache (experimental) reduces it by 4x.
3. Reduce Context Length¶
If your application generates short responses (e.g., classification, entity extraction), set max_seq_len to the minimum sufficient value. Going from 4096 to 512 reduces KV cache by 8x.
4. Share Activations Across Layers¶
ZigLlama already does this: a single activation buffer is allocated and reused for every layer. Ensure custom model implementations follow the same pattern.
5. Memory-Mapped Model Loading¶
mmap avoids doubling memory during loading (one copy on disk, one in RAM). The OS page cache serves as the in-memory copy.
6. Monitor with GPA¶
// Periodically log memory usage
const bytes = gpa.total_requested_bytes;
std.log.info("Memory in use: {d:.1f} MB", .{
@as(f64, @floatFromInt(bytes)) / (1024 * 1024),
});
Memory Planning Calculator¶
Use this formula to estimate whether a model fits in your available RAM:
Example: 13B Q4_K_M, context 4096
$$ \frac{13 \times 10^9 \times 4.5}{8} + 2 \times 40 \times 4096 \times 128 \times 40 \times 4 + 0.2\;\text{GB} $$ $$ = 7.3\;\text{GB} + 3.36\;\text{GB} + 0.2\;\text{GB} = 10.86\;\text{GB} $$ This fits comfortably in a 16 GB machine but not in 8 GB.
Source Reference¶
| File | Key Types |
|---|---|
src/inference/kv_cache.zig | KVCache, cache sizing logic |
src/inference/profiling.zig | InferenceProfiler, memory tracking |
src/foundation/memory_mapping.zig | MemoryMappedFile |
src/foundation/tensor.zig | Tensor, allocation tracking |