Memory Profiling¶

Accurate memory prediction is essential for choosing quantisation levels, setting context lengths, and provisioning hardware. This page derives the formulas for each memory component and shows how to use Zig's built-in tools to detect leaks and measure peak usage.

Zig's GeneralPurposeAllocator and Leak Detection¶

Zig's GeneralPurposeAllocator (GPA) is a debug allocator that tracks every allocation and reports leaks when deinit is called:

var gpa = std.heap.GeneralPurposeAllocator(.{
    .stack_trace_frames = 8,  // capture 8 frames per allocation
}){};
defer {
    const status = gpa.deinit();
    if (status == .leak) {
        std.log.err("LEAK DETECTED", .{});
    }
}
const allocator = gpa.allocator();

When a leak is found, GPA prints the stack trace of the offending allocation:

error: memory leak at 0x7f3a2c001000 (4096 bytes)
  src/models/llama.zig:142:25: LLaMAModel.init
  src/server/http_server.zig:182:19: ZigLlamaServer.loadModel
  examples/main.zig:23:5: main

Release builds

GPA is active only in debug and ReleaseSafe modes. In ReleaseFast the allocator degrades to a thin wrapper around malloc with no tracking overhead.

Workflow¶

Run the test suite under GPA: zig build test.
If GPA reports zero leaks, the allocator contract is satisfied.
For long-running server processes, periodically log gpa.total_requested_bytes to monitor growth.

Model Memory Budget¶

Parameter Count Formula¶

For a standard LLaMA-style transformer with $L$ layers and embedding dimension $d$:

\[ \text{params} \approx 12Ld^2 \]

This accounts for four projection matrices per layer ($W^Q$, $W^K$, $W^V$, $W^O$, each $d \times d$), the FFN gate and up projections ($d \times \tfrac{8}{3}d$ each), and the down projection.

Model	$L$	$d$	Params (formula)	Params (actual)
7B	32	4096	6.4 B	6.7 B
13B	40	5120	12.6 B	13.0 B
30B	60	6656	31.9 B	32.5 B
65B	80	8192	64.4 B	65.2 B

Discrepancy

The formula slightly undercounts because it omits the embedding table ($V \times d$, where $V$ is the vocabulary size) and the final RMSNorm parameters ($d$ per layer). These contribute <5 % of total parameters.

Weight Memory¶

Given the parameter count and the number of bits per weight ($b$):

\[ \text{weight memory} = \frac{\text{params} \times b}{8} \;\text{bytes} \]

Format	$b$	7B Size	13B Size	65B Size
FP32	32	28.0 GB	52.0 GB	260 GB
FP16	16	14.0 GB	26.0 GB	130 GB
Q8_0	8	7.0 GB	13.0 GB	65 GB
Q6_K	6.5	5.5 GB	10.6 GB	53 GB
Q4_K_M	4.5	3.9 GB	7.3 GB	37 GB
Q4_0	4	3.5 GB	6.5 GB	33 GB
IQ2_XS	2.3	2.0 GB	3.7 GB	19 GB

Activation Memory¶

During a forward pass, each transformer layer produces intermediate activations. For a single token at position $n$ in a model with dimension $d$:

\[ \text{activation per layer} = O(n \cdot d) \]

Concretely, the dominant activations are:

Activation	Shape	Size (FP32)
Attention scores	$H \times n \times n$	$4Hn^2$ bytes
QKV projections	$3 \times n \times d$	$12nd$ bytes
FFN intermediate	$n \times \tfrac{8}{3}d$	$\tfrac{32}{3}nd$ bytes

For a 7B model ($d=4096$, $H=32$) at context length $n=2048$:

Attention scores: $4 \times 32 \times 2048^2 \approx 537\;\text{MB}$ (per layer, without KV cache)
QKV projections: $12 \times 2048 \times 4096 \approx 96\;\text{MB}$
FFN intermediate: $\tfrac{32}{3} \times 2048 \times 4096 \approx 89\;\text{MB}$

Activation memory scales quadratically

The attention-score tensor grows as $O(n^2)$. At $n=8192$ the scores alone require 8.6 GB per layer. This is why KV caching (which reduces attention to $O(n)$ per step) is essential for long contexts.

In practice, ZigLlama only materialises one layer's activations at a time, reusing the buffer across layers. Peak activation memory is therefore a single layer's worth, not $L$ layers.

KV Cache Memory¶

The KV cache stores the key and value tensors for all layers and all past positions. For $L$ layers, $H$ heads, head dimension $d_h$, and context length $n$:

\[ \text{KV cache} = 2 \times L \times n \times d_h \times H \times 4 \;\text{bytes (FP32)} \]

The factor of 2 accounts for keys and values; the factor of 4 is the byte width of f32.

Model	$L$	$H$	$d_h$	Cache at $n=2048$	Cache at $n=4096$
7B	32	32	128	1.07 GB	2.15 GB
13B	40	40	128	1.68 GB	3.36 GB
30B	60	52	128	3.25 GB	6.50 GB
65B	80	64	128	5.37 GB	10.74 GB

Reducing KV cache memory

FP16 cache: Halves the cache by storing keys/values in half precision. Minimal quality impact.
Grouped-query attention (GQA): Models like LLaMA 2 70B use fewer KV heads than query heads, reducing the cache by $H_q / H_{kv}$.
Sliding-window attention: Mistral limits attention to a fixed window, capping cache size regardless of context length.

Peak Memory Analysis¶

The total peak memory during inference is:

\[ \text{peak} = \text{weights} + \text{KV cache} + \text{activations} + \text{overhead} \]

For a 7B Q4_K_M model at context length 2048:

Component	Size	% of Peak
Weights (Q4_K_M)	3.9 GB	75 %
KV cache (FP32)	1.07 GB	21 %
Activations (1 layer)	0.15 GB	3 %
Overhead (allocator, HTTP)	0.05 GB	1 %
Total	5.17 GB	100 %

pie title Peak Memory Breakdown (7B Q4_K_M)
    "Weights" : 3.9
    "KV Cache" : 1.07
    "Activations" : 0.15
    "Overhead" : 0.05

Optimisation Strategies¶

1. Quantise Weights¶

The most impactful lever. Moving from FP16 to Q4_K_M reduces weight memory by 3.6x.

2. Quantise the KV Cache¶

Storing keys and values in FP16 halves cache memory with negligible quality loss. INT8 KV cache (experimental) reduces it by 4x.

3. Reduce Context Length¶

If your application generates short responses (e.g., classification, entity extraction), set max_seq_len to the minimum sufficient value. Going from 4096 to 512 reduces KV cache by 8x.

ZigLlama already does this: a single activation buffer is allocated and reused for every layer. Ensure custom model implementations follow the same pattern.

5. Memory-Mapped Model Loading¶

mmap avoids doubling memory during loading (one copy on disk, one in RAM). The OS page cache serves as the in-memory copy.

6. Monitor with GPA¶

// Periodically log memory usage
const bytes = gpa.total_requested_bytes;
std.log.info("Memory in use: {d:.1f} MB", .{
    @as(f64, @floatFromInt(bytes)) / (1024 * 1024),
});

Memory Planning Calculator¶

Use this formula to estimate whether a model fits in your available RAM:

\[ \text{required RAM} \geq \frac{\text{params} \times b}{8} + 2Lnd_h H \times 4 + 0.2\;\text{GB} \]

Example: 13B Q4_K_M, context 4096

$$ \frac{13 \times 10^9 \times 4.5}{8} + 2 \times 40 \times 4096 \times 128 \times 40 \times 4 + 0.2\;\text{GB} $$ $$ = 7.3\;\text{GB} + 3.36\;\text{GB} + 0.2\;\text{GB} = 10.86\;\text{GB} $$ This fits comfortably in a 16 GB machine but not in 8 GB.

Source Reference¶

File	Key Types
`src/inference/kv_cache.zig`	`KVCache`, cache sizing logic
`src/inference/profiling.zig`	`InferenceProfiler`, memory tracking
`src/foundation/memory_mapping.zig`	`MemoryMappedFile`
`src/foundation/tensor.zig`	`Tensor`, allocation tracking

Activation	Shape	Size (FP32)
Attention scores	\(H \times n \times n\)	\(4Hn^2\) bytes
QKV projections	\(3 \times n \times d\)	\(12nd\) bytes
FFN intermediate	\(n \times \tfrac{8}{3}d\)	\(\tfrac{32}{3}nd\) bytes