Optimization Guide¶
This guide provides a systematic approach to improving ZigLlama's inference performance. The overarching principle is profile first, optimise second -- identify the actual bottleneck before applying any technique.
Profile First¶
The cardinal rule
Never optimise without measurement. A 10x improvement to a function that consumes 1 % of wall-clock time yields a 0.1 % end-to-end gain.
Identifying Bottlenecks¶
ZigLlama includes a profiling module (src/inference/profiling.zig) that timestamps each phase of generation:
const profiling = @import("inference/profiling.zig");
var profiler = profiling.InferenceProfiler.init(allocator);
defer profiler.deinit();
// ... run generation ...
profiler.printReport();
Typical output for a 7B model:
Phase Time (ms) % Total
--------------------------------------------
Tokenisation 2.1 0.4%
Embedding lookup 1.5 0.3%
Attention (QKV proj) 85.3 16.2%
Attention (softmax) 42.0 8.0%
Feed-forward 310.5 59.0%
Output projection 62.1 11.8%
Sampling 1.2 0.2%
Detokenisation 0.8 0.2%
Other 20.5 3.9%
In this profile, the feed-forward network dominates. Since FFN layers are pure matrix multiplications, the highest-impact optimisation is SIMD matmul.
Memory Optimisations¶
Quantisation Selection¶
Choose a quantisation level based on your memory budget and quality requirements:
flowchart TD
Start["Available RAM?"] --> R64[">=32 GB"]
Start --> R16["16-32 GB"]
Start --> R8["8-16 GB"]
Start --> R4["<8 GB"]
R64 --> Q8["Q8_0 or Q6_K<br/>Best quality"]
R16 --> QK["Q4_K_M<br/>Good trade-off"]
R8 --> Q4["Q4_0 or IQ3_XS<br/>Moderate quality"]
R4 --> IQ["IQ2_XS or IQ1_S<br/>Aggressive compression"] | RAM Budget | 7B Recommendation | 13B Recommendation |
|---|---|---|
| 64 GB | Q8_0 | Q6_K |
| 32 GB | Q6_K | Q4_K_M |
| 16 GB | Q4_K_M | IQ3_XS |
| 8 GB | IQ3_XS | -- |
| 4 GB | IQ2_XS | -- |
Memory-Mapped I/O¶
For models stored on fast SSDs, memory mapping (mmap) enables lazy loading -- pages are faulted into RAM only when first accessed. This reduces time-to-first-token because the entire file need not be read sequentially:
const mmap = @import("foundation/memory_mapping.zig");
var mapping = try mmap.MemoryMappedFile.init(allocator, "model.gguf");
defer mapping.deinit();
// Access tensors directly through the mapped region
const tensor_data = mapping.getSlice(offset, length);
SSD vs HDD
Memory mapping on a spinning disk can cause severe random-access latency. On HDD, prefer sequential reading into a heap buffer.
KV Cache Sizing¶
Pre-allocate the KV cache to the maximum expected context length to avoid mid-generation reallocation:
const kv_cache = @import("inference/kv_cache.zig");
var cache = try kv_cache.KVCache.init(allocator, .{
.num_layers = 32,
.num_heads = 32,
.head_dim = 128,
.max_seq_len = 4096, // pre-allocate for up to 4096 tokens
});
defer cache.deinit();
If memory is tight, set max_seq_len to the actual generation limit rather than the model's training context length.
Compute Optimisations¶
SIMD Acceleration¶
ZigLlama uses Zig's @Vector type for portable SIMD:
const Vec4 = @Vector(4, f32);
fn dotProduct4(a: [*]const f32, b: [*]const f32) f32 {
const va: Vec4 = a[0..4].*;
const vb: Vec4 = b[0..4].*;
const product = va * vb;
return @reduce(.Add, product);
}
Ensuring AVX2
Compile with -Dcpu=x86_64_v3 (or -OReleaseFast) to guarantee AVX2 code generation. Without this flag, Zig may emit SSE2-only code that processes only 4 floats at a time instead of 8.
Verification:
# Check which SIMD extensions the binary uses
objdump -d zig-out/bin/zigllama | grep -c vfmadd
# Non-zero count confirms FMA instructions are present
BLAS Integration¶
For the largest matrices (output projection, embedding lookup), delegating to an optimised BLAS library can outperform hand-written SIMD by 2--3x:
const blas = @import("foundation/blas_integration.zig");
// Use system BLAS for large matmuls
blas.sgemm('N', 'N', m, n, k, 1.0, A, lda, B, ldb, 0.0, C, ldc);
ZigLlama detects OpenBLAS, Apple Accelerate, or Intel MKL at build time.
Threading Configuration¶
The thread pool size should match the number of physical cores (not logical cores with hyperthreading):
const threading = @import("foundation/threading.zig");
var pool = try threading.ThreadPool.init(allocator, .{
.num_threads = 8, // match physical core count
});
defer pool.deinit();
NUMA awareness
On multi-socket systems, pin threads to the socket closest to the model's memory. ZigLlama's threading module provides NUMA topology detection.
Inference Optimisations¶
KV Cache¶
The single most impactful optimisation for autoregressive generation. Without a cache, every forward pass recomputes attention for all previous tokens. With a cache, only the new token's key/value pair is computed and appended:
| Context Length | Without Cache | With Cache | Speedup |
|---|---|---|---|
| 128 | 1.6 s | 0.08 s | 20x |
| 512 | 25.6 s | 0.32 s | 80x |
| 2048 | 409 s | 1.28 s | 320x |
The speedup grows linearly with context length because the cached version is \(O(n)\) per token while the non-cached version is \(O(n^2)\).
Batch Processing¶
When serving multiple users, batch their prompts into a single forward pass:
const batching = @import("inference/batching.zig");
var batch = try batching.InferenceBatch.init(allocator, .{
.max_batch_size = 8,
.max_seq_len = 2048,
});
Batching improves GPU utilisation and amortises the cost of weight loads across sequences. On CPU, the benefit is smaller (2--4x) but still meaningful.
Streaming¶
Streaming sends tokens to the client as they are generated, reducing perceived latency to the time-to-first-token rather than the full generation time. No throughput change, but user experience improves dramatically.
Platform-Specific Tips¶
x86_64 (Intel / AMD)¶
- Compile with AVX2:
-Dcpu=x86_64_v3or-Dcpu=native. - Enable FMA: Fused multiply-add reduces round-off and increases throughput. Zig enables FMA automatically when targeting AVX2+.
- Large pages:
madvise(MADV_HUGEPAGE)on the model mmap region can reduce TLB misses by 10--20 % for large models. - Turbo boost: Ensure the CPU governor is set to
performanceduring benchmarks.
ARM (Apple Silicon, Graviton)¶
- NEON: Zig's
@Vector(4, f32)maps directly to NEON 128-bit registers. No special flags needed on aarch64. - Apple Accelerate: Link against Accelerate for hardware-optimised BLAS on macOS (
-framework Accelerateviabuild.zig). - Unified memory: Apple Silicon's unified memory architecture eliminates the CPU-GPU copy penalty, but ZigLlama is CPU-only and does not currently benefit from the GPU.
General¶
- Huge pages: Reduce TLB misses for models > 4 GB.
- NUMA pinning: Keep threads and data on the same socket.
- Compiler flags: Always benchmark with
-OReleaseFast. Debug builds are 10--50x slower due to bounds checking and safety instrumentation.
Optimisation Checklist¶
- Profile with
InferenceProfilerto identify the bottleneck. - Choose the appropriate quantisation level for your memory budget.
- Enable KV caching (enabled by default in
TextGenerator). - Compile with
-OReleaseFast -Dcpu=native. - Set thread count to physical core count.
- Use memory mapping for model loading.
- Consider BLAS integration for the largest matrices.
- Pre-allocate KV cache to avoid mid-generation reallocation.
- Use streaming to minimise perceived latency.
Source Reference¶
| File | Key Types |
|---|---|
src/inference/profiling.zig | InferenceProfiler |
src/inference/kv_cache.zig | KVCache |
src/inference/batching.zig | InferenceBatch |
src/foundation/threading.zig | ThreadPool |
src/foundation/blas_integration.zig | BLAS wrappers |
src/foundation/memory_mapping.zig | MemoryMappedFile |