Performance¶

This section provides quantitative data and actionable guidance for understanding and improving ZigLlama's inference throughput, memory consumption, and output quality.

Section Overview¶

graph TD
    A[Benchmarks] --> B[Optimization Guide]
    B --> C[Parity Analysis]
    B --> D[Memory Profiling]
    style A fill:#7c4dff,color:#fff

Page	Description
Benchmarks	Methodology, matrix-multiplication timings, inference throughput, KV cache impact, and memory usage tables.
Optimization Guide	Practical steps for profiling bottlenecks and applying SIMD, quantization, caching, and threading optimizations.
Parity Analysis	Feature-by-feature and performance comparison with llama.cpp.
Memory Profiling	Formulas for predicting model, activation, and KV cache memory; leak-detection workflow with Zig's GPA.

Key Performance Numbers (7B Model, CPU-Only)¶

Metric	Unoptimised	Fully Optimised
Tokens / second	~5	~200
Time per token	~200 ms	~5 ms
Peak memory (FP32)	28 GB	--
Peak memory (Q4_K_M)	--	~3.9 GB
KV cache speedup	1x	20x
SIMD speedup (matmul)	1x	3--5x
Combined speedup	1x	~400x

Hardware

Numbers above are representative of a 2024-era x86_64 workstation (8-core, AVX2) running Zig 0.13 with -OReleaseFast. Your results will vary with CPU, memory bandwidth, and model size.