Performance¶
This section provides quantitative data and actionable guidance for understanding and improving ZigLlama's inference throughput, memory consumption, and output quality.
Section Overview¶
graph TD
A[Benchmarks] --> B[Optimization Guide]
B --> C[Parity Analysis]
B --> D[Memory Profiling]
style A fill:#7c4dff,color:#fff | Page | Description |
|---|---|
| Benchmarks | Methodology, matrix-multiplication timings, inference throughput, KV cache impact, and memory usage tables. |
| Optimization Guide | Practical steps for profiling bottlenecks and applying SIMD, quantization, caching, and threading optimizations. |
| Parity Analysis | Feature-by-feature and performance comparison with llama.cpp. |
| Memory Profiling | Formulas for predicting model, activation, and KV cache memory; leak-detection workflow with Zig's GPA. |
Key Performance Numbers (7B Model, CPU-Only)¶
| Metric | Unoptimised | Fully Optimised |
|---|---|---|
| Tokens / second | ~5 | ~200 |
| Time per token | ~200 ms | ~5 ms |
| Peak memory (FP32) | 28 GB | -- |
| Peak memory (Q4_K_M) | -- | ~3.9 GB |
| KV cache speedup | 1x | 20x |
| SIMD speedup (matmul) | 1x | 3--5x |
| Combined speedup | 1x | ~400x |
Hardware
Numbers above are representative of a 2024-era x86_64 workstation (8-core, AVX2) running Zig 0.13 with -OReleaseFast. Your results will vary with CPU, memory bandwidth, and model size.