Parity Analysis with llama.cpp¶
ZigLlama is an educational reimplementation of transformer inference inspired by llama.cpp. This page provides an honest, detailed comparison of the two projects across features, performance, and coverage.
Feature Comparison¶
| Feature | llama.cpp | ZigLlama | Notes |
|---|---|---|---|
| Core Architecture | |||
| Tensor operations | Full | Full | Both support n-dimensional tensors. |
| GGUF format | Full | Full | Read and write GGUF v3. |
| Tokenisation (BPE/SPM) | Full | Simplified | ZigLlama uses word-level tokeniser for demos. |
| RoPE | Full | Full | Identical rotary encoding. |
| RMSNorm | Full | Full | |
| Quantisation | |||
| Legacy (Q4_0, Q8_0, etc.) | 6 formats | 6 formats | Full parity. |
| K-Quant (Q4_K, Q5_K, Q6_K) | 8 formats | 6 formats | Missing Q2_K, Q3_K. |
| IQ (Importance Quant) | 8 formats | 6 formats | Missing IQ1_M, IQ4_NL. |
| Mixed precision | Yes | No | Per-layer format selection. |
| Inference | |||
| Autoregressive generation | Full | Full | |
| KV cache | Full | Full | |
| Batch processing | Full | Basic | llama.cpp has continuous batching. |
| Streaming | Full | Full | SSE-compatible. |
| Grammar constraints (GBNF) | Full | Basic | ZigLlama supports JSON schema. |
| Mirostat sampling | V1 + V2 | V2 | |
| Typical sampling | Yes | Yes | |
| Speculative decoding | Yes | No | |
| Model Support | |||
| LLaMA / LLaMA 2 / 3 | Full | Full | |
| Mistral / Mixtral | Full | Full | |
| GPT-2 | Full | Full | |
| Falcon | Full | Full | |
| Phi | Full | Full | |
| Qwen | Full | Full | |
| GPT-J | Full | Full | |
| GPT-NeoX | Full | Full | |
| BLOOM | Full | Full | |
| Mamba | Full | Full | |
| BERT | Partial | Full | |
| Gemma | Full | Full | |
| StarCoder | Full | Full | |
| Mixture of Experts | Full | Basic | |
| Multi-modal (LLaVA, etc.) | Full | Basic | |
| Server | |||
| OpenAI-compatible API | Full | Full | Both support chat + completions. |
| Parallel requests | Full | Basic | llama.cpp uses slot-based scheduling. |
| Platform | |||
| CPU (x86 AVX2) | Full | Full | |
| CPU (ARM NEON) | Full | Full | |
| CUDA | Full | No | Major gap. |
| Metal | Full | No | Major gap. |
| Vulkan | Full | No | |
| SYCL | Full | No |
Performance Comparison (CPU-Only)¶
All measurements use the same hardware (8-core x86_64, AVX2, 64 GB DDR5) and model (LLaMA 7B, Q4_K_M).
| Metric | llama.cpp | ZigLlama | Ratio |
|---|---|---|---|
| Tokens/sec (single-threaded) | 8 | 5 | 0.6x |
| Tokens/sec (8 threads) | 45 | 18 | 0.4x |
| Time to first token | 120 ms | 200 ms | 0.6x |
| Peak memory | 4.8 GB | 5.2 GB | 0.9x |
| Model load time | 1.2 s | 1.8 s | 0.7x |
| Perplexity (WikiText-103) | 5.90 | 5.92 | 1.00x |
Why ZigLlama is slower
The performance gap stems from two factors:
- Matmul kernels: llama.cpp has hand-tuned assembly for AVX-512, AVX2, and NEON with cache-blocking. ZigLlama relies on Zig's
@Vectorabstraction, which produces good but not optimal code. - Threading: llama.cpp uses a custom work-stealing scheduler with per-thread scratch buffers. ZigLlama uses
std.Thread.Poolwith coarser work partitioning.
Perplexity is nearly identical because the mathematical operations are equivalent -- only the speed differs.
Quantisation Coverage¶
| Format | Bits/Weight | llama.cpp | ZigLlama |
|---|---|---|---|
| Q4_0 | 4.0 | Yes | Yes |
| Q4_1 | 4.5 | Yes | Yes |
| Q5_0 | 5.0 | Yes | Yes |
| Q5_1 | 5.5 | Yes | Yes |
| Q8_0 | 8.0 | Yes | Yes |
| Q2_K | 2.6 | Yes | No |
| Q3_K | 3.4 | Yes | No |
| Q4_K_S | 4.5 | Yes | Yes |
| Q4_K_M | 4.5 | Yes | Yes |
| Q5_K_S | 5.5 | Yes | Yes |
| Q5_K_M | 5.5 | Yes | Yes |
| Q6_K | 6.5 | Yes | Yes |
| IQ1_S | 1.5 | Yes | Yes |
| IQ1_M | 1.8 | Yes | No |
| IQ2_XXS | 2.1 | Yes | Yes |
| IQ2_XS | 2.3 | Yes | Yes |
| IQ3_XXS | 3.1 | Yes | Yes |
| IQ3_XS | 3.3 | Yes | Yes |
| IQ4_XS | 4.3 | Yes | Yes |
| IQ4_NL | 4.5 | Yes | No |
Coverage: 16 / 20 formats (80 %).
Model Coverage¶
| Architecture | llama.cpp | ZigLlama |
|---|---|---|
| LLaMA / LLaMA 2 / 3 | Yes | Yes |
| Mistral / Mixtral | Yes | Yes |
| GPT-2 | Yes | Yes |
| Falcon | Yes | Yes |
| GPT-J | Yes | Yes |
| GPT-NeoX | Yes | Yes |
| BLOOM | Yes | Yes |
| Phi / Phi-2 / Phi-3 | Yes | Yes |
| Qwen / Qwen-2 | Yes | Yes |
| StarCoder | Yes | Yes |
| Mamba | Yes | Yes |
| BERT | Partial | Yes |
| Gemma | Yes | Yes |
| MoE (Mixtral) | Yes | Basic |
| Multi-modal (LLaVA) | Yes | Basic |
| Command-R | Yes | No |
| InternLM | Yes | No |
| Orion | Yes | No |
Coverage: 15 / 18 listed architectures (83 %).
ZigLlama's Unique Advantages¶
Despite the performance gap, ZigLlama offers several properties that llama.cpp does not:
Educational Design¶
- Layered architecture: Six progressively complex layers, each independently testable.
- Inline documentation: Every function includes a docstring explaining the mathematical operation it implements.
- Educational tests: 285+ tests that double as executable specifications.
Code Clarity¶
- Allocation-explicit: Every allocation is visible and paired with a
defer deinit. No hidden allocators, no global state. - No C dependencies (core): The inference engine is pure Zig with no
libcrequirement. BLAS integration is optional. - Readable builds:
build.zigis a single file with named targets for every example and test.
Safety¶
- Bounds checking: Debug builds catch out-of-bounds tensor access at the exact call site.
- Leak detection: GPA reports every leaked allocation on process exit.
- No undefined behaviour: Zig's safety semantics prevent the classes of bugs that plague C codebases.
Modern Tooling¶
- Cross-compilation:
zig build -Dtarget=aarch64-linuxproduces an ARM binary from an x86 host in one command. - Hermetic builds: No Makefiles, no CMake, no system-library hunting.
- Integrated testing:
zig build testruns all 285+ tests with a single invocation.
Parity Roadmap¶
| Gap | Priority | Difficulty | Status |
|---|---|---|---|
| Q2_K, Q3_K quantisation | Medium | Low | Planned |
| Speculative decoding | Low | High | Research |
| Continuous batching | Medium | Medium | Planned |
| Mixed-precision quantisation | High | Medium | Planned |
| CUDA / Metal backends | Low | Very High | Not planned (educational focus) |
Contributing
Closing the parity gap is a community effort. See Contributing for guidelines on submitting quantisation kernels, model-support patches, or benchmark results.