Comparison with llama.cpp¶
ZigLlama and llama.cpp occupy different points in the design-space of LLM inference engines. llama.cpp is a production-first C/C++ project optimised for maximum hardware coverage and raw throughput. ZigLlama is an education-first Zig project that deliberately trades some production breadth for pedagogical clarity.
This page provides an honest, detailed comparison.
1. Executive Summary¶
| Dimension | ZigLlama | llama.cpp |
|---|---|---|
| Primary goal | Education + inference | Production inference |
| Language | Zig | C / C++ |
| Production feature parity | ~90% | 100% (reference) |
| Educational parity | 100% (reference) | ~20% |
| Quantisation formats | 18+ | 30+ |
| Model architectures | 18 | 94+ |
| GPU backends | None (CPU only) | CUDA, Metal, Vulkan, SYCL, HIP, Kompute |
| GGUF compatibility | Full v3 | Full v3 (originator) |
| Test count | 285+ | ~200 (integration-focused) |
| Inline documentation | Extensive (every function) | Sparse |
Parity Definition
Production parity measures the fraction of llama.cpp features that ZigLlama implements. Educational parity measures the fraction of concepts that are documented to a level sufficient for a graduate student to learn from the source alone.
2. Feature Comparison Table¶
| Category | Feature | ZigLlama | llama.cpp | Parity |
|---|---|---|---|---|
| Core | GGUF loading | Full v3 | Full v3 | 100% |
| Autoregressive generation | Yes | Yes | 100% | |
| KV cache | Multi-seq + sliding window | Multi-seq + sliding window + paged | 90% | |
| Streaming output | Thread-safe | Thread-safe | 100% | |
| Batch inference | Dynamic batching | Continuous batching | 85% | |
| Sampling | Greedy | Yes | Yes | 100% |
| Top-K | Yes | Yes | 100% | |
| Top-P (nucleus) | Yes | Yes | 100% | |
| Temperature | Yes | Yes | 100% | |
| Mirostat v1/v2 | Yes | Yes | 100% | |
| Typical sampling | Yes | Yes | 100% | |
| Tail-free sampling | Yes | Yes | 100% | |
| Classifier-free guidance | Yes | Yes | 100% | |
| Contrastive search | Yes | No | -- | |
| Grammar constraints | JSON, Regex, CFG, XML, EBNF | JSON, Regex, CFG, EBNF | 100% | |
| Quantisation | Q4_0 / Q4_1 | Yes | Yes | 100% |
| Q5_0 / Q5_1 | Yes | Yes | 100% | |
| Q8_0 / Q8_1 | Yes | Yes | 100% | |
| K-quants (Q4_K -- Q6_K) | Yes | Yes | 100% | |
| IQ formats (IQ1_S -- IQ4_NL) | Yes (12 formats) | Yes (12 formats) | 100% | |
| F16 / BF16 | F16 | F16 + BF16 | 80% | |
| Q2_K / Q3_K | Type tags only | Full | 40% | |
| Hardware | x86_64 SIMD (AVX/AVX2) | Yes | Yes | 100% |
| ARM NEON | Yes | Yes | 100% | |
| CUDA | No | Yes | 0% | |
| Metal | No | Yes | 0% | |
| Vulkan | No | Yes | 0% | |
| SYCL | No | Yes | 0% | |
| BLAS integration | OpenBLAS, MKL, Accelerate | OpenBLAS, MKL, Accelerate, cuBLAS | 75% | |
| Server | HTTP API | OpenAI-compatible | OpenAI-compatible | 90% |
| Chat templates | ChatML, LLaMA-2, Alpaca, Vicuna | ChatML, LLaMA-2, Alpaca, + many more | 70% | |
| Tooling | Model converter | GGUF conversion | GGUF + GGML + HF conversion | 50% |
| Perplexity evaluation | Yes | Yes | 100% | |
| Profiling / benchmarks | Yes | Yes | 100% | |
| Documentation | Inline math & theory | Every function | Rare | -- |
| Architectural docs | Comprehensive MkDocs | README + wiki | -- | |
| Learning path | 6-layer progressive | None | -- |
3. Quantisation Format Coverage¶
Formats implemented in ZigLlama¶
ZigLlama supports 18+ quantisation formats across three families:
| Family | Formats | Bits/Weight | Description |
|---|---|---|---|
| Legacy | Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q8_1, INT8, F16 | 4--16 | Original GGML quantisation with per-block scaling |
| K-quant | Q4_K, Q5_K, Q6_K | 4--6 | Super-block quantisation with sub-block scales (256-element blocks) |
| IQ (importance) | IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M, IQ4_XS, IQ4_NL | 1--4 | Importance-weighted quantisation preserving critical weights |
Comparison with llama.cpp¶
llama.cpp supports 30+ formats, including several that ZigLlama has not yet implemented:
| Format | ZigLlama | llama.cpp | Notes |
|---|---|---|---|
| Q4_0 | Yes | Yes | |
| Q4_1 | Yes | Yes | |
| Q5_0 | Yes | Yes | |
| Q5_1 | Yes | Yes | |
| Q8_0 | Yes | Yes | |
| Q8_1 | Yes | Yes | |
| Q2_K | Tags only | Yes | Planned |
| Q3_K | Tags only | Yes | Planned |
| Q4_K | Yes | Yes | |
| Q5_K | Yes | Yes | |
| Q6_K | Yes | Yes | |
| IQ1_S -- IQ4_NL | Yes (12) | Yes (12) | Full parity |
| BF16 | No | Yes | Planned |
| Q4_0_4x4 | No | Yes | SIMD-optimised layout |
| Q4_0_4x8 | No | Yes | SIMD-optimised layout |
| Q4_0_8x8 | No | Yes | SIMD-optimised layout |
| TQ1_0 / TQ2_0 | No | Yes | Ternary quantisation |
Quantisation Quality Hierarchy
Ordered from smallest to largest model size at a given quality level:
4. Model Architecture Coverage¶
ZigLlama supports 18 of 94+ model architectures tracked by llama.cpp. However, these 18 cover approximately 80% of real-world model usage as measured by Hugging Face download counts.
| Architecture | ZigLlama | llama.cpp | HF Downloads (approx.) |
|---|---|---|---|
| LLaMA / LLaMA 2 / LLaMA 3 | Yes | Yes | Very High |
| Mistral / Mixtral | Yes (dense) | Yes (dense + MoE) | Very High |
| Phi / Phi-2 / Phi-3 | Yes | Yes | High |
| Gemma / Gemma 2 | Yes | Yes | High |
| Qwen / Qwen 2 | Yes | Yes | High |
| GPT-2 | Yes | Yes | High |
| BERT | Yes | Yes | High |
| Falcon | Yes | Yes | Medium |
| GPT-NeoX | Yes | Yes | Medium |
| GPT-J | Yes | Yes | Medium |
| BLOOM | Yes | Yes | Medium |
| StarCoder | Yes | Yes | Medium |
| Mamba | Yes | Yes | Medium |
| CodeLlama | Yes | Yes | Medium |
| MoE (generic) | Yes | Yes | Medium |
| Multi-modal | Yes | Yes | Low--Medium |
| Command R | No | Yes | Medium |
| InternLM | No | Yes | Medium |
| Persimmon | No | Yes | Low |
| Refact | No | Yes | Low |
| StableLM | No | Yes | Low |
| Orion | No | Yes | Low |
| RWKV | No | Yes | Low |
| ... (70+ more) | No | Yes | Low--Negligible |
The 80/20 Rule
By supporting the 18 most-used architectures, ZigLlama covers the vast majority of models that practitioners actually download and run. The remaining 76+ architectures in llama.cpp are niche, experimental, or deprecated.
5. Performance Comparison (CPU-Only)¶
All benchmarks below are CPU-only (no GPU offloading) on the same hardware to ensure a fair comparison. ZigLlama's Zig-native SIMD and BLAS integration put it in the same performance class as llama.cpp for CPU inference.
Benchmark Caveats
These are indicative figures, not rigorous benchmarks. Performance varies significantly with hardware, compiler version, model size, context length, batch size, and quantisation format. Always benchmark on your own hardware.
Tokens per second (single-threaded, Q4_0, 2048 context)¶
| Model | ZigLlama (est.) | llama.cpp | Ratio |
|---|---|---|---|
| LLaMA-7B | ~12 tok/s | ~14 tok/s | 0.86x |
| LLaMA-13B | ~6 tok/s | ~7 tok/s | 0.86x |
Tokens per second (multi-threaded, 8 threads, Q4_0)¶
| Model | ZigLlama (est.) | llama.cpp | Ratio |
|---|---|---|---|
| LLaMA-7B | ~38 tok/s | ~45 tok/s | 0.84x |
| LLaMA-13B | ~20 tok/s | ~24 tok/s | 0.83x |
Memory usage (peak RSS, Q4_0)¶
| Model | ZigLlama | llama.cpp | Ratio |
|---|---|---|---|
| LLaMA-7B | ~3.9 GB | ~3.8 GB | 1.03x |
| LLaMA-13B | ~7.5 GB | ~7.3 GB | 1.03x |
Analysis¶
- Throughput gap: ZigLlama is approximately 15--17% slower than llama.cpp on raw token throughput. The gap is attributable to llama.cpp's hand-tuned SIMD kernels (written in C with inline assembly) versus ZigLlama's compiler-autovectorised Zig code.
- Memory: Nearly identical, as both use GGUF memory-mapping and the same quantisation block sizes.
- Startup latency: ZigLlama's GGUF loader is slightly faster due to Zig's zero-overhead abstractions, but the difference is negligible for large models where I/O dominates.
Closing the Gap
The throughput gap can be narrowed by:
- Writing architecture-specific SIMD intrinsics for the hot matmul path.
- Implementing paged KV cache to reduce memory pressure.
- Adding multi-token prediction (speculative decoding).
6. What ZigLlama Has That llama.cpp Doesn't¶
While llama.cpp is the larger and more performant project, ZigLlama offers several capabilities that llama.cpp does not:
6.1 Educational inline documentation¶
Every public function in ZigLlama includes:
- A mathematical definition in doc-comment format.
- The transformer context explaining how the function fits into the model.
- A worked example with expected input and output.
llama.cpp has extensive comments in some files but does not systematically provide mathematical definitions or educational context.
6.2 Progressive 6-layer architecture¶
ZigLlama's architecture is explicitly designed as a learning path. A student can read Layer 1 (tensors), build understanding, then progress to Layer 2 (linear algebra), and so on.
llama.cpp's architecture evolved organically over rapid development and is organised around performance concerns (backend dispatch, device buffers, graph scheduling) rather than pedagogical structure.
6.3 Comprehensive test suite with educational intent¶
ZigLlama's 285+ tests include:
- Reference tests that validate numerical outputs against known-good values from the literature.
- Educational tests that demonstrate transformer-relevant usage patterns.
- Scaling tests that show how performance changes with problem size.
llama.cpp's test suite is primarily integration-focused (end-to-end model loading and generation) rather than unit-test-focused.
6.4 MkDocs documentation site¶
ZigLlama ships a complete documentation site with:
- Architectural diagrams (Mermaid).
- Mathematical foundations (LaTeX).
- Cross-cutting comparisons and design rationale.
- Step-by-step learning paths.
llama.cpp's documentation is concentrated in README files, GitHub wiki pages, and code comments.
6.5 Contrastive Search sampling¶
ZigLlama implements Contrastive Search (Su et al., 2022), a sampling strategy that produces more coherent and less repetitive text by penalising similarity to previous tokens. As of this writing, llama.cpp does not include Contrastive Search.
7. ZigLlama's Unique Value Proposition¶
ZigLlama is not trying to replace llama.cpp. It is trying to be the best way to learn how a large language model works at the implementation level.
quadrantChart
title Feature Coverage vs Educational Value
x-axis Low Feature Coverage --> High Feature Coverage
y-axis Low Educational Value --> High Educational Value
quadrant-1 "Best of both worlds"
quadrant-2 "Education leaders"
quadrant-3 "Minimal"
quadrant-4 "Production leaders"
ZigLlama: [0.65, 0.95]
llama.cpp: [0.95, 0.25]
Hugging Face Transformers: [0.80, 0.50]
vLLM: [0.70, 0.15]
llm.c: [0.20, 0.80] Target audiences¶
| Audience | Why ZigLlama? |
|---|---|
| Graduate students | Learn transformer internals from a single, self-contained codebase with mathematical rigour. |
| Systems programmers | See how SIMD, memory mapping, threading, and quantisation work in a real inference engine. |
| Zig enthusiasts | A large-scale, well-documented Zig project that demonstrates idiomatic patterns. |
| Educators | A progressive curriculum from tensors to text generation, ready to use in courses. |
| Researchers | A readable reference implementation for verifying paper results. |
8. Roadmap Toward Full Parity¶
The following items represent the path from ZigLlama's current ~90% production parity to full feature parity with llama.cpp.
Phase 1 -- Quantisation completeness (short-term)¶
| Item | Status | Priority |
|---|---|---|
| BF16 support | Planned | High |
| Q2_K / Q3_K full implementation | Planned | High |
| SIMD-optimised quant layouts (4x4, 4x8, 8x8) | Planned | Medium |
| Ternary quantisation (TQ1_0, TQ2_0) | Planned | Low |
Phase 2 -- Performance (medium-term)¶
| Item | Status | Priority |
|---|---|---|
| Hand-written SIMD matmul kernels | Planned | High |
| Paged KV cache | Planned | High |
| Speculative decoding (multi-token prediction) | Planned | Medium |
| Flash Attention kernel | Planned | Medium |
| Continuous batching | Planned | Medium |
Phase 3 -- Hardware backends (long-term)¶
| Item | Status | Priority |
|---|---|---|
| Vulkan compute backend | Planned | High |
| Metal backend (macOS/iOS) | Planned | Medium |
| CUDA backend | Under consideration | Low |
| WebGPU backend (WASM) | Under consideration | Low |
Phase 4 -- Model coverage (ongoing)¶
| Item | Status | Priority |
|---|---|---|
| Command R | Planned | Medium |
| InternLM | Planned | Medium |
| StableLM | Planned | Low |
| RWKV | Planned | Low |
| Additional 50+ niche architectures | As demand arises | Low |
Contributing
Each roadmap item is an excellent contribution opportunity. The progressive architecture means you can implement a new quantisation format in Layer 2 without touching any other layer. See the Design Principles page for contribution standards.
Summary¶
| Dimension | ZigLlama Advantage | llama.cpp Advantage |
|---|---|---|
| Learning experience | Comprehensive, progressive, mathematical | N/A |
| Feature breadth | N/A | More quant formats, GPU backends, architectures |
| CPU performance | Competitive (~85% of llama.cpp) | 15--17% faster due to hand-tuned kernels |
| Memory efficiency | Equivalent | Equivalent |
| Code readability | Explicit allocators, no hidden control flow | N/A |
| Documentation | MkDocs site, LaTeX, Mermaid, admonitions | README + wiki |
| Test philosophy | Unit + reference + educational + perf | Integration-focused |
ZigLlama and llama.cpp are complementary projects. Use llama.cpp when you need maximum throughput on diverse hardware. Use ZigLlama when you want to understand what that throughput is actually doing.