Skip to content

ZigLlama -- Educational LLM Inference in Zig

Version 0.1.0  |  ~30,000 lines of Zig  |  285+ tests  |  18 architectures  |  ~90% llama.cpp parity

What is ZigLlama

ZigLlama is an educational implementation of transformer-based large language models written entirely in Zig. It covers the full inference stack -- from raw tensor storage to autoregressive text generation -- organised into six progressively more complex layers. Every module is documented with the mathematical theory it implements, the design trade-offs it embodies, and the connections to production systems such as llama.cpp.

The project targets two audiences simultaneously. For learners, ZigLlama provides a self-contained curriculum: each layer depends only on the layers below it, so you can study attention mechanisms without first understanding quantisation, or explore KV-caching without reading the tokeniser. For practitioners, the same codebase demonstrates production-grade techniques -- SIMD matrix kernels, K-quantisation, grammar-constrained decoding, streaming generation, and memory-mapped model loading -- all expressed in readable, allocation-explicit Zig.


Why Zig for LLM Inference

Language choice rationale

Zig was selected not because it is the most popular systems language, but because its semantics make the costs of every operation visible to the reader -- exactly the property an educational codebase needs.

Zig Feature Benefit for LLM Inference
comptime generics Tensor operations are monomorphised at compile time. There is no runtime dispatch for element types, quantisation formats, or SIMD widths; the optimiser sees concrete types everywhere.
No hidden allocations Every byte of heap memory is allocated through an explicit std.mem.Allocator. Readers can trace every buffer from creation to deinit -- essential when a 7 B-parameter model consumes gigabytes.
SIMD intrinsics Zig exposes @Vector as a first-class type. Auto-vectorised matrix kernels can target AVX2, AVX-512, or NEON without external intrinsics headers.
Zero mandatory dependencies The entire project builds with a single zig build invocation. No CMake, no pkg-config, no system libraries required (BLAS back-ends are optional).
Memory safety without GC Bounds-checked slices, errdefer cleanup, and optional runtime safety checks catch bugs without the latency jitter of a garbage collector -- a hard requirement for real-time streaming inference.

Architecture at a Glance

The following diagram shows the six layers of ZigLlama as a dependency stack. Each layer builds exclusively on the layers beneath it.

block-beta
  columns 1
  block:L6["Layer 6 -- Inference"]
    G["Text Generation"] S["Sampling Strategies"] KV["KV Cache"] ST["Streaming"] B["Batching"]
  end
  block:L5["Layer 5 -- Models"]
    LL["LLaMA / Mistral / GPT-2 / ..."] TK["Tokeniser"] GU["GGUF Loader"] CF["Config"]
  end
  block:L4["Layer 4 -- Transformers"]
    ATT["Multi-Head Attention"] FF["Feed-Forward Networks"] TB["Transformer Blocks"]
  end
  block:L3["Layer 3 -- Neural Primitives"]
    ACT["Activations (SwiGLU, GELU, ...)"] NRM["Normalization (RMSNorm, LayerNorm)"] EMB["Embeddings & RoPE"]
  end
  block:L2["Layer 2 -- Linear Algebra"]
    MAT["SIMD Matrix Ops"] QNT["Quantisation (K-quant, IQ)"] BLS["BLAS Integration"]
  end
  block:L1["Layer 1 -- Foundation"]
    TEN["Tensors"] MEM["Memory Mapping"] GGF["GGUF Format"] THR["Threading"]
  end

  L6 --> L5
  L5 --> L4
  L4 --> L3
  L3 --> L2
  L2 --> L1

Layer numbering

Throughout this documentation, layers are numbered bottom-up: Layer 1 (Foundation) is at the bottom and Layer 6 (Inference) is at the top. This mirrors both the conceptual dependency order and the typical learning path.


Learning Paths

ZigLlama supports multiple learning trajectories depending on your background and available time.

Fast Track (2--4 hours)

For experienced ML engineers who want to understand how inference works at the systems level.

Step Topic Time
1 Foundations: Tensors 30 min
2 Transformers: Attention Mechanisms 45 min
3 Models: LLaMA 45 min
4 Inference: Text Generation 30 min
5 Inference: Sampling Strategies 30 min

Complete Journey (1--2 weeks)

Work through every layer from the ground up. This path assumes familiarity with basic linear algebra (\( Ax = b \), matrix multiplication) and a working knowledge of at least one systems language.

Week Layers Key concepts
1, days 1--2 Layer 1: Foundation Tensors, memory layout, GGUF binary format
1, days 3--4 Layer 2: Linear Algebra SIMD kernels, quantisation theory, cache blocking
1, day 5 Layer 3: Neural Primitives Activation functions, RMSNorm, RoPE
2, days 1--2 Layer 4: Transformers Scaled dot-product attention, feed-forward variants
2, days 3--4 Layer 5: Models LLaMA architecture, GGUF loading, tokenisation
2, day 5 Layer 6: Inference KV caching, sampling, streaming, batching

API Reference

For developers integrating ZigLlama into their own projects or extending it with new model architectures, the API Reference provides per-module documentation of every public function and type.


Key Statistics

The numbers below are computed from the repository at the v0.1.0 tag.

Metric Value
Source lines (.zig) ~30,000
Test cases 285+ (all passing)
Model architectures 18 families (LLaMA, Mistral, GPT-2, Falcon, Qwen, Phi, GPT-J, GPT-NeoX, BLOOM, Mamba, BERT, Gemma, StarCoder, and more)
Quantisation formats 18+ (Q4_0, Q8_0, K-quant family, IQ family)
Sampling strategies 8 (greedy, top-k, top-p, temperature, Mirostat, typical, tail-free, contrastive)
Combined inference speedup ~400x over naive implementation
llama.cpp production parity ~90%

Performance Highlights

Inference cost model

Without optimisation, generating \( T \) tokens from a model with \( d_\text{model} \) hidden dimensions and \( L \) layers costs

\[ \mathcal{O}\!\bigl(T \cdot L \cdot d_\text{model}^{\,2}\bigr) \]

per token, because every token recomputes all previous KV projections. With KV caching the per-token cost drops to \( \mathcal{O}(L \cdot d_\text{model}^{\,2}) \), yielding a 20x speedup for typical sequence lengths.

Optimisation Speedup Memory Reduction
KV Caching 20x 50%
SIMD Vectorisation 3--5x --
K-Quantisation (Q4_K) -- 87%
IQ-Quantisation (IQ1_S) -- 95%
Memory Mapping 10x load time 90%
Batch Processing 5--10x throughput --
Combined ~400x ~95%

Quick Navigation

  • :material-download: Getting Started

    Install Zig, clone the repo, run your first test.

  • :material-layers-triple: Architecture

    Design principles, the 6-layer model, and module dependencies.

  • :material-cube-outline: Layer 1 -- Foundations

    Tensors, memory management, GGUF binary format, BLAS integration.

  • :material-matrix: Layer 2 -- Linear Algebra

    SIMD matrix operations, quantisation theory and formats.

  • :material-function-variant: Layer 3 -- Neural Primitives

    Activation functions, normalisation layers, embeddings, RoPE.

  • :material-head-cog: Layer 4 -- Transformers

    Multi-head attention, feed-forward networks, full transformer blocks.

  • :material-llama: Layer 5 -- Models

    18 model architectures, GGUF loading, tokenisation, chat templates.

  • :material-rocket-launch: Layer 6 -- Inference

    Text generation, sampling, KV cache, streaming, batch processing.

  • :material-api: API Reference

    Per-module documentation for every public type and function.

  • :material-school: Examples and Tutorials

    Hands-on walkthroughs: first inference, attention visualisation, quantisation.

  • :material-speedometer: Performance

    Benchmarks, optimisation guide, parity analysis with llama.cpp.

  • :material-book-open-variant: References

    Academic papers, glossary, contributing guide, changelog.


Supported Model Architectures

ZigLlama implements 18 of the 94 architecture families tracked by llama.cpp. These 18 families cover approximately 80% of real-world model usage.

Category Architectures Count
Core language models LLaMA/LLaMA 2, Mistral, GPT-2, Falcon, Qwen, Phi, GPT-J, GPT-NeoX, BLOOM 9
Specialised models Mamba (state-space), BERT (bidirectional), Gemma, StarCoder (code) 4
Advanced components Mixture of Experts, Multi-modal (vision-language), BLAS integration 3
Tooling Model converter, Perplexity evaluation 2

How to Cite

If you use ZigLlama in academic work, please cite:

@software{zigllama2024,
  title   = {ZigLlama: Educational LLM Inference in Zig},
  author  = {Dipankar Sarkar and Contributors},
  year    = {2024},
  url     = {https://github.com/dipankar/zigllama},
  version = {0.1.0}
}

License

ZigLlama is released under the MIT License. See LICENSE for the full text.