Design Principles¶
ZigLlama is governed by a small set of non-negotiable principles. Every pull request, every new module, and every documentation paragraph is evaluated against these tenets. This page states them explicitly so that contributors and readers can understand the why behind the code.
1. Educational First, Performance Second¶
The primary purpose of ZigLlama is to teach. A reader who has never seen a transformer implementation should be able to open any source file and, by reading the comments alone, understand what mathematical operation is being performed, why it matters for language modelling, and how it relates to the components above and below it in the stack.
Guiding Question
"If a competent Zig programmer who has never read the Attention Is All You Need paper opens this file, can they learn the concept from the code and its comments alone?"
Concrete practices¶
- Every public function begins with a doc-comment that states the mathematical definition, the transformer context, and a minimal worked example where appropriate.
- Inline comments annotate non-obvious algorithmic steps with references to the corresponding equation numbers in the canonical literature (Vaswani et al., 2017; Touvron et al., 2023).
- Where an optimised path coexists with a naive path, both are present in the source: the naive version for readability, the optimised version for production use.
/// Rectified Linear Unit (ReLU)
///
/// ## Mathematical Definition
/// ```
/// ReLU(x) = max(0, x) = { x if x > 0, 0 otherwise }
/// ```
///
/// ## Transformer Context
/// ReLU was used in the original Transformer feed-forward network.
/// Modern architectures (LLaMA) have replaced it with SwiGLU.
pub fn relu(x: f32) f32 {
return @max(0.0, x);
}
Why Zig?
Zig's comptime generics, explicit allocators, and absence of hidden control flow make it an ideal language for educational systems programming. Every allocation is visible; every error is handled; there is no garbage collector to abstract away memory behaviour.
2. Progressive Component Architecture¶
The codebase is organised into six layers that build from the simplest abstractions (tensors, memory maps) to the most complex (text generation, streaming inference). This mirrors the way a textbook introduces concepts: fundamentals first, composition second.
Layer 6 Inference
Layer 5 Models
Layer 4 Transformers
Layer 3 Neural Primitives
Layer 2 Linear Algebra
Layer 1 Foundation
Dependency Rule
A module in layer \( L_i \) may import any module in layer \( L_j \) where \( j < i \). It must not import modules in the same layer or in layers above it. This produces a strict DAG.
Benefits¶
| Benefit | Explanation |
|---|---|
| Incremental learning | A reader can study Layer 1 in isolation before encountering Layer 2. |
| Testability | Each layer can be unit-tested against its own contracts without mocking upper layers. |
| Compile-time safety | Zig's import system enforces the DAG at compile time -- circular imports are a compile error. |
| Refactorability | Replacing the SIMD backend in Layer 2 cannot break Layer 5. |
3. Test-Driven Development¶
ZigLlama currently contains 285+ tests spanning four categories:
| Category | Purpose | Example |
|---|---|---|
| Unit tests | Verify a single function or type in isolation. | tensor.init returns correct shape. |
| Integration tests | Verify cross-layer interactions. | A TransformerBlock produces the expected output when composed from attention + FFN. |
| Reference tests | Validate numerical results against known-good values from the literature or from a reference implementation (PyTorch / llama.cpp). | GELU approximation matches the Hendrycks & Gimpel (2016) formula to \( < 10^{-5} \). |
| Performance tests | Guard against regressions and document scaling behaviour. | SIMD matmul is at least 3x faster than the naive loop for \( n \geq 256 \). |
Test Lifecycle
- Write the mathematical specification in the doc-comment.
- Derive the expected outputs by hand or from a reference implementation.
- Implement the function.
- Write unit tests that assert the expected outputs.
- Add an integration test that exercises the function in a realistic pipeline.
Test counts by layer¶
| Layer | Tests |
|---|---|
| 1 -- Foundation | 6 |
| 2 -- Linear Algebra | 5 |
| 3 -- Neural Primitives | 9 |
| 4 -- Transformers | 11 |
| 5 -- Models | 45 |
| 6 -- Inference | 47 |
| Cross-layer / Main | 2 |
| Extended suite | 160+ |
| Total | 285+ |
4. Documentation as Code¶
Documentation is not an afterthought -- it is a deliverable on par with the implementation. ZigLlama treats three kinds of documentation as first-class:
4.1 Inline documentation (source files)¶
Every source file opens with a module-level doc-comment that states:
- Educational Objectives -- what the reader will learn.
- Mathematical Foundation -- the relevant equations.
- Transformer Context -- how the component fits into the larger model.
4.2 MkDocs site (this documentation)¶
The MkDocs site provides long-form exposition that cannot fit into source comments: architectural diagrams, cross-cutting comparisons, learning paths, and reference tables. It uses:
- LaTeX for inline (\( x \)) and display ([ \sum_{i=1}^{n} x_i ]) mathematics.
- Mermaid for architectural and data-flow diagrams.
- Material for MkDocs admonitions for definitions, theorems, algorithms, and warnings.
4.3 Historical and theoretical context¶
Where appropriate, documentation references the original papers, explains the historical evolution of a technique (e.g. LayerNorm to RMSNorm), and provides intuition for why the technique works, not just how to implement it.
Documentation Triad
Every significant component should be documented along three axes:
- Mathematical: the formal definition with notation.
- Historical: who introduced it, when, and why.
- Practical: a worked example or code snippet.
5. Feature Parity and Beyond¶
ZigLlama aims for meaningful feature parity with llama.cpp, the industry-standard C++ inference engine. "Meaningful" is defined as covering the features that matter for understanding and using large language models, rather than chasing every niche backend or quantisation variant.
Current parity targets¶
| Area | Status |
|---|---|
| GGUF model loading | Full v3 spec compliance |
| Quantisation formats | 18+ formats (Q4_0 through IQ4_NL) |
| Model architectures | 18 (covering ~80% of real-world usage) |
| Sampling strategies | Greedy, Top-K, Top-P, Temperature, Combined, Mirostat, Typical, Tail-Free, Contrastive |
| KV cache | Multi-sequence, sliding window |
| Batch inference | Dynamic batching with request queuing |
| Streaming generation | Thread-safe token streaming |
| Grammar constraints | JSON, Regex, CFG, XML, EBNF |
| BLAS integration | OpenBLAS, Intel MKL, Apple Accelerate, generic fallback |
| SIMD | AVX, AVX2, NEON auto-detection |
What llama.cpp has that ZigLlama does not (yet)
GPU offloading (CUDA, Metal, Vulkan), 30+ quantisation formats, 94+ model architectures. See the full comparison.
6. Iterative Refinement¶
Development follows a three-pass strategy for every component:
Pass 1 -- Naive implementation¶
Write the simplest correct version. Prioritise readability. Use \( O(n^3) \) matrix multiplication with three nested loops. Document the algorithm thoroughly.
Pass 2 -- Optimised implementation¶
Add SIMD, cache blocking, and quantisation paths. Keep the naive version in the code (guarded behind a comptime flag or as a separate function) so that readers can compare.
Pass 3 -- Documentation and testing¶
Write the MkDocs page, add performance benchmarks, and record the speed-up ratios. This pass ensures that the educational value of the optimisation is captured, not just its runtime benefit.
Documenting Trade-offs
Every optimisation must document:
- Before: the naive complexity and measured time.
- After: the optimised complexity and measured time.
- Trade-off: what readability or generality was sacrificed.
- When to use: under what workload the optimisation matters.
7. Implementation Standards¶
The following standards apply to all contributions.
7.1 Code quality¶
| Standard | Rationale |
|---|---|
| Clarity over cleverness | A ten-line readable loop is preferred over a two-line bit-manipulation trick, unless the trick is itself the subject of the lesson. |
| Explicit allocation | Every std.mem.Allocator parameter is passed explicitly. No global allocators. |
| Error handling | All fallible operations return !T. Error sets are documented. |
| Naming conventions | Zig standard: camelCase functions, PascalCase types, snake_case fields. |
| No hidden control flow | No async, no hidden allocations, no implicit conversions. |
7.2 Documentation standards¶
| Standard | Rationale |
|---|---|
| Mathematical notation | Use LaTeX for any formula with more than one operator. |
| Historical context | Cite the introducing paper at least once per major component. |
| Worked examples | Include at least one input/output pair in each doc-comment. |
| Diagram-first | When explaining data flow, draw a diagram before writing prose. |
7.3 Testing standards¶
| Standard | Rationale |
|---|---|
| Deterministic | Tests must not depend on wall-clock time, random seeds, or network access. |
| Self-contained | Each test file can be run independently with zig test. |
| Named assertions | Use std.testing.expectEqual with descriptive variable names rather than raw boolean assertions. |
| Edge cases | Every numeric function must be tested at 0, negative values, very large values, and NaN/inf where applicable. |
Summary¶
The seven principles form a coherent whole:
graph TD
A[Educational First] --> B[Progressive Architecture]
B --> C[Test-Driven Development]
C --> D[Documentation as Code]
D --> E[Feature Parity]
E --> F[Iterative Refinement]
F --> G[Implementation Standards]
G --> A Each principle reinforces the others. Progressive architecture makes testing easier; testing makes documentation concrete; documentation keeps education front and centre; and the implementation standards ensure that the cycle repeats with every new contribution.