Project Structure¶
This page provides a complete map of the ZigLlama repository. Every directory and significant file is annotated with its purpose and the architectural layer it belongs to. Refer to this page whenever you need to locate a module or understand where a new file should live.
Full Repository Tree¶
zigllama/
├── build.zig # Zig build system entry point
├── README.md # Project overview and quick-start
├── PROGRESS.md # Development progress report
│
├── src/ # All production source code
│ ├── main.zig # Library root: re-exports all layers
│ │
│ ├── foundation/ # Layer 1 -- Foundation
│ │ ├── tensor.zig # Multi-dimensional tensor type
│ │ ├── memory_mapping.zig # mmap/mlock for model files
│ │ ├── gguf_format.zig # GGUF binary format parser
│ │ ├── blas_integration.zig # OpenBLAS / MKL / Accelerate bridge
│ │ └── threading.zig # Thread pool and NUMA-aware scheduling
│ │
│ ├── linear_algebra/ # Layer 2 -- Linear Algebra
│ │ ├── matrix_ops.zig # SIMD matrix multiplication & dot product
│ │ ├── quantization.zig # Basic quantisation (Q4_0, Q8_0, INT8)
│ │ ├── k_quantization.zig # K-quantisation family (Q2_K .. Q6_K)
│ │ └── iq_quantization.zig # Importance quantisation (IQ1_S .. IQ4_XS)
│ │
│ ├── neural_primitives/ # Layer 3 -- Neural Primitives
│ │ ├── activations.zig # ReLU, GELU, SiLU, SwiGLU, GeGLU, etc.
│ │ ├── normalization.zig # LayerNorm, RMSNorm, BatchNorm, GroupNorm
│ │ └── embeddings.zig # Token embeddings, positional encodings, RoPE
│ │
│ ├── transformers/ # Layer 4 -- Transformers
│ │ ├── attention.zig # Multi-head & grouped-query attention
│ │ ├── feed_forward.zig # Standard, GELU, and gated FFN variants
│ │ └── transformer_block.zig # Full pre-norm / post-norm blocks
│ │
│ ├── models/ # Layer 5 -- Models
│ │ ├── config.zig # ModelConfig, ModelSize, hyperparameters
│ │ ├── tokenizer.zig # SimpleTokenizer, BPE support
│ │ ├── gguf.zig # GGUF model file loader
│ │ ├── chat_templates.zig # ChatML, Alpaca, Vicuna, etc.
│ │ ├── llama.zig # LLaMA / LLaMA 2 architecture
│ │ ├── mistral.zig # Mistral (sliding-window attention)
│ │ ├── gpt2.zig # GPT-2
│ │ ├── falcon.zig # Falcon (multi-query attention)
│ │ ├── qwen.zig # Qwen (YARN RoPE scaling)
│ │ ├── phi.zig # Phi (partial RoPE)
│ │ ├── gptj.zig # GPT-J (parallel residuals)
│ │ ├── gpt_neox.zig # GPT-NeoX (fused QKV)
│ │ ├── bloom.zig # BLOOM (ALiBi attention)
│ │ ├── mamba.zig # Mamba (state-space model)
│ │ ├── bert.zig # BERT (bidirectional encoder)
│ │ ├── gemma.zig # Gemma (soft capping)
│ │ ├── starcoder.zig # StarCoder (code generation)
│ │ ├── mixture_of_experts.zig # MoE routing and load balancing
│ │ └── multi_modal.zig # Vision-language cross-modal projection
│ │
│ ├── inference/ # Layer 6 -- Inference
│ │ ├── generation.zig # Autoregressive text generation loop
│ │ ├── advanced_sampling.zig # Mirostat, typical, tail-free, contrastive
│ │ ├── kv_cache.zig # KV cache allocation and management
│ │ ├── streaming.zig # Real-time token streaming
│ │ ├── batching.zig # Batch inference scheduler
│ │ ├── grammar_constraints.zig # JSON, RegEx, CFG constraint engines
│ │ └── profiling.zig # Inference timing and memory profiling
│ │
│ ├── server/ # Tools -- HTTP server & CLI
│ │ ├── http_server.zig # REST API compatible with llama.cpp
│ │ └── cli.zig # Command-line interface
│ │
│ ├── tools/ # Tools -- Converters & evaluation
│ │ ├── model_converter.zig # Convert between model formats
│ │ └── converter_cli.zig # CLI wrapper for the converter
│ │
│ └── evaluation/ # Tools -- Model quality
│ └── perplexity.zig # Perplexity evaluation pipeline
│
├── tests/ # Test suites (mirror src/ structure)
│ ├── unit/
│ │ ├── test_tensor.zig # Foundation layer unit tests
│ │ └── test_linear_algebra.zig # Linear algebra unit tests
│ ├── test_neural_primitives.zig # Neural primitives tests
│ ├── test_basic_linear_algebra.zig # Additional LA tests
│ ├── test_transformer_components.zig# Transformer layer tests
│ ├── test_models.zig # Model architecture tests
│ ├── test_inference.zig # Inference pipeline tests
│ ├── models_test.zig # Extended model tests
│ ├── threading_test.zig # Threading and concurrency tests
│ ├── gguf_test.zig # GGUF format tests
│ ├── chat_templates_test.zig # Chat template rendering tests
│ ├── multi_modal_test.zig # Multi-modal integration tests
│ ├── multi_modal_basic_test.zig # Multi-modal basic tests
│ ├── multi_modal_simple_test.zig # Multi-modal simple tests
│ ├── multi_modal_concepts_test.zig # Multi-modal concept tests
│ ├── advanced_quantization_test.zig # K-quant and IQ-quant tests
│ ├── perplexity_test.zig # Perplexity evaluation tests
│ ├── server_test.zig # HTTP server tests
│ ├── model_converter_test.zig # Converter tests
│ └── production_parity_test.zig # End-to-end parity with llama.cpp
│
├── examples/ # Standalone runnable demos
│ ├── simple_demo.zig # End-to-end tour
│ ├── main.zig # Library entry-point demo
│ ├── educational_demo.zig # Detailed educational walkthrough
│ ├── benchmark_demo.zig # Performance benchmarks
│ ├── parity_demo.zig # llama.cpp feature comparison
│ ├── gguf_demo.zig # GGUF loading demo
│ ├── model_architectures_demo.zig # 18-architecture tour
│ ├── chat_templates_demo.zig # Chat template demo
│ ├── multi_modal_demo.zig # Vision-language demo
│ ├── multi_modal_concepts_demo.zig # Multi-modal concepts
│ ├── threading_demo.zig # Multi-threaded inference demo
│ └── perplexity_demo.zig # Model evaluation demo
│
├── benchmarks/ # Performance measurement tools
│ └── main.zig # Benchmark entry point
│
├── documentation/ # MkDocs documentation site
│ ├── mkdocs.yml # Site configuration and nav
│ ├── requirements.txt # Python dependencies for MkDocs
│ └── docs/ # Markdown content (this site)
│ ├── index.md
│ ├── getting-started/
│ ├── architecture/
│ ├── foundations/
│ ├── linear-algebra/
│ ├── neural-primitives/
│ ├── transformers/
│ ├── models/
│ ├── inference/
│ ├── tools/
│ ├── api/
│ ├── examples/
│ ├── performance/
│ ├── references/
│ ├── javascripts/
│ └── stylesheets/
│
├── docs/ # Legacy / quick-reference docs
│ └── ...
│
└── llama.cpp/ # Reference: llama.cpp source (read-only)
Source Code Organisation by Layer¶
The src/ directory is the heart of ZigLlama. Each subdirectory maps to exactly one architectural layer, and imports flow strictly upward (lower layers never import higher ones).
flowchart BT
F["foundation/"] --> LA["linear_algebra/"]
LA --> NP["neural_primitives/"]
NP --> TR["transformers/"]
TR --> MO["models/"]
MO --> IN["inference/"]
SV["server/"] --> IN
TO["tools/"] --> MO
EV["evaluation/"] --> IN
style F fill:#e8daef,stroke:#7d3c98
style LA fill:#d5f5e3,stroke:#27ae60
style NP fill:#fdebd0,stroke:#e67e22
style TR fill:#d6eaf8,stroke:#2e86c1
style MO fill:#fadbd8,stroke:#e74c3c
style IN fill:#fef9e7,stroke:#f1c40f Layer 1 -- Foundation (src/foundation/)¶
Dependency rule
Foundation modules import only from std. They have zero internal dependencies.
| Module | Responsibility | Key types |
|---|---|---|
tensor.zig | Multi-dimensional array with shape, strides, and element access | Tensor(T) |
memory_mapping.zig | Memory-mapped file I/O (mmap on POSIX, MapViewOfFile on Windows) | MappedFile |
gguf_format.zig | Parser for the GGUF binary container format | GGUFHeader, GGUFTensorInfo |
blas_integration.zig | Thin wrapper over external BLAS libraries | BlasBackend |
threading.zig | Thread pool with work-stealing and optional NUMA affinity | ThreadPool, Task |
Layer 2 -- Linear Algebra (src/linear_algebra/)¶
| Module | Responsibility | Key types |
|---|---|---|
matrix_ops.zig | SIMD-accelerated matrix multiplication, dot product, element-wise ops | SimdMatMul |
quantization.zig | Basic quantisation formats: Q4_0, Q8_0, INT8 | QuantizedTensor, QuantFormat |
k_quantization.zig | K-quantisation family (Q2_K through Q6_K) with block-wise scaling | KQuantBlock |
iq_quantization.zig | Importance quantisation (IQ1_S through IQ4_XS) with non-uniform codebooks | IQCodebook |
Layer 3 -- Neural Primitives (src/neural_primitives/)¶
| Module | Responsibility | Key types |
|---|---|---|
activations.zig | Activation functions: ReLU, GELU, SiLU, SwiGLU, GeGLU, GLU, Tanh, Sigmoid | ActivationFn |
normalization.zig | Normalisation layers: LayerNorm, RMSNorm, BatchNorm, GroupNorm | RMSNorm, LayerNorm |
embeddings.zig | Token embeddings, sinusoidal positional encodings, RoPE, segment embeddings | EmbeddingTable, RoPEEncoder |
Layer 4 -- Transformers (src/transformers/)¶
| Module | Responsibility | Key types |
|---|---|---|
attention.zig | Scaled dot-product attention, multi-head attention, grouped-query attention, sliding window | MultiHeadAttention |
feed_forward.zig | Position-wise FFN variants: standard, GELU-gated, SwiGLU-gated | FeedForward |
transformer_block.zig | Complete encoder/decoder blocks with residual connections and normalisation | TransformerBlock |
Layer 5 -- Models (src/models/)¶
This is the largest layer, containing 18 model architectures plus shared infrastructure.
| Module | Responsibility |
|---|---|
config.zig | Hyperparameter definitions for all model variants |
tokenizer.zig | Tokenisation (simple, BPE, SentencePiece-compatible) |
gguf.zig | High-level GGUF model loader (wraps foundation/gguf_format.zig) |
chat_templates.zig | Prompt formatting for ChatML, Alpaca, Vicuna, Zephyr, etc. |
llama.zig | LLaMA / LLaMA 2 forward pass |
mistral.zig | Mistral with sliding-window attention and GQA |
gpt2.zig | GPT-2 with learned positional embeddings |
falcon.zig | Falcon with multi-query attention and parallel blocks |
qwen.zig | Qwen with YARN RoPE scaling |
phi.zig | Phi with partial RoPE and QK-LayerNorm |
gptj.zig | GPT-J with parallel residual connections |
gpt_neox.zig | GPT-NeoX with fused QKV projections |
bloom.zig | BLOOM with ALiBi positional bias |
mamba.zig | Mamba state-space model with selective scan |
bert.zig | BERT bidirectional encoder |
gemma.zig | Gemma with soft-capping and GQA |
starcoder.zig | StarCoder for code generation with fill-in-the-middle |
mixture_of_experts.zig | MoE routing, expert selection, load balancing |
multi_modal.zig | Vision transformer encoder and cross-modal projector |
Layer 6 -- Inference (src/inference/)¶
| Module | Responsibility | Key types |
|---|---|---|
generation.zig | Autoregressive generation loop and sampling dispatch | TextGenerator, GenerationResult |
advanced_sampling.zig | Mirostat, typical sampling, tail-free sampling, contrastive search | SamplingStrategy |
kv_cache.zig | Key-value cache allocation, eviction, and ring-buffer management | KVCache |
streaming.zig | Token-by-token streaming with thread-safe buffer | StreamingSession |
batching.zig | Batch scheduler: continuous batching, padding, dynamic grouping | BatchScheduler |
grammar_constraints.zig | Constrained decoding: JSON schema, regex, context-free grammar | GrammarEngine |
profiling.zig | Per-layer timing, memory watermarks, throughput metrics | InferenceProfiler |
Test Organisation¶
Tests in ZigLlama live in two places:
-
Inline tests --
testblocks inside source files insrc/. These are the primary unit tests and are collected automatically byzig build test. -
Standalone test files -- files in
tests/that exercise cross-module interactions, integration scenarios, and production parity checks.
flowchart LR
subgraph "Inline tests (src/)"
T1["tensor.zig tests"]
T2["matrix_ops.zig tests"]
T3["attention.zig tests"]
T4["generation.zig tests"]
end
subgraph "Standalone tests (tests/)"
S1["test_tensor.zig"]
S2["test_linear_algebra.zig"]
S3["production_parity_test.zig"]
S4["multi_modal_test.zig"]
end
B["zig build test"] --> T1 & T2 & T3 & T4
B --> S1 & S2 & S3 & S4 Naming conventions
- Files prefixed with
test_(e.g.,test_tensor.zig) contain unit tests for a specific module. - Files suffixed with
_test(e.g.,gguf_test.zig) contain integration or feature-level tests. production_parity_test.zigvalidates end-to-end behaviour against llama.cpp reference outputs.
Test Distribution¶
| Layer / Category | Test files | Approximate test count |
|---|---|---|
| Foundation | unit/test_tensor.zig, inline in foundation/*.zig | 8 |
| Linear Algebra | unit/test_linear_algebra.zig, test_basic_linear_algebra.zig, advanced_quantization_test.zig | 25 |
| Neural Primitives | test_neural_primitives.zig | 12 |
| Transformers | test_transformer_components.zig | 15 |
| Models | test_models.zig, models_test.zig, gguf_test.zig, chat_templates_test.zig, multi_modal_*.zig | 120 |
| Inference | test_inference.zig, server_test.zig, perplexity_test.zig, threading_test.zig | 80 |
| Parity | production_parity_test.zig | 25 |
| Total | 20 test files | 285+ |
Examples Organisation¶
All examples reside in examples/ as self-contained .zig files. Each can be run with zig run examples/<name>.zig and requires no arguments.
| Category | Examples | Purpose |
|---|---|---|
| Overview | simple_demo.zig, main.zig, educational_demo.zig | Broad tours of the architecture |
| Performance | benchmark_demo.zig | Throughput and latency measurement |
| Compatibility | parity_demo.zig, gguf_demo.zig | llama.cpp feature comparison and format support |
| Model breadth | model_architectures_demo.zig, chat_templates_demo.zig | All 18 architectures and prompt formats |
| Advanced | multi_modal_demo.zig, multi_modal_concepts_demo.zig | Vision-language pipelines |
| Systems | threading_demo.zig | Concurrent inference and NUMA |
| Evaluation | perplexity_demo.zig | Model quality scoring |
Documentation Organisation¶
The documentation lives under documentation/ and is built with Material for MkDocs.
documentation/
├── mkdocs.yml # Navigation, theme, extensions
├── requirements.txt # Python: mkdocs-material, plugins
└── docs/
├── index.md # Home page (this documentation site)
├── getting-started/ # Installation, quick start, building, project structure
├── architecture/ # Design principles, 6-layer overview, llama.cpp comparison
├── foundations/ # Layer 1 deep dives
├── linear-algebra/ # Layer 2 deep dives
├── neural-primitives/ # Layer 3 deep dives
├── transformers/ # Layer 4 deep dives
├── models/ # Layer 5 deep dives (one page per architecture)
├── inference/ # Layer 6 deep dives
├── tools/ # Server, CLI, converter, perplexity
├── api/ # Per-module API reference
├── examples/ # Tutorials and demo walkthroughs
├── performance/ # Benchmarks, optimisation, parity analysis
├── references/ # Papers, glossary, contributing, changelog
├── javascripts/ # MathJax configuration
└── stylesheets/ # Custom CSS overrides
Building the Documentation Locally¶
cd documentation
pip install -r requirements.txt
mkdocs serve
# Open http://127.0.0.1:8000 in your browser
The main.zig Module Map¶
The library entry point src/main.zig re-exports every layer as a nested namespace. This means downstream code can import the entire library with a single @import:
const zigllama = @import("src/main.zig");
// Access any layer:
const Tensor = zigllama.foundation.tensor.Tensor;
const MatMul = zigllama.linear_algebra.matrix_ops;
const Attention = zigllama.transformers.attention;
const LLaMA = zigllama.models.llama;
const Generator = zigllama.inference.generation;
flowchart TD
MAIN["src/main.zig"] --> F["foundation"]
MAIN --> LA["linear_algebra"]
MAIN --> NP["neural_primitives"]
MAIN --> TR["transformers"]
MAIN --> MO["models"]
MAIN --> IN["inference"]
F --> F1["tensor"] & F2["memory_mapping"] & F3["gguf_format"] & F4["blas_integration"] & F5["threading"]
LA --> LA1["matrix_ops"] & LA2["quantization"] & LA3["k_quantization"] & LA4["iq_quantization"]
NP --> NP1["activations"] & NP2["normalization"] & NP3["embeddings"]
TR --> TR1["attention"] & TR2["feed_forward"] & TR3["transformer_block"]
MO --> MO1["llama"] & MO2["config"] & MO3["tokenizer"] & MO4["gguf"] & MO5["... 14 more"]
IN --> IN1["generation"] & IN2["kv_cache"] & IN3["streaming"] & IN4["batching"] & IN5["profiling"] Not all modules are re-exported
src/main.zig re-exports the core library modules. Utility modules (server/, tools/, evaluation/) are imported directly by their respective entry points rather than through the library namespace.
Conventions¶
| Convention | Rule |
|---|---|
| File naming | snake_case.zig everywhere |
| Module exports | One primary type per file; file name matches the type name |
| Test naming | test "descriptive name" blocks; names read as sentences |
| Error sets | Named error sets per module (TensorError, GGUFError, etc.) |
| Documentation | /// doc comments on every public declaration |
| Memory | Caller provides allocator; callee never stores a global allocator |
Next Steps¶
- Architecture Overview -- understand the design principles behind this structure.
- Layer 1: Foundations -- begin the technical deep dive with tensors and memory management.
- API Reference -- per-function documentation for every public symbol.