Project Structure¶

This page provides a complete map of the ZigLlama repository. Every directory and significant file is annotated with its purpose and the architectural layer it belongs to. Refer to this page whenever you need to locate a module or understand where a new file should live.

Full Repository Tree¶

zigllama/
├── build.zig                          # Zig build system entry point
├── README.md                          # Project overview and quick-start
├── PROGRESS.md                        # Development progress report
│
├── src/                               # All production source code
│   ├── main.zig                       # Library root: re-exports all layers
│   │
│   ├── foundation/                    # Layer 1 -- Foundation
│   │   ├── tensor.zig                 #   Multi-dimensional tensor type
│   │   ├── memory_mapping.zig         #   mmap/mlock for model files
│   │   ├── gguf_format.zig            #   GGUF binary format parser
│   │   ├── blas_integration.zig       #   OpenBLAS / MKL / Accelerate bridge
│   │   └── threading.zig             #   Thread pool and NUMA-aware scheduling
│   │
│   ├── linear_algebra/                # Layer 2 -- Linear Algebra
│   │   ├── matrix_ops.zig             #   SIMD matrix multiplication & dot product
│   │   ├── quantization.zig           #   Basic quantisation (Q4_0, Q8_0, INT8)
│   │   ├── k_quantization.zig         #   K-quantisation family (Q2_K .. Q6_K)
│   │   └── iq_quantization.zig        #   Importance quantisation (IQ1_S .. IQ4_XS)
│   │
│   ├── neural_primitives/             # Layer 3 -- Neural Primitives
│   │   ├── activations.zig            #   ReLU, GELU, SiLU, SwiGLU, GeGLU, etc.
│   │   ├── normalization.zig          #   LayerNorm, RMSNorm, BatchNorm, GroupNorm
│   │   └── embeddings.zig             #   Token embeddings, positional encodings, RoPE
│   │
│   ├── transformers/                  # Layer 4 -- Transformers
│   │   ├── attention.zig              #   Multi-head & grouped-query attention
│   │   ├── feed_forward.zig           #   Standard, GELU, and gated FFN variants
│   │   └── transformer_block.zig      #   Full pre-norm / post-norm blocks
│   │
│   ├── models/                        # Layer 5 -- Models
│   │   ├── config.zig                 #   ModelConfig, ModelSize, hyperparameters
│   │   ├── tokenizer.zig              #   SimpleTokenizer, BPE support
│   │   ├── gguf.zig                   #   GGUF model file loader
│   │   ├── chat_templates.zig         #   ChatML, Alpaca, Vicuna, etc.
│   │   ├── llama.zig                  #   LLaMA / LLaMA 2 architecture
│   │   ├── mistral.zig                #   Mistral (sliding-window attention)
│   │   ├── gpt2.zig                   #   GPT-2
│   │   ├── falcon.zig                 #   Falcon (multi-query attention)
│   │   ├── qwen.zig                   #   Qwen (YARN RoPE scaling)
│   │   ├── phi.zig                    #   Phi (partial RoPE)
│   │   ├── gptj.zig                   #   GPT-J (parallel residuals)
│   │   ├── gpt_neox.zig               #   GPT-NeoX (fused QKV)
│   │   ├── bloom.zig                  #   BLOOM (ALiBi attention)
│   │   ├── mamba.zig                  #   Mamba (state-space model)
│   │   ├── bert.zig                   #   BERT (bidirectional encoder)
│   │   ├── gemma.zig                  #   Gemma (soft capping)
│   │   ├── starcoder.zig              #   StarCoder (code generation)
│   │   ├── mixture_of_experts.zig     #   MoE routing and load balancing
│   │   └── multi_modal.zig            #   Vision-language cross-modal projection
│   │
│   ├── inference/                     # Layer 6 -- Inference
│   │   ├── generation.zig             #   Autoregressive text generation loop
│   │   ├── advanced_sampling.zig      #   Mirostat, typical, tail-free, contrastive
│   │   ├── kv_cache.zig               #   KV cache allocation and management
│   │   ├── streaming.zig              #   Real-time token streaming
│   │   ├── batching.zig               #   Batch inference scheduler
│   │   ├── grammar_constraints.zig    #   JSON, RegEx, CFG constraint engines
│   │   └── profiling.zig              #   Inference timing and memory profiling
│   │
│   ├── server/                        # Tools -- HTTP server & CLI
│   │   ├── http_server.zig            #   REST API compatible with llama.cpp
│   │   └── cli.zig                    #   Command-line interface
│   │
│   ├── tools/                         # Tools -- Converters & evaluation
│   │   ├── model_converter.zig        #   Convert between model formats
│   │   └── converter_cli.zig          #   CLI wrapper for the converter
│   │
│   └── evaluation/                    # Tools -- Model quality
│       └── perplexity.zig             #   Perplexity evaluation pipeline
│
├── tests/                             # Test suites (mirror src/ structure)
│   ├── unit/
│   │   ├── test_tensor.zig            #   Foundation layer unit tests
│   │   └── test_linear_algebra.zig    #   Linear algebra unit tests
│   ├── test_neural_primitives.zig     #   Neural primitives tests
│   ├── test_basic_linear_algebra.zig  #   Additional LA tests
│   ├── test_transformer_components.zig#   Transformer layer tests
│   ├── test_models.zig                #   Model architecture tests
│   ├── test_inference.zig             #   Inference pipeline tests
│   ├── models_test.zig                #   Extended model tests
│   ├── threading_test.zig             #   Threading and concurrency tests
│   ├── gguf_test.zig                  #   GGUF format tests
│   ├── chat_templates_test.zig        #   Chat template rendering tests
│   ├── multi_modal_test.zig           #   Multi-modal integration tests
│   ├── multi_modal_basic_test.zig     #   Multi-modal basic tests
│   ├── multi_modal_simple_test.zig    #   Multi-modal simple tests
│   ├── multi_modal_concepts_test.zig  #   Multi-modal concept tests
│   ├── advanced_quantization_test.zig #   K-quant and IQ-quant tests
│   ├── perplexity_test.zig            #   Perplexity evaluation tests
│   ├── server_test.zig                #   HTTP server tests
│   ├── model_converter_test.zig       #   Converter tests
│   └── production_parity_test.zig     #   End-to-end parity with llama.cpp
│
├── examples/                          # Standalone runnable demos
│   ├── simple_demo.zig                #   End-to-end tour
│   ├── main.zig                       #   Library entry-point demo
│   ├── educational_demo.zig           #   Detailed educational walkthrough
│   ├── benchmark_demo.zig             #   Performance benchmarks
│   ├── parity_demo.zig                #   llama.cpp feature comparison
│   ├── gguf_demo.zig                  #   GGUF loading demo
│   ├── model_architectures_demo.zig   #   18-architecture tour
│   ├── chat_templates_demo.zig        #   Chat template demo
│   ├── multi_modal_demo.zig           #   Vision-language demo
│   ├── multi_modal_concepts_demo.zig  #   Multi-modal concepts
│   ├── threading_demo.zig             #   Multi-threaded inference demo
│   └── perplexity_demo.zig            #   Model evaluation demo
│
├── benchmarks/                        # Performance measurement tools
│   └── main.zig                       #   Benchmark entry point
│
├── documentation/                     # MkDocs documentation site
│   ├── mkdocs.yml                     #   Site configuration and nav
│   ├── requirements.txt               #   Python dependencies for MkDocs
│   └── docs/                          #   Markdown content (this site)
│       ├── index.md
│       ├── getting-started/
│       ├── architecture/
│       ├── foundations/
│       ├── linear-algebra/
│       ├── neural-primitives/
│       ├── transformers/
│       ├── models/
│       ├── inference/
│       ├── tools/
│       ├── api/
│       ├── examples/
│       ├── performance/
│       ├── references/
│       ├── javascripts/
│       └── stylesheets/
│
├── docs/                              # Legacy / quick-reference docs
│   └── ...
│
└── llama.cpp/                         # Reference: llama.cpp source (read-only)

Source Code Organisation by Layer¶

The src/ directory is the heart of ZigLlama. Each subdirectory maps to exactly one architectural layer, and imports flow strictly upward (lower layers never import higher ones).

flowchart BT
    F["foundation/"] --> LA["linear_algebra/"]
    LA --> NP["neural_primitives/"]
    NP --> TR["transformers/"]
    TR --> MO["models/"]
    MO --> IN["inference/"]

    SV["server/"] --> IN
    TO["tools/"] --> MO
    EV["evaluation/"] --> IN

    style F fill:#e8daef,stroke:#7d3c98
    style LA fill:#d5f5e3,stroke:#27ae60
    style NP fill:#fdebd0,stroke:#e67e22
    style TR fill:#d6eaf8,stroke:#2e86c1
    style MO fill:#fadbd8,stroke:#e74c3c
    style IN fill:#fef9e7,stroke:#f1c40f

Layer 1 -- Foundation (`src/foundation/`)¶

Dependency rule

Foundation modules import only from std. They have zero internal dependencies.

Module	Responsibility	Key types
`tensor.zig`	Multi-dimensional array with shape, strides, and element access	`Tensor(T)`
`memory_mapping.zig`	Memory-mapped file I/O (`mmap` on POSIX, `MapViewOfFile` on Windows)	`MappedFile`
`gguf_format.zig`	Parser for the GGUF binary container format	`GGUFHeader`, `GGUFTensorInfo`
`blas_integration.zig`	Thin wrapper over external BLAS libraries	`BlasBackend`
`threading.zig`	Thread pool with work-stealing and optional NUMA affinity	`ThreadPool`, `Task`

Layer 2 -- Linear Algebra (`src/linear_algebra/`)¶

Module	Responsibility	Key types
`matrix_ops.zig`	SIMD-accelerated matrix multiplication, dot product, element-wise ops	`SimdMatMul`
`quantization.zig`	Basic quantisation formats: Q4_0, Q8_0, INT8	`QuantizedTensor`, `QuantFormat`
`k_quantization.zig`	K-quantisation family (Q2_K through Q6_K) with block-wise scaling	`KQuantBlock`
`iq_quantization.zig`	Importance quantisation (IQ1_S through IQ4_XS) with non-uniform codebooks	`IQCodebook`

Layer 3 -- Neural Primitives (`src/neural_primitives/`)¶

Module	Responsibility	Key types
`activations.zig`	Activation functions: ReLU, GELU, SiLU, SwiGLU, GeGLU, GLU, Tanh, Sigmoid	`ActivationFn`
`normalization.zig`	Normalisation layers: LayerNorm, RMSNorm, BatchNorm, GroupNorm	`RMSNorm`, `LayerNorm`
`embeddings.zig`	Token embeddings, sinusoidal positional encodings, RoPE, segment embeddings	`EmbeddingTable`, `RoPEEncoder`

Layer 4 -- Transformers (`src/transformers/`)¶

Module	Responsibility	Key types
`attention.zig`	Scaled dot-product attention, multi-head attention, grouped-query attention, sliding window	`MultiHeadAttention`
`feed_forward.zig`	Position-wise FFN variants: standard, GELU-gated, SwiGLU-gated	`FeedForward`
`transformer_block.zig`	Complete encoder/decoder blocks with residual connections and normalisation	`TransformerBlock`

Layer 5 -- Models (`src/models/`)¶

This is the largest layer, containing 18 model architectures plus shared infrastructure.

Module	Responsibility
`config.zig`	Hyperparameter definitions for all model variants
`tokenizer.zig`	Tokenisation (simple, BPE, SentencePiece-compatible)
`gguf.zig`	High-level GGUF model loader (wraps `foundation/gguf_format.zig`)
`chat_templates.zig`	Prompt formatting for ChatML, Alpaca, Vicuna, Zephyr, etc.
`llama.zig`	LLaMA / LLaMA 2 forward pass
`mistral.zig`	Mistral with sliding-window attention and GQA
`gpt2.zig`	GPT-2 with learned positional embeddings
`falcon.zig`	Falcon with multi-query attention and parallel blocks
`qwen.zig`	Qwen with YARN RoPE scaling
`phi.zig`	Phi with partial RoPE and QK-LayerNorm
`gptj.zig`	GPT-J with parallel residual connections
`gpt_neox.zig`	GPT-NeoX with fused QKV projections
`bloom.zig`	BLOOM with ALiBi positional bias
`mamba.zig`	Mamba state-space model with selective scan
`bert.zig`	BERT bidirectional encoder
`gemma.zig`	Gemma with soft-capping and GQA
`starcoder.zig`	StarCoder for code generation with fill-in-the-middle
`mixture_of_experts.zig`	MoE routing, expert selection, load balancing
`multi_modal.zig`	Vision transformer encoder and cross-modal projector

Layer 6 -- Inference (`src/inference/`)¶

Module	Responsibility	Key types
`generation.zig`	Autoregressive generation loop and sampling dispatch	`TextGenerator`, `GenerationResult`
`advanced_sampling.zig`	Mirostat, typical sampling, tail-free sampling, contrastive search	`SamplingStrategy`
`kv_cache.zig`	Key-value cache allocation, eviction, and ring-buffer management	`KVCache`
`streaming.zig`	Token-by-token streaming with thread-safe buffer	`StreamingSession`
`batching.zig`	Batch scheduler: continuous batching, padding, dynamic grouping	`BatchScheduler`
`grammar_constraints.zig`	Constrained decoding: JSON schema, regex, context-free grammar	`GrammarEngine`
`profiling.zig`	Per-layer timing, memory watermarks, throughput metrics	`InferenceProfiler`

Test Organisation¶

Tests in ZigLlama live in two places:

Inline tests -- test blocks inside source files in src/. These are the primary unit tests and are collected automatically by zig build test.
Standalone test files -- files in tests/ that exercise cross-module interactions, integration scenarios, and production parity checks.

flowchart LR
    subgraph "Inline tests (src/)"
        T1["tensor.zig tests"]
        T2["matrix_ops.zig tests"]
        T3["attention.zig tests"]
        T4["generation.zig tests"]
    end

    subgraph "Standalone tests (tests/)"
        S1["test_tensor.zig"]
        S2["test_linear_algebra.zig"]
        S3["production_parity_test.zig"]
        S4["multi_modal_test.zig"]
    end

    B["zig build test"] --> T1 & T2 & T3 & T4
    B --> S1 & S2 & S3 & S4

Naming conventions

Files prefixed with test_ (e.g., test_tensor.zig) contain unit tests for a specific module.
Files suffixed with _test (e.g., gguf_test.zig) contain integration or feature-level tests.
production_parity_test.zig validates end-to-end behaviour against llama.cpp reference outputs.

Test Distribution¶

Layer / Category	Test files	Approximate test count
Foundation	`unit/test_tensor.zig`, inline in `foundation/*.zig`	8
Linear Algebra	`unit/test_linear_algebra.zig`, `test_basic_linear_algebra.zig`, `advanced_quantization_test.zig`	25
Neural Primitives	`test_neural_primitives.zig`	12
Transformers	`test_transformer_components.zig`	15
Models	`test_models.zig`, `models_test.zig`, `gguf_test.zig`, `chat_templates_test.zig`, `multi_modal_*.zig`	120
Inference	`test_inference.zig`, `server_test.zig`, `perplexity_test.zig`, `threading_test.zig`	80
Parity	`production_parity_test.zig`	25
Total	20 test files	285+

Examples Organisation¶

All examples reside in examples/ as self-contained .zig files. Each can be run with zig run examples/<name>.zig and requires no arguments.

Category	Examples	Purpose
Overview	`simple_demo.zig`, `main.zig`, `educational_demo.zig`	Broad tours of the architecture
Performance	`benchmark_demo.zig`	Throughput and latency measurement
Compatibility	`parity_demo.zig`, `gguf_demo.zig`	llama.cpp feature comparison and format support
Model breadth	`model_architectures_demo.zig`, `chat_templates_demo.zig`	All 18 architectures and prompt formats
Advanced	`multi_modal_demo.zig`, `multi_modal_concepts_demo.zig`	Vision-language pipelines
Systems	`threading_demo.zig`	Concurrent inference and NUMA
Evaluation	`perplexity_demo.zig`	Model quality scoring

Documentation Organisation¶

The documentation lives under documentation/ and is built with Material for MkDocs.

documentation/
├── mkdocs.yml           # Navigation, theme, extensions
├── requirements.txt     # Python: mkdocs-material, plugins
└── docs/
    ├── index.md         # Home page (this documentation site)
    ├── getting-started/ # Installation, quick start, building, project structure
    ├── architecture/    # Design principles, 6-layer overview, llama.cpp comparison
    ├── foundations/      # Layer 1 deep dives
    ├── linear-algebra/  # Layer 2 deep dives
    ├── neural-primitives/ # Layer 3 deep dives
    ├── transformers/    # Layer 4 deep dives
    ├── models/          # Layer 5 deep dives (one page per architecture)
    ├── inference/       # Layer 6 deep dives
    ├── tools/           # Server, CLI, converter, perplexity
    ├── api/             # Per-module API reference
    ├── examples/        # Tutorials and demo walkthroughs
    ├── performance/     # Benchmarks, optimisation, parity analysis
    ├── references/      # Papers, glossary, contributing, changelog
    ├── javascripts/     # MathJax configuration
    └── stylesheets/     # Custom CSS overrides

Building the Documentation Locally¶

cd documentation
pip install -r requirements.txt
mkdocs serve
# Open http://127.0.0.1:8000 in your browser

The `main.zig` Module Map¶

The library entry point src/main.zig re-exports every layer as a nested namespace. This means downstream code can import the entire library with a single @import:

const zigllama = @import("src/main.zig");

// Access any layer:
const Tensor = zigllama.foundation.tensor.Tensor;
const MatMul = zigllama.linear_algebra.matrix_ops;
const Attention = zigllama.transformers.attention;
const LLaMA = zigllama.models.llama;
const Generator = zigllama.inference.generation;

flowchart TD
    MAIN["src/main.zig"] --> F["foundation"]
    MAIN --> LA["linear_algebra"]
    MAIN --> NP["neural_primitives"]
    MAIN --> TR["transformers"]
    MAIN --> MO["models"]
    MAIN --> IN["inference"]

    F --> F1["tensor"] & F2["memory_mapping"] & F3["gguf_format"] & F4["blas_integration"] & F5["threading"]
    LA --> LA1["matrix_ops"] & LA2["quantization"] & LA3["k_quantization"] & LA4["iq_quantization"]
    NP --> NP1["activations"] & NP2["normalization"] & NP3["embeddings"]
    TR --> TR1["attention"] & TR2["feed_forward"] & TR3["transformer_block"]
    MO --> MO1["llama"] & MO2["config"] & MO3["tokenizer"] & MO4["gguf"] & MO5["... 14 more"]
    IN --> IN1["generation"] & IN2["kv_cache"] & IN3["streaming"] & IN4["batching"] & IN5["profiling"]

Not all modules are re-exported

src/main.zig re-exports the core library modules. Utility modules (server/, tools/, evaluation/) are imported directly by their respective entry points rather than through the library namespace.

Conventions¶

Convention	Rule
File naming	`snake_case.zig` everywhere
Module exports	One primary type per file; file name matches the type name
Test naming	`test "descriptive name"` blocks; names read as sentences
Error sets	Named error sets per module (`TensorError`, `GGUFError`, etc.)
Documentation	`///` doc comments on every public declaration
Memory	Caller provides `allocator`; callee never stores a global allocator

Next Steps¶

Architecture Overview -- understand the design principles behind this structure.
Layer 1: Foundations -- begin the technical deep dive with tensors and memory management.
API Reference -- per-function documentation for every public symbol.