The 6-Layer Progressive Architecture¶

ZigLlama decomposes the problem of running a large language model into six progressively more abstract layers. A reader can begin at Layer 1 and build a working mental model before advancing to Layer 2; by the time they reach Layer 6 they have seen every concept needed to generate text from a transformer.

1. Architectural Philosophy¶

Why progressive layers?¶

Language models are deep stacks of repeated operations, but they are typically presented either as opaque library calls or as a wall of linear-algebra code. Neither approach is good for learning.

ZigLlama's layered design offers a middle path:

Problem with monolithic code	How layers solve it
"Where does the tensor come from?"	Layer 1 defines tensors in isolation, with full doc-comments and tests.
"What is quantisation doing to my weights?"	Layer 2 treats quantisation as a self-contained topic, tested against known-good values.
"How does RMSNorm fit into the model?"	Layer 3 implements normalisation as an independent primitive; Layer 4 composes it into blocks.
"What is KV caching and why does it matter?"	Layer 6 explains and benchmarks caching without requiring the reader to understand model loading (Layer 5).

Progressive Disclosure

Progressive disclosure is an interaction-design principle: present the simplest useful information first, and reveal complexity only when the user asks for it. ZigLlama applies this principle to source code.

The dependency invariant¶

\[ \forall\, m \in L_i,\; n \in L_j : \text{imports}(m, n) \implies j < i \]

Every module in layer \( i \) may depend only on modules in layers strictly below \( i \). This invariant is enforced by the Zig compiler (circular imports are a compile error) and by code review.

2. Layer Dependency Diagram¶

graph TB
    subgraph "Layer 6 -- Inference"
        gen[generation.zig]
        kv[kv_cache.zig]
        stream[streaming.zig]
        batch[batching.zig]
        prof[profiling.zig]
        adv_samp[advanced_sampling.zig]
        gram[grammar_constraints.zig]
    end

    subgraph "Layer 5 -- Models"
        llama[llama.zig]
        cfg[config.zig]
        tok[tokenizer.zig]
        gguf[gguf.zig]
        bert[bert.zig]
        gpt2[gpt2.zig]
        mistral[mistral.zig]
        falcon[falcon.zig]
        phi[phi.zig]
        bloom[bloom.zig]
        gemma[gemma.zig]
        qwen[qwen.zig]
        mamba[mamba.zig]
        gptj[gptj.zig]
        gptneox[gpt_neox.zig]
        starcoder[starcoder.zig]
        moe[mixture_of_experts.zig]
        mm[multi_modal.zig]
        chat[chat_templates.zig]
    end

    subgraph "Layer 4 -- Transformers"
        attn[attention.zig]
        ff[feed_forward.zig]
        tb[transformer_block.zig]
    end

    subgraph "Layer 3 -- Neural Primitives"
        act[activations.zig]
        norm[normalization.zig]
        emb[embeddings.zig]
    end

    subgraph "Layer 2 -- Linear Algebra"
        matops[matrix_ops.zig]
        quant[quantization.zig]
        kquant[k_quantization.zig]
        iquant[iq_quantization.zig]
    end

    subgraph "Layer 1 -- Foundation"
        tensor[tensor.zig]
        mmap[memory_mapping.zig]
        gguf_fmt[gguf_format.zig]
        blas[blas_integration.zig]
        thread[threading.zig]
    end

    %% Layer 6 -> 5
    gen --> llama
    gen --> tok
    stream --> gen
    batch --> llama & tok & gen & kv
    prof --> gen & batch & kv

    %% Layer 5 -> 4
    llama --> attn & ff & tb
    llama --> norm & emb

    %% Layer 4 -> 3 & 2
    attn --> matops & act & norm
    ff --> matops & act
    tb --> attn & ff & norm

    %% Layer 3 -> 1
    act --> tensor
    norm --> tensor
    emb --> tensor

    %% Layer 2 -> 1
    matops --> tensor
    quant --> tensor
    kquant --> tensor
    iquant --> tensor

    %% Foundation internal
    gguf_fmt --> tensor & quant
    mmap --> tensor
    blas --> tensor & thread

3. Layer-by-Layer Reference¶

Layer 1 -- Foundation¶

Purpose

Provide the lowest-level abstractions that every higher layer depends on: typed multi-dimensional arrays, memory-mapped I/O, GGUF binary parsing, BLAS dispatch, and thread-pool management.

Component	File	Description
`Tensor(T)`	`foundation/tensor.zig`	Generic \( n \)-dimensional array with row-major storage, shape metadata, stride calculation, and basic operations (indexing, fill, print, `matmul`).
`MemoryMap`	`foundation/memory_mapping.zig`	POSIX `mmap` wrapper for loading multi-gigabyte model files without copying them into heap memory. Supports read/write protection, huge pages, and page locking.
`GGUFReader`	`foundation/gguf_format.zig`	Full GGUF v3 parser: magic-number validation, metadata key-value pairs, tensor descriptors, quantisation-type tags, and byte-offset calculation.
`BlasInterface`	`foundation/blas_integration.zig`	Runtime BLAS backend selection among OpenBLAS, Intel MKL, Apple Accelerate, ATLAS, and a pure-Zig generic fallback. Auto-detects available libraries.
`ThreadPool`	`foundation/threading.zig`	Work-stealing thread pool with NUMA-aware allocation, CPU topology detection, and configurable affinity.

Key types exported:

pub const Tensor = foundation.tensor.Tensor;     // Tensor(f32), Tensor(i32), ...
pub const MemoryMap = memory_mapping.MemoryMap;
pub const GGUFReader = gguf_format.GGUFReader;
pub const BlasConfig = blas_integration.BlasConfig;
pub const ThreadPoolConfig = threading.ThreadPoolConfig;

Dependencies: std only.

Test count: 6 (tensor shape, indexing, fill, matmul, print, error cases).

Layer 2 -- Linear Algebra¶

Purpose

Accelerate the numerical core: SIMD-vectorised matrix multiplication, cache-blocking algorithms, and a comprehensive quantisation framework covering legacy (Q4/Q8), K-quant, and importance-quant (IQ) formats.

Component	File	Description
`matmulSIMD`	`linear_algebra/matrix_ops.zig`	Auto-vectorised matrix multiplication with compile-time SIMD width detection (AVX, AVX2, NEON). Falls back to scalar on unknown targets.
`QuantizedTensor`	`linear_algebra/quantization.zig`	Per-channel and per-group quantisation for Q4_0, Q4_1, Q8_0, INT8, and F16. Includes dequantisation and GGUF-compatible block layouts.
`KQuantizer`	`linear_algebra/k_quantization.zig`	K-quantisation formats (Q4_K, Q5_K, Q6_K) with 256-element super-blocks and sub-block scaling, matching the llama.cpp `k_quants` specification.
`IQuantizer`	`linear_algebra/iq_quantization.zig`	Importance quantisation (IQ1_S through IQ4_NL): 12 formats that allocate more bits to statistically important weights.

Key mathematical operation -- blocked matrix multiply:

\[ C_{ij} = \sum_{k=0}^{K-1} A_{ik} \, B_{kj} \]

Blocked into \( b \times b \) tiles to fit in L1 cache:

\[ C_{IJ} = \sum_{K'} A_{IK'} \, B_{K'J}, \quad I = [i \cdot b,\, (i+1) \cdot b), \; \text{etc.} \]

SIMD Speed-up

On AVX2-capable x86_64 CPUs, the SIMD matmul processes 8 floats per cycle versus 1 for the scalar loop -- a theoretical 8x throughput gain, typically realised as 4--6x after memory-bandwidth saturation.

Dependencies: Layer 1 (Tensor).

Test count: 5 (SIMD correctness, quantise/dequantise round-trip, block-format layout, edge cases).

Layer 3 -- Neural Primitives¶

Purpose

Implement the building blocks that sit between raw linear algebra and full transformer layers: activation functions, normalisation layers, and embedding lookups.

Component	File	Description
Activations	`neural_primitives/activations.zig`	ReLU, GELU, SiLU, GLU, GeGLU, SwiGLU, Tanh, Sigmoid -- both scalar and tensor-wide variants.
Normalisation	`neural_primitives/normalization.zig`	LayerNorm, RMSNorm, BatchNorm, GroupNorm with configurable \(\varepsilon\) and learnable scale/shift.
Embeddings	`neural_primitives/embeddings.zig`	`TokenEmbedding` (vocabulary lookup), sinusoidal positional encoding, learned positional encoding, Rotary Position Embeddings (RoPE), and segment embeddings.

Key mathematical definitions:

RMSNorm (Zhang & Sennrich, 2019):

\[ \text{RMSNorm}(x) = \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \varepsilon}} \odot \gamma \]

SwiGLU (Shazeer, 2020):

\[ \text{SwiGLU}(x, W_1, W_2, W_3) = \bigl(\text{SiLU}(x W_1)\bigr) \odot (x W_3) \cdot W_2 \]

Rotary Position Embedding:

\[ \text{RoPE}(x_m, m) = \begin{pmatrix} x_m^{(1)} \cos m\theta_1 - x_m^{(2)} \sin m\theta_1 \\ x_m^{(1)} \sin m\theta_1 + x_m^{(2)} \cos m\theta_1 \\ \vdots \end{pmatrix} \]

Dependencies: Layer 1 (Tensor).

Test count: 9 (activation properties, normalisation invariants, embedding lookup, RoPE rotation).

Layer 4 -- Transformers¶

Purpose

Compose neural primitives into the core transformer building blocks: multi-head attention, position-wise feed-forward networks, and complete encoder / decoder / encoder-decoder blocks.

Component	File	Description
`MultiHeadAttention`	`transformers/attention.zig`	Scaled dot-product attention with configurable head count, causal masking, and cross-attention support.
`FeedForward`	`transformers/feed_forward.zig`	Five FFN variants: Standard (ReLU), GELU, SwiGLU, GeGLU, and classic GLU.
`TransformerBlock`	`transformers/transformer_block.zig`	Full blocks in Encoder, Decoder, and EncoderDecoder configurations with Pre-Norm or Post-Norm placement.

Scaled dot-product attention (Vaswani et al., 2017):

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V \]

Multi-head decomposition:

\[ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_h) \, W^O \]

where each head computes:

\[ \text{head}_i = \text{Attention}(Q W_i^Q,\, K W_i^K,\, V W_i^V) \]

Why scale by \( 1/\sqrt{d_k} \)?

If the entries of \( Q \) and \( K \) are independent random variables with zero mean and unit variance, then \( Q K^\top \) has variance \( d_k \). Dividing by \( \sqrt{d_k} \) restores unit variance, which keeps the softmax in its non-saturated regime and improves gradient flow.

Dependencies: Layers 1--3 (Tensor, matrix_ops, activations, normalization).

Test count: 11 (attention output shape, causal mask, FFN variants, block residual connection, pre/post-norm equivalence).

Layer 5 -- Models¶

Purpose

Define complete model architectures, configuration presets, tokenisation pipelines, and the GGUF loading path that turns a binary file into a runnable model.

Supported architectures (18)¶

#	Architecture	File	Notes
1	LLaMA	`models/llama.zig`	Primary reference implementation. 7B--65B presets.
2	Mistral	`models/mistral.zig`	Sliding-window attention variant.
3	Falcon	`models/falcon.zig`	Multi-query attention.
4	GPT-2	`models/gpt2.zig`	Classic autoregressive decoder.
5	GPT-J	`models/gptj.zig`	Rotary embeddings + parallel attention.
6	GPT-NeoX	`models/gpt_neox.zig`	EleutherAI architecture.
7	BERT	`models/bert.zig`	Encoder-only, masked LM.
8	BLOOM	`models/bloom.zig`	Multilingual, ALiBi positional.
9	Phi	`models/phi.zig`	Microsoft small-model family.
10	Gemma	`models/gemma.zig`	Google DeepMind.
11	Qwen	`models/qwen.zig`	Alibaba Cloud.
12	StarCoder	`models/starcoder.zig`	Code generation.
13	Mamba	`models/mamba.zig`	State-space model (SSM).
14	Mixture of Experts	`models/mixture_of_experts.zig`	Sparse MoE routing.
15	Multi-modal	`models/multi_modal.zig`	Vision + language.
16--18	CodeLlama variants	via `config.zig`	7B, 13B, 34B code presets.

Supporting modules¶

Module	File	Description
`ModelConfig`	`models/config.zig`	`ModelSize` enum (LLaMA_7B -- CodeLlama_34B), activation and normalisation type enums, parameter-count calculator.
`SimpleTokenizer`	`models/tokenizer.zig`	BPE / SentencePiece-compatible tokeniser with special-token handling.
`GGUFLoader`	`models/gguf.zig`	High-level GGUF loading: reads header, resolves tensor offsets, dequantises weights into `Tensor(f32)`.
`ChatTemplates`	`models/chat_templates.zig`	Prompt formatting for ChatML, LLaMA-2-Chat, Alpaca, Vicuna, and custom templates.

Dependencies: Layers 1--4.

Test count: 45 (config presets, tokeniser encode/decode, GGUF header parsing, LLaMA forward pass, model size calculations).

Layer 6 -- Inference¶

Purpose

Turn a loaded model into a text-generation system: autoregressive decoding, sampling strategies, KV caching, streaming output, batch processing, advanced sampling, grammar constraints, and performance profiling.

Component	File	Description
`TextGenerator`	`inference/generation.zig`	Autoregressive loop with Greedy, Top-K, Top-P, Temperature, and Combined sampling.
`KVCache` / `ModelKVCache`	`inference/kv_cache.zig`	Per-layer key/value cache with multi-sequence support and sliding-window eviction. Reduces redundant computation by >95% on long sequences.
`StreamingGenerator`	`inference/streaming.zig`	Thread-safe token-by-token streaming via producer/consumer buffer with back-pressure.
`BatchProcessor`	`inference/batching.zig`	Dynamic-batching engine: request queue, padding, concurrent execution, and per-request KV caches.
`AdvancedSampler`	`inference/advanced_sampling.zig`	Mirostat (v1/v2), Typical, Tail-Free, Locally Typical, Classifier-Free Guidance, and Contrastive Search.
`GrammarConstraint`	`inference/grammar_constraints.zig`	Constrained decoding for JSON, Regex, CFG, XML Schema, and EBNF grammars.
`Profiler`	`inference/profiling.zig`	Wall-clock timing, memory-usage tracking, tokens/sec measurement, and regression detection.

Autoregressive Generation

tokens = tokenize(prompt)
for step in 0..max_tokens:
    logits = model.forward(tokens)
    next_token = sample(logits, config)
    tokens.append(next_token)
    if next_token == EOS:
        break
return detokenize(tokens)

Dependencies: Layers 1--5.

Test count: 47 (sampling distributions, KV cache append/evict, streaming delivery order, batch scheduling, profiler accuracy).

4. Data Flow Diagram¶

The following sequence diagram traces a single inference request from raw text to generated output.

sequenceDiagram
    participant User
    participant Gen as TextGenerator<br/>(Layer 6)
    participant Tok as Tokenizer<br/>(Layer 5)
    participant Model as LLaMAModel<br/>(Layer 5)
    participant TB as TransformerBlock<br/>(Layer 4)
    participant Attn as MultiHeadAttention<br/>(Layer 4)
    participant FFN as FeedForward<br/>(Layer 4)
    participant NP as Neural Primitives<br/>(Layer 3)
    participant LA as Linear Algebra<br/>(Layer 2)
    participant KV as KVCache<br/>(Layer 6)

    User->>Gen: generate("Once upon a time")
    Gen->>Tok: encode(prompt)
    Tok-->>Gen: token_ids[]

    loop For each generation step
        Gen->>Model: forward(token_ids)
        Model->>NP: TokenEmbedding.lookup(token_ids)
        NP-->>Model: embeddings [seq_len, d_model]

        loop For each transformer layer
            Model->>TB: forward(hidden_state)
            TB->>NP: RMSNorm(hidden_state)
            TB->>Attn: forward(Q, K, V)
            Attn->>LA: matmulSIMD(Q, K^T)
            LA-->>Attn: attention scores
            Attn->>NP: softmax(scores / sqrt(d_k))
            Attn->>LA: matmulSIMD(weights, V)
            LA-->>Attn: attention output
            Attn-->>TB: attended
            TB->>TB: residual + attended
            TB->>NP: RMSNorm(hidden_state)
            TB->>FFN: forward(normalized)
            FFN->>LA: matmulSIMD (gate, up, down)
            FFN->>NP: SwiGLU activation
            FFN-->>TB: ffn_output
            TB->>TB: residual + ffn_output
            TB-->>Model: layer_output
            Model->>KV: cache(K, V, layer_id)
        end

        Model->>NP: RMSNorm(final_hidden)
        Model->>LA: output_projection
        Model-->>Gen: logits [vocab_size]
        Gen->>Gen: sample(logits)
        Gen-->>User: next_token (streamed)
    end

5. Parameter and Memory Budget¶

The table below shows approximate parameter counts and memory requirements for LLaMA-family models at different quantisation levels.

Notation

\( P \) = total parameter count.
Memory is computed as \( P \times \text{bytes per parameter} \).
KV cache memory assumes sequence length 2048 and fp16 storage.

Model	Parameters \( P \)	FP32 Memory	FP16 Memory	Q8_0 Memory	Q4_0 Memory	KV Cache (2048 ctx)
LLaMA-7B	6.7 B	26.8 GB	13.4 GB	6.7 GB	3.8 GB	~1.0 GB
LLaMA-13B	13.0 B	52.0 GB	26.0 GB	13.0 GB	7.3 GB	~1.6 GB
LLaMA-30B	30.0 B	120.0 GB	60.0 GB	30.0 GB	16.9 GB	~3.2 GB
LLaMA-65B	65.2 B	260.8 GB	130.4 GB	65.2 GB	36.7 GB	~5.2 GB

Practical Guidance

On a consumer machine with 16 GB of RAM, Q4_0 quantisation makes the 7B model comfortable and the 13B model feasible. The 30B and 65B models require server-grade memory or aggressive quantisation (IQ2/IQ3).

Memory breakdown for LLaMA-7B (Q4_0)¶

Component	Size	Share
Embedding matrix (\( V \times d \))	0.5 GB	13%
Transformer layers (\( 32 \times \) attention + FFN)	2.9 GB	76%
Output projection	0.3 GB	8%
Norms and biases	0.02 GB	<1%
KV cache (2048 tokens)	0.08 GB	2%
Total	~3.8 GB	100%

Summary¶

The six-layer architecture transforms the inherently complex task of language model inference into a sequence of self-contained, testable, and documentable modules. Each layer adds exactly one level of abstraction:

Foundation -- data structures and system interfaces.
Linear Algebra -- fast numerics and compression.
Neural Primitives -- non-linearities and normalisation.
Transformers -- attention and feed-forward composition.
Models -- architecture definitions and weight loading.
Inference -- generation loop and production optimisations.

This separation is the single most important architectural decision in ZigLlama and the one most responsible for its educational effectiveness.