Skip to content

Changelog

This changelog documents ZigLlama's development milestones. Each version represents a complete architectural layer, following the project's progressive bottom-up design. Within each milestone, components are listed in dependency order -- foundational pieces first, higher-level abstractions second.


v0.6.0 -- Inference Engine

Layer 6: Text generation, sampling, caching, streaming, and batch processing.

This milestone completes the full inference stack, enabling end-to-end text generation from a loaded model.

Added

  • Text Generation (text_generation.zig). Autoregressive token-by-token generation with configurable stopping criteria (max tokens, EOS token, stop sequences).
  • Sampling Strategies (sampling.zig). Eight decoding strategies:
    • Greedy (argmax).
    • Top-k sampling with configurable \( k \).
    • Top-p (nucleus) sampling with cumulative probability threshold.
    • Temperature scaling.
    • Mirostat v1 and v2 adaptive sampling.
    • Locally typical sampling.
    • Tail-free sampling.
  • Repetition Penalty. Logit-based penalty for previously generated tokens, with configurable penalty factor and lookback window.
  • KV Cache (kv_cache.zig). Pre-allocated key-value cache supporting incremental updates. Per-token cost drops from \( O(T) \) to \( O(1) \) for previously computed positions.
  • Streaming (streaming.zig). Callback-based token emission for real-time output. Tokens are delivered to the caller as soon as they are decoded, before the full sequence is complete.
  • Batch Processing (batch_processing.zig). Concurrent processing of multiple independent sequences with length-based grouping for efficient padding.
  • Grammar-Constrained Decoding. BNF-based grammar constraints that mask invalid tokens at each generation step, ensuring output conforms to a specified grammar (e.g., JSON, code).
  • Profiling and Diagnostics. Per-layer timing, memory-usage tracking, and token-throughput reporting for performance analysis.

v0.5.0 -- Model Architectures

Layer 5: Eighteen model families, GGUF loading, tokenisation, and chat templates.

This milestone adds support for loading and running real pre-trained models.

Added

  • LLaMA / LLaMA 2 (llama.zig). The primary reference architecture: RMSNorm, SwiGLU, RoPE, pre-norm blocks. Supports 7B, 13B, 33B, and 65B configurations. Llama 2 adds grouped-query attention.
  • Mistral (mistral.zig). Sliding window attention with configurable window size, GQA, and SwiGLU.
  • GPT-2 (gpt2.zig). Post-norm transformer with GELU activation and learned positional embeddings.
  • Falcon (falcon.zig). Multi-query attention, ALiBi positional encoding, and parallel attention-FFN computation.
  • Qwen (qwen.zig). RoPE with NTK-aware scaling, SwiGLU, and RMSNorm.
  • Phi (phi.zig). Partial RoPE application, dense attention, and GELU activation.
  • GPT-J (gptj.zig). Rotary embeddings applied to a subset of head dimensions, parallel attention-FFN layout.
  • GPT-NeoX (gpt_neox.zig). Full rotary embeddings with parallel attention-FFN and post-norm configuration.
  • BLOOM (bloom.zig). ALiBi positional encoding, LayerNorm, and GELU activation.
  • Mamba (mamba.zig). Selective state-space model with linear-time sequence processing, selective scan mechanism.
  • BERT (bert.zig). Bidirectional encoder with masked language modelling, segment embeddings, and [CLS]/[SEP] tokens.
  • Gemma (gemma.zig). GeGLU activation, RoPE, and RMSNorm with Google's weight initialisation conventions.
  • StarCoder (starcoder.zig). Multi-query attention, fill-in-the-middle support, and code-specific tokenisation.
  • Mixture of Experts (mixture_of_experts.zig). Top-k expert routing with load-balancing, configurable number of experts and active expert count.
  • Multi-Modal (multi_modal.zig). Vision-language architecture with ViT image encoder, projection layer, and cross-attention to the language model.
  • Model Converter (model_converter.zig). Conversion between quantisation formats for existing GGUF files.
  • Perplexity Evaluation (perplexity.zig). Token-level log-likelihood computation for model quality assessment.
  • GGUF Loader (gguf_format.zig). Version 3 format reader with typed metadata extraction, tensor descriptor parsing, and alignment validation.
  • Tokenizer (tokenizer.zig). SentencePiece (LLaMA), byte-level BPE (GPT-2), and WordPiece (BERT) support with encode/decode round-trip fidelity.
  • Model Configuration (model_config.zig). Unified configuration struct covering all 18 architectures with architecture-specific parameter validation.
  • Chat Templates. Jinja-style template rendering for instruction-tuned models (Llama 2 Chat, Mistral Instruct, ChatML).

v0.4.0 -- Transformer Components

Layer 4: Multi-head attention, feed-forward networks, and complete transformer blocks.

Added

  • Multi-Head Attention (multi_head_attention.zig). Full MHA with support for MHA, GQA, and MQA configurations. Includes:
    • Q/K/V linear projections.
    • Scaled dot-product attention with causal masking.
    • Output projection and concatenation.
    • Sliding window attention (Mistral).
    • ALiBi bias computation (BLOOM, Falcon).
  • Feed-Forward Networks (feed_forward.zig). Standard two-layer FFN and gated variants:
    • Standard: \( W_2 \cdot \sigma(W_1 x + b_1) + b_2 \).
    • SwiGLU: \( W_2 \cdot (\text{SiLU}(W_1 x) \odot W_3 x) \).
    • GeGLU: \( W_2 \cdot (\text{GELU}(W_1 x) \odot W_3 x) \).
  • Transformer Blocks (transformer_block.zig). Composable block abstraction supporting:
    • Pre-norm and post-norm ordering.
    • Parallel and sequential attention-FFN layout.
    • Configurable normalization (RMSNorm, LayerNorm).
    • Configurable attention type (MHA, GQA, MQA).

v0.3.0 -- Neural Primitives

Layer 3: Activation functions, normalization layers, embeddings, and positional encodings.

Added

  • Activation Functions (activation_functions.zig).
    • ReLU, Leaky ReLU, GELU (exact and fast approximation).
    • Sigmoid, Tanh, SiLU (Swish).
    • SwiGLU and GeGLU gated activations.
    • Softmax with log-sum-exp numerical stabilisation.
  • Normalization (normalization.zig).
    • Layer Normalization (Ba et al., 2016).
    • RMS Normalization (Zhang & Sennrich, 2019).
    • Configurable epsilon, learnable scale and bias parameters.
  • Embeddings (embeddings.zig).
    • Token embedding lookup.
    • Learned positional embeddings (GPT-2).
    • Segment embeddings (BERT).
    • Tied embedding / un-embedding weight sharing.
  • Rotary Position Embeddings (rope.zig).
    • Standard RoPE with configurable base frequency.
    • NTK-aware frequency scaling for context extension.
    • Partial application (Phi-style, applied to a subset of dimensions).
    • YaRN interpolation support.

v0.2.0 -- Linear Algebra

Layer 2: SIMD-accelerated matrix operations and quantisation formats.

Added

  • SIMD Operations (simd_operations.zig).
    • Vectorised dot product, matrix-vector multiply, matrix-matrix multiply.
    • Cache-blocking for L1/L2 residency.
    • Auto-vectorisation targeting AVX2, AVX-512, and NEON via Zig @Vector.
  • Basic Quantization (quantization.zig).
    • Q4_0: 4-bit quantisation with per-block scale.
    • Q4_1: 4-bit with per-block scale and minimum.
    • Q5_0, Q5_1: 5-bit variants.
    • Q8_0, Q8_1: 8-bit quantisation.
    • Quantise and dequantise routines for all formats.
  • K-Quantization (k_quantization.zig).
    • Q2_K, Q3_K, Q4_K, Q5_K, Q6_K formats.
    • Super-block structure with nested scale quantisation.
    • SIMD-accelerated dequantisation kernels.
  • IQ-Quantization (iq_quantization.zig).
    • IQ1_S, IQ1_M: ultra-low 1-bit formats.
    • IQ2_XXS, IQ2_XS, IQ2_S: 2-bit formats.
    • IQ3_XXS, IQ3_S: 3-bit formats.
    • IQ4_NL, IQ4_XS: 4-bit non-linear formats.
    • Importance-based weight selection.
    • Lookup-table dequantisation.

v0.1.0 -- Foundation Layer

Layer 1: Core data structures, memory management, model file I/O, and threading.

Added

  • Tensor Operations (tensor.zig).
    • Generic Tensor(T) struct with compile-time type specialisation.
    • Row-major memory layout with explicit shape and stride.
    • Element-wise arithmetic, slicing, reshaping, transposition.
    • Bounds-checked access with optional runtime safety.
  • Memory Management.
    • Allocator-explicit design throughout the codebase.
    • defer / errdefer resource cleanup patterns.
    • Arena allocator for per-inference scratch memory.
  • Memory-Mapped I/O (memory_mapping.zig).
    • MemoryMap abstraction over POSIX mmap.
    • ModelFileMapper for zero-copy model weight access.
    • Automatic page alignment and size calculation.
  • GGUF Binary Format (gguf_format.zig).
    • Version 3 header parsing (magic, version, tensor count, metadata count).
    • Typed metadata key-value store (uint8 through string arrays).
    • Tensor descriptor parsing (name, dimensions, type, offset).
    • Alignment-aware data section access.
  • BLAS Integration (blas_integration.zig).
    • Vtable-based interface supporting OpenBLAS, MKL, and Accelerate.
    • Pure-Zig SIMD fallback when no external BLAS is available.
    • GEMM, GEMV, and batch operations.
  • CPU Threading (threading.zig).
    • Work-stealing thread pool with configurable worker count.
    • NUMA-aware task distribution.
    • Parallel matrix and attention operations.
  • Basic Math (math.zig).
    • Fused multiply-add, fast reciprocal square root.
    • Log-sum-exp for numerical stability.
    • Half-precision (f16) conversion utilities.

Current Status

Metric Value
Test cases 285+ (all passing)
Examples 12
Model architectures 18 families
Quantisation formats 18+
Sampling strategies 8
Source lines (approx.) ~30,000
llama.cpp production parity ~90%

Roadmap

The following items are under consideration for future milestones. No timeline is committed.

  • Speculative decoding for accelerated generation.
  • Beam search as an additional decoding strategy.
  • C API header generation for cross-language integration.
  • WebAssembly compilation target for browser-based inference.
  • Additional model architectures as they are released in GGUF format.
  • Expanded documentation including video walkthroughs and interactive notebooks.