Quick Start Guide¶
This page gets you from a verified installation to interactive text generation in under ten minutes. It assumes you have already completed the Installation steps and that zig build test passes.
Running Your First Demo¶
The fastest way to see ZigLlama in action is the bundled simple_demo:
The demo walks through every layer of the architecture, printing educational commentary as it goes:
- Tensor creation and matrix multiplication (Layer 1).
- SIMD-accelerated operations and quantisation (Layer 2).
- Activation functions and normalisation (Layer 3).
- Multi-head attention and feed-forward networks (Layer 4).
- LLaMA model initialisation and tokenisation (Layer 5).
- Autoregressive text generation with sampling (Layer 6).
No model download required
The demo uses randomly initialised weights. It demonstrates the full inference pipeline -- allocation, forward pass, sampling, token decoding -- without requiring a multi-gigabyte GGUF download.
Basic Usage Example¶
The following program is a minimal but complete inference pipeline. It initialises a LLaMA-7B configuration, creates a model and tokeniser, and generates text from a prompt.
const std = @import("std");
const zigllama = @import("src/main.zig");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const config = zigllama.models.config.ModelConfig.llama(.LLaMA_7B);
var model = try zigllama.models.llama.LLaMAModel.init(config, allocator);
defer model.deinit();
var tokenizer = try zigllama.models.tokenizer.SimpleTokenizer.init(allocator, config.vocab_size);
defer tokenizer.deinit();
var generator = zigllama.inference.generation.TextGenerator.init(&model, &tokenizer, allocator, null);
const result = try generator.generate("The future of AI is");
defer result.deinit(allocator);
std.debug.print("Generated: {s}\n", .{result.text orelse ""});
}
Key types in the example
-
ModelConfig-- a compile-time-friendly struct that encodes every hyperparameter of a model variant: hidden dimension, number of layers, number of attention heads, vocabulary size, activation type, and more. -
LLaMAModel-- the full LLaMA forward-pass implementation. Oninitit allocates weight buffers sized according to the config. With a real GGUF file, weights are memory-mapped instead of heap-allocated. -
SimpleTokenizer-- a lightweight tokeniser sufficient for demonstration. Production use cases should load a SentencePiece or BPE vocabulary from the GGUF metadata. -
TextGenerator-- the autoregressive loop. Each call togeneratetokenises the prompt, runs repeated forward passes, samples the next token, and decodes the result back to text.
Step-by-step Walkthrough¶
sequenceDiagram
participant User
participant Generator as TextGenerator
participant Model as LLaMAModel
participant Tokeniser as SimpleTokenizer
User ->> Generator: generate("The future of AI is")
Generator ->> Tokeniser: encode(prompt)
Tokeniser -->> Generator: token_ids[]
loop For each new token
Generator ->> Model: forward(token_ids)
Model -->> Generator: logits[vocab_size]
Generator ->> Generator: sample(logits)
Generator ->> Tokeniser: decode(new_token)
Tokeniser -->> Generator: text_piece
end
Generator -->> User: GenerationResult { text, tokens, timing } Autoregressive generation
Given a prompt \( x_1, x_2, \ldots, x_n \), the generator produces tokens one at a time:
where \( \mathbf{z}_t \) is the logit vector output by the model at position \( t \) and \( \tau \) is the temperature parameter. The sampling strategy (greedy, top-k, top-p, etc.) determines how the categorical distribution is truncated before drawing.
Running Tests Layer by Layer¶
ZigLlama's test suite is organised to match the 6-layer architecture. You can validate each layer independently to build confidence before moving on.
# Layer 1 -- Foundation (tensors, memory management, GGUF)
zig build test-foundation
# Layer 2 -- Linear Algebra (SIMD, quantisation)
zig build test-linear-algebra
# All layers at once
zig build test
Test counts by layer
| Layer | Tests | Focus areas |
|---|---|---|
| 1. Foundation | 8 | Tensor ops, memory mapping, GGUF parsing |
| 2. Linear Algebra | 25 | SIMD mat-mul, K-quant, IQ-quant |
| 3. Neural Primitives | 12 | Activations, normalisation, embeddings |
| 4. Transformers | 15 | Attention, sliding window, FFN |
| 5. Models | 120 | 18 architectures, GGUF loading, tokenisation |
| 6. Inference | 80 | Generation, sampling, KV cache, streaming |
| Integration | 25 | Production parity, end-to-end |
| Total | 285+ |
Example Files¶
The examples/ directory contains 12 self-contained programs. Each can be run directly with zig run.
| File | Description | Layers exercised |
|---|---|---|
simple_demo.zig | End-to-end tour of all six layers | 1--6 |
main.zig | Library entry-point demo | 1 |
educational_demo.zig | Detailed educational walkthrough with commentary | 1--6 |
benchmark_demo.zig | Performance benchmarks (SIMD, quantisation, caching) | 1--2, 6 |
parity_demo.zig | Comparison with llama.cpp feature set | 5--6 |
gguf_demo.zig | GGUF file loading and inspection | 1, 5 |
model_architectures_demo.zig | Tour of all 18 supported architectures | 5 |
chat_templates_demo.zig | Chat-template rendering (ChatML, Alpaca, etc.) | 5 |
multi_modal_demo.zig | Vision-language multi-modal pipeline | 3--5 |
multi_modal_concepts_demo.zig | Multi-modal theory and design patterns | 3--5 |
threading_demo.zig | Multi-threaded inference and NUMA awareness | 1, 6 |
perplexity_demo.zig | Perplexity evaluation and model quality metrics | 5--6 |
Run any example from the repository root:
Reading order for newcomers
Start with simple_demo.zig for the big picture, then work through educational_demo.zig for deeper explanations. After that, pick examples that match your interests -- benchmark_demo.zig for performance, model_architectures_demo.zig for breadth, or gguf_demo.zig for format internals.
Understanding the Output¶
When you run the simple demo, expect output similar to:
=== ZigLlama Foundation Layer Demo ===
Demonstrating basic tensor operations...
Matrix A (2x3):
[[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]]
Matrix B (3x2):
[[1.0, 0.0],
[0.0, 1.0],
[0.5, 0.5]]
Result A x B (2x2):
[[2.5, 3.5],
[7.0, 8.0]]
Matrix multiplication refresher
The result element at row \( i \), column \( j \) is the dot product of row \( i \) of \( A \) and column \( j \) of \( B \):
For the element \( C_{0,0} \): \( 1 \times 1 + 2 \times 0 + 3 \times 0.5 = 2.5 \).
This operation is the computational backbone of every transformer layer: query-key-value projections, attention score computation, and feed-forward network evaluations are all matrix multiplications.
Loading a Real Model (Optional)¶
If you have a GGUF model file (e.g., downloaded from Hugging Face), you can load it directly:
const std = @import("std");
const zigllama = @import("src/main.zig");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Load model from GGUF file
const gguf_file = try zigllama.models.gguf.GGUFFile.init(
allocator,
"models/llama-2-7b-chat.Q4_K_M.gguf",
);
defer gguf_file.deinit();
// Extract configuration from GGUF metadata
const config = try zigllama.models.config.ModelConfig.fromGGUF(gguf_file);
std.debug.print("Model: {s}\n", .{config.name});
std.debug.print("Parameters: {d:.1}B\n", .{config.parameterCount()});
std.debug.print("Vocab size: {d}\n", .{config.vocab_size});
}
Model file sizes
GGUF files can be large. A 7B-parameter model in Q4_K_M quantisation is approximately 4 GB. Ensure you have sufficient disk space and RAM (or rely on memory mapping, which ZigLlama enables by default).
Exploring the Inference Pipeline¶
The generation pipeline offers several configuration knobs. Here is a slightly more advanced example that demonstrates temperature scaling and top-p sampling:
const sampling_config = zigllama.inference.generation.SamplingConfig{
.strategy = .top_p,
.temperature = 0.8,
.top_p = 0.95,
.max_tokens = 128,
};
var generator = zigllama.inference.generation.TextGenerator.init(
&model,
&tokenizer,
allocator,
&sampling_config,
);
const result = try generator.generate("Once upon a time");
defer result.deinit(allocator);
std.debug.print("{s}\n", .{result.text orelse ""});
Top-p (nucleus) sampling
Top-p sampling selects the smallest set \( V_p \subseteq V \) such that
where \( p \) is the nucleus threshold (typically 0.9--0.95). This dynamically adjusts the number of candidate tokens based on the shape of the distribution, unlike top-k which always considers a fixed number.
What to Read Next¶
After completing this quick start, choose the path that fits your goals:
| Goal | Next page |
|---|---|
| Understand the build system in depth | Building from Source |
| Navigate the codebase confidently | Project Structure |
| Learn the design philosophy | Architecture Overview |
| Dive into the math, starting from tensors | Layer 1: Foundations |
| Jump to attention mechanisms | Layer 4: Transformers |
| See all sampling strategies | Layer 6: Inference -- Sampling |