Layer 6: Inference¶
The Inference layer is the capstone of ZigLlama. It composes every layer beneath it -- tensors, linear algebra, neural primitives, transformer blocks, and model loading -- into a complete text generation pipeline. This is where a static collection of weight matrices becomes a system that produces language.
Given a prompt and a set of configuration parameters, the inference layer autoregressively generates text one token at a time: running a forward pass through the model, converting raw logits into a probability distribution, selecting the next token via a sampling strategy, and repeating until a stop condition is met. The remaining modules in this layer address the engineering challenges that arise when you want this loop to be fast, controllable, and production-ready.
Learning Objectives¶
After completing the eight modules in this layer you will be able to:
- Implement autoregressive text generation from first principles, including the forward pass, logit processing, and token selection loop.
- Compare sampling strategies -- greedy, temperature, top-k, top-p, and combined -- and predict their effects on output quality and diversity.
- Explain advanced sampling methods (Mirostat, typical, tail-free, contrastive search) and select appropriate strategies for different tasks.
- Design a KV cache that reduces per-token attention cost from \( O(n^2 d) \) to \( O(n \cdot d) \), including sliding-window variants.
- Build a streaming generation pipeline with thread-safe token buffers, callback-based delivery, and natural break detection.
- Architect a batch processing system with dynamic batching, request queuing, and throughput scaling analysis.
- Constrain generation output using grammar specifications (JSON, regex, context-free grammars) via token masking.
- Profile inference performance using RAII measurement blocks, percentile statistics, and the roofline model for bottleneck identification.
Prerequisites¶
Required Background
This layer assumes familiarity with:
- Layer 4 -- Transformers: Multi-head attention, feed-forward blocks, and the full transformer forward pass.
- Layer 5 -- Models: LLaMA architecture, tokenisation, and GGUF model loading.
- Probability: Softmax, entropy \( H(X) = -\sum p(x) \log p(x) \), and sampling from discrete distributions.
- Systems Programming: Threading primitives (mutex, condition variable), memory management, and basic concurrency patterns.
Components Overview¶
| Module | Page | Source | Key Types |
|---|---|---|---|
| Text Generation | text-generation.md | src/inference/generation.zig | TextGenerator, GenerationConfig, GenerationResult |
| Sampling Strategies | sampling-strategies.md | src/inference/generation.zig | SamplingStrategy, TokenProb |
| Advanced Sampling | advanced-sampling.md | src/inference/advanced_sampling.zig | AdvancedSampler, MirostatConfig, TypicalConfig |
| KV Cache | kv-cache.md | src/inference/kv_cache.zig | KVCacheEntry, MultiSequenceKVCache, SlidingWindowKVCache |
| Streaming | streaming.md | src/inference/streaming.zig | StreamingGenerator, TokenBuffer, StreamStatus |
| Batch Processing | batching.md | src/inference/batching.zig | BatchProcessor, BatchRequest, BatchingStrategy |
| Grammar Constraints | grammar-constraints.md | src/inference/grammar_constraints.zig | GrammarConstrainedSampler, JSONConstraint, CFGConstraint |
| Profiling | profiling.md | src/inference/profiling.zig | Profiler, BenchmarkRunner, PerformanceStats |
Inference Pipeline Architecture¶
The following diagram shows the complete data flow from a user prompt to generated text. Every box corresponds to a module in this layer.
flowchart TD
PROMPT["User Prompt"]
subgraph "Layer 6 -- Inference"
TOK["Tokenize Prompt"]
FWD["Forward Pass (Layer 5 Model)"]
LOGITS["Raw Logits"]
REP["Repetition Penalty"]
GRAM["Grammar Mask (optional)"]
SAMP["Sampling Strategy"]
ADV["Advanced Sampling (optional)"]
CACHE["KV Cache Update"]
CHECK["Stop Condition Check"]
BUF["Stream Buffer"]
BATCH["Batch Scheduler"]
end
OUTPUT["Generated Text"]
PROMPT --> TOK --> FWD
FWD --> LOGITS --> REP --> GRAM --> SAMP
SAMP --> ADV
ADV --> CACHE --> CHECK
CHECK -->|continue| FWD
CHECK -->|stop| OUTPUT
CHECK -->|stream| BUF --> OUTPUT
PROMPT --> BATCH --> TOK
style SAMP fill:#d5f5e3,stroke:#1e8449
style CACHE fill:#d6eaf8,stroke:#2e86c1
style BUF fill:#fdebd0,stroke:#e67e22
style GRAM fill:#e8daef,stroke:#7d3c98 Dependency Graph¶
Within the Inference layer, the modules depend on each other as follows:
graph LR
GEN["generation.zig"]
ADV["advanced_sampling.zig"]
KV["kv_cache.zig"]
STR["streaming.zig"]
BAT["batching.zig"]
GRM["grammar_constraints.zig"]
PRF["profiling.zig"]
STR --> GEN
BAT --> GEN
BAT --> KV
PRF --> GEN
PRF --> BAT
PRF --> KV
ADV --> GEN
GRM --> GEN generation.zig is the central module; every other inference module depends on it either directly or through the types it exports.
Suggested Reading Order¶
- Text Generation -- start with the core autoregressive loop and generation configuration.
- Sampling Strategies -- understand how tokens are selected from the model's probability distribution.
- Advanced Sampling -- explore entropy-targeting and information-theoretic sampling methods.
- KV Cache -- learn the key optimisation that makes autoregressive generation practical.
- Streaming -- see how tokens are delivered in real time to users.
- Batch Processing -- scale throughput with dynamic batching and request scheduling.
- Grammar Constraints -- constrain output to valid JSON, regex patterns, or formal grammars.
- Profiling -- measure, benchmark, and optimise the entire pipeline.
Key Design Decisions¶
Composition over Inheritance
ZigLlama's inference layer is composed of independent modules connected through explicit function calls and shared types. The TextGenerator owns a model and tokenizer; the StreamingGenerator wraps a TextGenerator; the BatchProcessor manages a queue of generation requests. There is no class hierarchy -- Zig's comptime generics and explicit allocation make composition both natural and efficient.
Sampling as a Pluggable Strategy
The SamplingStrategy enum dispatches at runtime to the appropriate sampling function. Because the strategy is known per-generation (not per-token), the branch predictor learns it immediately, and the overhead versus a direct function call is negligible.
Performance Summary¶
| Optimisation | Source Module | Typical Impact |
|---|---|---|
| KV Caching | kv_cache.zig | ~100x per-token speedup for long sequences |
| Batch Processing | batching.zig | 5--10x throughput improvement |
| Streaming | streaming.zig | First-token latency visible to user |
| Grammar Masking | grammar_constraints.zig | Guaranteed valid output structure |
| Combined Sampling | generation.zig | Quality-diversity control |
| RAII Profiling | profiling.zig | Zero-overhead when disabled |
References¶
-
Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017. ↩
-
Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
-
Holtzman, A. et al. "The Curious Case of Neural Text Degeneration." ICLR, 2020. ↩
-
Basu, S. et al. "Mirostat: A Neural Text Decoding Algorithm." ICLR, 2021. ↩
-
Gerganov, G. "llama.cpp -- Inference of LLaMA model in C/C++." GitHub, 2023. ↩