API Reference¶
Overview¶
ZigLlama exposes its functionality through a layered module hierarchy defined in src/main.zig. The architecture follows a bottom-up design where each layer builds on the one below it:
- Foundation -- Tensors, memory management, file formats, threading.
- Linear Algebra -- SIMD matrix operations and quantization.
- Neural Primitives -- Activations, normalization, embeddings.
- Transformers -- Attention, feed-forward networks, transformer blocks.
- Models -- Complete LLaMA architecture, configuration, tokenization, GGUF loading.
- Inference -- Generation engine, caching, streaming, batching, profiling.
All public modules are reachable from the root import:
Module Index¶
Foundation¶
| Module | Source | Description |
|---|---|---|
foundation.tensor | src/foundation/tensor.zig | Generic tensor type with multi-dimensional operations |
foundation.memory_mapping | src/foundation/memory_mapping.zig | Memory-mapped I/O (internal) |
foundation.gguf_format | src/foundation/gguf_format.zig | Low-level GGUF format constants and helpers (internal) |
foundation.blas_integration | src/foundation/blas_integration.zig | BLAS library interface (internal) |
foundation.threading | src/foundation/threading.zig | Thread pool, work stealing, and NUMA support (internal) |
Linear Algebra¶
| Module | Source | Description |
|---|---|---|
linear_algebra.matrix_ops | src/linear_algebra/matrix_ops.zig | SIMD-accelerated matrix operations |
linear_algebra.quantization | src/linear_algebra/quantization.zig | Quantization and dequantization framework |
linear_algebra.k_quantization | src/linear_algebra/k_quantization.zig | K-quantization formats (internal) |
linear_algebra.iq_quantization | src/linear_algebra/iq_quantization.zig | Importance quantization formats (internal) |
Neural Primitives¶
| Module | Source | Description |
|---|---|---|
neural_primitives.activations | src/neural_primitives/activations.zig | Activation functions (ReLU, GELU, SiLU, SwiGLU) |
neural_primitives.normalization | src/neural_primitives/normalization.zig | Normalization layers (LayerNorm, RMSNorm) |
neural_primitives.embeddings | src/neural_primitives/embeddings.zig | Token and positional embeddings |
Transformers¶
| Module | Source | Description |
|---|---|---|
transformers.attention | src/transformers/attention.zig | Multi-head attention mechanisms |
transformers.feed_forward | src/transformers/feed_forward.zig | Feed-forward network variants |
transformers.transformer_block | src/transformers/transformer_block.zig | Complete transformer blocks |
Models¶
| Module | Source | Description |
|---|---|---|
models.llama | src/models/llama.zig | LLaMA model architecture |
models.config | src/models/config.zig | Model configuration and presets |
models.tokenizer | src/models/tokenizer.zig | Tokenization (Simple, BPE) |
models.gguf | src/models/gguf.zig | GGUF file reader |
Inference¶
| Module | Source | Description |
|---|---|---|
inference.generation | src/inference/generation.zig | Text generation engine |
inference.kv_cache | src/inference/kv_cache.zig | Key-value cache for inference |
inference.streaming | src/inference/streaming.zig | Streaming token generation |
inference.batching | src/inference/batching.zig | Batch request processing |
inference.profiling | src/inference/profiling.zig | Performance profiling |
inference.advanced_sampling | src/inference/advanced_sampling.zig | Advanced sampling methods (Mirostat, Typical, Tail-Free) |
inference.grammar_constraints | src/inference/grammar_constraints.zig | Grammar-constrained generation |
Conventions¶
- All allocating functions accept a
std.mem.Allocatorand return errors via Zig's error union syntax (!T). - Types suffixed with
Errorare error sets specific to that module. - Functions prefixed with
deinitfree resources owned by a struct instance. - Modules marked (internal) are implementation details; their APIs may change between releases.