API Reference¶

Overview¶

ZigLlama exposes its functionality through a layered module hierarchy defined in src/main.zig. The architecture follows a bottom-up design where each layer builds on the one below it:

Foundation -- Tensors, memory management, file formats, threading.
Linear Algebra -- SIMD matrix operations and quantization.
Neural Primitives -- Activations, normalization, embeddings.
Transformers -- Attention, feed-forward networks, transformer blocks.
Models -- Complete LLaMA architecture, configuration, tokenization, GGUF loading.
Inference -- Generation engine, caching, streaming, batching, profiling.

All public modules are reachable from the root import:

const zigllama = @import("zigllama");
const Tensor = zigllama.foundation.tensor.Tensor;

Module Index¶

Foundation¶

Module	Source	Description
`foundation.tensor`	`src/foundation/tensor.zig`	Generic tensor type with multi-dimensional operations
`foundation.memory_mapping`	`src/foundation/memory_mapping.zig`	Memory-mapped I/O (internal)
`foundation.gguf_format`	`src/foundation/gguf_format.zig`	Low-level GGUF format constants and helpers (internal)
`foundation.blas_integration`	`src/foundation/blas_integration.zig`	BLAS library interface (internal)
`foundation.threading`	`src/foundation/threading.zig`	Thread pool, work stealing, and NUMA support (internal)

Linear Algebra¶

Module	Source	Description
`linear_algebra.matrix_ops`	`src/linear_algebra/matrix_ops.zig`	SIMD-accelerated matrix operations
`linear_algebra.quantization`	`src/linear_algebra/quantization.zig`	Quantization and dequantization framework
`linear_algebra.k_quantization`	`src/linear_algebra/k_quantization.zig`	K-quantization formats (internal)
`linear_algebra.iq_quantization`	`src/linear_algebra/iq_quantization.zig`	Importance quantization formats (internal)

Neural Primitives¶

Module	Source	Description
`neural_primitives.activations`	`src/neural_primitives/activations.zig`	Activation functions (ReLU, GELU, SiLU, SwiGLU)
`neural_primitives.normalization`	`src/neural_primitives/normalization.zig`	Normalization layers (LayerNorm, RMSNorm)
`neural_primitives.embeddings`	`src/neural_primitives/embeddings.zig`	Token and positional embeddings

Transformers¶

Module	Source	Description
`transformers.attention`	`src/transformers/attention.zig`	Multi-head attention mechanisms
`transformers.feed_forward`	`src/transformers/feed_forward.zig`	Feed-forward network variants
`transformers.transformer_block`	`src/transformers/transformer_block.zig`	Complete transformer blocks

Models¶

Module	Source	Description
`models.llama`	`src/models/llama.zig`	LLaMA model architecture
`models.config`	`src/models/config.zig`	Model configuration and presets
`models.tokenizer`	`src/models/tokenizer.zig`	Tokenization (Simple, BPE)
`models.gguf`	`src/models/gguf.zig`	GGUF file reader

Inference¶

Module	Source	Description
`inference.generation`	`src/inference/generation.zig`	Text generation engine
`inference.kv_cache`	`src/inference/kv_cache.zig`	Key-value cache for inference
`inference.streaming`	`src/inference/streaming.zig`	Streaming token generation
`inference.batching`	`src/inference/batching.zig`	Batch request processing
`inference.profiling`	`src/inference/profiling.zig`	Performance profiling
`inference.advanced_sampling`	`src/inference/advanced_sampling.zig`	Advanced sampling methods (Mirostat, Typical, Tail-Free)
`inference.grammar_constraints`	`src/inference/grammar_constraints.zig`	Grammar-constrained generation

Conventions¶

All allocating functions accept a std.mem.Allocator and return errors via Zig's error union syntax (!T).
Types suffixed with Error are error sets specific to that module.
Functions prefixed with deinit free resources owned by a struct instance.
Modules marked (internal) are implementation details; their APIs may change between releases.

Version¶

pub const version = std.SemanticVersion{
    .major = 0,
    .minor = 1,
    .patch = 0,
};