Model Configuration¶

Overview¶

Every model architecture in ZigLLM is instantiated from a configuration struct that fully specifies its dimensions, component choices, and numerical parameters. The ModelConfig struct in src/models/config.zig serves as the shared configuration type, while each architecture also defines its own specialized config (e.g., LLaMAConfig, MistralConfig) that converts to ModelConfig when needed.

Understanding model configuration is prerequisite to working with any architecture. The configuration determines parameter count, memory footprint, and the specific algorithmic choices (activation functions, normalization, position encoding) that define the model's behavior.

ModelConfig Struct¶

The central configuration type captures all parameters needed to instantiate a transformer-based language model.

pub const ModelConfig = struct {
    // === Model Architecture ===
    d_model: usize,          // Hidden dimension (embedding size)
    num_layers: usize,       // Number of transformer layers
    num_heads: usize,        // Number of attention heads
    d_ff: usize,             // Feed-forward intermediate dimension

    // === Vocabulary and Sequence ===
    vocab_size: usize,       // Number of tokens in vocabulary
    max_seq_len: usize,      // Maximum supported sequence length

    // === Component Choices ===
    activation: ActivationType,           // FFN activation function
    normalization: NormalizationType,     // Normalization layer type
    position_encoding: PositionEncodingType, // Position encoding strategy

    // === Numerical Parameters ===
    norm_eps: f32,           // Epsilon for numerical stability in norms
    dropout_rate: f32,       // Dropout rate (training only, 0.0 for inference)

    // === RoPE Parameters ===
    rope_theta: f32,         // RoPE base frequency (typically 10000.0)
    rope_scaling: f32,       // RoPE scaling factor (1.0 = no scaling)

    // === Attention Options ===
    attention_bias: bool,    // Use bias in attention output projection
    qkv_bias: bool,          // Use bias in Q, K, V projections

    // === Memory Optimization ===
    gradient_checkpointing: bool,  // Trade compute for memory
    use_flash_attention: bool,     // Use memory-efficient attention kernel
};

Key Dimensions

\( d_\text{model} \): The hidden dimension that flows through the entire network. All residual connections operate at this dimension.
\( d_\text{ff} \): The intermediate dimension of the feed-forward network. For standard FFN, \( d_\text{ff} = 4 \cdot d_\text{model} \). For gated activations (SwiGLU), \( d_\text{ff} \approx \frac{8}{3} \cdot d_\text{model} \).
\( d_\text{head} = d_\text{model} / n_\text{heads} \): Per-head dimension, typically 64 or 128.

Enumerated Types¶

ActivationType¶

Controls the nonlinearity in the feed-forward network.

pub const ActivationType = enum {
    ReLU,    // max(0, x) -- original transformer
    GELU,    // Gaussian Error Linear Unit -- GPT-2, BERT
    SwiGLU,  // Swish-gated -- LLaMA, Mistral, Qwen
    GeGLU,   // GELU-gated -- Gemma
    GLU,     // Standard gated linear unit
};

The parameterMultiplier() method returns the number of weight matrices needed:

Activation	Multiplier	Matrices	Used By
ReLU	2.0	up + down	Original Transformer
GELU	2.0	up + down	GPT-2, BERT, Falcon
SwiGLU	3.0	gate + up + down	LLaMA, Mistral, Qwen
GeGLU	3.0	gate + up + down	Gemma
GLU	3.0	gate + up + down	--

NormalizationType¶

pub const NormalizationType = enum {
    LayerNorm,  // Original transformer, GPT-2, BERT
    RMSNorm,    // LLaMA, Mistral, Qwen (more efficient)
    None,       // No normalization
};

RMSNorm vs LayerNorm

RMSNorm omits the mean-centering step of LayerNorm, computing only \( \hat{x} = x / \text{RMS}(x) \) where \( \text{RMS}(x) = \sqrt{\frac{1}{d} \sum x_i^2 + \epsilon} \). This saves one reduction operation per layer with negligible accuracy loss.

PositionEncodingType¶

pub const PositionEncodingType = enum {
    None,        // No positional information
    Sinusoidal,  // Fixed sinusoidal (original Transformer)
    Learned,     // Trainable position embeddings (GPT-2)
    RoPE,        // Rotary Position Embeddings (LLaMA, Mistral)
    ALiBi,       // Attention with Linear Biases (BLOOM, Falcon-1B)
};

Preset Configurations¶

ZigLLM provides factory methods for canonical model sizes. The ModelConfig.llama() method creates configurations for the LLaMA family.

LLaMA Family Presets¶

Parameter	7B	13B	30B	65B
`d_model`	4096	5120	6656	8192
`num_layers`	32	40	60	80
`num_heads`	32	40	52	64
`d_ff`	11008	13824	17920	22016
`vocab_size`	32000	32000	32000	32000
`max_seq_len`	2048	2048	2048	2048
`activation`	SwiGLU	SwiGLU	SwiGLU	SwiGLU
`normalization`	RMSNorm	RMSNorm	RMSNorm	RMSNorm
`position_encoding`	RoPE	RoPE	RoPE	RoPE
`norm_eps`	1e-6	1e-6	1e-6	1e-6
`rope_theta`	10000.0	10000.0	10000.0	10000.0
`d_head`	128	128	128	128
`gradient_checkpointing`	false	false	true	true

Scaling Pattern

Observe that \( d_\text{ff} \approx \frac{8}{3} \cdot d_\text{model} \) for all LLaMA variants. This ratio is characteristic of SwiGLU activations, which use three weight matrices instead of two, so the intermediate dimension is reduced from \( 4 \cdot d_\text{model} \) to compensate.

CodeLlama Variants¶

CodeLlama extends LLaMA with longer context support:

.CodeLlama_7B => blk: {
    var config = ModelConfig.llama(.LLaMA_7B);
    config.max_seq_len = 16384;  // 8x longer than base LLaMA
    config.rope_scaling = 1.0;
    break :blk config;
},

Configuration Validation¶

The validate() method checks for common configuration errors before model instantiation.

pub fn validate(self: ModelConfig) !void {
    // Head dimension must divide evenly
    if (self.d_model % self.num_heads != 0)
        return error.IncompatibleHeadDimension;

    // Reasonable dimension ranges
    if (self.d_model < 64 or self.d_model > 32768)
        return error.UnreasonableModelDimension;

    if (self.num_layers < 1 or self.num_layers > 1000)
        return error.UnreasonableLayerCount;

    if (self.num_heads < 1 or self.num_heads > 256)
        return error.UnreasonableHeadCount;

    if (self.vocab_size < 1000 or self.vocab_size > 1000000)
        return error.UnreasonableVocabSize;

    // Numerical stability
    if (self.norm_eps < 1e-12 or self.norm_eps > 1e-3)
        return error.UnreasonableEpsilon;
}

Validation Checks

Head dimension compatibility: \( d_\text{model} \bmod n_\text{heads} = 0 \)
Dimension bounds: \( 64 \le d_\text{model} \le 32768 \)
Layer count bounds: \( 1 \le n_\text{layers} \le 1000 \)
Head count bounds: \( 1 \le n_\text{heads} \le 256 \)
Vocabulary bounds: \( 1000 \le V \le 1{,}000{,}000 \)
Epsilon bounds: \( 10^{-12} \le \epsilon \le 10^{-3} \)

Parameter Counting¶

The parameterCount() method computes the total number of trainable parameters.

\[ P = \underbrace{V \cdot d}_{\text{embeddings}} + L \cdot \Big( \underbrace{3d^2 + d^2}_{\text{attention}} + \underbrace{m \cdot d \cdot d_\text{ff}}_{\text{FFN}} + \underbrace{2d}_{\text{norms}} \Big) + \underbrace{d}_{\text{final norm}} + \underbrace{d_\text{pos}}_{\text{position}} \]

where:

\( V \) = vocab_size, \( d \) = d_model, \( L \) = num_layers
\( m \) = activation parameter multiplier (2.0 for standard, 3.0 for gated)
\( d_\text{pos} \) = position encoding parameters (if any)

pub fn parameterCount(self: ModelConfig) usize {
    const embedding_params = self.vocab_size * self.d_model;

    var layer_params: usize = 0;
    const qkv_params = 3 * self.d_model * self.d_model;
    const attention_output_params = self.d_model * self.d_model;
    layer_params += qkv_params + attention_output_params;

    const ffn_multiplier = self.activation.parameterMultiplier();
    const ffn_params = @as(usize, @intFromFloat(
        ffn_multiplier * @as(f32, @floatFromInt(self.d_model * self.d_ff))
    ));
    layer_params += ffn_params;

    if (self.normalization.hasParameters()) {
        layer_params += 2 * self.d_model;
    }

    return embedding_params + (layer_params * self.num_layers) + ...;
}

Memory Requirements¶

The memoryRequirements() method estimates the total memory needed for inference at a given batch size and sequence length.

Memory Breakdown

Component	Formula	LLaMA-7B (B=1, S=2048)
Parameters	\( P \times 4 \) bytes	~26.8 GB
Activations	\( B \times S \times d \times L \times 4 \)	~1.0 GB
KV Cache	\( 2 \times B \times H \times S \times d_h \times L \times 4 \)	~2.0 GB
Total	Sum	~29.8 GB

pub fn memoryRequirements(self: ModelConfig, batch_size: usize, sequence_length: usize)
    struct { parameters: usize, activations: usize, kv_cache: usize, total: usize }
{
    const param_memory = self.parameterCount() * @sizeOf(f32);
    const activation_memory = batch_size * sequence_length *
        self.d_model * self.num_layers * @sizeOf(f32);
    const kv_cache_memory = 2 * batch_size * self.num_heads *
        sequence_length * self.headDim() * self.num_layers * @sizeOf(f32);
    return .{
        .parameters = param_memory,
        .activations = activation_memory,
        .kv_cache = kv_cache_memory,
        .total = param_memory + activation_memory + kv_cache_memory,
    };
}

Custom Configurations¶

To create a non-standard configuration, use the custom() factory method or construct the struct directly.

// Factory method with sensible defaults
const config = ModelConfig.custom(
    512,    // d_model
    6,      // num_layers
    8,      // num_heads
    1000,   // vocab_size
);
// d_ff defaults to 4 * d_model = 2048
// activation defaults to SwiGLU
// normalization defaults to RMSNorm
// position_encoding defaults to RoPE

// Validate before use
try config.validate();

For full control, construct the struct directly:

const config = ModelConfig{
    .d_model = 256,
    .num_layers = 4,
    .num_heads = 4,
    .d_ff = 688,              // ~8/3 * 256
    .vocab_size = 500,
    .max_seq_len = 512,
    .activation = .GELU,      // Override: use GELU instead of SwiGLU
    .normalization = .LayerNorm,
    .position_encoding = .Learned,
    .norm_eps = 1e-5,
    .dropout_rate = 0.0,
    .rope_theta = 10000.0,
    .rope_scaling = 1.0,
    .attention_bias = true,
    .qkv_bias = true,
    .gradient_checkpointing = false,
    .use_flash_attention = false,
};

Testing Tip

When writing tests, use very small configurations (e.g., d_model=16, num_layers=1) to keep test execution fast while still exercising the full forward pass.

Architecture-Specific Configs¶

Each model architecture provides its own configuration struct that captures architecture-specific parameters not present in the generic ModelConfig.

Architecture	Config Struct	Extra Fields
LLaMA	`LLaMAConfig`	`use_rope`, `rope_theta`
Mistral	`MistralConfig`	`n_kv_heads`, `sliding_window`, `num_experts`
GPT-2	`GPT2Config`	`dropout` (learned positions implicit)
Falcon	`FalconConfig`	`parallel_attn`, `multi_query`, `alibi`, `new_decoder_architecture`
Qwen	`QwenConfig`	`rope_scaling`, `use_dynamic_ntk`, `use_logn_attn`, `use_sliding_window`

These specialized configs typically provide a fromVariant() or create() factory and, where applicable, a toModelConfig() conversion method for interoperability.

References¶

Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
Zhang, S. et al. "OPT: Open Pre-trained Transformer Language Models." arXiv:2205.01068, 2022. ↩