Skip to content

Text Generation

Text generation is where language models produce output. This page covers the mathematical foundation of autoregressive decoding, the TextGenerator struct that orchestrates the generation loop, and the configuration parameters that control output quality and style.


1. Autoregressive Decoding

Autoregressive Factorisation

A language model defines a joint distribution over a token sequence \( (x_1, x_2, \ldots, x_T) \) by factoring it into a product of conditional distributions:

\[ p(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} p(x_t \mid x_{<t}) \]

where \( x_{<t} = (x_1, \ldots, x_{t-1}) \) denotes all tokens preceding position \( t \).

At each step the model takes the entire sequence generated so far, performs a forward pass through the transformer stack, and produces a logit vector \( \mathbf{z} \in \mathbb{R}^{|V|} \) over the vocabulary \( V \). The logits are converted to probabilities via softmax:

\[ p(x_t = v \mid x_{<t}) = \frac{\exp(z_v)}{\sum_{j \in V} \exp(z_j)} \]

A sampling strategy then selects one token from this distribution, the token is appended to the sequence, and the process repeats.


2. TextGenerator Struct

The TextGenerator is the top-level entry point for text generation in ZigLlama. It holds references to the model and tokenizer, owns a random number generator, and stores the current generation configuration.

pub const TextGenerator = struct {
    /// Model for inference
    model: *LLaMAModel,
    /// Tokenizer for text conversion
    tokenizer: *SimpleTokenizer,
    /// Memory allocator
    allocator: Allocator,
    /// Random number generator
    rng: Random,
    /// Current generation configuration
    config: GenerationConfig,

    pub fn init(
        model: *LLaMAModel,
        tokenizer: *SimpleTokenizer,
        allocator: Allocator,
        seed: ?u64,
    ) TextGenerator { ... }

    pub fn setConfig(self: *TextGenerator, config: GenerationConfig) !void { ... }

    pub fn generate(self: *TextGenerator, prompt: []const u8) !GenerationResult { ... }
};

Deterministic Reproduction

Pass a fixed seed to init to get reproducible output across runs. When seed is null, the generator uses the system clock, giving different results each time.

2.1 Initialization

init creates a PRNG from the provided seed (or the current timestamp), sets the default configuration to GenerationConfig.balanced(), and stores the model and tokenizer references. The generator does not own these resources -- the caller is responsible for their lifetime.

2.2 Configuration Update

setConfig validates the new configuration (see Section 3) and, if a new seed is provided, reinitialises the PRNG. This allows mid-session reconfiguration without constructing a new generator.


3. GenerationConfig

GenerationConfig bundles every parameter that controls the generation process.

pub const GenerationConfig = struct {
    strategy: SamplingStrategy = .Combined,
    temperature: f32 = 0.7,
    top_k: u32 = 40,
    top_p: f32 = 0.9,
    max_tokens: u32 = 512,
    min_tokens: u32 = 1,
    stop_tokens: []const TokenId = &[_]TokenId{SpecialTokens.EOS},
    stop_strings: []const []const u8 = &[_][]const u8{},
    repetition_penalty: f32 = 1.1,
    length_penalty: f32 = 1.0,
    seed: ?u64 = null,
};
Parameter Type Default Description
strategy SamplingStrategy .Combined Which sampling algorithm to use
temperature f32 0.7 Softmax temperature \( T \)
top_k u32 40 Number of highest-probability tokens to keep
top_p f32 0.9 Cumulative probability threshold for nucleus sampling
max_tokens u32 512 Hard upper limit on generated tokens
min_tokens u32 1 Minimum tokens before stop conditions are checked
stop_tokens []const TokenId {EOS} Token IDs that trigger generation stop
stop_strings []const []const u8 {} String patterns that trigger stop
repetition_penalty f32 1.1 Multiplicative penalty for repeated tokens
length_penalty f32 1.0 Penalty applied to longer sequences
seed ?u64 null Optional RNG seed for reproducibility

3.1 Validation

validate() enforces the following invariants:

  • \( T \ge 0 \) (temperature must be non-negative)
  • \( 0 \le p \le 1 \) (top-p must be a valid probability)
  • max_tokens > 0
  • min_tokens <= max_tokens
  • repetition_penalty >= 0

Violations return typed errors (InvalidTemperature, InvalidTopP, etc.) rather than panicking, enabling graceful error handling in server contexts.


4. Presets

ZigLlama ships four configuration presets that cover common use cases.

Preset Temperature Top-K Top-P Rep. Penalty Strategy Use Case
creative() 0.9 50 0.95 1.05 Combined Story writing, brainstorming
balanced() 0.7 40 0.9 1.1 Combined General-purpose chat
focused() 0.3 20 0.8 1.15 Combined Factual Q&A, summarisation
deterministic() 0.0 1 1.0 1.0 Greedy Testing, exact reproduction

Greedy is Not Always Best

The deterministic preset uses greedy decoding (\( T = 0 \)). While this maximises the probability of each individual token, it often produces repetitive or degenerate text for open-ended generation. Use it primarily for evaluation and regression testing.

// Example: switch to creative mode
try generator.setConfig(GenerationConfig.creative());
const result = try generator.generate("Once upon a time");

5. Generation Loop

Autoregressive Generation Loop

Input: prompt string, GenerationConfig

Output: GenerationResult

  1. Tokenise the prompt into token IDs \( (x_1, \ldots, x_n) \).
  2. Initialise generated_tokens with prompt tokens.
  3. Set num_generated \( \leftarrow 0 \).
  4. while num_generated < max_tokens do
    1. Run the model forward pass on generated_tokens.
    2. Extract the logit vector \( \mathbf{z} \in \mathbb{R}^{|V|} \) for the last position.
    3. Copy logits and apply repetition penalty (Section 6).
    4. Sample next token \( x_{n+1} \) using the configured strategy.
    5. if \( x_{n+1} \) triggers a stop condition then break.
    6. Append \( x_{n+1} \) to generated_tokens.
    7. Record \( \log p(x_{n+1}) \).
    8. Increment num_generated.
  5. Decode generated_tokens back to text.
  6. Compute GenerationStats from elapsed time and token count.
  7. return GenerationResult.
flowchart TD
    A["Tokenise Prompt"] --> B["Forward Pass"]
    B --> C["Extract Last-Position Logits"]
    C --> D["Apply Repetition Penalty"]
    D --> E["Sample Token"]
    E --> F{"Stop Condition?"}
    F -->|No| G["Append Token"]
    G --> B
    F -->|Yes| H["Decode and Return Result"]

Per-Step Complexity

Each iteration of the inner loop performs one full model forward pass. Without KV caching, the cost is \( O(n \cdot L \cdot d_{\text{model}}^{\,2}) \) where \( n \) is the current sequence length. With KV caching (see KV Cache), this drops to \( O(L \cdot d_{\text{model}}^{\,2}) \) per token.


6. Repetition Penalty

Repetition penalty discourages the model from producing the same tokens repeatedly. ZigLlama implements the method from Keskar et al. (2019)1:

Repetition Penalty

For each token \( v \) that appears in the recent history window (last 64 tokens by default), the logit is modified as:

\[ z'_v = \begin{cases} z_v \;/\; \rho & \text{if } z_v > 0 \\ z_v \times \rho & \text{if } z_v \le 0 \end{cases} \]

where \( \rho \ge 1 \) is the repetition penalty factor.

The asymmetric treatment ensures that positive logits are reduced (making the token less likely) while negative logits are pushed further negative (also making the token less likely). Setting \( \rho = 1.0 \) disables the penalty entirely.

fn applyRepetitionPenalty(self: *TextGenerator, logits: []f32, tokens: []const TokenId) !void {
    if (self.config.repetition_penalty == 1.0) return;

    const history_window = @min(tokens.len, 64);
    const recent_tokens = if (tokens.len > history_window)
        tokens[tokens.len - history_window ..]
    else
        tokens;

    for (recent_tokens) |token| {
        if (token < logits.len) {
            if (logits[token] > 0) {
                logits[token] /= self.config.repetition_penalty;
            } else {
                logits[token] *= self.config.repetition_penalty;
            }
        }
    }
}

Choosing the Penalty Factor

  • \( \rho = 1.0 \): No penalty (good for short, factual answers).
  • \( \rho = 1.05\text{--}1.15 \): Light penalty (recommended for most tasks).
  • \( \rho > 1.3 \): Aggressive penalty (may cause incoherent output as the model is forced away from natural continuations).

7. GenerationResult

The generate method returns a GenerationResult struct that bundles the output with metadata for logging, evaluation, and downstream processing.

pub const GenerationResult = struct {
    tokens: []TokenId,        // Generated token IDs
    text: ?[]u8,              // Decoded text (if tokenizer available)
    log_probs: []f32,         // Per-token log probabilities
    total_log_prob: f32,      // Sum of log probabilities
    num_tokens: u32,          // Count of generated tokens
    stop_reason: StopReason,  // Why generation stopped
    stats: GenerationStats,   // Performance statistics
};

7.1 Stop Reasons

StopReason Description
MaxTokens Reached the max_tokens limit
StopToken Encountered a token in stop_tokens
StopString Detected a string in stop_strings
EndOfSequence Model produced the EOS token
Error An error occurred during generation

7.2 Generation Statistics

GenerationStats captures timing and throughput data computed at the end of generation.

pub const GenerationStats = struct {
    generation_time_ms: f64,     // Wall-clock time
    tokens_per_second: f32,      // Throughput
    time_per_token_ms: f32,      // Average latency per token
    peak_memory_bytes: usize,    // Peak memory during generation
    num_forward_passes: u32,     // Number of model evaluations
};

Tokens Per Second

The tokens_per_second metric is the primary throughput indicator. Typical values for a 7B parameter model on CPU:

  • Without KV cache: 0.5--2 tokens/s
  • With KV cache: 5--20 tokens/s
  • With KV cache + Q4 quantisation: 15--50 tokens/s

8. Sampling Dispatch

The TextGenerator delegates token selection to the configured sampling strategy through a switch dispatch:

fn sampleToken(self: *TextGenerator, logits: []f32) !TokenProb {
    return switch (self.config.strategy) {
        .Greedy => try self.sampleGreedy(logits),
        .TopK => try self.sampleTopK(logits, self.config.top_k),
        .TopP => try self.sampleTopP(logits, self.config.top_p),
        .Temperature => try self.sampleTemperature(logits, self.config.temperature),
        .Combined => try self.sampleCombined(logits),
    };
}

Each sampling method is covered in detail in Sampling Strategies and Advanced Sampling.


9. End-to-End Example

const std = @import("std");
const generation = @import("inference/generation.zig");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Assume model and tokenizer are loaded (see Layer 5)
    var model = try loadModel(allocator, "model.gguf");
    defer model.deinit();
    var tokenizer = try loadTokenizer(allocator, "tokenizer.model");
    defer tokenizer.deinit();

    // Create generator with reproducible seed
    var gen = generation.TextGenerator.init(&model, &tokenizer, allocator, 42);

    // Use creative preset
    try gen.setConfig(generation.GenerationConfig.creative());

    // Generate text
    const result = try gen.generate("The theory of relativity states that");
    defer result.deinit(allocator);

    // Print output
    if (result.text) |text| {
        std.debug.print("Generated: {s}\n", .{text});
    }
    std.debug.print("Tokens: {d}, Speed: {d:.1} t/s, Stop: {s}\n", .{
        result.num_tokens,
        result.stats.tokens_per_second,
        result.stop_reason.description(),
    });
}

References


  1. Keskar, N.S. et al. "CTRL: A Conditional Transformer Language Model for Controllable Generation." arXiv:1909.05858, 2019. 

  2. Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017. 

  3. Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. 

  4. Gerganov, G. "llama.cpp -- Inference of LLaMA model in C/C++." GitHub, 2023.