Text Generation¶

This guide covers autoregressive text generation with UniLLM.

Basic Generation¶

The simplest way to generate text:

use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::{Model, GenerationConfig};

let model = LlamaModelV2::new(LlamaConfig::default())?;

let gen_config = GenerationConfig::default();
let response = model.generate("Once upon a time", &gen_config)?;

println!("{}", response);

Generation Configuration¶

Customize generation behavior with GenerationConfig:

let gen_config = GenerationConfig {
    max_new_tokens: 100,      // Maximum tokens to generate
    temperature: 0.7,          // Sampling temperature
    top_p: 0.9,               // Nucleus sampling threshold
    top_k: Some(50),          // Top-k sampling
    do_sample: true,          // Enable sampling (vs greedy)
    repetition_penalty: 1.1,  // Penalize repeated tokens
    eos_token_id: 2,          // End of sequence token
    pad_token_id: 0,          // Padding token
    stop_sequences: vec![],   // Stop generation on these strings
};

Sampling Strategies¶

Greedy Decoding¶

Always select the most likely token:

let gen_config = GenerationConfig {
    do_sample: false,  // Greedy decoding
    ..Default::default()
};

When to Use Greedy

Greedy decoding is deterministic and produces the same output every time. Use it for:

Code generation
Factual responses
Reproducible outputs

Temperature Sampling¶

Control randomness with temperature:

let gen_config = GenerationConfig {
    do_sample: true,
    temperature: 0.7,  // Lower = more focused, Higher = more creative
    ..Default::default()
};

Temperature	Behavior
0.0	Greedy (deterministic)
0.3	Focused, less creative
0.7	Balanced (recommended)
1.0	Full distribution
1.5+	Very creative, may be incoherent

Top-K Sampling¶

Limit sampling to top K tokens:

let gen_config = GenerationConfig {
    do_sample: true,
    temperature: 0.7,
    top_k: Some(50),  // Only consider top 50 tokens
    ..Default::default()
};

Nucleus (Top-P) Sampling¶

Sample from the smallest set of tokens with cumulative probability >= p:

let gen_config = GenerationConfig {
    do_sample: true,
    temperature: 0.7,
    top_p: 0.9,  // Sample from tokens comprising 90% probability mass
    ..Default::default()
};

Combined Sampling¶

Combine multiple strategies:

let gen_config = GenerationConfig {
    do_sample: true,
    temperature: 0.7,
    top_p: 0.9,
    top_k: Some(50),
    repetition_penalty: 1.1,
    ..Default::default()
};

Repetition Penalty¶

Prevent the model from repeating itself:

let gen_config = GenerationConfig {
    repetition_penalty: 1.1,  // Penalize tokens already generated
    ..Default::default()
};

Penalty	Effect
1.0	No penalty
1.1	Slight penalty (recommended)
1.5	Strong penalty
2.0+	Very strong, may affect quality

Stop Sequences¶

Stop generation when specific strings are encountered:

let gen_config = GenerationConfig {
    stop_sequences: vec![
        "\n\n".to_string(),
        "Human:".to_string(),
        "```".to_string(),
    ],
    ..Default::default()
};

Generation Flow¶

Understanding how generation works:

┌──────────────────────────────────────────────────────┐
│ Input: "Hello"                                        │
└──────────────────────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────┐
│ Tokenize: [1, 15496]                                 │
└──────────────────────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────┐
│ Forward Pass → Logits [1, 2, vocab_size]             │
└──────────────────────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────┐
│ Sample Next Token (using temperature, top_p, etc.)   │
│ Selected: 3186 ("world")                             │
└──────────────────────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────┐
│ Append to sequence: [1, 15496, 3186]                 │
│ Repeat until max_tokens or EOS                       │
└──────────────────────────────────────────────────────┘
                         │
                         ▼
┌──────────────────────────────────────────────────────┐
│ Decode: "Hello world"                                │
└──────────────────────────────────────────────────────┘

Example: Chat Completion¶

fn chat_completion(model: &LlamaModelV2, messages: &[Message]) -> Result<String> {
    // Format messages into prompt
    let prompt = format_chat_prompt(messages);

    let gen_config = GenerationConfig {
        max_new_tokens: 512,
        temperature: 0.7,
        top_p: 0.9,
        stop_sequences: vec!["Human:".to_string()],
        ..Default::default()
    };

    model.generate(&prompt, &gen_config)
}

fn format_chat_prompt(messages: &[Message]) -> String {
    messages.iter()
        .map(|m| format!("{}: {}", m.role, m.content))
        .collect::<Vec<_>>()
        .join("\n")
}

Example: Code Generation¶

let gen_config = GenerationConfig {
    max_new_tokens: 256,
    temperature: 0.2,      // Lower temperature for code
    do_sample: true,
    top_p: 0.95,
    stop_sequences: vec!["```".to_string()],
    ..Default::default()
};

let prompt = "Write a Python function to calculate fibonacci:\n```python\n";
let code = model.generate(prompt, &gen_config)?;

Performance Tips¶

Generation Optimization

KV Caching - Reuse computed key/value pairs (in development)
Batched Generation - Generate multiple sequences in parallel
Early Stopping - Use stop sequences to avoid generating unnecessary tokens
Quantized Models - Use GGUF Q4/Q8 for faster generation

Next Steps¶

Learn about Configuration Options
Explore the Model Catalog for model-specific settings