inference.generation¶
Module Path¶
Source file: src/inference/generation.zig
Public Types¶
SamplingStrategy¶
| Variant | Behavior |
|---|---|
Greedy | Always pick the highest-probability token |
TopK | Sample from the k most probable tokens |
TopP | Sample from the smallest set whose cumulative probability exceeds p |
Temperature | Scale logits by 1/temperature before sampling |
Combined | Apply top-k, then top-p, then temperature (the default) |
GenerationConfig¶
pub const GenerationConfig = struct {
strategy: SamplingStrategy = .Combined,
temperature: f32 = 0.7,
top_k: u32 = 40,
top_p: f32 = 0.9,
max_tokens: u32 = 512,
min_tokens: u32 = 1,
stop_tokens: []const TokenId = &[_]TokenId{SpecialTokens.EOS},
stop_strings: []const []const u8 = &[_][]const u8{},
repetition_penalty: f32 = 1.1,
length_penalty: f32 = 1.0,
seed: ?u64 = null,
};
Controls every aspect of the generation loop.
StopReason¶
GenerationResult¶
pub const GenerationResult = struct {
tokens: []TokenId,
text: []const u8,
log_probs: ?[]f32,
stop_reason: StopReason,
stats: GenerationStats,
};
Returned by generate. Contains the produced tokens, decoded text, optional per-token log probabilities, the reason generation stopped, and timing statistics.
TextGenerator¶
pub const TextGenerator = struct {
model: *LLaMAModel,
tokenizer: *SimpleTokenizer,
config: GenerationConfig,
allocator: std.mem.Allocator,
};
High-level generation engine that ties together a model, tokenizer, and sampling configuration.
Public Functions¶
TextGenerator.init¶
pub fn init(
model: *LLaMAModel,
tokenizer: *SimpleTokenizer,
allocator: std.mem.Allocator,
config: GenerationConfig,
) TextGenerator
Construct a generator. Does not allocate; the returned struct is ready to use immediately.
TextGenerator.generate¶
End-to-end text generation:
- Encode
promptto token IDs. - Run the autoregressive generation loop with the configured sampling strategy.
- Decode output tokens to text.
- Return a
GenerationResult.
GenerationConfig.creative¶
Preset: temperature=1.0, top_k=0, top_p=0.95. Good for creative writing.
GenerationConfig.balanced¶
Preset: temperature=0.7, top_k=40, top_p=0.9. The default -- good balance between coherence and variety.
GenerationConfig.deterministic¶
Preset: strategy=.Greedy, temperature=0.0. Always produces the same output for the same input.
Error Types¶
error{EmptyPrompt}-- prompt string is empty.error{ModelError}-- forward pass failed.error{OutOfMemory}
Usage Example¶
const gen = @import("zigllama").inference.generation;
var config = gen.GenerationConfig.balanced();
config.max_tokens = 256;
var generator = gen.TextGenerator.init(&model, &tokenizer, allocator, config);
const result = try generator.generate("Once upon a time");
defer allocator.free(result.tokens);
defer allocator.free(result.text);
std.debug.print("{s}\n", .{result.text});
std.debug.print("Stop reason: {}\n", .{result.stop_reason});
Related Modules¶
models.llama-- TheLLaMAModeldriven by the generator.models.tokenizer-- Encodes prompts and decodes outputs.inference.kv_cache-- Speeds up autoregressive generation.inference.streaming-- Stream tokens as they are generated.inference.advanced_sampling-- Mirostat, typical, tail-free sampling strategies.