Text Generation¶

Learn how to generate text with Mullama's inference engine, from basic completions to advanced chat interactions with fine-tuned sampling control.

Basic Generation¶

The simplest way to generate text from a prompt:

Node.jsPythonRustCLI

import { Model, Context } from 'mullama';

const model = await Model.load('./model.gguf');
const context = new Context(model);

const response = await context.generate("Once upon a time,", 100);
console.log(response);

from mullama import Model, Context

model = Model.load("./model.gguf")
context = Context(model)

response = context.generate("Once upon a time,", max_tokens=100)
print(response)

use mullama::{Model, Context, ContextParams};
use std::sync::Arc;

let model = Arc::new(Model::load("model.gguf")?);
let mut context = Context::new(model.clone(), ContextParams::default())?;

let response = context.generate("Once upon a time,", 100)?;
println!("{}", response);

mullama run llama3.2:1b "Once upon a time,"

Context Parameters¶

Configure the inference context for your workload. Context parameters control the maximum sequence length, parallelism, and memory efficiency:

Node.jsPythonRustCLI

const context = new Context(model, {
  nCtx: 4096,            // Context window size (max tokens)
  nBatch: 512,           // Tokens processed per prompt batch
  nThreads: 8,           // CPU threads for generation
  nThreadsBatch: 8,      // CPU threads for batch processing
  ropeFreqBase: 0.0,     // RoPE frequency base (0 = model default)
  ropeFreqScale: 0.0,    // RoPE frequency scale (0 = model default)
  flashAttn: true,       // Enable flash attention
  kvCacheType: 'f16',    // KV cache quantization
});

from mullama import Context, ContextParams

context = Context(model, ContextParams(
    n_ctx=4096,              # Context window size (max tokens)
    n_batch=512,             # Tokens processed per prompt batch
    n_threads=8,             # CPU threads for generation
    n_threads_batch=8,       # CPU threads for batch processing
    rope_freq_base=0.0,      # RoPE frequency base (0 = model default)
    rope_freq_scale=0.0,     # RoPE frequency scale (0 = model default)
    flash_attn=True,         # Enable flash attention
    kv_cache_type="f16",     # KV cache quantization
))

use mullama::{Context, ContextParams, KvCacheType};

let params = ContextParams {
    n_ctx: 4096,
    n_batch: 512,
    n_threads: 8,
    n_threads_batch: 8,
    rope_freq_base: 0.0,
    rope_freq_scale: 0.0,
    flash_attn: true,
    kv_cache_type: KvCacheType::F16,
    ..Default::default()
};

let mut context = Context::new(model, params)?;

mullama run llama3.2:1b "Hello!" \
  --ctx-size 4096 \
  --batch-size 512 \
  --threads 8 \
  --flash-attn

Parameter Reference¶

Parameter	Type	Default	Description
`n_ctx`	`usize`	Model default	Maximum context length in tokens
`n_batch`	`usize`	2048	Prompt processing batch size
`n_threads`	`usize`	System cores	CPU threads for token generation
`n_threads_batch`	`usize`	System cores	CPU threads for batch prompt processing
`rope_freq_base`	`f32`	0.0 (auto)	RoPE positional encoding frequency base
`rope_freq_scale`	`f32`	0.0 (auto)	RoPE frequency scaling factor
`flash_attn`	`bool`	`true`	Enable flash attention for memory efficiency
`embeddings`	`bool`	`false`	Enable embedding extraction
`kv_cache_type`	`KvCacheType`	`F16`	KV cache quantization (F32/F16/BF16/Q8_0/Q4_0)

Context Size vs Training Length

Setting n_ctx larger than the model's training context may work with RoPE scaling but can degrade quality. For best results, stay within the trained context window.

Sampling Parameters¶

Control the randomness and quality of generation with sampling parameters:

Node.jsPythonRustCLI

const response = await context.generate("Write a poem:", 200, {
  temperature: 0.7,       // Controls randomness (0.0 = deterministic)
  topK: 40,               // Consider only top K candidates
  topP: 0.9,              // Nucleus sampling threshold
  minP: 0.05,             // Minimum probability relative to top token
  typicalP: 1.0,          // Typical sampling threshold
  penaltyRepeat: 1.1,     // Penalize repeated tokens
  penaltyFreq: 0.0,       // Frequency-based penalty
  penaltyPresent: 0.0,    // Presence-based penalty
  seed: 42,               // Random seed for reproducibility
});

from mullama import SamplerParams

response = context.generate("Write a poem:", max_tokens=200, params=SamplerParams(
    temperature=0.7,         # Controls randomness (0.0 = deterministic)
    top_k=40,                # Consider only top K candidates
    top_p=0.9,               # Nucleus sampling threshold
    min_p=0.05,              # Minimum probability relative to top token
    typical_p=1.0,           # Typical sampling threshold
    penalty_repeat=1.1,      # Penalize repeated tokens
    penalty_freq=0.0,        # Frequency-based penalty
    penalty_present=0.0,     # Presence-based penalty
    seed=42,                 # Random seed for reproducibility
))

use mullama::SamplerParams;

let sampling = SamplerParams {
    temperature: 0.7,
    top_k: 40,
    top_p: 0.9,
    min_p: 0.05,
    typical_p: 1.0,
    penalty_repeat: 1.1,
    penalty_freq: 0.0,
    penalty_present: 0.0,
    seed: 42,
    ..Default::default()
};

let response = context.generate_with_params("Write a poem:", 200, sampling)?;

mullama run llama3.2:1b "Write a poem:" \
  --temperature 0.7 \
  --top-k 40 \
  --top-p 0.9 \
  --min-p 0.05 \
  --repeat-penalty 1.1 \
  --seed 42

Sampling Presets¶

Use these presets as starting points for common use cases:

Node.jsPythonRustCLI

// Deterministic: best for code generation, factual Q&A
const deterministic = { temperature: 0.0 };

// Balanced: good for general conversation
const balanced = { temperature: 0.7, topK: 40, topP: 0.9 };

// Creative: for stories, brainstorming, poetry
const creative = { temperature: 1.2, topK: 80, topP: 0.95 };

const response = await context.generate(prompt, 200, balanced);

# Deterministic: best for code generation, factual Q&A
deterministic = SamplerParams(temperature=0.0)

# Balanced: good for general conversation
balanced = SamplerParams(temperature=0.7, top_k=40, top_p=0.9)

# Creative: for stories, brainstorming, poetry
creative = SamplerParams(temperature=1.2, top_k=80, top_p=0.95)

response = context.generate(prompt, max_tokens=200, params=balanced)

// Deterministic: best for code generation, factual Q&A
let deterministic = SamplerParams { temperature: 0.0, ..Default::default() };

// Balanced: good for general conversation
let balanced = SamplerParams {
    temperature: 0.7, top_k: 40, top_p: 0.9, ..Default::default()
};

// Creative: for stories, brainstorming, poetry
let creative = SamplerParams {
    temperature: 1.2, top_k: 80, top_p: 0.95, ..Default::default()
};

# Deterministic
mullama run llama3.2:1b "What is 2+2?" --temperature 0

# Balanced
mullama run llama3.2:1b "Tell me about Rust" --temperature 0.7 --top-k 40

# Creative
mullama run llama3.2:1b "Write a haiku" --temperature 1.2 --top-k 80

For a comprehensive guide to all sampling strategies, see Sampling Strategies.

Token-by-Token Generation¶

For fine-grained control, generate one token at a time. This enables custom stop logic, progress tracking, and dynamic parameter adjustment:

Node.jsPythonRustCLI

import { Model, Context, SamplerChain } from 'mullama';

const model = await Model.load('./model.gguf');
const context = new Context(model);

// Tokenize the prompt
const tokens = model.tokenize("The meaning of life is");

// Feed prompt into context
await context.decode(tokens);

// Build a sampler chain
const sampler = new SamplerChain()
  .addTemperature(0.7)
  .addTopK(40)
  .addTopP(0.9)
  .addDist(42);

// Generate tokens one at a time
let output = '';
for (let i = 0; i < 200; i++) {
  const token = sampler.sample(context);

  if (model.isEndOfGeneration(token)) break;

  const text = model.tokenToString(token);
  output += text;
  process.stdout.write(text);

  await context.evalToken(token);
}
console.log('\n\nGenerated:', output);

from mullama import Model, Context, SamplerChain

model = Model.load("./model.gguf")
context = Context(model)

# Tokenize the prompt
tokens = model.tokenize("The meaning of life is")

# Feed prompt into context
context.decode(tokens)

# Build a sampler chain
sampler = (SamplerChain()
    .add_temperature(0.7)
    .add_top_k(40)
    .add_top_p(0.9)
    .add_dist(42))

# Generate tokens one at a time
output = ""
for _ in range(200):
    token = sampler.sample(context)

    if model.is_end_of_generation(token):
        break

    text = model.token_to_string(token)
    output += text
    print(text, end="", flush=True)

    context.eval_token(token)

print(f"\n\nGenerated: {output}")

use mullama::{Model, Context, ContextParams};
use mullama::sampling::{Sampler, SamplerChain};
use std::sync::Arc;

let model = Arc::new(Model::load("model.gguf")?);
let mut context = Context::new(model.clone(), ContextParams::default())?;

// Tokenize the prompt
let tokens = model.tokenize("The meaning of life is", true, false)?;
context.decode(&tokens)?;

// Build a sampler chain
let mut chain = SamplerChain::new()
    .add(Sampler::temperature(0.7)?)
    .add(Sampler::top_k(40)?)
    .add(Sampler::top_p(0.9, 1)?)
    .add(Sampler::dist(42)?);

// Generate tokens one at a time
let mut output = String::new();
for _ in 0..200 {
    let token = chain.sample(&context)?;

    if model.token_is_eog(token) {
        break;
    }

    let text = model.token_to_str(token, 0, false)?;
    output.push_str(&text);
    print!("{}", text);

    context.eval_token(token)?;
}
println!("\n\nGenerated: {}", output);

# CLI handles token-by-token generation internally
# Use --verbose to see per-token timing
mullama run llama3.2:1b "The meaning of life is" --verbose

Chat Templates¶

Format messages for chat/instruction-tuned models using their built-in templates:

Node.jsPythonRustCLI

import { Model, Context, ChatMessage } from 'mullama';

const model = await Model.load('./model.gguf');
const context = new Context(model);

const messages = [
  ChatMessage.system("You are a helpful coding assistant."),
  ChatMessage.user("Write a function to compute fibonacci numbers."),
];

// Apply the model's built-in chat template
const prompt = model.applyChatTemplate(messages);
const response = await context.generate(prompt, 500);
console.log(response);

from mullama import Model, Context, ChatMessage

model = Model.load("./model.gguf")
context = Context(model)

messages = [
    ChatMessage.system("You are a helpful coding assistant."),
    ChatMessage.user("Write a function to compute fibonacci numbers."),
]

# Apply the model's built-in chat template
prompt = model.apply_chat_template(messages)
response = context.generate(prompt, max_tokens=500)
print(response)

use mullama::ChatMessage;

let messages = vec![
    ChatMessage::system("You are a helpful coding assistant."),
    ChatMessage::user("Write a function to compute fibonacci numbers."),
];

let prompt = model.apply_chat_template(&messages)?;
let response = context.generate(&prompt, 500)?;
println!("{}", response);

# CLI automatically applies chat template
mullama run llama3.2:1b "Write a function to compute fibonacci numbers." \
  --system "You are a helpful coding assistant."

Chat Template Formats¶

Different models use different chat formats. Mullama automatically detects the correct template from model metadata. For models without embedded templates, you can format manually:

ChatML (Qwen, etc.)Llama 3Mistral

<|im_start|>system
You are helpful.<|im_end|>
<|im_start|>user
Hello!<|im_end|>
<|im_start|>assistant

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are helpful.<|eot_id|>
<|start_header_id|>user<|end_header_id|>

Hello!<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>

[INST] Hello! [/INST]

Multi-Turn Conversations¶

Build multi-turn conversations by accumulating messages:

Node.jsPythonRustCLI

import { Model, Context, ChatMessage } from 'mullama';
import * as readline from 'readline';

const model = await Model.load('./model.gguf');
const context = new Context(model, { nCtx: 4096 });

const messages = [
  ChatMessage.system("You are a helpful assistant."),
];

const rl = readline.createInterface({ input: process.stdin, output: process.stdout });

function ask(question) {
  return new Promise(resolve => rl.question(question, resolve));
}

while (true) {
  const userInput = await ask("You: ");
  if (userInput === 'quit') break;

  messages.push(ChatMessage.user(userInput));

  const prompt = model.applyChatTemplate(messages);
  const response = await context.generate(prompt, 500);

  console.log(`Assistant: ${response}`);
  messages.push(ChatMessage.assistant(response));
}

from mullama import Model, Context, ContextParams, ChatMessage

model = Model.load("./model.gguf")
context = Context(model, ContextParams(n_ctx=4096))

messages = [
    ChatMessage.system("You are a helpful assistant."),
]

while True:
    user_input = input("You: ")
    if user_input == "quit":
        break

    messages.append(ChatMessage.user(user_input))

    prompt = model.apply_chat_template(messages)
    response = context.generate(prompt, max_tokens=500)

    print(f"Assistant: {response}")
    messages.append(ChatMessage.assistant(response))

use mullama::ChatMessage;

let mut messages = vec![
    ChatMessage::system("You are a helpful assistant."),
];

loop {
    let user_input = get_user_input();  // Your input function
    messages.push(ChatMessage::user(&user_input));

    let prompt = model.apply_chat_template(&messages)?;
    let response = context.generate(&prompt, 500)?;

    println!("Assistant: {}", response);
    messages.push(ChatMessage::assistant(&response));
}

# Interactive chat mode
mullama run llama3.2:1b --interactive \
  --system "You are a helpful assistant."

Stop Conditions¶

Control when generation terminates using stop sequences:

Node.jsPythonRustCLI

const response = await context.generate("Generate a list:", 500, {
  stopSequences: ["\n\n", "User:", "```"],
});

response = context.generate("Generate a list:", max_tokens=500, params=SamplerParams(
    stop_sequences=["\n\n", "User:", "```"],
))

let params = SamplerParams {
    stop_sequences: vec![
        "\n\n".to_string(),
        "User:".to_string(),
        "```".to_string(),
    ],
    ..Default::default()
};

let response = context.generate_with_params(prompt, 500, params)?;

mullama run llama3.2:1b "Generate a list:" --stop "\n\n" --stop "User:"

KV Cache Management¶

The KV (Key-Value) cache stores attention state. Managing it is important for long conversations:

Node.jsPythonRustCLI

// Check context usage
const nPast = context.tokenCount();
const nCtx = context.contextSize();
console.log(`Context usage: ${nPast}/${nCtx} tokens`);

// Clear context when nearly full
if (nPast > nCtx - 200) {
  context.clear();
  // Re-inject system prompt after clearing
  const systemTokens = model.tokenize("System: You are helpful.\n");
  await context.decode(systemTokens);
}

# Check context usage
n_past = context.token_count()
n_ctx = context.context_size()
print(f"Context usage: {n_past}/{n_ctx} tokens")

# Clear context when nearly full
if n_past > n_ctx - 200:
    context.clear()
    # Re-inject system prompt after clearing
    system_tokens = model.tokenize("System: You are helpful.\n")
    context.decode(system_tokens)

// Check context usage
let n_past = context.n_past();
let n_ctx = context.n_ctx();
println!("Context usage: {}/{} tokens", n_past, n_ctx);

// Clear context when nearly full
if n_past > n_ctx - 200 {
    context.clear()?;
    let sys_tokens = model.tokenize("System: You are helpful.\n", true, true)?;
    context.decode(&sys_tokens)?;
}

# CLI manages context automatically in interactive mode
mullama run llama3.2:1b --interactive --ctx-size 4096

KV Cache Quantization¶

Reduce memory usage by quantizing the KV cache. Lower precision uses less memory at the cost of minor quality loss:

Type	Bits	Memory (4096 ctx, 7B)	Quality Impact
F32	32	~4 GB	None
F16	16	~2 GB	Negligible
BF16	16	~2 GB	Negligible
Q8_0	8	~1 GB	Very minor
Q4_0	4	~0.5 GB	Minor

Recommended

Use F16 for most workloads. Switch to Q8_0 or Q4_0 when running with very large context windows (32K+) or on memory-constrained systems.

Batch Processing¶

Process multiple prompts efficiently using the Batch API:

Node.jsPythonRustCLI

import { Batch } from 'mullama';

const prompts = [
  "Translate to French: Hello",
  "Translate to French: Goodbye",
  "Translate to French: Thank you",
];

const batch = new Batch(512, prompts.length);

for (let i = 0; i < prompts.length; i++) {
  const tokens = model.tokenize(prompts[i]);
  batch.addSequence(i, tokens);
}

await context.decodeBatch(batch);

from mullama import Batch

prompts = [
    "Translate to French: Hello",
    "Translate to French: Goodbye",
    "Translate to French: Thank you",
]

batch = Batch(max_tokens=512, n_sequences=len(prompts))

for i, prompt in enumerate(prompts):
    tokens = model.tokenize(prompt)
    batch.add_sequence(i, tokens)

context.decode_batch(batch)

use mullama::Batch;

let prompts = vec![
    "Translate to French: Hello",
    "Translate to French: Goodbye",
    "Translate to French: Thank you",
];

let mut batch = Batch::new(512, prompts.len());
for (i, prompt) in prompts.iter().enumerate() {
    let tokens = model.tokenize(prompt, true, false)?;
    batch.add_sequence(i, &tokens)?;
}

context.decode_batch(&batch)?;

# Batch via daemon REST API
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "prompt": ["Translate to French: Hello", "Translate to French: Goodbye"]
  }'

Performance Tips¶

Use appropriate batch sizes -- Larger n_batch values speed up prompt processing for long inputs
Enable flash attention -- Reduces memory usage and improves speed for long contexts
Tune thread count -- Set n_threads to the number of physical cores (not hyperthreads)
Reuse contexts -- Creating a new context is expensive; clear and reuse instead
Use streaming -- Better user experience and enables early stopping to save compute
Match n_ctx to need -- Smaller context windows use less memory and are faster
Quantize KV cache -- Use F16 or Q8_0 for large context windows to reduce memory pressure
Offload to GPU -- Even partial GPU offloading significantly accelerates generation

Thread Count

Setting n_threads higher than your physical core count can actually reduce performance due to hyperthreading overhead. Use nproc --all on Linux or check your CPU specifications.

Text Generation¶

Basic Generation¶

Context Parameters¶

Parameter Reference¶

Sampling Parameters¶

Sampling Presets¶

Token-by-Token Generation¶

Chat Templates¶

Chat Template Formats¶

Multi-Turn Conversations¶

Stop Conditions¶

KV Cache Management¶

KV Cache Quantization¶

Batch Processing¶

Performance Tips¶

See Also¶