Context API¶

The Context struct manages the inference state for text generation. It holds the KV-cache, processes token batches, and coordinates sampling. Each context is bound to a single model and maintains its own generation state.

Context Struct¶

/// Represents a model context for inference.
///
/// Holds the KV-cache and internal state for token processing.
/// Not thread-safe -- each thread should create its own context.
pub struct Context {
    pub model: Arc<Model>,
    pub ctx_ptr: *mut llama_context,
}

Thread Safety

Context is not Send or Sync. It must not be shared across threads or moved between threads. Each thread should create its own context from a shared Arc<Model>. For async use, wrap operations with AsyncContext which uses spawn_blocking internally.

Creating Contexts¶

`Context::new`¶

Create a new inference context for a model with specified parameters.

pub fn new(model: Arc<Model>, params: ContextParams) -> Result<Self, MullamaError>

Parameters:

Name	Type	Default	Description
`model`	`Arc<Model>`	--	Shared reference to the loaded model
`params`	`ContextParams`	--	Context configuration parameters

Returns: Result<Context, MullamaError>

Errors:

MullamaError::ContextError -- Failed to allocate context (usually insufficient memory for KV-cache)
MullamaError::InvalidInput -- Invalid parameter combination

Example:

use mullama::{Model, Context, ContextParams};
use std::sync::Arc;

let model = Arc::new(Model::load("model.gguf")?);

let mut params = ContextParams::default();
params.n_ctx = 4096;
params.n_batch = 512;
params.n_threads = 8;

let mut ctx = Context::new(model, params)?;

ContextParams¶

Full configuration for context behavior including threading, RoPE, flash attention, and KV-cache quantization.

#[derive(Debug, Clone)]
pub struct ContextParams {
    pub n_ctx: u32,
    pub n_batch: u32,
    pub n_ubatch: u32,
    pub n_seq_max: u32,
    pub n_threads: i32,
    pub n_threads_batch: i32,
    pub rope_scaling_type: llama_rope_scaling_type,
    pub pooling_type: llama_pooling_type,
    pub attention_type: llama_attention_type,
    pub flash_attn_type: llama_flash_attn_type,
    pub rope_freq_base: f32,
    pub rope_freq_scale: f32,
    pub yarn_ext_factor: f32,
    pub yarn_attn_factor: f32,
    pub yarn_beta_fast: f32,
    pub yarn_beta_slow: f32,
    pub yarn_orig_ctx: u32,
    pub defrag_thold: f32,
    pub embeddings: bool,
    pub offload_kqv: bool,
    pub no_perf: bool,
    pub op_offload: bool,
    pub swa_full: bool,
    pub kv_unified: bool,
    pub type_k: KvCacheType,
    pub type_v: KvCacheType,
}

Core Fields¶

Name	Type	Default	Description
`n_ctx`	`u32`	`0` (model default)	Context window size in tokens. Set to 0 to use model's training context length.
`n_batch`	`u32`	`2048`	Maximum batch size for prompt processing. Larger values use more memory but process prompts faster.
`n_ubatch`	`u32`	`512`	Physical batch size (micro-batch). Internal chunking size for processing.
`n_seq_max`	`u32`	`1`	Maximum number of parallel sequences (for beam search or parallel generation).
`n_threads`	`i32`	CPU count	Number of threads for single-token generation.
`n_threads_batch`	`i32`	CPU count	Number of threads for batch/prompt processing. Can differ from `n_threads`.

Attention and RoPE Fields¶

Name	Type	Default	Description
`flash_attn_type`	`llama_flash_attn_type`	`AUTO`	Flash attention mode (AUTO, DISABLED, ENABLED).
`rope_freq_base`	`f32`	`0.0` (model default)	RoPE base frequency for positional encoding.
`rope_freq_scale`	`f32`	`0.0` (model default)	RoPE frequency scaling factor. Values < 1.0 extend context.
`rope_scaling_type`	`llama_rope_scaling_type`	`UNSPECIFIED`	RoPE scaling strategy (NONE, LINEAR, YARN).
`yarn_ext_factor`	`f32`	`-1.0`	YaRN extension factor. -1.0 uses model default.
`yarn_attn_factor`	`f32`	`1.0`	YaRN attention scaling factor.
`yarn_beta_fast`	`f32`	`32.0`	YaRN fast interpolation beta parameter.
`yarn_beta_slow`	`f32`	`1.0`	YaRN slow interpolation beta parameter.
`yarn_orig_ctx`	`u32`	`0`	YaRN original context size (0 = use model default).
`pooling_type`	`llama_pooling_type`	`UNSPECIFIED`	Pooling type for embedding models.
`attention_type`	`llama_attention_type`	`UNSPECIFIED`	Attention type (CAUSAL, NON_CAUSAL).

Memory and Performance Fields¶

Name	Type	Default	Description
`embeddings`	`bool`	`false`	Enable embedding output mode. Required for generating embeddings.
`offload_kqv`	`bool`	`true`	Offload KQV operations to GPU when GPU layers are used.
`type_k`	`KvCacheType`	`F16`	Key cache quantization type. Lower precision saves memory.
`type_v`	`KvCacheType`	`F16`	Value cache quantization type. Lower precision saves memory.
`defrag_thold`	`f32`	`-1.0`	KV-cache defragmentation threshold (-1.0 = disabled).
`no_perf`	`bool`	`false`	Disable performance counters for slightly faster inference.
`op_offload`	`bool`	`false`	Enable operation offloading to GPU.
`swa_full`	`bool`	`true`	Use full sliding window attention.
`kv_unified`	`bool`	`false`	Use unified KV-cache layout.

KvCacheType¶

Controls KV-cache quantization precision for memory optimization. Lower quantization significantly reduces memory usage at the cost of some quality.

#[derive(Debug, Clone, Copy, PartialEq, Eq, Default)]
pub enum KvCacheType {
    F32,     // 2x memory vs F16 (highest precision)
    #[default]
    F16,     // 1x memory (default, best balance)
    BF16,    // 1x memory (alternative 16-bit, better for some hardware)
    Q8_0,    // 0.5x memory (~50% savings, minimal quality loss)
    Q4_0,    // 0.25x memory (~75% savings, may affect quality)
}

Memory Factors¶

Type	Memory Factor	Quality Impact	Use Case
`F32`	2.0x	None	Maximum precision, debugging
`F16`	1.0x (baseline)	None	Default, recommended for most uses
`BF16`	1.0x	Negligible	Alternative for hardware with BF16 support
`Q8_0`	0.5x	Minimal	Large context windows with limited memory
`Q4_0`	0.25x	Noticeable	Extreme memory constraints, shorter contexts

Example with KV-cache quantization:

use mullama::{Context, ContextParams, KvCacheType};

let params = ContextParams {
    n_ctx: 8192,
    type_k: KvCacheType::Q8_0,  // 50% KV-cache memory savings
    type_v: KvCacheType::Q8_0,
    ..Default::default()
};
// For a 7B model with 8192 context, Q8_0 saves ~1GB of KV-cache memory

Flash Attention Types¶

pub enum llama_flash_attn_type {
    LLAMA_FLASH_ATTN_TYPE_AUTO = -1,     // Auto-detect best setting
    LLAMA_FLASH_ATTN_TYPE_DISABLED = 0,  // Disable flash attention
    LLAMA_FLASH_ATTN_TYPE_ENABLED = 1,   // Enable flash attention
}

ContextBuilder¶

Fluent API for context creation with validation.

use mullama::builder::ContextBuilder;

let ctx = ContextBuilder::new(model.clone())
    .context_size(4096)
    .batch_size(512)
    .threads(8)
    .embeddings(true)
    .flash_attention(true)
    .kv_cache_type(KvCacheType::Q8_0)
    .build()
    .await?;

Methods¶

Method	Parameter	Description
`new(model)`	`Arc<Model>`	Create builder with model reference
`context_size(n)`	`u32`	Set context window size
`batch_size(n)`	`u32`	Set batch processing size
`threads(n)`	`i32`	Set thread count for generation
`threads_batch(n)`	`i32`	Set thread count for batch ops
`embeddings(b)`	`bool`	Enable embedding output
`flash_attention(b)`	`bool`	Enable flash attention
`kv_cache_type(t)`	`KvCacheType`	Set KV-cache quantization for both K and V
`rope_freq_base(f)`	`f32`	Set RoPE base frequency
`rope_freq_scale(f)`	`f32`	Set RoPE frequency scale
`build()`	--	Build the context

Generation¶

`generate`¶

Generate text from prompt tokens using default sampling parameters.

pub fn generate(
    &mut self,
    prompt_tokens: &[TokenId],
    max_tokens: usize,
) -> Result<String, MullamaError>

Parameters:

Name	Type	Default	Description
`prompt_tokens`	`&[TokenId]`	--	Tokenized prompt (use `model.tokenize()`)
`max_tokens`	`usize`	--	Maximum number of tokens to generate

Returns: Result<String, MullamaError> -- The generated text (decoded from tokens).

Errors:

MullamaError::GenerationError -- Decode failure or empty prompt
MullamaError::ContextError -- Context overflow (prompt + generation > n_ctx)

Example:

use mullama::{Model, Context, ContextParams};
use std::sync::Arc;

let model = Arc::new(Model::load("model.gguf")?);
let mut ctx = Context::new(model.clone(), ContextParams::default())?;

let tokens = model.tokenize("Once upon a time", true, false)?;
let output = ctx.generate(&tokens, 100)?;
println!("{}", output);

`generate_with_params`¶

Generate text with custom sampling parameters for control over temperature, top-k, top-p, and penalties.

pub fn generate_with_params(
    &mut self,
    prompt_tokens: &[TokenId],
    max_tokens: usize,
    sampler_params: &SamplerParams,
) -> Result<String, MullamaError>

Parameters:

Name	Type	Default	Description
`prompt_tokens`	`&[TokenId]`	--	Tokenized prompt
`max_tokens`	`usize`	--	Maximum tokens to generate
`sampler_params`	`&SamplerParams`	--	Custom sampling configuration

Example:

use mullama::SamplerParams;

let params = SamplerParams {
    temperature: 0.7,
    top_k: 50,
    top_p: 0.9,
    penalty_repeat: 1.1,
    ..Default::default()
};

let output = ctx.generate_with_params(&tokens, 100, &params)?;

Token-Level Operations¶

`decode`¶

Process a batch of tokens through the model. Automatically handles chunking if the token count exceeds n_batch.

pub fn decode(&mut self, tokens: &[TokenId]) -> Result<(), MullamaError>

Parameters:

Name	Type	Default	Description
`tokens`	`&[TokenId]`	--	Tokens to process through the model

Errors: MullamaError::GenerationError -- Decode failure.

Example:

let tokens = model.tokenize("Hello, world!", true, false)?;
ctx.decode(&tokens)?;
// Logits are now available for sampling

`decode_single`¶

Optimized single-token decode that avoids heap allocation. Used in generation loops for maximum performance.

pub fn decode_single(&mut self, token: TokenId) -> Result<(), MullamaError>

Example:

// During generation loop, decode one token at a time
ctx.decode_single(next_token)?;

`eval`¶

Lower-level token evaluation with explicit position tracking.

pub fn eval(&mut self, tokens: &[TokenId], n_past: i32) -> Result<(), MullamaError>

Parameters:

Name	Type	Default	Description
`tokens`	`&[TokenId]`	--	Tokens to evaluate
`n_past`	`i32`	--	Number of previously processed tokens (KV-cache position)

KV-Cache Management¶

`clear_cache`¶

Clear the entire KV-cache, resetting the context state. Use this to start a new conversation or prompt.

pub fn clear_cache(&mut self)

`kv_cache_seq_rm`¶

Remove tokens from the KV-cache for a specific sequence. Useful for implementing sliding window or trimming old context.

pub fn kv_cache_seq_rm(
    &mut self,
    seq_id: i32,
    p0: i32,
    p1: i32,
) -> bool

Parameters:

Name	Type	Default	Description
`seq_id`	`i32`	--	Sequence ID (-1 for all sequences)
`p0`	`i32`	--	Start position inclusive (-1 for beginning)
`p1`	`i32`	--	End position exclusive (-1 for end)

Returns: bool -- Whether the operation succeeded.

`kv_cache_seq_cp`¶

Copy a sequence in the KV-cache. Useful for beam search or branching conversations.

pub fn kv_cache_seq_cp(
    &mut self,
    seq_id_src: i32,
    seq_id_dst: i32,
    p0: i32,
    p1: i32,
)

`kv_cache_seq_shift`¶

Shift token positions in the KV-cache. Used for implementing context window sliding.

pub fn kv_cache_seq_shift(
    &mut self,
    seq_id: i32,
    p0: i32,
    p1: i32,
    delta: i32,
)

Parameters:

Name	Type	Default	Description
`seq_id`	`i32`	--	Sequence ID
`p0`	`i32`	--	Start position
`p1`	`i32`	--	End position
`delta`	`i32`	--	Position shift amount (negative to shift left)

Logits Access¶

`get_logits`¶

Get the logits array for all tokens in the last decoded batch.

pub fn get_logits(&self) -> &[f32]

Returns: Slice of logit values with length n_vocab.

`get_logits_ith`¶

Get logits for a specific token position in the batch.

pub fn get_logits_ith(&self, i: i32) -> &[f32]

Parameters:

Name	Type	Default	Description
`i`	`i32`	--	Token index in the batch (-1 for last token)

Properties¶

`n_ctx`¶

Get the context window size (number of tokens this context can hold).

pub fn n_ctx(&self) -> u32

`n_batch`¶

Get the configured batch size.

pub fn n_batch(&self) -> u32

Thread Safety Notes¶

Context is NOT Send

Context holds a raw pointer to the llama.cpp context and is not safe to send between threads. For multi-threaded applications:

Create one Context per thread from a shared Arc<Model>
Use AsyncContext (feature: async) for non-blocking operations
Never share a Context reference across thread boundaries

// CORRECT: Each thread creates its own context
let model = Arc::new(Model::load("model.gguf")?);

let handle = std::thread::spawn({
    let model = model.clone();
    move || {
        let mut ctx = Context::new(model, ContextParams::default()).unwrap();
        ctx.generate(&[1, 2, 3], 50)
    }
});

// INCORRECT: This will not compile
// let ctx = Context::new(model.clone(), ContextParams::default())?;
// std::thread::spawn(move || ctx.generate(&[1], 10)); // Error: Context is !Send

Complete Generation Example¶

use mullama::{Model, Context, ContextParams, SamplerParams, SamplerChain};
use std::sync::Arc;

fn main() -> Result<(), mullama::MullamaError> {
    // Load model
    let model = Arc::new(Model::load("model.gguf")?);

    // Create context with custom parameters
    let params = ContextParams {
        n_ctx: 2048,
        n_batch: 512,
        n_threads: 8,
        type_k: KvCacheType::Q8_0,
        type_v: KvCacheType::Q8_0,
        ..Default::default()
    };
    let mut ctx = Context::new(model.clone(), params)?;

    // Configure sampling
    let sampler_params = SamplerParams {
        temperature: 0.7,
        top_k: 40,
        top_p: 0.9,
        penalty_repeat: 1.1,
        ..Default::default()
    };
    let mut sampler = sampler_params.build_chain(model.clone())?;

    // Tokenize prompt
    let prompt = "The meaning of life is";
    let tokens = model.tokenize(prompt, true, false)?;

    // Process prompt
    ctx.decode(&tokens)?;

    // Generate tokens one at a time
    print!("{}", prompt);
    for _ in 0..100 {
        let next_token = sampler.sample(&mut ctx, -1);
        sampler.accept(next_token);

        if model.token_is_eog(next_token) {
            break;
        }

        let text = model.token_to_str(next_token, 0, false)?;
        print!("{}", text);

        ctx.decode_single(next_token)?;
    }
    println!();

    Ok(())
}

Context API¶

Context Struct¶

Creating Contexts¶

Context::new¶

ContextParams¶

Core Fields¶

Attention and RoPE Fields¶

Memory and Performance Fields¶

KvCacheType¶

Memory Factors¶

Flash Attention Types¶

ContextBuilder¶

Methods¶

Generation¶

generate¶

generate_with_params¶

Token-Level Operations¶

decode¶

decode_single¶

eval¶

KV-Cache Management¶

clear_cache¶

kv_cache_seq_rm¶

kv_cache_seq_cp¶

kv_cache_seq_shift¶

Logits Access¶

get_logits¶

get_logits_ith¶

Properties¶

n_ctx¶

n_batch¶

Thread Safety Notes¶

Complete Generation Example¶

`Context::new`¶

`generate`¶

`generate_with_params`¶

`decode`¶

`decode_single`¶

`eval`¶

`clear_cache`¶

`kv_cache_seq_rm`¶

`kv_cache_seq_cp`¶

`kv_cache_seq_shift`¶

`get_logits`¶

`get_logits_ith`¶

`n_ctx`¶

`n_batch`¶