Batch API¶

The Batch struct provides efficient multi-token processing for prompt evaluation and parallel sequence handling. It uses SmallVec optimization to avoid heap allocation for small batches (32 tokens or fewer).

Batch Struct¶

/// Represents a batch of tokens for processing.
///
/// Uses SmallVec (via TokenBuffer) for token storage, providing:
/// - Stack allocation for batches up to 32 tokens (no heap allocation)
/// - Transparent heap fallback for larger batches
/// - 5-10% faster for typical small batch operations
pub struct Batch {
    inner: Option<llama_batch>,
    tokens_storage: Option<TokenBuffer>,
    needs_free: bool,
}

SmallVec Optimization

For batches of 32 tokens or fewer, Batch uses stack-allocated storage via SmallVec<[TokenId; 32]>, completely avoiding heap allocation. This is a Rust-exclusive optimization that provides 5-10% faster processing for typical interactive generation where single tokens are decoded one at a time.

Creating Batches¶

`Batch::new`¶

Create a new batch with pre-allocated memory for a maximum number of tokens.

pub fn new(max_tokens: usize, embd: i32, max_seq: usize) -> Self

Parameters:

Name	Type	Default	Description
`max_tokens`	`usize`	--	Maximum number of tokens the batch can hold
`embd`	`i32`	--	Embedding dimension (0 for token-based batches)
`max_seq`	`usize`	--	Maximum number of parallel sequences

Example:

use mullama::batch::Batch;

// Create batch for up to 512 tokens, 1 sequence
let batch = Batch::new(512, 0, 1);

// Create batch supporting 4 parallel sequences
let batch = Batch::new(2048, 0, 4);

`Batch::from_tokens`¶

Create a batch from a token slice. Uses stack allocation for 32 tokens or fewer.

pub fn from_tokens(tokens: &[TokenId]) -> Self

Parameters:

Name	Type	Default	Description
`tokens`	`&[TokenId]`	--	Slice of token IDs to process

Example:

use mullama::batch::Batch;

let tokens = vec![1, 2, 3, 4, 5];
let batch = Batch::from_tokens(&tokens);
assert_eq!(batch.len(), 5);
// Stack-allocated since len <= 32

`Batch::from_tokens_owned`¶

Create a batch from an owned Vec<TokenId>, avoiding a copy when you already have ownership.

pub fn from_tokens_owned(tokens: Vec<TokenId>) -> Self

Example:

let tokens = model.tokenize("Hello, world!", true, false)?;
let batch = Batch::from_tokens_owned(tokens);

`Batch::from_token_buffer`¶

Create a batch from a TokenBuffer (SmallVec-based storage).

pub fn from_token_buffer(tokens: TokenBuffer) -> Self

Methods¶

`len`¶

Get the number of tokens currently in the batch.

pub fn len(&self) -> usize

`is_empty`¶

Check if the batch contains no tokens.

pub fn is_empty(&self) -> bool

`get_llama_batch`¶

Get a reference to the internal llama_batch struct for advanced/FFI use.

pub fn get_llama_batch(&self) -> Option<&llama_batch>

Default¶

The default batch is created with 512 max tokens, 0 embedding dimensions, and 1 sequence:

impl Default for Batch {
    fn default() -> Self {
        Self::new(512, 0, 1)
    }
}

Multi-Sequence Support¶

Batches can handle multiple parallel sequences for use cases like beam search, speculative decoding, or parallel generation:

use mullama::batch::Batch;

// Create batch supporting up to 4 parallel sequences
let batch = Batch::new(2048, 0, 4);

When using multi-sequence batches, each token can be assigned to one or more sequence IDs, enabling independent generation streams within a single context.

Usage with Context::decode¶

The primary use of Batch is with Context::decode() for processing tokens:

use mullama::{Model, Context, ContextParams, batch::Batch};
use std::sync::Arc;

let model = Arc::new(Model::load("model.gguf")?);
let mut ctx = Context::new(model.clone(), ContextParams::default())?;

// Tokenize prompt
let tokens = model.tokenize("Hello, world!", true, false)?;

// Method 1: Direct decode (creates batch internally)
ctx.decode(&tokens)?;

// Method 2: Manual batch for more control
let batch = Batch::from_tokens(&tokens);
// The batch is used internally by Context operations

Memory Management¶

Batch implements Drop for automatic cleanup:

Batches created with Batch::new() (via llama_batch_init) allocate internal memory that is freed on drop
Batches created with from_tokens / from_token_buffer (via llama_batch_get_one) only manage the TokenBuffer storage

{
    let batch = Batch::new(1024, 0, 1); // Allocates internal C memory
    // ... use batch ...
} // Automatically freed here via llama_batch_free

{
    let batch = Batch::from_tokens(&[1, 2, 3]); // Stack-allocated for small sizes
    // ... use batch ...
} // TokenBuffer dropped, no llama_batch_free needed

Memory Behavior¶

Creation Method	Storage	Allocation	Drop Behavior
`Batch::new(n, 0, s)`	C heap	Always heap	Calls `llama_batch_free`
`from_tokens` (<=32)	Rust stack	No allocation	Stack unwind only
`from_tokens` (>32)	Rust heap	Vec allocation	Vec dropped
`from_tokens_owned`	Rust heap	Uses existing Vec	Vec dropped

Performance Considerations¶

Scenario	Recommendation	Reason
Single token decode	Use `Context::decode_single()`	Avoids all allocation
Small prompt (1-32 tokens)	Use `Batch::from_tokens()`	Stays entirely on stack
Medium prompt (33-512 tokens)	Use `Batch::from_tokens()`	Heap fallback is fast
Large prompt (>512 tokens)	Use `Context::decode()`	Handles chunking automatically
Multiple sequences	Use `Batch::new(n, 0, seq_count)`	Explicit capacity for sequences

Generation Loop Performance

During token-by-token generation, always use Context::decode_single() rather than creating a Batch for each token. The single-token path avoids all allocation overhead and is the fastest way to process generated tokens.

Complete Example¶

use mullama::{Model, Context, ContextParams, SamplerParams, batch::Batch};
use std::sync::Arc;

fn main() -> Result<(), mullama::MullamaError> {
    let model = Arc::new(Model::load("model.gguf")?);
    let mut ctx = Context::new(model.clone(), ContextParams::default())?;

    // Tokenize and process prompt
    let prompt_tokens = model.tokenize("The capital of France is", true, false)?;

    // For typical use, just call decode directly:
    ctx.decode(&prompt_tokens)?;

    // For generation, decode tokens one at a time using the optimized path:
    let mut sampler = SamplerParams::default().build_chain(model.clone())?;

    for _ in 0..50 {
        let token = sampler.sample(&mut ctx, -1);
        sampler.accept(token);

        if model.token_is_eog(token) {
            break;
        }

        // Single-token decode avoids batch allocation entirely
        ctx.decode_single(token)?;

        let text = model.token_to_str(token, 0, false)?;
        print!("{}", text);
    }
    println!();

    Ok(())
}

Batch API¶

Batch Struct¶

Creating Batches¶

Batch::new¶

Batch::from_tokens¶

Batch::from_tokens_owned¶

Batch::from_token_buffer¶

Methods¶

len¶

is_empty¶

get_llama_batch¶

Default¶

Multi-Sequence Support¶

Usage with Context::decode¶

Memory Management¶

Memory Behavior¶

Performance Considerations¶

Complete Example¶

`Batch::new`¶

`Batch::from_tokens`¶

`Batch::from_tokens_owned`¶

`Batch::from_token_buffer`¶

`len`¶

`is_empty`¶

`get_llama_batch`¶