Model API¶

The Model struct is the foundational type in Mullama, representing a loaded GGUF language model. Models are reference-counted and thread-safe, designed to be shared across multiple contexts and threads.

Model Struct¶

/// Represents a loaded LLM model.
///
/// Models are reference-counted via Arc and can be safely cloned and shared
/// across threads. The underlying C++ model is freed when the last reference
/// is dropped.
#[derive(Debug, Clone)]
pub struct Model {
    inner: Arc<ModelInner>,
}

Thread Safety: Model implements Send + Sync and uses Arc internally. Cloning a Model is a cheap reference count increment -- the underlying model data is shared.

Node.js / Python Equivalents

Node.js: const model = await Model.load("model.gguf")
Python: model = Model.load("model.gguf")

Loading Models¶

`Model::load`¶

Load a model from a GGUF file with default parameters.

pub fn load(path: impl AsRef<Path>) -> Result<Self, MullamaError>

Parameters:

Name	Type	Default	Description
`path`	`impl AsRef<Path>`	--	Path to the GGUF model file

Returns: Result<Model, MullamaError>

Errors:

MullamaError::ModelLoadError -- File not found, invalid GGUF format, or insufficient memory
MullamaError::IoError -- Filesystem access failure

Example:

use mullama::Model;

let model = Model::load("models/llama-7b.gguf")?;
println!("Loaded model with {} parameters", model.n_params());
println!("Vocabulary size: {}", model.n_vocab());
println!("Embedding dimension: {}", model.n_embd());

`Model::load_with_params`¶

Load a model with custom parameters for fine-grained control over GPU offloading, memory mapping, and other options.

pub fn load_with_params(
    path: impl AsRef<Path>,
    params: ModelParams,
) -> Result<Self, MullamaError>

Parameters:

Name	Type	Default	Description
`path`	`impl AsRef<Path>`	--	Path to the GGUF model file
`params`	`ModelParams`	--	Custom loading parameters

Returns: Result<Model, MullamaError>

Errors:

MullamaError::ModelLoadError -- Loading failure due to parameters or file issues
MullamaError::GpuError -- GPU layer offloading failure
MullamaError::IoError -- Filesystem access failure

Example:

use mullama::{Model, ModelParams};

let params = ModelParams {
    n_gpu_layers: 32,
    use_mmap: true,
    use_mlock: false,
    ..Default::default()
};

let model = Model::load_with_params("models/llama-7b.gguf", params)?;

ModelParams¶

Configuration for model loading behavior.

#[derive(Debug, Clone)]
pub struct ModelParams {
    pub n_gpu_layers: i32,
    pub split_mode: llama_split_mode,
    pub main_gpu: i32,
    pub tensor_split: Vec<f32>,
    pub vocab_only: bool,
    pub use_mmap: bool,
    pub use_mlock: bool,
    pub check_tensors: bool,
    pub use_extra_bufts: bool,
    pub kv_overrides: Vec<ModelKvOverride>,
    pub progress_callback: Option<fn(f32) -> bool>,
}

Fields¶

Name	Type	Default	Description
`n_gpu_layers`	`i32`	`0`	Number of layers to offload to GPU. Set to `999` to offload all layers.
`split_mode`	`llama_split_mode`	`NONE`	How to split the model across multiple GPUs (NONE, LAYER, ROW)
`main_gpu`	`i32`	`0`	Main GPU device index for single-GPU or primary device
`tensor_split`	`Vec<f32>`	`[]`	Proportion of model to assign to each GPU (must sum to 1.0)
`vocab_only`	`bool`	`false`	Load only vocabulary (for tokenization-only use cases)
`use_mmap`	`bool`	`true`	Use memory-mapped file I/O for faster loading and lower RAM usage
`use_mlock`	`bool`	`false`	Lock model memory to prevent swapping to disk
`check_tensors`	`bool`	`true`	Validate tensor data integrity on load (slight overhead)
`use_extra_bufts`	`bool`	`false`	Use extra buffer types for specialized backends
`kv_overrides`	`Vec<ModelKvOverride>`	`[]`	Override model metadata key-value pairs
`progress_callback`	`Option<fn(f32) -> bool>`	`None`	Progress callback during loading (return `false` to cancel)

GPU Layer Offloading

When n_gpu_layers is 0, the split_mode and main_gpu fields are ignored to avoid "invalid value for main_gpu" errors when no GPU is available. Set n_gpu_layers to a positive number to enable GPU acceleration.

ModelKvOverride¶

Override model metadata values at load time. This allows customizing model behavior without modifying the GGUF file.

#[derive(Debug, Clone)]
pub struct ModelKvOverride {
    pub key: String,
    pub value: ModelKvOverrideValue,
}

#[derive(Debug, Clone)]
pub enum ModelKvOverrideValue {
    Int(i64),
    Float(f64),
    Bool(bool),
    Str(String),
}

Example:

use mullama::{Model, ModelParams, ModelKvOverride, ModelKvOverrideValue};

let params = ModelParams {
    kv_overrides: vec![
        ModelKvOverride {
            key: "tokenizer.chat_template".to_string(),
            value: ModelKvOverrideValue::Str("custom template".to_string()),
        },
    ],
    ..Default::default()
};

let model = Model::load_with_params("model.gguf", params)?;

ModelBuilder¶

Fluent API for model configuration with validation and sensible defaults.

use mullama::builder::ModelBuilder;

let model = ModelBuilder::new()
    .path("models/llama-7b.gguf")
    .gpu_layers(32)
    .context_size(4096)
    .memory_mapping(true)
    .build()
    .await?;

Methods¶

Method	Parameter	Description
`new()`	--	Create a new builder
`path(p)`	`impl Into<String>`	Set model file path (required)
`gpu_layers(n)`	`i32`	Set GPU layer offload count
`context_size(n)`	`u32`	Set context size in tokens
`memory_mapping(b)`	`bool`	Enable/disable mmap
`memory_lock(b)`	`bool`	Enable/disable mlock
`vocab_only(b)`	`bool`	Load vocabulary only
`check_tensors(b)`	`bool`	Enable tensor validation
`progress_callback(f)`	`fn(f32) -> bool`	Set loading progress callback
`build()`	--	Build and load the model

Model Metadata Methods¶

Access model architecture and configuration information. These methods are safe to call from any thread.

`n_params`¶

pub fn n_params(&self) -> u64

Returns the total number of parameters in the model.

`n_vocab` / `vocab_size`¶

pub fn n_vocab(&self) -> usize

Returns the vocabulary size (number of unique tokens). Aliased as vocab_size().

`n_embd`¶

pub fn n_embd(&self) -> u32

Returns the embedding dimension (hidden size).

`n_layer`¶

pub fn n_layer(&self) -> u32

Returns the number of transformer layers.

`n_head`¶

pub fn n_head(&self) -> usize

Returns the number of attention heads.

`n_ctx_train`¶

pub fn n_ctx_train(&self) -> u32

Returns the training context length (maximum sequence length the model was trained with).

`model_type`¶

pub fn model_type(&self) -> String

Returns the model architecture type (e.g., "llama", "mistral", "phi").

`model_desc`¶

pub fn model_desc(&self) -> String

Returns a human-readable model description including parameter count and quantization.

Example:

use mullama::Model;

let model = Model::load("model.gguf")?;
println!("Parameters: {}", model.n_params());
println!("Vocab size: {}", model.n_vocab());
println!("Embedding dim: {}", model.n_embd());
println!("Layers: {}", model.n_layer());
println!("Heads: {}", model.n_head());
println!("Training ctx: {}", model.n_ctx_train());
println!("Type: {}", model.model_type());
println!("Description: {}", model.model_desc());

Tokenization¶

`tokenize`¶

Convert text to a sequence of token IDs.

pub fn tokenize(
    &self,
    text: &str,
    add_bos: bool,
    special: bool,
) -> Result<Vec<TokenId>, MullamaError>

Parameters:

Name	Type	Default	Description
`text`	`&str`	--	Input text to tokenize
`add_bos`	`bool`	--	Prepend beginning-of-sequence token
`special`	`bool`	--	Parse special tokens (e.g., `<\|system\|>`) in the text

Returns: Result<Vec<TokenId>, MullamaError>

Errors: MullamaError::TokenizationError -- Invalid text or internal tokenizer failure.

Example:

let tokens = model.tokenize("Hello, world!", true, false)?;
println!("Token count: {}", tokens.len());
println!("Tokens: {:?}", tokens);

`detokenize`¶

Convert a sequence of token IDs back to text.

pub fn detokenize(
    &self,
    tokens: &[TokenId],
    remove_special: bool,
    unparse_special: bool,
) -> Result<String, MullamaError>

Parameters:

Name	Type	Default	Description
`tokens`	`&[TokenId]`	--	Token IDs to convert back to text
`remove_special`	`bool`	--	Skip control/special tokens in output
`unparse_special`	`bool`	--	Render special tokens as their text markers

Returns: Result<String, MullamaError>

SentencePiece Leading Space

SentencePiece-based tokenizers include a leading space marker on the first token. If you need exact round-trip fidelity for text that did not start with a space, you may need to strip the leading space from the result.

Example:

let tokens = model.tokenize("Hello, world!", true, false)?;
let text = model.detokenize(&tokens, true, false)?;
assert_eq!(text.trim(), "Hello, world!");

`token_to_str`¶

Convert a single token ID to its text representation.

pub fn token_to_str(
    &self,
    token: TokenId,
    lstrip: i32,
    special: bool,
) -> Result<String, MullamaError>

Parameters:

Name	Type	Default	Description
`token`	`TokenId`	--	Token ID to convert
`lstrip`	`i32`	--	Number of leading spaces to strip (for SentencePiece)
`special`	`bool`	--	Render special tokens as text

Returns: Result<String, MullamaError>

Vocabulary Access¶

`vocab_info`¶

Get information about the model's vocabulary.

pub fn vocab_info(&self) -> VocabInfo

`vocab_get_text`¶

Get the text representation of a token by ID.

pub fn vocab_get_text(&self, token: TokenId) -> Option<&str>

`vocab_get_score`¶

Get the score/probability of a token in the vocabulary.

pub fn vocab_get_score(&self, token: TokenId) -> f32

Special Tokens¶

Access the model's special token IDs.

/// Beginning-of-sequence token
pub fn bos_token(&self) -> TokenId

/// End-of-sequence token
pub fn eos_token(&self) -> TokenId

/// Check if a token is end-of-generation
pub fn token_is_eog(&self, token: TokenId) -> bool

/// Check if a token is a control token
pub fn token_is_control(&self, token: TokenId) -> bool

Example:

let bos = model.bos_token();
let eos = model.eos_token();
println!("BOS token ID: {}", bos);
println!("EOS token ID: {}", eos);

// Check during generation
if model.token_is_eog(generated_token) {
    println!("Generation complete");
}

Chat Templates¶

`apply_chat_template`¶

Apply the model's built-in chat template to format messages for instruction-tuned models.

pub fn apply_chat_template(
    &self,
    messages: &[ChatMessage],
    add_generation_prompt: bool,
) -> Result<String, MullamaError>

Parameters:

Name	Type	Default	Description
`messages`	`&[ChatMessage]`	--	Conversation messages with roles
`add_generation_prompt`	`bool`	--	Append the assistant generation prompt marker

Example:

use mullama::{Model, ChatMessage};

let model = Model::load("model.gguf")?;

let messages = vec![
    ChatMessage { role: "system".into(), content: "You are helpful.".into() },
    ChatMessage { role: "user".into(), content: "Hello!".into() },
];

let prompt = model.apply_chat_template(&messages, true)?;
println!("{}", prompt);

Context Creation¶

`create_context`¶

Convenience method to create an inference context from this model.

pub fn create_context(&self, params: ContextParams) -> Result<Context, MullamaError>

Example:

use mullama::{Model, ContextParams};

let model = Model::load("model.gguf")?;
let ctx = model.create_context(ContextParams::default())?;

Thread Safety¶

Model is designed for safe concurrent access:

use mullama::Model;
use std::sync::Arc;
use std::thread;

let model = Arc::new(Model::load("model.gguf")?);

// Share across threads safely
let model_clone = model.clone(); // Cheap Arc clone
thread::spawn(move || {
    let tokens = model_clone.tokenize("Hello", true, false).unwrap();
    println!("Tokens: {:?}", tokens);
});

Clone -- Cheap reference count increment (Arc-based)
Send -- Can be moved between threads
Sync -- Can be shared between threads via &Model

Memory Considerations¶

Aspect	Behavior
Model size	Determined by GGUF file size and quantization level
mmap mode	Model stays on disk, pages loaded on demand (lower RSS)
mlock mode	Entire model locked in RAM (prevents swapping, higher RSS)
GPU offload	Layers on GPU consume VRAM, reduce RAM usage proportionally
Drop behavior	Model freed when last `Arc` reference is dropped

Memory-Mapped Loading

With use_mmap: true (the default), the OS memory-maps the model file. This means the model loads almost instantly and the OS manages which pages are resident in RAM. For production deployments with limited RAM, this is the recommended approach.

Model API¶

Model Struct¶

Loading Models¶

Model::load¶

Model::load_with_params¶

ModelParams¶

Fields¶

ModelKvOverride¶

ModelBuilder¶

Methods¶

Model Metadata Methods¶

n_params¶

n_vocab / vocab_size¶

n_embd¶

n_layer¶

n_head¶

n_ctx_train¶

model_type¶

model_desc¶

Tokenization¶

tokenize¶

detokenize¶

token_to_str¶

Vocabulary Access¶

vocab_info¶

vocab_get_text¶

vocab_get_score¶

Special Tokens¶

Chat Templates¶

apply_chat_template¶

Context Creation¶

create_context¶

Thread Safety¶

Memory Considerations¶

`Model::load`¶

`Model::load_with_params`¶

`n_params`¶

`n_vocab` / `vocab_size`¶

`n_embd`¶

`n_layer`¶

`n_head`¶

`n_ctx_train`¶

`model_type`¶

`model_desc`¶

`tokenize`¶

`detokenize`¶

`token_to_str`¶

`vocab_info`¶

`vocab_get_text`¶

`vocab_get_score`¶

`apply_chat_template`¶

`create_context`¶