Model API¶
The Model struct is the foundational type in Mullama, representing a loaded GGUF language model. Models are reference-counted and thread-safe, designed to be shared across multiple contexts and threads.
Model Struct¶
/// Represents a loaded LLM model.
///
/// Models are reference-counted via Arc and can be safely cloned and shared
/// across threads. The underlying C++ model is freed when the last reference
/// is dropped.
#[derive(Debug, Clone)]
pub struct Model {
inner: Arc<ModelInner>,
}
Thread Safety: Model implements Send + Sync and uses Arc internally. Cloning a Model is a cheap reference count increment -- the underlying model data is shared.
Node.js / Python Equivalents
- Node.js:
const model = await Model.load("model.gguf") - Python:
model = Model.load("model.gguf")
Loading Models¶
Model::load¶
Load a model from a GGUF file with default parameters.
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
path |
impl AsRef<Path> |
-- | Path to the GGUF model file |
Returns: Result<Model, MullamaError>
Errors:
MullamaError::ModelLoadError-- File not found, invalid GGUF format, or insufficient memoryMullamaError::IoError-- Filesystem access failure
Example:
use mullama::Model;
let model = Model::load("models/llama-7b.gguf")?;
println!("Loaded model with {} parameters", model.n_params());
println!("Vocabulary size: {}", model.n_vocab());
println!("Embedding dimension: {}", model.n_embd());
Model::load_with_params¶
Load a model with custom parameters for fine-grained control over GPU offloading, memory mapping, and other options.
pub fn load_with_params(
path: impl AsRef<Path>,
params: ModelParams,
) -> Result<Self, MullamaError>
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
path |
impl AsRef<Path> |
-- | Path to the GGUF model file |
params |
ModelParams |
-- | Custom loading parameters |
Returns: Result<Model, MullamaError>
Errors:
MullamaError::ModelLoadError-- Loading failure due to parameters or file issuesMullamaError::GpuError-- GPU layer offloading failureMullamaError::IoError-- Filesystem access failure
Example:
use mullama::{Model, ModelParams};
let params = ModelParams {
n_gpu_layers: 32,
use_mmap: true,
use_mlock: false,
..Default::default()
};
let model = Model::load_with_params("models/llama-7b.gguf", params)?;
ModelParams¶
Configuration for model loading behavior.
#[derive(Debug, Clone)]
pub struct ModelParams {
pub n_gpu_layers: i32,
pub split_mode: llama_split_mode,
pub main_gpu: i32,
pub tensor_split: Vec<f32>,
pub vocab_only: bool,
pub use_mmap: bool,
pub use_mlock: bool,
pub check_tensors: bool,
pub use_extra_bufts: bool,
pub kv_overrides: Vec<ModelKvOverride>,
pub progress_callback: Option<fn(f32) -> bool>,
}
Fields¶
| Name | Type | Default | Description |
|---|---|---|---|
n_gpu_layers |
i32 |
0 |
Number of layers to offload to GPU. Set to 999 to offload all layers. |
split_mode |
llama_split_mode |
NONE |
How to split the model across multiple GPUs (NONE, LAYER, ROW) |
main_gpu |
i32 |
0 |
Main GPU device index for single-GPU or primary device |
tensor_split |
Vec<f32> |
[] |
Proportion of model to assign to each GPU (must sum to 1.0) |
vocab_only |
bool |
false |
Load only vocabulary (for tokenization-only use cases) |
use_mmap |
bool |
true |
Use memory-mapped file I/O for faster loading and lower RAM usage |
use_mlock |
bool |
false |
Lock model memory to prevent swapping to disk |
check_tensors |
bool |
true |
Validate tensor data integrity on load (slight overhead) |
use_extra_bufts |
bool |
false |
Use extra buffer types for specialized backends |
kv_overrides |
Vec<ModelKvOverride> |
[] |
Override model metadata key-value pairs |
progress_callback |
Option<fn(f32) -> bool> |
None |
Progress callback during loading (return false to cancel) |
GPU Layer Offloading
When n_gpu_layers is 0, the split_mode and main_gpu fields are ignored to avoid "invalid value for main_gpu" errors when no GPU is available. Set n_gpu_layers to a positive number to enable GPU acceleration.
ModelKvOverride¶
Override model metadata values at load time. This allows customizing model behavior without modifying the GGUF file.
#[derive(Debug, Clone)]
pub struct ModelKvOverride {
pub key: String,
pub value: ModelKvOverrideValue,
}
#[derive(Debug, Clone)]
pub enum ModelKvOverrideValue {
Int(i64),
Float(f64),
Bool(bool),
Str(String),
}
Example:
use mullama::{Model, ModelParams, ModelKvOverride, ModelKvOverrideValue};
let params = ModelParams {
kv_overrides: vec![
ModelKvOverride {
key: "tokenizer.chat_template".to_string(),
value: ModelKvOverrideValue::Str("custom template".to_string()),
},
],
..Default::default()
};
let model = Model::load_with_params("model.gguf", params)?;
ModelBuilder¶
Fluent API for model configuration with validation and sensible defaults.
use mullama::builder::ModelBuilder;
let model = ModelBuilder::new()
.path("models/llama-7b.gguf")
.gpu_layers(32)
.context_size(4096)
.memory_mapping(true)
.build()
.await?;
Methods¶
| Method | Parameter | Description |
|---|---|---|
new() |
-- | Create a new builder |
path(p) |
impl Into<String> |
Set model file path (required) |
gpu_layers(n) |
i32 |
Set GPU layer offload count |
context_size(n) |
u32 |
Set context size in tokens |
memory_mapping(b) |
bool |
Enable/disable mmap |
memory_lock(b) |
bool |
Enable/disable mlock |
vocab_only(b) |
bool |
Load vocabulary only |
check_tensors(b) |
bool |
Enable tensor validation |
progress_callback(f) |
fn(f32) -> bool |
Set loading progress callback |
build() |
-- | Build and load the model |
Model Metadata Methods¶
Access model architecture and configuration information. These methods are safe to call from any thread.
n_params¶
Returns the total number of parameters in the model.
n_vocab / vocab_size¶
Returns the vocabulary size (number of unique tokens). Aliased as vocab_size().
n_embd¶
Returns the embedding dimension (hidden size).
n_layer¶
Returns the number of transformer layers.
n_head¶
Returns the number of attention heads.
n_ctx_train¶
Returns the training context length (maximum sequence length the model was trained with).
model_type¶
Returns the model architecture type (e.g., "llama", "mistral", "phi").
model_desc¶
Returns a human-readable model description including parameter count and quantization.
Example:
use mullama::Model;
let model = Model::load("model.gguf")?;
println!("Parameters: {}", model.n_params());
println!("Vocab size: {}", model.n_vocab());
println!("Embedding dim: {}", model.n_embd());
println!("Layers: {}", model.n_layer());
println!("Heads: {}", model.n_head());
println!("Training ctx: {}", model.n_ctx_train());
println!("Type: {}", model.model_type());
println!("Description: {}", model.model_desc());
Tokenization¶
tokenize¶
Convert text to a sequence of token IDs.
pub fn tokenize(
&self,
text: &str,
add_bos: bool,
special: bool,
) -> Result<Vec<TokenId>, MullamaError>
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
text |
&str |
-- | Input text to tokenize |
add_bos |
bool |
-- | Prepend beginning-of-sequence token |
special |
bool |
-- | Parse special tokens (e.g., <|system|>) in the text |
Returns: Result<Vec<TokenId>, MullamaError>
Errors: MullamaError::TokenizationError -- Invalid text or internal tokenizer failure.
Example:
let tokens = model.tokenize("Hello, world!", true, false)?;
println!("Token count: {}", tokens.len());
println!("Tokens: {:?}", tokens);
detokenize¶
Convert a sequence of token IDs back to text.
pub fn detokenize(
&self,
tokens: &[TokenId],
remove_special: bool,
unparse_special: bool,
) -> Result<String, MullamaError>
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
tokens |
&[TokenId] |
-- | Token IDs to convert back to text |
remove_special |
bool |
-- | Skip control/special tokens in output |
unparse_special |
bool |
-- | Render special tokens as their text markers |
Returns: Result<String, MullamaError>
SentencePiece Leading Space
SentencePiece-based tokenizers include a leading space marker on the first token. If you need exact round-trip fidelity for text that did not start with a space, you may need to strip the leading space from the result.
Example:
let tokens = model.tokenize("Hello, world!", true, false)?;
let text = model.detokenize(&tokens, true, false)?;
assert_eq!(text.trim(), "Hello, world!");
token_to_str¶
Convert a single token ID to its text representation.
pub fn token_to_str(
&self,
token: TokenId,
lstrip: i32,
special: bool,
) -> Result<String, MullamaError>
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
token |
TokenId |
-- | Token ID to convert |
lstrip |
i32 |
-- | Number of leading spaces to strip (for SentencePiece) |
special |
bool |
-- | Render special tokens as text |
Returns: Result<String, MullamaError>
Vocabulary Access¶
vocab_info¶
Get information about the model's vocabulary.
vocab_get_text¶
Get the text representation of a token by ID.
vocab_get_score¶
Get the score/probability of a token in the vocabulary.
Special Tokens¶
Access the model's special token IDs.
/// Beginning-of-sequence token
pub fn bos_token(&self) -> TokenId
/// End-of-sequence token
pub fn eos_token(&self) -> TokenId
/// Check if a token is end-of-generation
pub fn token_is_eog(&self, token: TokenId) -> bool
/// Check if a token is a control token
pub fn token_is_control(&self, token: TokenId) -> bool
Example:
let bos = model.bos_token();
let eos = model.eos_token();
println!("BOS token ID: {}", bos);
println!("EOS token ID: {}", eos);
// Check during generation
if model.token_is_eog(generated_token) {
println!("Generation complete");
}
Chat Templates¶
apply_chat_template¶
Apply the model's built-in chat template to format messages for instruction-tuned models.
pub fn apply_chat_template(
&self,
messages: &[ChatMessage],
add_generation_prompt: bool,
) -> Result<String, MullamaError>
Parameters:
| Name | Type | Default | Description |
|---|---|---|---|
messages |
&[ChatMessage] |
-- | Conversation messages with roles |
add_generation_prompt |
bool |
-- | Append the assistant generation prompt marker |
Example:
use mullama::{Model, ChatMessage};
let model = Model::load("model.gguf")?;
let messages = vec![
ChatMessage { role: "system".into(), content: "You are helpful.".into() },
ChatMessage { role: "user".into(), content: "Hello!".into() },
];
let prompt = model.apply_chat_template(&messages, true)?;
println!("{}", prompt);
Context Creation¶
create_context¶
Convenience method to create an inference context from this model.
Example:
use mullama::{Model, ContextParams};
let model = Model::load("model.gguf")?;
let ctx = model.create_context(ContextParams::default())?;
Thread Safety¶
Model is designed for safe concurrent access:
use mullama::Model;
use std::sync::Arc;
use std::thread;
let model = Arc::new(Model::load("model.gguf")?);
// Share across threads safely
let model_clone = model.clone(); // Cheap Arc clone
thread::spawn(move || {
let tokens = model_clone.tokenize("Hello", true, false).unwrap();
println!("Tokens: {:?}", tokens);
});
- Clone -- Cheap reference count increment (Arc-based)
- Send -- Can be moved between threads
- Sync -- Can be shared between threads via
&Model
Memory Considerations¶
| Aspect | Behavior |
|---|---|
| Model size | Determined by GGUF file size and quantization level |
| mmap mode | Model stays on disk, pages loaded on demand (lower RSS) |
| mlock mode | Entire model locked in RAM (prevents swapping, higher RSS) |
| GPU offload | Layers on GPU consume VRAM, reduce RAM usage proportionally |
| Drop behavior | Model freed when last Arc reference is dropped |
Memory-Mapped Loading
With use_mmap: true (the default), the OS memory-maps the model file. This means the model loads almost instantly and the OS manages which pages are resident in RAM. For production deployments with limited RAM, this is the recommended approach.