Loading Models¶
Mullama supports loading GGUF format models with flexible configuration options for GPU offloading, memory mapping, and multi-threaded access.
Basic Model Loading¶
The simplest way to load a model from a local GGUF file:
Shared Ownership
Models are designed to be shared between multiple contexts and threads. In Rust, wrap in Arc. In Node.js and Python, sharing is handled automatically through reference counting.
Model Parameters¶
Configure model loading with ModelParams for fine-grained control over GPU offloading, memory mapping, and tensor distribution:
import { Model } from 'mullama';
const model = await Model.loadWithParams('./model.gguf', {
nGpuLayers: 35, // Offload 35 layers to GPU
useMmap: true, // Memory-map the model file
useMlock: false, // Don't lock in RAM
vocabOnly: false, // Load full model, not just vocab
mainGpu: 0, // Primary GPU device index
});
from mullama import Model, ModelParams
model = Model.load_with_params("./model.gguf", ModelParams(
n_gpu_layers=35, # Offload 35 layers to GPU
use_mmap=True, # Memory-map the model file
use_mlock=False, # Don't lock in RAM
vocab_only=False, # Load full model, not just vocab
main_gpu=0, # Primary GPU device index
))
Parameter Reference¶
| Parameter | Type | Default | Description |
|---|---|---|---|
n_gpu_layers |
i32 |
0 |
Number of layers to offload to GPU (-1 for all) |
split_mode |
SplitMode |
Layer |
How to split model across GPUs |
main_gpu |
i32 |
0 |
Primary GPU device index |
tensor_split |
Vec<f32> |
[] |
Proportional split across GPUs |
vocab_only |
bool |
false |
Load only vocabulary (for tokenization) |
use_mmap |
bool |
true |
Enable memory-mapped file loading |
use_mlock |
bool |
false |
Lock model pages in physical RAM |
ModelBuilder Pattern¶
For complex configurations, use the builder pattern which provides a fluent API:
import { Model } from 'mullama';
const model = await Model.builder('./model.gguf')
.gpuLayers(35)
.useMmap(true)
.useMlock(false)
.vocabOnly(false)
.tensorSplit([0.6, 0.4])
.progressCallback((progress) => {
console.log(`Loading: ${(progress * 100).toFixed(1)}%`);
return true; // Return true to continue
})
.build();
from mullama import Model
def on_progress(progress: float) -> bool:
print(f"Loading: {progress * 100:.1f}%")
return True # Return True to continue
model = (Model.builder("./model.gguf")
.gpu_layers(35)
.use_mmap(True)
.use_mlock(False)
.vocab_only(False)
.tensor_split([0.6, 0.4])
.progress_callback(on_progress)
.build())
use mullama::ModelBuilder;
let model = ModelBuilder::new("model.gguf")
.with_n_gpu_layers(35)
.with_use_mmap(true)
.with_use_mlock(false)
.with_vocab_only(false)
.with_tensor_split(&[0.6, 0.4])
.with_progress_callback(|progress| {
println!("Loading: {:.1}%", progress * 100.0);
true // Return true to continue, false to abort
})
.build()?;
GPU Layer Offloading¶
Offload transformer layers to the GPU for significantly faster inference:
| Value | Behavior |
|---|---|
0 |
CPU only -- no GPU acceleration |
1 to N |
Offload N layers to GPU |
-1 or large number |
Offload all layers to GPU |
Finding the Right Balance
Start with all layers on GPU (-1). If you run out of VRAM, reduce the count until the model fits. Monitor GPU memory with:
- NVIDIA:
nvidia-smi - Apple Silicon: Activity Monitor (Memory tab)
- AMD:
rocm-smi
Multi-GPU Tensor Splitting¶
For systems with multiple GPUs, distribute model layers across devices:
Model Introspection¶
Query model properties after loading to understand its architecture and capabilities:
const model = await Model.load('./model.gguf');
// Architecture information
console.log(`Embedding dimension: ${model.embeddingDim()}`);
console.log(`Number of layers: ${model.layerCount()}`);
console.log(`Training context length: ${model.trainContextLength()}`);
// Vocabulary information
console.log(`Vocabulary size: ${model.vocabSize()}`);
console.log(`BOS token: ${model.bosToken()}`);
console.log(`EOS token: ${model.eosToken()}`);
// Model description
console.log(`Description: ${model.description()}`);
model = Model.load("./model.gguf")
# Architecture information
print(f"Embedding dimension: {model.embedding_dim()}")
print(f"Number of layers: {model.layer_count()}")
print(f"Training context length: {model.train_context_length()}")
# Vocabulary information
print(f"Vocabulary size: {model.vocab_size()}")
print(f"BOS token: {model.bos_token()}")
print(f"EOS token: {model.eos_token()}")
# Model description
print(f"Description: {model.description()}")
let model = Model::load("model.gguf")?;
// Architecture information
println!("Embedding dimension: {}", model.n_embd());
println!("Number of layers: {}", model.n_layer());
println!("Training context length: {}", model.n_ctx_train());
// Vocabulary information
println!("Vocabulary size: {}", model.vocab_size());
println!("BOS token: {:?}", model.bos_token());
println!("EOS token: {:?}", model.eos_token());
// Model description
println!("Description: {:?}", model.description());
Key Metadata Fields¶
| Method | Returns | Description |
|---|---|---|
vocab_size() |
usize |
Total vocabulary size |
n_embd() / embeddingDim() |
usize |
Embedding/hidden dimension |
n_layer() / layerCount() |
usize |
Number of transformer layers |
n_ctx_train() / trainContextLength() |
usize |
Maximum trained context length |
bos_token() / bosToken() |
Option<TokenId> |
Beginning-of-sequence token |
eos_token() / eosToken() |
Option<TokenId> |
End-of-sequence token |
description() |
Option<String> |
Model description from metadata |
Tokenization and Detokenization¶
Convert between text and tokens using the model's vocabulary:
// Text to tokens
const tokens = model.tokenize("Hello, world!");
console.log(`Token IDs: ${tokens}`);
console.log(`Token count: ${tokens.length}`);
// Tokens back to text
const text = model.detokenize(tokens);
console.log(`Decoded text: ${text}`);
// Single token to string
const tokenStr = model.tokenToString(tokens[0]);
console.log(`First token: '${tokenStr}'`);
# Text to tokens
tokens = model.tokenize("Hello, world!")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")
# Tokens back to text
text = model.detokenize(tokens)
print(f"Decoded text: {text}")
# Single token to string
token_str = model.token_to_string(tokens[0])
print(f"First token: '{token_str}'")
// Text to tokens
let tokens = model.tokenize("Hello, world!", true, false)?;
// Arguments: text, add_bos, parse_special_tokens
println!("Token IDs: {:?}", tokens);
println!("Token count: {}", tokens.len());
// Single token to text
let text = model.token_to_str(tokens[0], 0, false)?;
println!("First token text: '{}'", text);
// Detokenize a sequence
let decoded = model.detokenize(&tokens)?;
println!("Decoded: {}", decoded);
Special Tokens
By default, tokenization adds the BOS (beginning-of-sequence) token. In Rust, control this with the add_bos parameter. In Node.js and Python, use the addBos/add_bos option.
Model Aliases¶
When using the Mullama daemon, you can reference models by aliases instead of file paths:
Downloading from HuggingFace¶
Download GGUF models directly from HuggingFace using the daemon:
Memory Considerations¶
GGUF models come in various quantization levels, trading quality for size and speed:
| Quantization | Bits | Quality | Speed | RAM (7B model) |
|---|---|---|---|---|
| F16 | 16 | Best | Slow | ~14 GB |
| Q8_0 | 8 | Excellent | Medium | ~7 GB |
| Q6_K | 6 | Very Good | Fast | ~5.5 GB |
| Q5_K_M | 5 | Good | Fast | ~5 GB |
| Q4_K_M | 4 | Good | Fastest | ~4 GB |
| Q3_K_M | 3 | Acceptable | Fastest | ~3.5 GB |
| Q2_K | 2 | Poor | Fastest | ~3 GB |
Recommended Quantization
Q4_K_M offers the best balance of quality, speed, and size for most use cases. Use Q8_0 when quality is paramount, or Q3_K_M when memory is extremely limited.
Estimating Memory Requirements¶
A rough formula for estimating RAM usage:
For a 7B parameter model with Q4_K_M quantization and 4096 context:
Model weights: 7B * 4 bits / 8 = ~3.5 GB
KV cache (F16): 2 * 32 layers * 4096 ctx * 4096 dim * 2 bytes = ~2 GB
Total: ~5.5 GB
GPU VRAM
When offloading to GPU, the offloaded layers consume VRAM instead of system RAM. Ensure your GPU has sufficient VRAM for the layers you offload.
Error Handling¶
try {
const model = await Model.load('./model.gguf');
} catch (error) {
if (error.code === 'MODEL_LOAD_ERROR') {
console.error(`Failed to load: ${error.message}`);
} else if (error.code === 'FILE_NOT_FOUND') {
console.error('Model file not found');
} else {
console.error(`Unexpected error: ${error.message}`);
}
}
Best Practices¶
- Share models across contexts -- A single loaded model can serve many concurrent inference contexts
- Match quantization to hardware -- Use Q4_K_M for consumer GPUs, Q8_0 for high-end systems
- Enable GPU offloading -- Significant speedup when VRAM is available
- Use mmap for large models -- Faster loading and better memory efficiency
- Cache loaded models -- Model loading is expensive; reuse loaded models across requests
- Start with full GPU offload -- Use
-1for gpu_layers, then reduce if OOM occurs
See Also¶
- Text Generation -- Using loaded models for inference
- Memory Management -- Detailed memory optimization strategies
- API Reference: Model -- Complete Model API documentation
- Daemon: Model Management -- Managing models with the daemon