Loading Models¶

Mullama supports loading GGUF format models with flexible configuration options for GPU offloading, memory mapping, and multi-threaded access.

Basic Model Loading¶

The simplest way to load a model from a local GGUF file:

Node.jsPythonRustCLI

import { Model } from 'mullama';

const model = await Model.load('./path/to/model.gguf');
console.log(`Loaded model with ${model.vocabSize()} tokens`);

from mullama import Model

model = Model.load("./path/to/model.gguf")
print(f"Loaded model with {model.vocab_size()} tokens")

use mullama::Model;
use std::sync::Arc;

let model = Arc::new(Model::load("path/to/model.gguf")?);
println!("Loaded model with {} tokens", model.vocab_size());

# Load and run directly
mullama run ./path/to/model.gguf "Hello!"

# Or use a model alias
mullama run llama3.2:1b "Hello!"

Shared Ownership

Models are designed to be shared between multiple contexts and threads. In Rust, wrap in Arc. In Node.js and Python, sharing is handled automatically through reference counting.

Model Parameters¶

Configure model loading with ModelParams for fine-grained control over GPU offloading, memory mapping, and tensor distribution:

Node.jsPythonRustCLI

import { Model } from 'mullama';

const model = await Model.loadWithParams('./model.gguf', {
  nGpuLayers: 35,        // Offload 35 layers to GPU
  useMmap: true,         // Memory-map the model file
  useMlock: false,       // Don't lock in RAM
  vocabOnly: false,      // Load full model, not just vocab
  mainGpu: 0,            // Primary GPU device index
});

from mullama import Model, ModelParams

model = Model.load_with_params("./model.gguf", ModelParams(
    n_gpu_layers=35,       # Offload 35 layers to GPU
    use_mmap=True,         # Memory-map the model file
    use_mlock=False,       # Don't lock in RAM
    vocab_only=False,      # Load full model, not just vocab
    main_gpu=0,            # Primary GPU device index
))

use mullama::{Model, ModelParams};

let params = ModelParams {
    n_gpu_layers: 35,
    use_mmap: true,
    use_mlock: false,
    vocab_only: false,
    main_gpu: 0,
    ..Default::default()
};

let model = Model::load_with_params("model.gguf", params)?;

mullama run llama3.2:1b "Hello!" \
  --gpu-layers 35 \
  --mmap \
  --main-gpu 0

Parameter Reference¶

Parameter	Type	Default	Description
`n_gpu_layers`	`i32`	`0`	Number of layers to offload to GPU (-1 for all)
`split_mode`	`SplitMode`	`Layer`	How to split model across GPUs
`main_gpu`	`i32`	`0`	Primary GPU device index
`tensor_split`	`Vec<f32>`	`[]`	Proportional split across GPUs
`vocab_only`	`bool`	`false`	Load only vocabulary (for tokenization)
`use_mmap`	`bool`	`true`	Enable memory-mapped file loading
`use_mlock`	`bool`	`false`	Lock model pages in physical RAM

ModelBuilder Pattern¶

For complex configurations, use the builder pattern which provides a fluent API:

Node.jsPythonRustCLI

import { Model } from 'mullama';

const model = await Model.builder('./model.gguf')
  .gpuLayers(35)
  .useMmap(true)
  .useMlock(false)
  .vocabOnly(false)
  .tensorSplit([0.6, 0.4])
  .progressCallback((progress) => {
    console.log(`Loading: ${(progress * 100).toFixed(1)}%`);
    return true;  // Return true to continue
  })
  .build();

from mullama import Model

def on_progress(progress: float) -> bool:
    print(f"Loading: {progress * 100:.1f}%")
    return True  # Return True to continue

model = (Model.builder("./model.gguf")
    .gpu_layers(35)
    .use_mmap(True)
    .use_mlock(False)
    .vocab_only(False)
    .tensor_split([0.6, 0.4])
    .progress_callback(on_progress)
    .build())

use mullama::ModelBuilder;

let model = ModelBuilder::new("model.gguf")
    .with_n_gpu_layers(35)
    .with_use_mmap(true)
    .with_use_mlock(false)
    .with_vocab_only(false)
    .with_tensor_split(&[0.6, 0.4])
    .with_progress_callback(|progress| {
        println!("Loading: {:.1}%", progress * 100.0);
        true  // Return true to continue, false to abort
    })
    .build()?;

mullama run llama3.2:1b "Hello!" \
  --gpu-layers 35 \
  --mmap \
  --tensor-split "0.6,0.4"

GPU Layer Offloading¶

Offload transformer layers to the GPU for significantly faster inference:

Value	Behavior
`0`	CPU only -- no GPU acceleration
`1` to `N`	Offload N layers to GPU
`-1` or large number	Offload all layers to GPU

Node.jsPythonRustCLI

// Offload all layers to GPU
const model = await Model.loadWithParams('./model.gguf', {
  nGpuLayers: -1
});

# Offload all layers to GPU
model = Model.load_with_params("./model.gguf", ModelParams(
    n_gpu_layers=-1
))

// Offload all layers to GPU
let model = ModelBuilder::new("model.gguf")
    .with_n_gpu_layers(-1)
    .build()?;

# Offload all layers to GPU
mullama run llama3.2:1b "Hello!" --gpu-layers -1

Finding the Right Balance

Start with all layers on GPU (-1). If you run out of VRAM, reduce the count until the model fits. Monitor GPU memory with:

NVIDIA: nvidia-smi
Apple Silicon: Activity Monitor (Memory tab)
AMD: rocm-smi

Multi-GPU Tensor Splitting¶

For systems with multiple GPUs, distribute model layers across devices:

Node.jsPythonRustCLI

// Split 60% on GPU 0, 40% on GPU 1
const model = await Model.loadWithParams('./large-model.gguf', {
  nGpuLayers: -1,
  tensorSplit: [0.6, 0.4]
});

# Split 60% on GPU 0, 40% on GPU 1
model = Model.load_with_params("./large-model.gguf", ModelParams(
    n_gpu_layers=-1,
    tensor_split=[0.6, 0.4]
))

// Split 60% on GPU 0, 40% on GPU 1
let model = ModelBuilder::new("large-model.gguf")
    .with_n_gpu_layers(-1)
    .with_tensor_split(&[0.6, 0.4])
    .build()?;

mullama run large-model "Hello!" \
  --gpu-layers -1 \
  --tensor-split "0.6,0.4"

Model Introspection¶

Query model properties after loading to understand its architecture and capabilities:

Node.jsPythonRustCLI

const model = await Model.load('./model.gguf');

// Architecture information
console.log(`Embedding dimension: ${model.embeddingDim()}`);
console.log(`Number of layers: ${model.layerCount()}`);
console.log(`Training context length: ${model.trainContextLength()}`);

// Vocabulary information
console.log(`Vocabulary size: ${model.vocabSize()}`);
console.log(`BOS token: ${model.bosToken()}`);
console.log(`EOS token: ${model.eosToken()}`);

// Model description
console.log(`Description: ${model.description()}`);

model = Model.load("./model.gguf")

# Architecture information
print(f"Embedding dimension: {model.embedding_dim()}")
print(f"Number of layers: {model.layer_count()}")
print(f"Training context length: {model.train_context_length()}")

# Vocabulary information
print(f"Vocabulary size: {model.vocab_size()}")
print(f"BOS token: {model.bos_token()}")
print(f"EOS token: {model.eos_token()}")

# Model description
print(f"Description: {model.description()}")

let model = Model::load("model.gguf")?;

// Architecture information
println!("Embedding dimension: {}", model.n_embd());
println!("Number of layers: {}", model.n_layer());
println!("Training context length: {}", model.n_ctx_train());

// Vocabulary information
println!("Vocabulary size: {}", model.vocab_size());
println!("BOS token: {:?}", model.bos_token());
println!("EOS token: {:?}", model.eos_token());

// Model description
println!("Description: {:?}", model.description());

# Show model metadata
mullama show llama3.2:1b

# Show full modelfile including parameters
mullama show llama3.2:1b --modelfile

Key Metadata Fields¶

Method	Returns	Description
`vocab_size()`	`usize`	Total vocabulary size
`n_embd()` / `embeddingDim()`	`usize`	Embedding/hidden dimension
`n_layer()` / `layerCount()`	`usize`	Number of transformer layers
`n_ctx_train()` / `trainContextLength()`	`usize`	Maximum trained context length
`bos_token()` / `bosToken()`	`Option<TokenId>`	Beginning-of-sequence token
`eos_token()` / `eosToken()`	`Option<TokenId>`	End-of-sequence token
`description()`	`Option<String>`	Model description from metadata

Tokenization and Detokenization¶

Convert between text and tokens using the model's vocabulary:

Node.jsPythonRustCLI

// Text to tokens
const tokens = model.tokenize("Hello, world!");
console.log(`Token IDs: ${tokens}`);
console.log(`Token count: ${tokens.length}`);

// Tokens back to text
const text = model.detokenize(tokens);
console.log(`Decoded text: ${text}`);

// Single token to string
const tokenStr = model.tokenToString(tokens[0]);
console.log(`First token: '${tokenStr}'`);

# Text to tokens
tokens = model.tokenize("Hello, world!")
print(f"Token IDs: {tokens}")
print(f"Token count: {len(tokens)}")

# Tokens back to text
text = model.detokenize(tokens)
print(f"Decoded text: {text}")

# Single token to string
token_str = model.token_to_string(tokens[0])
print(f"First token: '{token_str}'")

// Text to tokens
let tokens = model.tokenize("Hello, world!", true, false)?;
// Arguments: text, add_bos, parse_special_tokens
println!("Token IDs: {:?}", tokens);
println!("Token count: {}", tokens.len());

// Single token to text
let text = model.token_to_str(tokens[0], 0, false)?;
println!("First token text: '{}'", text);

// Detokenize a sequence
let decoded = model.detokenize(&tokens)?;
println!("Decoded: {}", decoded);

# Tokenize text (shows token count)
mullama tokenize llama3.2:1b "Hello, world!"

Special Tokens

By default, tokenization adds the BOS (beginning-of-sequence) token. In Rust, control this with the add_bos parameter. In Node.js and Python, use the addBos/add_bos option.

Model Aliases¶

When using the Mullama daemon, you can reference models by aliases instead of file paths:

Node.jsPythonRustCLI

import { Model } from 'mullama';

// Connect to the daemon and use an alias
const model = await Model.fromAlias('llama3.2:1b');

from mullama import Model

# Connect to the daemon and use an alias
model = Model.from_alias("llama3.2:1b")

use mullama::Model;

// When using the daemon client
let model = Model::from_alias("llama3.2:1b")?;

# List available model aliases
mullama list

# Run with alias
mullama run llama3.2:1b "Hello!"

# Create a custom alias via Modelfile
mullama create my-assistant -f ./Modelfile

Downloading from HuggingFace¶

Download GGUF models directly from HuggingFace using the daemon:

Node.jsPythonRustCLI

import { daemon } from 'mullama';

// Pull a model by name
await daemon.pull('llama3.2:1b');

// Pull with progress tracking
await daemon.pull('llama3.2:1b', (progress) => {
  console.log(`Download: ${(progress * 100).toFixed(1)}%`);
});

from mullama import daemon

# Pull a model by name
daemon.pull("llama3.2:1b")

# Pull with progress tracking
def on_progress(progress: float):
    print(f"Download: {progress * 100:.1f}%")

daemon.pull("llama3.2:1b", progress_callback=on_progress)

use mullama::daemon::DaemonClient;

let client = DaemonClient::connect().await?;
client.pull_model("llama3.2:1b").await?;

# Pull a model
mullama pull llama3.2:1b

# Use HuggingFace reference in Modelfile
# FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF

# Pin to a specific revision
# FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF@a1b2c3d

Memory Considerations¶

GGUF models come in various quantization levels, trading quality for size and speed:

Quantization	Bits	Quality	Speed	RAM (7B model)
F16	16	Best	Slow	~14 GB
Q8_0	8	Excellent	Medium	~7 GB
Q6_K	6	Very Good	Fast	~5.5 GB
Q5_K_M	5	Good	Fast	~5 GB
Q4_K_M	4	Good	Fastest	~4 GB
Q3_K_M	3	Acceptable	Fastest	~3.5 GB
Q2_K	2	Poor	Fastest	~3 GB

Recommended Quantization

Q4_K_M offers the best balance of quality, speed, and size for most use cases. Use Q8_0 when quality is paramount, or Q3_K_M when memory is extremely limited.

Estimating Memory Requirements¶

A rough formula for estimating RAM usage:

RAM = (model_parameters * bits_per_weight) / 8 + context_memory

For a 7B parameter model with Q4_K_M quantization and 4096 context:

Model weights: 7B * 4 bits / 8 = ~3.5 GB
KV cache (F16): 2 * 32 layers * 4096 ctx * 4096 dim * 2 bytes = ~2 GB
Total: ~5.5 GB

GPU VRAM

When offloading to GPU, the offloaded layers consume VRAM instead of system RAM. Ensure your GPU has sufficient VRAM for the layers you offload.

Error Handling¶

Node.jsPythonRustCLI

try {
  const model = await Model.load('./model.gguf');
} catch (error) {
  if (error.code === 'MODEL_LOAD_ERROR') {
    console.error(`Failed to load: ${error.message}`);
  } else if (error.code === 'FILE_NOT_FOUND') {
    console.error('Model file not found');
  } else {
    console.error(`Unexpected error: ${error.message}`);
  }
}

from mullama import Model, MullamaError

try:
    model = Model.load("./model.gguf")
except FileNotFoundError:
    print("Model file not found")
except MullamaError as e:
    print(f"Failed to load model: {e}")

use mullama::{Model, MullamaError};

match Model::load("model.gguf") {
    Ok(model) => {
        println!("Loaded: {} layers", model.n_layer());
    }
    Err(MullamaError::ModelLoadError(msg)) => {
        eprintln!("Failed to load model: {}", msg);
    }
    Err(e) => eprintln!("Unexpected error: {}", e),
}

# CLI provides descriptive error messages automatically
mullama run nonexistent-model "Hello!"
# Error: model 'nonexistent-model' not found. Run 'mullama list' to see available models.

Best Practices¶

Share models across contexts -- A single loaded model can serve many concurrent inference contexts
Match quantization to hardware -- Use Q4_K_M for consumer GPUs, Q8_0 for high-end systems
Enable GPU offloading -- Significant speedup when VRAM is available
Use mmap for large models -- Faster loading and better memory efficiency
Cache loaded models -- Model loading is expensive; reuse loaded models across requests
Start with full GPU offload -- Use -1 for gpu_layers, then reduce if OOM occurs

Loading Models¶

Basic Model Loading¶

Model Parameters¶

Parameter Reference¶

ModelBuilder Pattern¶

GPU Layer Offloading¶

Multi-GPU Tensor Splitting¶

Model Introspection¶

Key Metadata Fields¶

Tokenization and Detokenization¶

Model Aliases¶

Downloading from HuggingFace¶

Memory Considerations¶

Estimating Memory Requirements¶

Error Handling¶

Best Practices¶

See Also¶