Loading Models¶

UniLLM supports loading models from multiple formats with a unified interface.

Supported Formats¶

Format	Extension	Description
GGUF	`.gguf`	Quantized format used by llama.cpp, Ollama
SafeTensors	`.safetensors`	HuggingFace's safe serialization format
PyTorch	`.bin`, `.pt`	PyTorch checkpoint format

Using WeightLoader¶

The WeightLoader provides format-agnostic loading:

Auto-Detection¶

use unillm::weight_loader_core::WeightLoader;

// Automatically detect format from file extension
let weights = WeightLoader::auto_detect("path/to/model")?;

Format-Specific Loading¶

// Load GGUF (Ollama/llama.cpp format)
let weights = WeightLoader::from_gguf("model.gguf")?;

// Load SafeTensors (HuggingFace format)
let weights = WeightLoader::from_safetensors("model.safetensors")?;

// Load PyTorch checkpoint
let weights = WeightLoader::from_pytorch("model.bin")?;

Loading from Ollama¶

The easiest way to get models is through the Ollama integration:

use unillm::ollama::OllamaRegistry;

// Download and cache model
let registry = OllamaRegistry::new();
let model_path = registry.pull("tinyllama")?;

// Load the downloaded model
let weights = WeightLoader::from_gguf(&model_path)?;

Available Ollama Models¶

# List some popular models
tinyllama      # 1.1B parameters, ~600MB
llama2:7b      # 7B parameters, ~4GB (Q4)
mistral:7b     # 7B parameters, ~4GB (Q4)
mixtral:8x7b   # 47B parameters, ~26GB (Q4)

Creating Models with Weights¶

Once weights are loaded, create a model instance:

use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};

// 1. Define configuration
let config = LlamaConfig {
    vocab_size: 32000,
    hidden_size: 4096,
    num_hidden_layers: 32,
    num_attention_heads: 32,
    ..Default::default()
};

// 2. Load weights
let weights = WeightLoader::from_gguf("llama-7b.gguf")?;

// 3. Create model with weights
let model = LlamaModelV2::from_weights(config, weights)?;

Weight Format Details¶

GGUF Format¶

GGUF files contain:

Model weights (quantized)
Tokenizer vocabulary
Model configuration metadata

// GGUF provides configuration automatically
let gguf_config = weights.metadata().gguf_config();
let config = LlamaConfig::from_gguf_config(&gguf_config);

SafeTensors Format¶

SafeTensors files are memory-mapped for efficiency:

// SafeTensors supports lazy loading
let weights = WeightLoader::from_safetensors("model.safetensors")?;

// Weights are loaded on-demand when accessed
let embed_weight = weights.get("model.embed_tokens.weight")?;

Working with ModelWeights¶

The ModelWeights container provides access to loaded tensors:

// Get a specific weight
if let Some(weight) = weights.get("model.layers.0.self_attn.q_proj.weight") {
    println!("Shape: {:?}", weight.shape());
}

// Iterate over all weights
for key in weights.keys() {
    println!("Weight: {}", key);
}

// Check weight count
println!("Total weights: {}", weights.len());

Device Transfer¶

Move loaded weights to a specific device:

use unillm::tensor_core::Device;

// Load weights on CPU first
let mut weights = WeightLoader::from_gguf("model.gguf")?;

// Transfer to GPU
weights.to_device(&Device::CUDA(0))?;

// Create model on GPU
let model = LlamaModelV2::from_weights(config, weights)?;

Memory Considerations¶

Memory Usage

GGUF Q4 models use ~0.5 bytes per parameter
SafeTensors F16 models use ~2 bytes per parameter
Full F32 models use ~4 bytes per parameter

Model Size	GGUF Q4	SafeTensors F16	Full F32
7B	~3.5GB	~14GB	~28GB
13B	~6.5GB	~26GB	~52GB
70B	~35GB	~140GB	~280GB

Error Handling¶

use anyhow::Result;

fn load_model() -> Result<LlamaModelV2> {
    let weights = WeightLoader::from_gguf("model.gguf")
        .map_err(|e| anyhow::anyhow!("Failed to load weights: {}", e))?;

    let config = LlamaConfig::default();
    let model = LlamaModelV2::from_weights(config, weights)?;

    Ok(model)
}

Next Steps¶

Learn about Running Inference with loaded models
Explore Configuration Options for fine-tuning