Your First Model¶

This tutorial walks you through running your first model with UniLLM.

Using the Ollama Integration¶

The easiest way to get started is using the built-in Ollama integration, which automatically downloads and manages models.

Step 1: Run the Ollama Test¶

cargo run --bin test_ollama -p runtime

This will:

Download TinyLlama (~600MB) from the Ollama registry
Load the GGUF weights
Run a sample inference

Expected output:

Downloading tinyllama...
Loading model weights...
Running inference...
Prompt: "The quick brown fox"
Generated: "jumps over the lazy dog..."

Step 2: Try Different Models¶

# Use a different model
cargo run --bin test_ollama -p runtime -- --model llama2:7b

# List cached models
cargo run --bin test_ollama -p runtime -- --list-cached

Using the API Directly¶

For more control, use the API directly in your Rust code.

Basic Example¶

use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::{Model, ModelInputs, GenerationConfig};
use unillm::tensor_core::{ops_fn, DataType, Device};

fn main() -> anyhow::Result<()> {
    // 1. Create model configuration
    let config = LlamaConfig {
        vocab_size: 32000,
        hidden_size: 4096,
        num_hidden_layers: 32,
        num_attention_heads: 32,
        ..Default::default()
    };

    // 2. Create the model
    let model = LlamaModelV2::new(config)?;

    // 3. Configure generation
    let gen_config = GenerationConfig {
        max_new_tokens: 50,
        temperature: 0.7,
        do_sample: true,
        ..Default::default()
    };

    // 4. Generate text
    let response = model.generate("Hello, world!", &gen_config)?;
    println!("Generated: {}", response);

    Ok(())
}

Loading Pre-trained Weights¶

use unillm::weight_loader_core::WeightLoader;

// Load from SafeTensors
let weights = WeightLoader::from_safetensors("model.safetensors")?;

// Load from GGUF (Ollama format)
let weights = WeightLoader::from_gguf("model.gguf")?;

// Auto-detect format
let weights = WeightLoader::auto_detect("model_file")?;

// Create model with weights
let model = LlamaModelV2::from_weights(config, weights)?;

Understanding the Output¶

When you run inference, UniLLM:

Tokenizes the input text into token IDs
Embeds the tokens into continuous vectors
Runs the forward pass through transformer layers
Samples the next token from output logits
Decodes tokens back to text

Input: "Hello"
  ↓ Tokenize
[1, 15496]  (token IDs)
  ↓ Embed
[batch, seq, hidden_size]  (embeddings)
  ↓ Forward Pass
[batch, seq, vocab_size]  (logits)
  ↓ Sample
[12345]  (next token)
  ↓ Decode
"Hello world"

Model Memory Requirements¶

Different models have different memory requirements:

Model	Parameters	RAM Required
TinyLlama	1.1B	~2GB
LLaMA-7B	7B	~14GB
LLaMA-13B	13B	~26GB
Mixtral-8x7B	47B	~94GB

Quantized Models

GGUF models are quantized, reducing memory requirements significantly. A Q4 quantized 7B model uses ~4GB instead of ~14GB.

Next Steps¶

Now that you've run your first model:

Learn about Loading Models in detail
Explore the Model Catalog for all supported architectures
Understand Configuration Options