Your First Model¶
This tutorial walks you through running your first model with UniLLM.
Using the Ollama Integration¶
The easiest way to get started is using the built-in Ollama integration, which automatically downloads and manages models.
Step 1: Run the Ollama Test¶
This will:
- Download TinyLlama (~600MB) from the Ollama registry
- Load the GGUF weights
- Run a sample inference
Expected output:
Downloading tinyllama...
Loading model weights...
Running inference...
Prompt: "The quick brown fox"
Generated: "jumps over the lazy dog..."
Step 2: Try Different Models¶
# Use a different model
cargo run --bin test_ollama -p runtime -- --model llama2:7b
# List cached models
cargo run --bin test_ollama -p runtime -- --list-cached
Using the API Directly¶
For more control, use the API directly in your Rust code.
Basic Example¶
use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::{Model, ModelInputs, GenerationConfig};
use unillm::tensor_core::{ops_fn, DataType, Device};
fn main() -> anyhow::Result<()> {
// 1. Create model configuration
let config = LlamaConfig {
vocab_size: 32000,
hidden_size: 4096,
num_hidden_layers: 32,
num_attention_heads: 32,
..Default::default()
};
// 2. Create the model
let model = LlamaModelV2::new(config)?;
// 3. Configure generation
let gen_config = GenerationConfig {
max_new_tokens: 50,
temperature: 0.7,
do_sample: true,
..Default::default()
};
// 4. Generate text
let response = model.generate("Hello, world!", &gen_config)?;
println!("Generated: {}", response);
Ok(())
}
Loading Pre-trained Weights¶
use unillm::weight_loader_core::WeightLoader;
// Load from SafeTensors
let weights = WeightLoader::from_safetensors("model.safetensors")?;
// Load from GGUF (Ollama format)
let weights = WeightLoader::from_gguf("model.gguf")?;
// Auto-detect format
let weights = WeightLoader::auto_detect("model_file")?;
// Create model with weights
let model = LlamaModelV2::from_weights(config, weights)?;
Understanding the Output¶
When you run inference, UniLLM:
- Tokenizes the input text into token IDs
- Embeds the tokens into continuous vectors
- Runs the forward pass through transformer layers
- Samples the next token from output logits
- Decodes tokens back to text
Input: "Hello"
↓ Tokenize
[1, 15496] (token IDs)
↓ Embed
[batch, seq, hidden_size] (embeddings)
↓ Forward Pass
[batch, seq, vocab_size] (logits)
↓ Sample
[12345] (next token)
↓ Decode
"Hello world"
Model Memory Requirements¶
Different models have different memory requirements:
| Model | Parameters | RAM Required |
|---|---|---|
| TinyLlama | 1.1B | ~2GB |
| LLaMA-7B | 7B | ~14GB |
| LLaMA-13B | 13B | ~26GB |
| Mixtral-8x7B | 47B | ~94GB |
Quantized Models
GGUF models are quantized, reducing memory requirements significantly. A Q4 quantized 7B model uses ~4GB instead of ~14GB.
Next Steps¶
Now that you've run your first model:
- Learn about Loading Models in detail
- Explore the Model Catalog for all supported architectures
- Understand Configuration Options