UniLLM¶
High-performance Rust-based LLM inference engine with solid abstractions
UniLLM is a modern, high-performance inference runtime built in Rust, designed with clean architecture and solid abstractions. It provides a unified interface for running large language models across different architectures and deployment targets.
Key Features¶
-
:material-layers-triple:{ .lg .middle } Solid Architecture
Clean three-layer abstraction system: TensorCore, ModelCore, and WeightLoaderCore
-
:material-robot:{ .lg .middle } 45+ Model Architectures
Support for LLMs, MoE, Vision-Language, Audio models with consistent interfaces
-
:material-lightning-bolt:{ .lg .middle } Performance Ready
Device abstraction for CPU/GPU, async runtime, zero-cost abstractions
-
:material-puzzle:{ .lg .middle } Easy Extension
Add new models with minimal boilerplate using the
model_config!macro
Supported Models¶
UniLLM supports 45+ model architectures across 10 categories:
| Category | Models |
|---|---|
| Core LLMs | LLaMA, Qwen, Gemma, Phi, DeepSeek, Mistral, Mixtral |
| GPT Family | GPT-2, GPT-J, GPT-NeoX, OPT, BLOOM, MPT |
| Code Models | StarCoder, CodeLlama |
| MoE Models | DeepSeek-MoE, DBRX, Grok, Arctic, Jamba |
| RWKV/Linear | RWKV-4, RWKV-6, RecurrentGemma |
| Vision-Language | Qwen2-VL, Phi-3-Vision, InternVL, CogVLM, LLaVA, CLIP |
| Audio/Speech | Wav2Vec2, HuBERT, MusicGen, Encodec, Whisper |
| Additional | Yi, Falcon, Baichuan, InternLM, ChatGLM, BERT, T5, Mamba |
Quick Start¶
Installation¶
Basic Usage¶
use unillm::models_v2::llama::{LlamaModelV2, LlamaConfig};
use unillm::{Model, ModelInputs, GenerationConfig};
// Create model with default configuration
let config = LlamaConfig::default();
let model = LlamaModelV2::new(config)?;
// Generate text
let gen_config = GenerationConfig::default();
let response = model.generate("Hello, world!", &gen_config)?;
println!("{}", response);
Using with Ollama¶
# Run inference with a model from Ollama registry
cargo run --bin test_ollama -p runtime -- --model tinyllama
Architecture Overview¶
UniLLM uses a three-layer abstraction system:
┌─────────────────────────────────────────────────────────────────┐
│ UniLLM │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ TensorCore │ │ ModelCore │ │WeightLoaderCore │ │
│ │ │ │ │ │ │ │
│ │ • Tensor │ │ • Model trait │ │ • WeightLoader │ │
│ │ • TensorOps │ │ • ModelConfig │ │ • ModelWeights │ │
│ │ • Device │ │ • model_config! │ │ • Format detect │ │
│ │ • ops_fn │ │ • ModelInputs │ │ • GGUF, SafeT. │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ ──────────────────────────────────────────────────────────── │
│ Model Implementations │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ LlamaV2 │ │ QwenV2 │ │MixtralV2│ │ LLaVAV2 │ │WhisperV2│ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
Learn more about the architecture :material-arrow-right:
Current Status¶
Production Ready Components
- 47 model architectures implemented
- 166 passing tests
- GGUF and SafeTensors weight loading
- Ollama registry integration
- Full LLaMA inference pipeline
In Development
- KV caching for efficient generation
- GPU acceleration (CUDA, Metal)
- Production HTTP server
Next Steps¶
-
Learn how to install and run your first model
-
Detailed guides for loading models and running inference
-
Browse all supported model architectures
-
Complete API documentation