Skip to content

Design Decisions

This document explains the key architectural decisions made in UniLLM and their rationale.

Functional Tensor Operations

Decision: Use a functional interface (ops_fn::*) rather than methods on Tensor.

Rationale:

// Chosen approach: Functional
let result = ops_fn::matmul(&a, &b)?;
let normed = ops_fn::layer_norm(&hidden, &weight, None, 1e-6)?;

// Alternative: Method-based
let result = a.matmul(&b)?;
let normed = hidden.layer_norm(&weight, None, 1e-6)?;

Benefits of functional approach: - Explicit operations - Clear what operation is being performed - Backend flexibility - Easy to swap implementations - Consistency - Same pattern for all operations - Testing - Operations can be tested independently

Single Model Trait

Decision: All models implement a single Model trait rather than specialized traits per model type.

Rationale:

// Chosen approach: Universal trait
pub trait Model {
    fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs>;
    fn generate(&self, prompt: &str, config: &GenerationConfig) -> Result<String>;
}

// Alternative: Specialized traits
pub trait LanguageModel { fn generate(&self, ...) -> ...; }
pub trait VisionModel { fn encode_image(&self, ...) -> ...; }
pub trait AudioModel { fn transcribe(&self, ...) -> ...; }

Benefits: - Simplicity - One interface to learn - Composability - Models can be used interchangeably - Extensibility - Easy to add new model types - Enum-based I/O - ModelInputs and ModelOutputs handle type differences

model_config! Macro

Decision: Use a macro to generate configuration structs and trait implementations.

Rationale:

// With macro (current)
model_config!(LlamaConfig {
    vocab_size: usize = 32000,
    hidden_size: usize = 4096,
});

// Without macro (alternative)
#[derive(Clone, Debug)]
pub struct LlamaConfig {
    pub vocab_size: usize,
    pub hidden_size: usize,
}

impl Default for LlamaConfig { /* ... */ }
impl ModelConfig for LlamaConfig { /* ... */ }

Benefits: - Reduced boilerplate - 47 models × ~50 lines saved = significant reduction - Consistency - All configs follow same pattern - Defaults inline - Easy to see default values - Automatic validation - Macro can generate validation

Enum-Based I/O

Decision: Use enums for ModelInputs and ModelOutputs rather than generics.

Rationale:

// Chosen approach: Enums
pub enum ModelInputs {
    Text { input_ids: Tensor, ... },
    Image { pixel_values: Tensor, ... },
    Multimodal { ... },
    Audio { ... },
}

// Alternative: Generics
pub trait ModelInput { fn to_tensor(&self) -> Tensor; }
fn forward<I: ModelInput>(&self, input: I) -> Result<O>;

Benefits: - Explicit types - Clear what each model accepts - Runtime flexibility - Can switch input types dynamically - Pattern matching - Easy to handle different cases - Serialization - Simpler to serialize/deserialize

Automatic Dequantization

Decision: GGUF quantized weights are automatically dequantized to F32 during loading.

Rationale:

// Current behavior
let weights = WeightLoader::from_gguf("model-Q4_K_M.gguf")?;
// All weights are F32, ready for any backend

// Alternative: Keep quantized
let weights = WeightLoader::from_gguf("model-Q4_K_M.gguf")?;
// Weights stay quantized, need special ops

Benefits: - Backend compatibility - All backends support F32 - Simpler models - No quantization-aware code paths - Correctness - Same numerical results as reference

Trade-offs: - Memory - Uses more memory than quantized - Future - Will add quantized inference as optimization

Device as Enum

Decision: Use an enum for device representation rather than trait objects.

Rationale:

// Chosen approach: Enum
pub enum Device {
    CPU,
    CUDA(usize),
    Metal(usize),
}

// Alternative: Trait object
pub trait Device: Send + Sync { ... }
pub struct CpuDevice;
pub struct CudaDevice(usize);

Benefits: - Simple - Easy to understand and use - Pattern matching - Clean device-specific logic - No vtable - Slightly more efficient - Copy - Can be passed by value

Candle as Backend

Decision: Use Candle as the primary tensor backend.

Rationale:

Evaluated alternatives: - PyTorch (via tch-rs) - Heavy dependency, C++ complexity - ONNX Runtime - Graph-based, less flexible - Custom - Too much work for same functionality - Candle - Pure Rust, active development, good API

Benefits of Candle: - Pure Rust - No C++ dependencies - HuggingFace - Same maintainers as transformers - CUDA/Metal - GPU support built-in - Quantization - GGUF support included - Active - Regular updates and improvements

Error Handling with anyhow

Decision: Use anyhow::Result for error handling rather than custom error types.

Rationale:

// Chosen approach: anyhow
use anyhow::Result;
fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs>;

// Alternative: Custom errors
#[derive(Error)]
enum ModelError { ... }
fn forward(&self, inputs: &ModelInputs) -> Result<ModelOutputs, ModelError>;

Benefits: - Simplicity - Less boilerplate - Composability - Different error types work together - Context - Easy to add error context - Development speed - Can refine error types later

Layer-by-Layer Construction

Decision: Models are constructed layer by layer from weights, not from a graph.

Rationale:

// Chosen approach: Manual construction
let layers: Vec<Layer> = (0..config.num_layers)
    .map(|i| Layer::from_weights(&weights, i))
    .collect()?;

// Alternative: Graph-based
let graph = Graph::from_onnx("model.onnx")?;
let model = Model::from_graph(graph)?;

Benefits: - Flexibility - Full control over model structure - Debugging - Easy to inspect intermediate states - Optimization - Can apply architecture-specific optimizations - Understanding - Clear how model works

No Training Support

Decision: UniLLM is inference-only; no gradient computation or training.

Rationale:

  • Focus - Inference and training have different requirements
  • Performance - No backward pass overhead
  • Simplicity - Much simpler codebase
  • Use case - Target audience wants fast inference

Ollama Integration

Decision: Integrate with Ollama registry for model downloads.

Rationale:

// Easy model access
let path = OllamaRegistry::pull("llama2:7b")?;
let weights = WeightLoader::from_gguf(&path)?;

Benefits: - Ecosystem - Leverage existing model library - Convenience - No manual download needed - Caching - Built-in model caching - Compatibility - GGUF format already supported

Consistent Naming

Decision: Follow consistent naming conventions across all code.

Patterns: - Models: {Name}ModelV2 (e.g., LlamaModelV2) - Configs: {Name}Config (e.g., LlamaConfig) - Layers: {Name}Layer (e.g., LlamaLayer) - Attention: {Name}Attention (e.g., LlamaAttention) - MLP: {Name}MLP (e.g., LlamaMLP)

Benefits: - Predictability - Know names without looking - Search - Easy to find related code - Refactoring - Consistent patterns to update

Test Structure

Decision: Tests live in #[cfg(test)] modules within source files.

Rationale:

// In model.rs
pub struct MyModel { ... }

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_forward() { ... }
}

Benefits: - Proximity - Tests next to implementation - Access - Can test private functions - Compilation - Only compile tests when needed - Discovery - Easy to find tests for code