Mixtral¶
Mixtral is Mistral AI's Mixture-of-Experts (MoE) model, achieving near GPT-4 performance with efficient sparse computation.
Overview¶
| Property | Value |
|---|---|
| Architecture | Sparse MoE Transformer |
| Parameters | 46.7B total, 12.9B active |
| Experts | 8 experts per layer |
| Active Experts | 2 per token |
| Context Length | 32K tokens |
| Attention | Grouped Query (8 KV heads) |
| Position Encoding | RoPE |
Quick Start¶
use unillm::models_v2::mixtral::{MixtralModelV2, MixtralConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};
// Load model
let weights = WeightLoader::from_gguf("mixtral-8x7b.gguf")?;
let config = MixtralConfig::from_gguf_metadata(weights.metadata())?;
let model = MixtralModelV2::from_weights(config, weights)?;
// Generate
let response = model.generate(
"Explain how neural networks learn:",
&GenerationConfig::default(),
)?;
Configuration¶
model_config!(MixtralConfig {
vocab_size: usize = 32000,
hidden_size: usize = 4096,
intermediate_size: usize = 14336,
num_hidden_layers: usize = 32,
num_attention_heads: usize = 32,
num_key_value_heads: usize = 8,
num_experts: usize = 8,
num_experts_per_tok: usize = 2,
max_position_embeddings: usize = 32768,
sliding_window: usize = 4096,
rope_theta: f32 = 1000000.0,
rms_norm_eps: f32 = 1e-5,
router_aux_loss_coef: f32 = 0.02,
});
How MoE Works¶
Architecture¶
Input
│
▼
┌─────────────────────────────────┐
│ Self-Attention (GQA) │
└─────────────────────────────────┘
│
▼
┌─────────────────────────────────┐
│ Router Network │
│ ├─ Expert 0 ─┐ │
│ ├─ Expert 1 ─┼─ Select Top-2 │
│ ├─ Expert 2 ─┘ │
│ ├─ Expert 3 │
│ ├─ Expert 4 │
│ ├─ Expert 5 │
│ ├─ Expert 6 │
│ └─ Expert 7 │
└─────────────────────────────────┘
│
▼
Output
Router¶
The router selects which experts process each token:
fn forward_moe(&self, hidden: &Tensor) -> Result<Tensor> {
// Compute router logits
let router_logits = ops_fn::linear(hidden, &self.gate, None)?;
// Select top-k experts
let (weights, indices) = ops_fn::topk(
&router_logits,
self.config.num_experts_per_tok, // k=2
-1
)?;
// Normalize weights
let weights = ops_fn::softmax(&weights, -1)?;
// Process through selected experts and combine
// ...
}
Benefits¶
- Efficiency: Only ~25% of parameters active per token
- Capacity: Total knowledge of 47B parameters
- Speed: Similar latency to 13B dense model
- Quality: Performance approaching GPT-4
Loading from Ollama¶
use unillm::ollama::OllamaRegistry;
// Full model
let path = OllamaRegistry::pull("mixtral:8x7b")?;
// Instruct version
let path = OllamaRegistry::pull("mixtral:8x7b-instruct")?;
// Quantized (recommended)
let path = OllamaRegistry::pull("mixtral:8x7b-q4_0")?;
Generation Examples¶
Chat Format¶
let prompt = "<s>[INST] Explain quantum computing [/INST]";
let config = GenerationConfig {
max_new_tokens: 512,
temperature: 0.7,
top_p: 0.9,
..Default::default()
};
let response = model.generate(prompt, &config)?;
Multi-turn¶
let prompt = r#"<s>[INST] What is Python? [/INST] Python is a high-level programming language...</s>
[INST] How do I install it? [/INST]"#;
let response = model.generate(prompt, &config)?;
Memory Requirements¶
| Format | VRAM | RAM |
|---|---|---|
| F16 | 90 GB | 90 GB |
| Q8_0 | 50 GB | 50 GB |
| Q6_K | 40 GB | 40 GB |
| Q4_K_M | 28 GB | 28 GB |
| Q3_K_M | 22 GB | 22 GB |
For consumer GPUs, use Q4_K_M or lower quantization.
Performance¶
Benchmarks¶
| Benchmark | Mixtral 8x7B | LLaMA 2 70B | GPT-3.5 |
|---|---|---|---|
| MMLU | 70.6 | 68.9 | 70.0 |
| HellaSwag | 84.4 | 85.3 | 85.5 |
| Arc-C | 66.4 | 64.6 | 85.2 |
| HumanEval | 40.2 | 29.9 | 48.1 |
Inference Speed¶
Mixtral is faster than dense models with similar quality:
| Model | Active Params | Speed* |
|---|---|---|
| Mixtral 8x7B | 12.9B | 30 tok/s |
| LLaMA 2 13B | 13B | 28 tok/s |
| LLaMA 2 70B | 70B | 8 tok/s |
*Approximate, hardware dependent
Expert Analysis¶
Expert Specialization¶
Experts tend to specialize in different domains:
| Expert | Tendency |
|---|---|
| 0 | General knowledge |
| 1 | Technical/scientific |
| 2 | Code and formatting |
| 3 | Creative writing |
| ... | ... |
Load Balancing¶
The router includes auxiliary loss to prevent expert collapse:
let config = MixtralConfig {
router_aux_loss_coef: 0.02, // Encourages balanced routing
..Default::default()
};
Use Cases¶
Ideal For¶
- Complex reasoning - Benefits from expert diversity
- Multi-domain tasks - Experts cover different areas
- Production - Better quality/compute ratio
- Long context - 32K native context
When to Use¶
| Scenario | Recommendation |
|---|---|
| Limited VRAM (<16GB) | Use Mistral 7B Q4 |
| 24GB VRAM | Use Mixtral Q4 |
| 48GB+ VRAM | Use Mixtral Q8 or F16 |
| Maximum quality | Use Mixtral Instruct |
Best Practices¶
- Use quantization - Q4_K_M maintains good quality
- Proper prompting - Use [INST] format for Instruct
- Balance load - MoE benefits from diverse prompts
- Monitor memory - All experts must fit in memory
Variants¶
Mixtral 8x7B Base¶
Pre-trained model for fine-tuning.
Mixtral 8x7B Instruct¶
Instruction-tuned for chat and assistance.
Mixtral 8x22B¶
Larger MoE variant (not yet supported).