DeepSeek-MoE¶
DeepSeek-MoE is a Mixture-of-Experts model with fine-grained expert segmentation and shared expert isolation for improved efficiency.
Overview¶
| Property | Value |
|---|---|
| Architecture | Fine-grained MoE Transformer |
| Parameters | 16B total, ~2.8B active |
| Experts | 64 routed + 2 shared |
| Active Experts | 6 per token |
| Context Length | 4096 tokens |
| Attention | Grouped Query |
| Position Encoding | RoPE |
Quick Start¶
use unillm::models_v2::deepseek_moe::{DeepSeekMoEModelV2, DeepSeekMoEConfig};
use unillm::weight_loader_core::WeightLoader;
use unillm::{Model, GenerationConfig};
// Load model
let weights = WeightLoader::from_gguf("deepseek-moe-16b.gguf")?;
let config = DeepSeekMoEConfig::from_gguf_metadata(weights.metadata())?;
let model = DeepSeekMoEModelV2::from_weights(config, weights)?;
// Generate
let response = model.generate(
"Explain the concept of entropy:",
&GenerationConfig::default(),
)?;
Configuration¶
model_config!(DeepSeekMoEConfig {
vocab_size: usize = 102400,
hidden_size: usize = 2048,
intermediate_size: usize = 1408,
num_hidden_layers: usize = 28,
num_attention_heads: usize = 16,
num_key_value_heads: usize = 16,
num_experts: usize = 64,
num_shared_experts: usize = 2,
num_experts_per_tok: usize = 6,
moe_intermediate_size: usize = 1408,
max_position_embeddings: usize = 4096,
rope_theta: f32 = 10000.0,
rms_norm_eps: f32 = 1e-6,
first_k_dense_replace: usize = 1,
});
Key Innovations¶
Fine-Grained Experts¶
Instead of 8 large experts, DeepSeek-MoE uses 64 smaller experts:
Benefits: - More specialization - Experts can focus on narrow domains - Better routing - More options for token assignment - Efficiency - Smaller experts are faster
Shared Expert Isolation¶
Two experts are always active for every token:
fn forward_moe(&self, hidden: &Tensor) -> Result<Tensor> {
// Shared experts always process
let shared_out = self.forward_shared_experts(hidden)?;
// Route to top-k routed experts
let routed_out = self.forward_routed_experts(hidden)?;
// Combine outputs
ops_fn::add(&shared_out, &routed_out)
}
First Layer Dense¶
The first layer uses dense FFN instead of MoE:
let config = DeepSeekMoEConfig {
first_k_dense_replace: 1, // First 1 layer(s) are dense
..Default::default()
};
Architecture Details¶
Expert Structure¶
Layer Input
│
├──────────────────────────┐
│ │
▼ ▼
┌─────────┐ ┌──────────────────┐
│ Shared │ │ Router (Top-6) │
│ Expert 1│ │ → 64 Experts │
│ Expert 2│ └──────────────────┘
└─────────┘ │
│ │
└──────────┬───────────┘
│
▼
Output
Efficiency Analysis¶
| Model | Total Params | Active Params | Ratio |
|---|---|---|---|
| DeepSeek-MoE 16B | 16.4B | 2.8B | 17% |
| Mixtral 8x7B | 46.7B | 12.9B | 28% |
| Dense 7B | 7B | 7B | 100% |
Generation Examples¶
Basic Generation¶
let config = GenerationConfig {
max_new_tokens: 256,
temperature: 0.7,
top_p: 0.9,
..Default::default()
};
let response = model.generate(
"What are the applications of machine learning?",
&config,
)?;
Chat Format¶
let prompt = "User: What is deep learning?\nAssistant:";
let config = GenerationConfig {
stop_sequences: vec!["\nUser:".to_string()],
..Default::default()
};
let response = model.generate(prompt, &config)?;
Memory Requirements¶
| Format | Memory |
|---|---|
| F16 | 33 GB |
| Q8_0 | 17 GB |
| Q4_K_M | 10 GB |
| Q3_K_M | 8 GB |
Performance¶
Benchmarks¶
| Benchmark | DeepSeek-MoE 16B | LLaMA 2 7B | Mixtral 8x7B |
|---|---|---|---|
| MMLU | 45.0 | 45.3 | 70.6 |
| HellaSwag | 77.1 | 77.2 | 84.4 |
| HumanEval | 26.8 | 12.8 | 40.2 |
Efficiency¶
DeepSeek-MoE 16B matches 7B dense models while using: - 40% less compute per token - Similar memory to 7B model during inference
Use Cases¶
Ideal For¶
- Efficiency-focused - Maximum quality per FLOP
- Research - Novel MoE architecture
- Fine-tuning - Good base for domain adaptation
Comparison¶
| Priority | Best Choice |
|---|---|
| Smallest active params | DeepSeek-MoE |
| Highest quality | Mixtral 8x7B |
| Simplest deployment | Dense 7B |
Advanced Topics¶
Expert Load Balancing¶
DeepSeek-MoE uses auxiliary losses for balanced routing:
Capacity Factor¶
Controls maximum tokens per expert:
// Expert capacity = (tokens / num_experts) * capacity_factor
let capacity_factor = 1.25; // 25% extra capacity
Best Practices¶
- Use Q4 quantization - Good balance of quality/memory
- Monitor expert usage - Check for collapsed experts
- Batch efficiently - MoE benefits from larger batches
- Consider dense - For simpler deployments