Mixture of Experts (MoE)¶
The Mixture of Experts paradigm is one of the most important ideas in scaling language models efficiently. Rather than activating all parameters for every token, MoE models activate only a small subset of expert sub-networks per token, selected by a learned router. This allows the total parameter count to grow dramatically -- improving model capacity -- while keeping the computational cost per token nearly constant. A 47B-parameter MoE model may activate only 13B parameters per token, achieving the quality of a dense 47B model at the cost of a 13B one1.
1. Architecture Overview¶
Historical Context
The Mixture of Experts concept dates back to Jacobs et al. (1991)2 and was revived for transformers by Shazeer et al. (2017)3. The modern resurgence began with the Switch Transformer (Fedus et al., 2021)4, which simplified routing to \( k=1 \) (one expert per token). Mixtral (Jiang et al., 2024)1 demonstrated that MoE could produce a state-of-the-art open model at a fraction of the inference cost.
In a standard transformer, every token passes through the same feed-forward network (FFN). In an MoE transformer, the FFN layer is replaced by a set of \( N \) parallel expert FFNs, and a router (also called a gate) selects the top-\( k \) experts for each token.
2. Key Innovations¶
2.1 Sparse Activation¶
Sparse Expert Activation
For \( N \) experts and top-\( k \) routing, only \( k \) experts are active per token. The ratio of active to total parameters is:
For Mixtral 8x7B with \( k=2 \), \( N=8 \): sparsity = 75%.
2.2 Router / Gate¶
The router is a small linear layer that maps each token's hidden state to a score over all experts:
Expert Router
where \( W_g \in \mathbb{R}^{d \times N} \) is the gate weight matrix, and TopK selects the \( k \) experts with the highest scores. The gate values for non-selected experts are set to zero.
2.3 Expert Computation¶
Each expert is a standard FFN (typically SwiGLU in modern models) with independent weights:
2.4 Weighted Expert Output¶
The final output is a weighted sum of the selected expert outputs:
MoE Layer Output
Only the top-\( k \) experts contribute. The gate values \( G(x)_i \) are re-normalized to sum to 1 over the selected experts:
where \( s_i = (xW_g)_i \) is the raw gate score for expert \( i \).
2.5 Load Balancing Loss¶
Without regularization, the router tends to collapse to always selecting the same few experts (the "rich get richer" problem). An auxiliary loss encourages uniform expert utilization:
Auxiliary Load Balancing Loss
where:
- \( f_i = \frac{1}{T} \sum_{t=1}^{T} \mathbb{1}[i \in \text{TopK}(x_t)] \) is the fraction of tokens routed to expert \( i \)
- \( P_i = \frac{1}{T} \sum_{t=1}^{T} G(x_t)_i \) is the average gate probability for expert \( i \)
- \( \alpha \) is a small coefficient (typically 0.01)
- \( N \) is the number of experts
When all experts receive equal traffic, \( f_i = k/N \) and \( P_i = 1/N \), so the loss is minimized.
3. Architecture Diagram¶
flowchart TD
INPUT["Hidden State x"] --> ROUTER["Router\n(Linear + Softmax + TopK)"]
ROUTER -->|"gate_1 = 0.6"| E1["Expert 1\n(SwiGLU FFN)"]
ROUTER -->|"gate_5 = 0.4"| E5["Expert 5\n(SwiGLU FFN)"]
ROUTER -.->|"gate_i = 0 (inactive)"| E2["Expert 2"]
ROUTER -.->|"gate_i = 0 (inactive)"| E3["Expert 3"]
ROUTER -.->|"gate_i = 0 (inactive)"| E4["Expert 4"]
ROUTER -.->|"gate_i = 0 (inactive)"| E6["Expert 6"]
ROUTER -.->|"gate_i = 0 (inactive)"| E7["Expert 7"]
ROUTER -.->|"gate_i = 0 (inactive)"| E8["Expert 8"]
E1 -->|"0.6 * E1(x)"| SUM["Weighted Sum\ny = 0.6*E1(x) + 0.4*E5(x)"]
E5 -->|"0.4 * E5(x)"| SUM
SUM --> OUTPUT["Output y"]
style E2 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5
style E3 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5
style E4 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5
style E6 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5
style E7 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5
style E8 fill:#f5f5f5,stroke:#ccc,stroke-dasharray: 5 5 flowchart TD
subgraph FULL_BLOCK["MoE Transformer Block"]
LN1["RMSNorm"] --> ATTN["Multi-Head Attention\n(standard, dense)"]
ATTN --> RES1["Residual Add"]
RES1 --> LN2["RMSNorm"]
LN2 --> MOE["MoE Layer\n(Router + N Experts)"]
MOE --> RES2["Residual Add"]
end 4. Configuration Parameters¶
| Parameter | Mixtral 8x7B | Mixtral 8x22B | Qwen2-MoE-57B |
|---|---|---|---|
n_layers | 32 | 56 | 28 |
d_model | 4096 | 6144 | 3584 |
n_heads | 32 | 48 | 28 |
n_kv_heads | 8 | 8 | 4 |
n_experts | 8 | 8 | 64 |
n_experts_active (k) | 2 | 2 | 8 |
d_expert_ff | 14336 | 16384 | 2560 |
vocab_size | 32000 | 32768 | 151936 |
total_params | 46.7B | 141B | 57.4B |
active_params | 12.9B | 39B | 14.3B |
activation | SwiGLU | SwiGLU | SwiGLU |
sliding_window | 4096 | - | - |
Parameter Accounting
For \( N \) experts, each with FFN parameters \( P_{\text{ffn}} \):
[ P_{\text{total}} = P_{\text{non-expert}} + N \cdot P_{\text{ffn}} ] [ P_{\text{active}} = P_{\text{non-expert}} + k \cdot P_{\text{ffn}} ]
Attention layers, embeddings, and norms are shared (dense) and always active. Only the expert FFNs are sparse.
5. Mathematical Formulation¶
5.1 Complete MoE Layer¶
For input \( x \in \mathbb{R}^d \) and \( N \) experts:
Step 1: Router scores
Step 2: TopK selection
Step 3: Gate normalization (over selected experts only)
Step 4: Expert computation and aggregation
5.2 Routing Strategies¶
| Strategy | k | Description | Reference |
|---|---|---|---|
| Top-2 | 2 | Standard, used by Mixtral | Shazeer et al., 20173 |
| Switch | 1 | Single expert per token, simpler routing | Fedus et al., 20214 |
| Expert Choice | varies | Experts choose their top tokens (inverted) | Zhou et al., 20225 |
| Soft MoE | all | Soft routing, all experts get weighted input | Puigcerver et al., 2024 |
5.3 Load Balancing: Detailed Derivation¶
The ideal distribution routes \( T \cdot k / N \) tokens to each expert. The auxiliary loss measures deviation from this ideal:
Under perfect balance (\( f_i = k/N \), \( P_i = 1/N \)):
The loss increases quadratically when load concentrates on fewer experts.
6. Zig Implementation¶
6.1 MoEConfig¶
pub const MoEConfig = struct {
n_layers: u32,
d_model: u32,
n_heads: u32,
n_kv_heads: u32,
n_experts: u32, // N: total number of experts
n_experts_active: u32, // k: experts active per token
d_expert_ff: u32, // FFN hidden dim per expert
vocab_size: u32,
max_seq_len: u32,
rope_theta: f32 = 10000.0,
norm_eps: f32 = 1e-5,
router_aux_loss_coef: f32 = 0.01,
pub fn headDim(self: MoEConfig) u32 {
return self.d_model / self.n_heads;
}
pub fn totalParams(self: MoEConfig) u64 {
const expert_params = @as(u64, self.n_experts)
* 3 * self.d_model * self.d_expert_ff;
// ... add attention, embedding, norm params
return expert_params;
}
pub fn activeParams(self: MoEConfig) u64 {
const active_expert_params = @as(u64, self.n_experts_active)
* 3 * self.d_model * self.d_expert_ff;
return active_expert_params;
}
};
6.2 Expert Router¶
pub const ExpertRouter = struct {
gate_weight: Tensor(f32), // [d_model, n_experts]
n_experts: u32,
n_active: u32,
pub const RoutingResult = struct {
expert_indices: []u32, // top-k expert indices
gate_values: []f32, // normalized gate scores
};
pub fn route(self: *ExpertRouter, x: []f32) !RoutingResult {
// Compute gate scores: x @ W_g
var scores = try linearForward(x, self.gate_weight);
// Softmax over all experts
softmaxInPlace(scores);
// Select top-k
var result = try topK(scores, self.n_active);
// Re-normalize gate values over selected experts
var gate_sum: f32 = 0;
for (result.gate_values) |g| gate_sum += g;
for (result.gate_values) |*g| g.* /= gate_sum;
return result;
}
};
6.3 MoE Layer¶
pub const MoELayer = struct {
router: ExpertRouter,
experts: []SwiGLU_FFN, // N independent FFN experts
config: MoEConfig,
pub fn forward(self: *MoELayer, x: []f32) ![]f32 {
// Route: determine which experts process this token
const routing = try self.router.route(x);
// Compute weighted sum of selected expert outputs
var output = try allocator.alloc(f32, x.len);
@memset(output, 0);
for (routing.expert_indices, routing.gate_values) |idx, gate| {
const expert_out = try self.experts[idx].forward(x);
defer allocator.free(expert_out);
// Accumulate: output += gate * expert_output
for (0..output.len) |i| {
output[i] += gate * expert_out[i];
}
}
return output;
}
};
6.4 Load Balancing Computation¶
pub fn computeAuxLoss(
routing_decisions: []const ExpertRouter.RoutingResult,
n_experts: u32,
n_tokens: u32,
coef: f32,
) f32 {
var expert_counts = try allocator.alloc(f32, n_experts);
var expert_probs = try allocator.alloc(f32, n_experts);
@memset(expert_counts, 0);
@memset(expert_probs, 0);
for (routing_decisions) |decision| {
for (decision.expert_indices) |idx| {
expert_counts[idx] += 1.0;
}
// Accumulate full probability distribution
// ...
}
// Normalize
const n = @as(f32, @floatFromInt(n_tokens));
for (0..n_experts) |i| {
expert_counts[i] /= n; // f_i
expert_probs[i] /= n; // P_i
}
// L_aux = alpha * N * sum(f_i * P_i)
var loss: f32 = 0;
for (0..n_experts) |i| {
loss += expert_counts[i] * expert_probs[i];
}
return coef * @as(f32, @floatFromInt(n_experts)) * loss;
}
7. Variants¶
| Model | Total Params | Active Params | Experts | k | Notes |
|---|---|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 | 2 | Mistral-based MoE1 |
| Mixtral 8x22B | 141B | 39B | 8 | 2 | Larger Mistral MoE |
| Switch-C | 1.6T | 1.6B | 2048 | 1 | Extreme sparsity4 |
| Qwen2-MoE-57B | 57.4B | 14.3B | 64 | 8 | Fine-grained experts |
| DeepSeek-MoE | 16.4B | 2.8B | 64 | 6 | Shared + routed experts |
| DBRX | 132B | 36B | 16 | 4 | Databricks MoE |
| Arctic | 480B | 17B | 128 | 2 | Snowflake, extreme scale |
8. Educational Value¶
What MoE Teaches
-
Sparse computation: MoE is the primary example of sparse activation in modern deep learning. Understanding that most parameters are "dormant" for any given input challenges the intuition that all parameters contribute to every prediction.
-
Routing as a learned decision: The router is itself a small neural network making a discrete (TopK) selection. This connects to broader topics in differentiable discrete optimization and the straight-through estimator.
-
Load balancing and emergent specialization: The auxiliary loss creates a tension between letting experts specialize (some tokens naturally prefer certain experts) and ensuring all experts are utilized. This is a microcosm of the exploration-exploitation trade-off.
-
Scaling laws: MoE models demonstrate that total parameter count and active parameter count are distinct axes of scaling. A 47B MoE model does not require 47B FLOPs per token, challenging naive parameter-count comparisons.
-
Inference implications: MoE models require all expert weights in memory (or on disk for offloading) even though only a fraction are active. This creates a unique memory-compute trade-off that students must reason about when designing inference systems.
9. References¶
-
Jiang, A. Q. et al. "Mixtral of Experts." arXiv:2401.04088, 2024. ↩↩↩
-
Jacobs, R. A. et al. "Adaptive Mixtures of Local Experts." Neural Computation, 3(1):79--87, 1991. ↩
-
Shazeer, N. et al. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR, 2017. ↩↩
-
Fedus, W., Zoph, B. & Shazeer, N. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR, 2022. ↩↩↩
-
Zhou, Y. et al. "Mixture-of-Experts with Expert Choice Routing." NeurIPS, 2022. ↩