Feed-Forward Networks¶
Every transformer block contains a feed-forward network (FFN) applied independently to each position in the sequence. While attention allows positions to communicate, the FFN is where the model performs its per-position processing -- transforming representations through a non-linear bottleneck. In modern LLMs, FFN layers hold approximately two-thirds of all per-layer parameters, making them the largest single component of the model.
1. Role in Transformers¶
1.1 Position-Wise Processing¶
Position Independence
The FFN applies the same transformation to every position in the sequence, with no interaction between positions:
This is in contrast to attention, which mixes information across all positions. The FFN's job is to process the information that attention has gathered.
The position-wise property makes FFNs embarrassingly parallel across the sequence dimension -- every position can be computed independently, which is ideal for GPU/SIMD execution.
1.2 The Two Halves of a Transformer Block¶
A transformer block alternates between two types of computation:
- Attention: What information is relevant? (inter-position mixing)
- FFN: How should I transform this information? (per-position processing)
This alternation is repeated \(L\) times, with residual connections ensuring information can flow directly from early to late layers.
2. Standard FFN (Original Transformer)¶
Standard Feed-Forward Network
where \(W_1 \in \mathbb{R}^{d \times d_{ff}}\), \(W_2 \in \mathbb{R}^{d_{ff} \times d}\), and \(\sigma\) is the activation function (ReLU in the original paper).1
The standard FFN is a two-layer MLP with a hidden dimension \(d_{ff}\) that is larger than the model dimension \(d\). The expansion to \(d_{ff}\) followed by compression back to \(d\) creates an information bottleneck that forces the network to learn useful intermediate representations.
Convention: \(d_{ff} = 4\, d_{\text{model}}\).
| Parameter | Shape | Count |
|---|---|---|
| \(W_1\) | \(d \times d_{ff}\) | \(d \cdot d_{ff}\) |
| \(W_2\) | \(d_{ff} \times d\) | \(d_{ff} \cdot d\) |
| Total | \(2\, d \cdot d_{ff} = 8\, d^2\) (with \(d_{ff} = 4d\)) |
3. SwiGLU FFN (LLaMA Architecture)¶
3.1 Definition¶
SwiGLU Feed-Forward Network
where \(W_{\text{gate}}, W_{\text{up}} \in \mathbb{R}^{d \times d_{ff}}\) and \(W_{\text{down}} \in \mathbb{R}^{d_{ff} \times d}\).2
3.2 The Three Matrices¶
Unlike the standard FFN with two weight matrices, SwiGLU uses three:
| Matrix | Role | Shape |
|---|---|---|
| \(W_{\text{gate}}\) | Gate projection (passed through SiLU) | \(d \times d_{ff}\) |
| \(W_{\text{up}}\) | Up projection (content stream) | \(d \times d_{ff}\) |
| \(W_{\text{down}}\) | Down projection (output compression) | \(d_{ff} \times d\) |
3.3 Computation Flow¶
flowchart LR
X["x ∈ R^d"] --> GATE["x @ W_gate"]
X --> UP["x @ W_up"]
GATE --> SILU["SiLU(...)"]
SILU --> MUL["Element-wise Multiply"]
UP --> MUL
MUL --> DOWN["... @ W_down"]
DOWN --> OUT["output ∈ R^d"]
style SILU fill:#d5f5e3,stroke:#1e8449
style MUL fill:#fdebd0,stroke:#e67e22 3.4 Intuition: Why Gating Helps¶
The gating mechanism allows the network to learn which features to activate. The gate stream \(\operatorname{SiLU}(xW_{\text{gate}})\) produces values in \(\approx(-0.28, +\infty)\) that modulate the content stream \(xW_{\text{up}}\) element-wise. This is more expressive than applying a single activation to the entire hidden representation.
4. GeGLU FFN¶
GeGLU Feed-Forward Network
GeGLU is identical to SwiGLU except that the gate activation is GELU instead of SiLU. The two variants perform comparably; choice is typically dictated by consistency with the rest of the architecture.
5. Hidden Dimension Sizing¶
5.1 Standard FFN: \(d_{ff} = 4\, d_{\text{model}}\)¶
The original Transformer used a 4:1 ratio, giving each FFN layer \(8d^2\) parameters. This was found empirically to provide a good trade-off between model capacity and computational cost.
5.2 SwiGLU: \(d_{ff} = \frac{8}{3}\, d_{\text{model}}\)¶
Matching Parameter Counts
The standard FFN has \(2 \cdot d \cdot d_{ff}^{\text{std}} = 8d^2\) parameters (with \(d_{ff}^{\text{std}} = 4d\)).
The SwiGLU FFN has \(3 \cdot d \cdot d_{ff}^{\text{gated}}\) parameters. Setting these equal:
In practice, \(d_{ff}\) is often rounded to the nearest multiple of 256 for hardware alignment.
For LLaMA-7B with \(d = 4096\):
6. Parameter Distribution¶
6.1 FFN Dominance¶
Per-Layer Parameter Breakdown
For a standard transformer block with \(d_{ff} = 4d\):
| Component | Parameters | Fraction |
|---|---|---|
| Attention (Q, K, V, O) | \(4d^2\) | 33.3% |
| FFN (W1, W2) | \(8d^2\) | 66.7% |
| Normalization | \(\sim 4d\) | ~0% |
| Total | \(12d^2 + O(d)\) | 100% |
For a SwiGLU block with \(d_{ff} = \frac{8}{3}d\):
| Component | Parameters | Fraction |
|---|---|---|
| Attention (Q, K, V, O) | \(4d^2\) | 33.3% |
| FFN (gate, up, down) | \(8d^2\) | 66.7% |
| Normalization | \(\sim 2d\) (RMSNorm, no bias) | ~0% |
| Total | \(12d^2 + O(d)\) | 100% |
The parameter count is identical by design (the SwiGLU hidden dimension was chosen to match). The difference is that SwiGLU allocates those parameters across three matrices with gating, which empirically yields better quality.
6.2 FLOPs per Token¶
For a single FFN forward pass:
| Variant | Matrix Multiplies | FLOPs/token |
|---|---|---|
| Standard | 2 | \(2 \times 2 \times d \times d_{ff} = 16 d^2\) |
| SwiGLU | 3 | \(3 \times 2 \times d \times d_{ff} = 16 d^2\) |
The per-token FLOP count is the same because SwiGLU compensates with a smaller \(d_{ff}\).
7. Comparison Table¶
| Variant | Matrices | \(d_{ff}\) | Activation | Parameters | Quality | Models |
|---|---|---|---|---|---|---|
| Standard (ReLU) | 2 | \(4d\) | ReLU | \(8d^2\) | Baseline | Original Transformer |
| GELU | 2 | \(4d\) | GELU | \(8d^2\) | Better | BERT, GPT-2, GPT-3 |
| SwiGLU | 3 | \(\frac{8}{3}d\) | SiLU (gate) | \(8d^2\) | Best | LLaMA, LLaMA 2, Mistral |
| GeGLU | 3 | \(\frac{8}{3}d\) | GELU (gate) | \(8d^2\) | ~Best | Some T5 variants |
| GLU | 3 | \(\frac{8}{3}d\) | Sigmoid (gate) | \(8d^2\) | Good | Older research |
8. Mixture of Experts (MoE) -- Brief Extension¶
ZigLlama also implements an ExpertFFN for Mixture of Experts models, where multiple FFN "experts" exist per layer and a routing network selects which expert(s) process each token.
flowchart TD
X["Input token"] --> ROUTER["Router Network"]
ROUTER --> |"Top-k selection"| E1["Expert 1 (FFN)"]
ROUTER --> |"Top-k selection"| E2["Expert 2 (FFN)"]
ROUTER -.-> E3["Expert 3"]
ROUTER -.-> EN["Expert N"]
E1 --> COMBINE["Weighted Sum"]
E2 --> COMBINE
COMBINE --> OUT["Output"]
style ROUTER fill:#e8daef,stroke:#7d3c98
style E3 fill:#f2f3f4,stroke:#aab7b8
style EN fill:#f2f3f4,stroke:#aab7b8 Each expert is a standard FeedForward instance. The router produces softmax-normalized weights, and only the top-\(k\) experts (typically \(k=2\)) are activated per token, keeping compute constant while scaling model capacity.
9. Implementation in ZigLlama¶
9.1 FFNType Enum¶
pub const FFNType = enum {
Standard, // Linear -> ReLU -> Linear
GELU, // Linear -> GELU -> Linear
SwiGLU, // SwiGLU gated activation (LLaMA)
GeGLU, // GeGLU gated activation
GLU, // Classic GLU (sigmoid gate)
};
9.2 FeedForward Struct¶
pub const FeedForward = struct {
ffn_type: FFNType,
d_model: usize,
d_ff: usize,
w1: Tensor(f32), // [d_model x d_ff] (gate for gated variants)
w2: Tensor(f32), // [d_ff x d_model] (down projection)
w3: ?Tensor(f32), // [d_model x d_ff] (up projection, gated only)
allocator: Allocator,
pub fn init(allocator: Allocator, d_model: usize, d_ff: usize, ffn_type: FFNType) !FeedForward {
var w1 = try Tensor(f32).init(allocator, &[_]usize{ d_model, d_ff });
var w2 = try Tensor(f32).init(allocator, &[_]usize{ d_ff, d_model });
var w3: ?Tensor(f32) = null;
// Gated variants need a third weight matrix
if (ffn_type == .SwiGLU or ffn_type == .GeGLU or ffn_type == .GLU) {
w3 = try Tensor(f32).init(allocator, &[_]usize{ d_model, d_ff });
}
// ... Xavier initialization ...
return FeedForward{ .ffn_type = ffn_type, .d_model = d_model, ... };
}
};
9.3 SwiGLU Forward Pass¶
fn forwardSwiGLU(self: *const FeedForward, input: Tensor(f32)) !Tensor(f32) {
// Gate projection: gate = input @ W1
var gate = try input.matmul(self.w1, self.allocator);
defer gate.deinit();
// Up projection: up = input @ W3
var up = try input.matmul(self.w3.?, self.allocator);
defer up.deinit();
// Apply SiLU to up projection
var up_activated = try activations.silu(f32, up, self.allocator);
defer up_activated.deinit();
// Element-wise gating: hidden = gate * SiLU(up)
var gated = try gate.elementWiseMultiply(up_activated, self.allocator);
defer gated.deinit();
// Down projection: output = hidden @ W2
return try gated.matmul(self.w2, self.allocator);
}
9.4 Standard Forward Pass¶
fn forwardStandard(self: *const FeedForward, input: Tensor(f32)) !Tensor(f32) {
// Up: hidden = input @ W1
var hidden = try input.matmul(self.w1, self.allocator);
defer hidden.deinit();
// Activation: ReLU(hidden)
var activated = try activations.relu(f32, hidden);
defer activated.deinit();
// Down: output = activated @ W2
return try activated.matmul(self.w2, self.allocator);
}
9.5 ExpertFFN Struct¶
pub const ExpertFFN = struct {
num_experts: usize,
experts: []FeedForward,
router: Tensor(f32), // [d_model x num_experts]
top_k: usize,
allocator: Allocator,
pub fn forward(self: *const ExpertFFN, input: Tensor(f32)) !Tensor(f32) {
// 1. Compute routing probabilities
// 2. Select top-k experts per token
// 3. Apply selected experts
// 4. Weighted combination of expert outputs
}
};
Source File
Full implementation: src/transformers/feed_forward.zig (approximately 510 lines including ExpertFFN, tests, and all FFN variants).
10. Exercises¶
- Verify that setting \(d_{ff} = \frac{8}{3}d\) for SwiGLU yields the same parameter count as \(d_{ff} = 4d\) for standard FFN. What is the exact parameter count for \(d = 4096\)?
- Compare the FLOPs per token for standard FFN vs. SwiGLU FFN with matched parameter counts. Are they equal?
- Explain why FFN layers are described as "memories" in recent interpretability research (keys = input patterns, values = output patterns).
- Implement a simple top-2 routing mechanism for
ExpertFFNthat selects the two experts with the highest router scores per token.
References¶
-
Vaswani, A. et al. "Attention Is All You Need." NeurIPS, 2017. ↩
-
Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202, 2020. ↩
-
Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
-
Fedus, W., Zoph, B. & Shazeer, N. "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR, 2022. ↩
-
Geva, M. et al. "Transformer Feed-Forward Layers Are Key-Value Memories." EMNLP, 2021. ↩