GPT-NeoX¶
GPT-NeoX is EleutherAI's 20-billion-parameter autoregressive language model, released in 2022. It extended the architectural patterns established by GPT-J -- parallel residual connections and Rotary Position Embeddings -- to a significantly larger scale while introducing fused QKV projections and providing the community with one of the first fully open 20B-class models complete with training code and data documentation1.
1. Architecture Overview¶
Origins
GPT-NeoX-20B was trained on The Pile using 96 A100-40GB GPUs with the GPT-NeoX library (a fork of Megatron-DeepSpeed). The paper by Black et al. (2022) documented not only the architecture but also the training infrastructure, making it a reference for large-scale distributed training1.
GPT-NeoX is a decoder-only transformer that shares GPT-J's parallel residual design but adds fused query-key-value projections and applies RoPE to 25% of the head dimension by default.
2. Key Innovations¶
2.1 Fused QKV Projection¶
Instead of three separate linear projections for Q, K, and V, GPT-NeoX fuses them into a single matrix multiplication:
Fused QKV
where \( W_{QKV} \in \mathbb{R}^{d \times 3d} \). The output is then split along the last dimension into three equal parts. This reduces kernel launch overhead on GPUs and improves memory access patterns.
2.2 Parallel Attention and FFN¶
Like GPT-J, GPT-NeoX computes attention and the feed-forward network from the same LayerNorm output:
This single-norm, dual-path design yields a single residual addition per block.
2.3 Advanced RoPE Scaling¶
GPT-NeoX introduced configurable RoPE parameters that later became standard:
| Parameter | Description |
|---|---|
rotary_pct | Fraction of head dimensions receiving RoPE (default 0.25) |
rotary_emb_base | Base frequency \(\theta\) for RoPE (default 10000) |
The ability to adjust rotary_pct allows trading off positional sensitivity against pure content-based attention within each head.
3. Architecture Diagram¶
flowchart TD
INPUT["Token IDs"] --> EMB["Token Embedding"]
EMB --> BLOCKS
subgraph BLOCKS["44 x NeoXBlock (Parallel Residual)"]
LN["LayerNorm"]
LN --> FUSED["Fused QKV Projection\n(single matmul)"]
FUSED --> SPLIT["Split -> Q, K, V"]
SPLIT --> ROPE["Partial RoPE\n(25% of head dims)"]
ROPE --> MHA["Multi-Head Attention"]
LN --> FFN["Feed-Forward\n(GELU activation)"]
MHA --> SUM["Sum"]
FFN --> SUM
SUM --> RES["Residual Add with x"]
end
BLOCKS --> FINAL_LN["Final LayerNorm"]
FINAL_LN --> LM_HEAD["LM Head"] 4. Configuration Parameters¶
| Parameter | GPT-NeoX-20B |
|---|---|
n_layers | 44 |
d_model | 6144 |
n_heads | 64 |
d_head | 96 |
d_ff | 24576 |
vocab_size | 50432 |
max_seq_len | 2048 |
rotary_pct | 0.25 |
rotary_emb_base | 10000 |
activation | GELU |
parallel_residual | true |
use_bias | true |
norm_eps | 1e-5 |
5. Mathematical Formulation¶
5.1 Fused QKV Computation¶
For input \( x \in \mathbb{R}^{s \times d} \):
5.2 Partial RoPE¶
With rotary_pct = 0.25 and head dimension \( d_h = 96 \), the rotary dimension is \( d_{\text{rot}} = 24 \). For position \( m \):
5.3 Full Block Computation¶
where:
FLOPs per Block
The fused QKV projection saves one kernel launch versus separate projections but performs identical arithmetic. For sequence length \( s \) and model dimension \( d \):
- QKV projection: \( 6sd^2 \) FLOPs (same as separate, just one matmul)
- Attention: \( 2s^2 d + 2s^2 d \) FLOPs
- FFN: \( 16sd^2 \) FLOPs (with \( d_{\text{ff}} = 4d \))
- Total per block: \( \approx 22sd^2 + 4s^2 d \)
6. Zig Implementation¶
6.1 GPTNeoXConfig¶
pub const GPTNeoXConfig = struct {
n_layers: u32 = 44,
d_model: u32 = 6144,
n_heads: u32 = 64,
d_ff: u32 = 24576,
vocab_size: u32 = 50432,
max_seq_len: u32 = 2048,
rotary_pct: f32 = 0.25,
rotary_emb_base: f32 = 10000.0,
parallel_residual: bool = true,
norm_eps: f32 = 1e-5,
activation: ActivationType = .gelu,
pub fn headDim(self: GPTNeoXConfig) u32 {
return self.d_model / self.n_heads;
}
pub fn rotaryDim(self: GPTNeoXConfig) u32 {
return @intFromFloat(@as(f32, @floatFromInt(self.headDim()))
* self.rotary_pct);
}
};
6.2 Fused QKV Projection¶
pub const FusedQKVProjection = struct {
w_qkv: Tensor(f32), // [d_model, 3 * d_model]
b_qkv: Tensor(f32), // [3 * d_model]
pub fn forward(self: *FusedQKVProjection, x: Tensor(f32)) !struct {
q: Tensor(f32), k: Tensor(f32), v: Tensor(f32)
} {
// Single matmul: [seq, d] @ [d, 3d] -> [seq, 3d]
const fused = try x.matmul(self.w_qkv, allocator);
defer fused.deinit();
const d = self.w_qkv.shape[0];
return .{
.q = try fused.slice(1, 0, d),
.k = try fused.slice(1, d, 2 * d),
.v = try fused.slice(1, 2 * d, 3 * d),
};
}
};
6.3 NeoX Block (Parallel Residual)¶
pub const NeoXBlock = struct {
ln: LayerNorm,
qkv: FusedQKVProjection,
attention: MultiHeadAttention,
ffn: FeedForward,
out_proj: Linear,
pub fn forward(self: *NeoXBlock, x: []f32, pos: u32) ![]f32 {
const h = self.ln.forward(x);
// Parallel path 1: Attention
const qkv = try self.qkv.forward(h);
applyPartialRoPE(&qkv.q, &qkv.k, pos, self.config.rotaryDim());
const attn_out = try self.attention.forward(qkv.q, qkv.k, qkv.v);
// Parallel path 2: FFN (from same normalized input)
const ffn_out = try self.ffn.forward(h);
// Single residual addition
var output = try allocator.alloc(f32, x.len);
for (0..x.len) |i| {
output[i] = x[i] + attn_out[i] + ffn_out[i];
}
return output;
}
};
7. Variants¶
| Variant | Parameters | Notes |
|---|---|---|
| GPT-NeoX-20B | 20B | Original release, trained on The Pile |
| Pythia suite | 70M--12B | Successor family using NeoX architecture with controlled training |
| Dolly | 12B | Databricks fine-tune of Pythia-12B for instruction following |
Pythia Connection
The Pythia model suite (Biderman et al., 2023)3 reuses the GPT-NeoX architecture across eight model sizes trained on deduplicated Pile data. Pythia checkpoints are released at regular training intervals, making them invaluable for studying training dynamics.
8. Educational Value¶
What GPT-NeoX Teaches
-
Fused projections: The fused QKV pattern demonstrates how mathematically equivalent operations can be restructured for hardware efficiency. Students can verify that splitting a single \( d \times 3d \) matmul yields identical results to three separate \( d \times d \) matmuls.
-
Scaling parallel residuals: GPT-NeoX validated that the parallel residual design (pioneered in GPT-J at 6B) remains effective at 20B scale, suggesting it is a robust architectural choice.
-
RoPE configuration space: The
rotary_pctandrotary_emb_baseparameters expose the design space around positional encoding, allowing students to experiment with how much positional information each head receives. -
Large-scale training infrastructure: The GPT-NeoX paper is one of the most transparent accounts of distributed training, covering pipeline parallelism, data parallelism, and mixed-precision training.
9. References¶
-
Black, S. et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745, 2022. ↩↩
-
Su, J. et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. ↩
-
Biderman, S. et al. "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." ICML, 2023. ↩