GPT-NeoX¶

GPT-NeoX is EleutherAI's 20-billion-parameter autoregressive language model, released in 2022. It extended the architectural patterns established by GPT-J -- parallel residual connections and Rotary Position Embeddings -- to a significantly larger scale while introducing fused QKV projections and providing the community with one of the first fully open 20B-class models complete with training code and data documentation¹.

1. Architecture Overview¶

Origins

GPT-NeoX-20B was trained on The Pile using 96 A100-40GB GPUs with the GPT-NeoX library (a fork of Megatron-DeepSpeed). The paper by Black et al. (2022) documented not only the architecture but also the training infrastructure, making it a reference for large-scale distributed training¹.

GPT-NeoX is a decoder-only transformer that shares GPT-J's parallel residual design but adds fused query-key-value projections and applies RoPE to 25% of the head dimension by default.

2. Key Innovations¶

2.1 Fused QKV Projection¶

Instead of three separate linear projections for Q, K, and V, GPT-NeoX fuses them into a single matrix multiplication:

Fused QKV

\[ [Q \,|\, K \,|\, V] = x W_{QKV} + b_{QKV} \]

where \( W_{QKV} \in \mathbb{R}^{d \times 3d} \). The output is then split along the last dimension into three equal parts. This reduces kernel launch overhead on GPUs and improves memory access patterns.

2.2 Parallel Attention and FFN¶

Like GPT-J, GPT-NeoX computes attention and the feed-forward network from the same LayerNorm output:

\[ x' = x + \text{Attn}(\text{LN}(x)) + \text{FFN}(\text{LN}(x)) \]

This single-norm, dual-path design yields a single residual addition per block.

2.3 Advanced RoPE Scaling¶

GPT-NeoX introduced configurable RoPE parameters that later became standard:

Parameter	Description
`rotary_pct`	Fraction of head dimensions receiving RoPE (default 0.25)
`rotary_emb_base`	Base frequency \(\theta\) for RoPE (default 10000)

The ability to adjust rotary_pct allows trading off positional sensitivity against pure content-based attention within each head.

3. Architecture Diagram¶

flowchart TD
    INPUT["Token IDs"] --> EMB["Token Embedding"]
    EMB --> BLOCKS

    subgraph BLOCKS["44 x NeoXBlock (Parallel Residual)"]
        LN["LayerNorm"]
        LN --> FUSED["Fused QKV Projection\n(single matmul)"]
        FUSED --> SPLIT["Split -> Q, K, V"]
        SPLIT --> ROPE["Partial RoPE\n(25% of head dims)"]
        ROPE --> MHA["Multi-Head Attention"]
        LN --> FFN["Feed-Forward\n(GELU activation)"]
        MHA --> SUM["Sum"]
        FFN --> SUM
        SUM --> RES["Residual Add with x"]
    end

    BLOCKS --> FINAL_LN["Final LayerNorm"]
    FINAL_LN --> LM_HEAD["LM Head"]

4. Configuration Parameters¶

Parameter	GPT-NeoX-20B
`n_layers`	44
`d_model`	6144
`n_heads`	64
`d_head`	96
`d_ff`	24576
`vocab_size`	50432
`max_seq_len`	2048
`rotary_pct`	0.25
`rotary_emb_base`	10000
`activation`	GELU
`parallel_residual`	true
`use_bias`	true
`norm_eps`	1e-5

5. Mathematical Formulation¶

5.1 Fused QKV Computation¶

For input \( x \in \mathbb{R}^{s \times d} \):

\[ H = x W_{QKV} + b_{QKV}, \quad W_{QKV} \in \mathbb{R}^{d \times 3d} \]

\[ Q = H_{:, \, 0:d}, \quad K = H_{:, \, d:2d}, \quad V = H_{:, \, 2d:3d} \]

5.2 Partial RoPE¶

With rotary_pct = 0.25 and head dimension \( d_h = 96 \), the rotary dimension is \( d_{\text{rot}} = 24 \). For position \( m \):

\[ \hat{q}_{[:24]} = R_{\Theta, m} \cdot q_{[:24]}, \quad \hat{q}_{[24:]} = q_{[24:]} \]

5.3 Full Block Computation¶

\[ h = \text{LayerNorm}(x^{(l)}) \]

\[ x^{(l+1)} = x^{(l)} + \text{Attn}(h) + \text{FFN}(h) \]

where:

\[ \text{FFN}(h) = \text{GELU}(hW_1 + b_1)W_2 + b_2 \]

FLOPs per Block

The fused QKV projection saves one kernel launch versus separate projections but performs identical arithmetic. For sequence length \( s \) and model dimension \( d \):

QKV projection: \( 6sd^2 \) FLOPs (same as separate, just one matmul)
Attention: \( 2s^2 d + 2s^2 d \) FLOPs
FFN: \( 16sd^2 \) FLOPs (with \( d_{\text{ff}} = 4d \))
Total per block: \( \approx 22sd^2 + 4s^2 d \)

6. Zig Implementation¶

6.1 GPTNeoXConfig¶

pub const GPTNeoXConfig = struct {
    n_layers: u32 = 44,
    d_model: u32 = 6144,
    n_heads: u32 = 64,
    d_ff: u32 = 24576,
    vocab_size: u32 = 50432,
    max_seq_len: u32 = 2048,
    rotary_pct: f32 = 0.25,
    rotary_emb_base: f32 = 10000.0,
    parallel_residual: bool = true,
    norm_eps: f32 = 1e-5,
    activation: ActivationType = .gelu,

    pub fn headDim(self: GPTNeoXConfig) u32 {
        return self.d_model / self.n_heads;
    }

    pub fn rotaryDim(self: GPTNeoXConfig) u32 {
        return @intFromFloat(@as(f32, @floatFromInt(self.headDim()))
            * self.rotary_pct);
    }
};

6.2 Fused QKV Projection¶

pub const FusedQKVProjection = struct {
    w_qkv: Tensor(f32),    // [d_model, 3 * d_model]
    b_qkv: Tensor(f32),    // [3 * d_model]

    pub fn forward(self: *FusedQKVProjection, x: Tensor(f32)) !struct {
        q: Tensor(f32), k: Tensor(f32), v: Tensor(f32)
    } {
        // Single matmul: [seq, d] @ [d, 3d] -> [seq, 3d]
        const fused = try x.matmul(self.w_qkv, allocator);
        defer fused.deinit();

        const d = self.w_qkv.shape[0];
        return .{
            .q = try fused.slice(1, 0, d),
            .k = try fused.slice(1, d, 2 * d),
            .v = try fused.slice(1, 2 * d, 3 * d),
        };
    }
};

6.3 NeoX Block (Parallel Residual)¶

pub const NeoXBlock = struct {
    ln: LayerNorm,
    qkv: FusedQKVProjection,
    attention: MultiHeadAttention,
    ffn: FeedForward,
    out_proj: Linear,

    pub fn forward(self: *NeoXBlock, x: []f32, pos: u32) ![]f32 {
        const h = self.ln.forward(x);

        // Parallel path 1: Attention
        const qkv = try self.qkv.forward(h);
        applyPartialRoPE(&qkv.q, &qkv.k, pos, self.config.rotaryDim());
        const attn_out = try self.attention.forward(qkv.q, qkv.k, qkv.v);

        // Parallel path 2: FFN (from same normalized input)
        const ffn_out = try self.ffn.forward(h);

        // Single residual addition
        var output = try allocator.alloc(f32, x.len);
        for (0..x.len) |i| {
            output[i] = x[i] + attn_out[i] + ffn_out[i];
        }
        return output;
    }
};

7. Variants¶

Variant	Parameters	Notes
GPT-NeoX-20B	20B	Original release, trained on The Pile
Pythia suite	70M--12B	Successor family using NeoX architecture with controlled training
Dolly	12B	Databricks fine-tune of Pythia-12B for instruction following

Pythia Connection

The Pythia model suite (Biderman et al., 2023)³ reuses the GPT-NeoX architecture across eight model sizes trained on deduplicated Pile data. Pythia checkpoints are released at regular training intervals, making them invaluable for studying training dynamics.

8. Educational Value¶

What GPT-NeoX Teaches

Fused projections: The fused QKV pattern demonstrates how mathematically equivalent operations can be restructured for hardware efficiency. Students can verify that splitting a single \( d \times 3d \) matmul yields identical results to three separate \( d \times d \) matmuls.
Scaling parallel residuals: GPT-NeoX validated that the parallel residual design (pioneered in GPT-J at 6B) remains effective at 20B scale, suggesting it is a robust architectural choice.
RoPE configuration space: The rotary_pct and rotary_emb_base parameters expose the design space around positional encoding, allowing students to experiment with how much positional information each head receives.
Large-scale training infrastructure: The GPT-NeoX paper is one of the most transparent accounts of distributed training, covering pipeline parallelism, data parallelism, and mixed-precision training.

9. References¶

Black, S. et al. "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv:2204.06745, 2022. ↩↩
Su, J. et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. ↩
Biderman, S. et al. "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." ICML, 2023. ↩