GPT-2¶
Overview¶
GPT-2 (Generative Pre-trained Transformer 2), released by OpenAI in February 20191, is the classic decoder-only transformer language model. While superseded in capability by later architectures, GPT-2 remains an essential reference point for understanding transformer design. Its straightforward architecture -- learned position embeddings, GELU activation, pre-LayerNorm -- makes it the ideal starting point for learning how transformer language models work.
ZigLLM implements GPT-2 in src/models/gpt2.zig as both a functional model and a pedagogical reference.
Historical Significance
GPT-2 was famously released in stages due to concerns about misuse, with OpenAI initially publishing only the 124M parameter version. It demonstrated that scaling language models produces emergent capabilities across diverse tasks without task-specific training, a finding that motivated the scaling laws research that led to GPT-3, PaLM, and LLaMA.
Key Features¶
| Feature | GPT-2 Choice | Modern Alternative |
|---|---|---|
| Position encoding | Learned embeddings | RoPE (LLaMA) |
| Activation | GELU | SwiGLU (LLaMA) |
| Normalization | Pre-LayerNorm | Pre-RMSNorm (LLaMA) |
| Attention | Standard MHA | GQA (Mistral) |
| Bias terms | Yes (all linear layers) | No (LLaMA) |
| FFN structure | Standard 2-matrix | Gated 3-matrix (LLaMA) |
Learned Position Embeddings¶
GPT-2 uses a trainable embedding table for position information, with one embedding vector per position up to max_seq_len:
where \( W_e \in \mathbb{R}^{V \times d} \) is the token embedding matrix and \( W_p \in \mathbb{R}^{S \times d} \) is the position embedding matrix.
Learned vs Rotary Positions
Learned position embeddings store a separate vector for each absolute position, consuming \( S \times d \) parameters. They cannot generalize beyond the trained sequence length. RoPE, used by LLaMA, encodes relative positions through rotation and requires no additional parameters beyond the frequency schedule.
GELU Activation¶
The Gaussian Error Linear Unit provides a smooth approximation to ReLU:
Causal Attention¶
GPT-2 uses standard multi-head self-attention with a causal mask to prevent attending to future positions:
Configuration¶
GPT2Config Struct¶
pub const GPT2Config = struct {
d_model: usize, // Model dimension
n_heads: usize, // Number of attention heads
n_layers: usize, // Number of transformer layers
vocab_size: usize, // Vocabulary size (50257 for GPT-2)
max_seq_len: usize, // Maximum sequence length (1024)
dropout: f32, // Dropout rate (training only)
};
Variant Configurations¶
| Variant | d_model | n_heads | n_layers | d_ff | Parameters |
|---|---|---|---|---|---|
| GPT-2 Small | 768 | 12 | 12 | 3072 | 124M |
| GPT-2 Medium | 1024 | 16 | 24 | 4096 | 355M |
| GPT-2 Large | 1280 | 20 | 36 | 5120 | 774M |
| GPT-2 XL | 1600 | 25 | 48 | 6400 | 1.5B |
Scaling Pattern
GPT-2 uses a fixed FFN expansion ratio of \( d_\text{ff} = 4 \times d_\text{model} \) with a standard 2-matrix FFN (no gating). The head dimension varies: 768/12 = 64, 1024/16 = 64, 1280/20 = 64, 1600/25 = 64. All variants use \( d_\text{head} = 64 \).
Architecture Diagram¶
flowchart TD
IN["Input Token IDs"] --> TE["Token Embedding\n(50257 x d_model)"]
IN --> PE["Position Embedding\n(1024 x d_model)"]
TE --> ADD["Element-wise Add"]
PE --> ADD
ADD --> LOOP
subgraph LOOP ["Transformer Block x N"]
direction TB
LN1["LayerNorm"] --> ATT["Multi-Head Attention\n(causal mask)"]
ATT --> RES1["Residual Add"]
RES1 --> LN2["LayerNorm"]
LN2 --> FC["Linear (d -> 4d)"]
FC --> GELU["GELU"]
GELU --> PROJ["Linear (4d -> d)"]
PROJ --> RES2["Residual Add"]
end
LOOP --> LNF["Final LayerNorm"]
LNF --> LMH["LM Head\n(d_model x vocab_size)"]
LMH --> OUT["Logits\n[seq_len, 50257]"]
style IN fill:#f0f0f0,color:#333
style OUT fill:#4a9eff,color:#fff
style LOOP fill:#e8f5e9,color:#333 Forward Pass¶
Token and Position Embedding¶
GPT-2's embedding combines token and position information through addition:
pub fn forward(self: *Self, input_ids: []const u32) !Tensor(f32) {
const seq_len = input_ids.len;
// Token embeddings: look up each token ID
var token_embeds = try self.getTokenEmbeddings(input_ids);
// Position embeddings: look up positions [0, 1, ..., seq_len-1]
var pos_embeds = try self.getPositionEmbeddings(seq_len);
// Combined embedding = token + position
var hidden_states = try self.addEmbeddings(token_embeds, pos_embeds);
// Pass through transformer blocks
for (self.blocks) |*block| {
const new_hidden = try block.forward(hidden_states);
hidden_states = new_hidden;
}
// Final LayerNorm + LM head
const normed = try self.ln_f.forward(hidden_states);
return try normed.matmul(self.lm_head, self.allocator);
}
Transformer Block¶
Each GPT2Block implements the pre-norm residual pattern with GELU MLP:
pub fn forward(self: *Self, input: Tensor(f32)) !Tensor(f32) {
// Pre-attention LayerNorm
const normed1 = try self.ln_1.forward(input);
// Causal self-attention
const causal_mask = try self.createCausalMask(input.shape[0]);
const attn_output = try self.attn.forward(
normed1, normed1, normed1, causal_mask);
// First residual
const after_attn = try self.addResidual(input, attn_output);
// Pre-MLP LayerNorm
const normed2 = try self.ln_2.forward(after_attn);
// MLP: Linear -> GELU -> Linear
const mlp_output = try self.mlpForward(normed2);
// Second residual
return try self.addResidual(after_attn, mlp_output);
}
MLP with GELU¶
fn mlpForward(self: *Self, input: Tensor(f32)) !Tensor(f32) {
// Project up: d_model -> 4 * d_model
const intermediate = try input.matmul(self.mlp_c_fc, self.allocator);
// GELU activation
const activated = try neural_primitives.gelu(f32, intermediate, self.allocator);
// Project down: 4 * d_model -> d_model
return try activated.matmul(self.mlp_c_proj, self.allocator);
}
Model Struct¶
pub const GPT2Model = struct {
config: GPT2Config,
token_embeddings: Tensor(f32), // [vocab_size, d_model]
position_embeddings: Tensor(f32), // [max_seq_len, d_model]
blocks: []GPT2Block, // N transformer blocks
ln_f: LayerNorm, // Final layer normalization
lm_head: Tensor(f32), // [d_model, vocab_size]
allocator: Allocator,
};
Weight Tying¶
GPT-2 ties the token embedding matrix with the output projection (LM head):
// LM head initialized with same values as token embeddings
const lm_head_data = try allocator.alloc(f32, config.d_model * config.vocab_size);
@memcpy(lm_head_data, token_emb_data); // Weight tying
This reduces parameter count by \( V \times d \) (approximately 38.6M for GPT-2 Small) and improves training stability.
Parameter Count¶
pub fn parameterCount(self: *Self) usize {
var total: usize = 0;
total += self.token_embeddings.data.len; // V * d
total += self.position_embeddings.data.len; // S * d
total += block_params * self.config.n_layers; // Per-layer
total += 2 * self.config.d_model; // Final LN
total += self.lm_head.data.len; // d * V
return total;
}
Per-block parameters:
Total for GPT-2 Small (\( d=768, L=12, V=50257, S=1024 \)):
GPT-2 vs Modern Architectures¶
| Design Choice | GPT-2 | LLaMA | Impact |
|---|---|---|---|
| Position | Learned (S params) | RoPE (0 params) | LLaMA generalizes to longer sequences |
| Activation | GELU (2 matrices) | SwiGLU (3 matrices) | SwiGLU is ~1% better at same param count |
| Norm | LayerNorm | RMSNorm | RMSNorm is ~15% faster |
| Bias | Yes | No | Marginal effect; no bias is simpler |
| Context | 1024 | 2048+ | RoPE enables arbitrary extension |
| Vocab | 50257 (BPE) | 32000 (SP) | Smaller vocab = less embedding memory |
Why Study GPT-2?
GPT-2 is the simplest complete decoder-only transformer in ZigLLM. Every subsequent architecture (LLaMA, Mistral, Falcon, Qwen) can be understood as a set of targeted improvements to the GPT-2 baseline. Master GPT-2 first, then understand each innovation incrementally.