Phi¶
The Phi family represents Microsoft Research's sustained investigation into the hypothesis that high-quality training data can compensate for reduced parameter counts. Beginning with Phi-1 (2023) and continuing through Phi-3 (2024), each generation demonstrated that small models trained on carefully curated "textbook-quality" data can match or exceed models several times their size on standard benchmarks.
1. Architecture Overview¶
Lineage
| Model | Year | Parameters | Key Paper |
|---|---|---|---|
| Phi-1 | 2023 | 1.3B | Gunasekar et al. 20231 |
| Phi-2 | 2023 | 2.7B | Microsoft Research Blog |
| Phi-3 Mini | 2024 | 3.8B | Abdin et al. 20242 |
| Phi-3 Medium | 2024 | 14B | Abdin et al. 20242 |
Phi-1 targeted code generation exclusively. Phi-2 broadened to general language understanding while retaining the "small but capable" philosophy. Phi-3 introduced long-context variants (up to 128K tokens) and a Mixture of Experts variant (PhiMoE).
2. Key Innovations¶
2.1 Partial Rotary Embeddings¶
Unlike LLaMA, which applies Rotary Position Embeddings (RoPE) to the full head dimension, Phi applies RoPE to only a fraction of the query and key dimensions. The remaining dimensions receive no positional encoding.
Partial RoPE
Given head dimension \( d_h \) and a partial ratio \( r \in (0, 1] \), define \( d_{\text{rot}} = \lfloor r \cdot d_h \rfloor \). For a query vector \( q \in \mathbb{R}^{d_h} \):
where \( m \) is the token position. The non-rotated dimensions carry content-only information, providing the model with a mix of positional and purely semantic features.
2.2 QK LayerNorm¶
Phi-2 and later variants apply LayerNorm to the query and key projections before computing attention scores. This stabilizes the magnitude of the dot products, reducing the need for careful learning-rate tuning.
2.3 Data-Centric Training¶
The Phi series was among the first to demonstrate that a 1.3B model trained on "textbook-quality" synthetic data could outperform 10B+ models trained on raw web crawls. This is an architectural-agnostic insight but deeply influenced the model's design constraints.
3. Architecture Diagram¶
flowchart TD
INPUT["Token IDs"] --> EMB["Token Embedding"]
EMB --> BLOCKS
subgraph BLOCKS["N x PhiDecoderLayer"]
LN1["LayerNorm"] --> ATTN
subgraph ATTN["Phi Attention"]
QKV["QKV Projection"] --> QKN["QK LayerNorm"]
QKN --> PROPE["Partial RoPE\n(first d_rot dims only)"]
PROPE --> DOT["Scaled Dot-Product"]
DOT --> OUT_PROJ["Output Projection"]
end
ATTN --> RES1["Residual Add"]
RES1 --> LN2["LayerNorm"]
LN2 --> FFN["FFN (GELU / SwiGLU)"]
FFN --> RES2["Residual Add"]
end
BLOCKS --> FINAL_LN["Final LayerNorm"]
FINAL_LN --> LM_HEAD["LM Head"] 4. Configuration Parameters¶
| Parameter | Phi-1 (1.3B) | Phi-2 (2.7B) | Phi-3 Mini (3.8B) | Phi-3 Medium (14B) |
|---|---|---|---|---|
n_layers | 24 | 32 | 32 | 40 |
d_model | 2048 | 2560 | 3072 | 5120 |
n_heads | 32 | 32 | 32 | 40 |
n_kv_heads | 32 | 32 | 32 | 40 |
d_ff | 8192 | 10240 | 8192 | 17920 |
partial_rotary_factor | 0.5 | 0.5 | 1.0 | 1.0 |
max_seq_len | 2048 | 2048 | 4096 | 8192 |
vocab_size | 51200 | 51200 | 32064 | 32064 |
activation | GELU | GELU | SwiGLU | SwiGLU |
qk_layernorm | No | Yes | Yes | Yes |
5. Mathematical Formulation¶
5.1 Partial RoPE Application¶
For position \( m \) and head dimension index \( i \):
The rotation is applied only to the first \( d_{\text{rot}} \) dimensions of \( q \) and \( k \), leaving the remaining \( d_h - d_{\text{rot}} \) dimensions unchanged.
5.2 Attention with QK Norm¶
where \(\text{LN}\) denotes LayerNorm applied independently to Q and K.
5.3 Feed-Forward (Phi-3 SwiGLU Variant)¶
Earlier Phi models use standard GELU activation without gating:
6. Zig Implementation¶
6.1 PhiConfig¶
pub const PhiConfig = struct {
n_layers: u32,
d_model: u32,
n_heads: u32,
n_kv_heads: u32,
d_ff: u32,
vocab_size: u32,
max_seq_len: u32,
partial_rotary_factor: f32, // fraction of head dim to rotate
qk_layernorm: bool, // apply LayerNorm to Q and K
rope_theta: f32 = 10000.0,
norm_eps: f32 = 1e-5,
activation: ActivationType = .gelu,
pub fn headDim(self: PhiConfig) u32 {
return self.d_model / self.n_heads;
}
pub fn rotaryDim(self: PhiConfig) u32 {
return @intFromFloat(@as(f32, @floatFromInt(self.headDim()))
* self.partial_rotary_factor);
}
};
6.2 Forward Pass Overview¶
pub fn forward(self: *PhiModel, tokens: []const u32, pos: u32) ![]f32 {
var hidden = self.embedding.lookup(tokens);
for (self.layers) |*layer| {
const normed = layer.input_norm.forward(hidden);
// QKV projection
var q = layer.wq.forward(normed);
var k = layer.wk.forward(normed);
const v = layer.wv.forward(normed);
// Optional QK LayerNorm
if (self.config.qk_layernorm) {
q = layer.q_norm.forward(q);
k = layer.k_norm.forward(k);
}
// Partial RoPE -- rotate only first rotary_dim dimensions
const rot_dim = self.config.rotaryDim();
applyPartialRoPE(&q, &k, pos, rot_dim);
const attn_out = layer.attention.forward(q, k, v);
hidden = residualAdd(hidden, attn_out);
const ff_normed = layer.post_norm.forward(hidden);
const ff_out = layer.ffn.forward(ff_normed);
hidden = residualAdd(hidden, ff_out);
}
return self.output_norm.forward(hidden);
}
7. Variants¶
| Variant | Description |
|---|---|
| Phi-1 | Code-only, 1.3B, GELU activation, no QK norm |
| Phi-1.5 | Extended to natural language, same architecture as Phi-1 |
| Phi-2 | 2.7B, introduced QK LayerNorm, partial RoPE at 0.5 |
| Phi-3 Mini | 3.8B, switched to SwiGLU, full RoPE, long context support |
| Phi-3 Medium | 14B, larger hidden dims |
| PhiMoE | Mixture of Experts variant with sparse activation |
8. Educational Value¶
What Phi Teaches
-
Data quality vs. scale: Phi demonstrated that training data curation can be more impactful than parameter count -- a foundational insight for understanding model capabilities.
-
Partial positional encoding: The partial RoPE mechanism illustrates that not all dimensions need positional information. Some heads may benefit from purely content-based attention.
-
QK normalization: Applying LayerNorm to queries and keys is a simple technique that stabilizes attention scores, especially at initialization. This is an accessible lesson in training stability.
-
Incremental architecture evolution: Tracing Phi-1 through Phi-3 shows how a model family evolves -- adding gating (SwiGLU), expanding context, and incorporating MoE -- while preserving core design choices.
9. References¶
-
Gunasekar, S. et al. "Textbooks Are All You Need." arXiv:2306.11644, 2023. ↩
-
Abdin, M. et al. "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." arXiv:2404.14219, 2024. ↩↩
-
Su, J. et al. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, 2021. ↩
-
Ba, J. L. et al. "Layer Normalization." arXiv:1607.06450, 2016. ↩