Skip to content

Tutorial: Understanding Attention

In this tutorial you will construct the scaled dot-product attention mechanism from first principles, using ZigLlama's tensor and linear-algebra primitives. Every line of code is paired with the mathematical operation it implements, so you can trace the data-flow from raw embeddings to attended representations.

Prerequisites: Familiarity with matrix multiplication and softmax.

Estimated time: 20 minutes.


The Attention Equation

The core computation is:

\[ \text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^{\!\top}}{\sqrt{d_k}}\right) V \]

where:

  • \(Q \in \mathbb{R}^{n \times d_k}\) -- queries: "what am I looking for?"
  • \(K \in \mathbb{R}^{n \times d_k}\) -- keys: "what do I contain?"
  • \(V \in \mathbb{R}^{n \times d_v}\) -- values: "what do I offer?"
  • \(d_k\) -- key/query dimension (used for scaling)

Step 1: Create Q, K, V Tensors

We start with a small sequence of 4 tokens, each with a 64-dimensional embedding. In a real model, Q, K, and V are obtained by multiplying the input embeddings by learned weight matrices \(W^Q\), \(W^K\), \(W^V\).

const std = @import("std");
const Tensor = @import("foundation/tensor.zig").Tensor;

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const seq_len: usize = 4;
    const d_k: usize = 8;  // small for illustration

    // Shape: [batch=1, heads=1, seq_len=4, d_k=8]
    var Q = try Tensor(f32).init(allocator, &[_]usize{ 1, 1, seq_len, d_k });
    defer Q.deinit();
    var K = try Tensor(f32).init(allocator, &[_]usize{ 1, 1, seq_len, d_k });
    defer K.deinit();
    var V = try Tensor(f32).init(allocator, &[_]usize{ 1, 1, seq_len, d_k });
    defer V.deinit();

    // Fill with illustrative values
    Q.fill(0.5);
    K.fill(0.3);
    V.fill(1.0);
}

Shape convention

ZigLlama uses the 4-D shape [batch, heads, seq_len, d_k] for attention tensors. The batch and head dimensions are both 1 here to keep things simple.


Step 2: Compute Attention Scores \(QK^{\top}\)

Multiplying queries by transposed keys produces a \(n \times n\) score matrix where element \((i, j)\) measures how much token \(i\) should attend to token \(j\).

const attention = @import("transformers/attention.zig");

// scaledDotProductAttention performs all steps internally, but let us
// walk through them manually first.

// Batched matmul with K transposed: [1,1,4,8] x [1,1,4,8]^T -> [1,1,4,4]
var scores = try attention.batchedMatMul(Q, K, allocator, true);
defer scores.deinit();

std.debug.print("Score matrix shape: [{d},{d},{d},{d}]\n", .{
    scores.shape[0], scores.shape[1], scores.shape[2], scores.shape[3],
});
// Output: Score matrix shape: [1,1,4,4]

Step 3: Scale by \(\sqrt{d_k}\)

Without scaling, the dot products grow proportionally to \(d_k\), pushing the softmax into saturation (one element near 1, the rest near 0). Dividing by \(\sqrt{d_k}\) controls the variance:

\[ \text{score}_{ij} \leftarrow \frac{\text{score}_{ij}}{\sqrt{d_k}} \]
const scale: f32 = 1.0 / @sqrt(@as(f32, @floatFromInt(d_k)));
// scale = 1/sqrt(8) ~ 0.354

for (0..scores.size) |i| {
    scores.data[i] *= scale;
}

Intuition for the scaling factor

If \(Q\) and \(K\) entries are i.i.d. with mean 0 and variance 1, each dot product has variance \(d_k\). Dividing by \(\sqrt{d_k}\) restores unit variance, keeping softmax in its sensitive region.


Step 4: Apply Causal Mask

For autoregressive (decoder) models, position \(i\) must not attend to any position \(j > i\). This is enforced by setting future positions to \(-\infty\) before softmax:

var mask = try attention.createCausalMask(allocator, seq_len);
defer mask.deinit();

// Mask pattern (0 = allowed, -inf = blocked):
//   0   -inf -inf -inf
//   0    0   -inf -inf
//   0    0    0   -inf
//   0    0    0    0

try attention.applyMask(&scores, mask);

After masking, the score matrix for position 0 has only one finite entry (itself), while position 3 can attend to all four positions.


Step 5: Apply Softmax

Softmax converts raw scores into a probability distribution over keys:

\[ \alpha_{ij} = \frac{\exp(\text{score}_{ij})}{\sum_{k=1}^{n} \exp(\text{score}_{ik})} \]

ZigLlama's softmaxLastDim operates on the last axis of a 4-D tensor:

var weights = try attention.softmaxLastDim(scores, allocator);
defer weights.deinit();

// For position 0 (can only attend to itself):  [1.0, 0.0, 0.0, 0.0]
// For position 3 (uniform inputs after mask):  [0.25, 0.25, 0.25, 0.25]

Numerical stability

The implementation subtracts the row maximum before exponentiating (\(\exp(x_i - \max_j x_j)\)), preventing overflow for large logits.


Step 6: Multiply by Values

The final step produces the attended output by taking a weighted sum of value vectors:

\[ \text{output}_i = \sum_{j} \alpha_{ij} \, V_j \]
var output = try attention.batchedMatMul(weights, V, allocator, false);
defer output.deinit();
// Shape: [1, 1, 4, 8] -- same as V

Each row of the output is a convex combination of the value vectors, with coefficients given by the attention weights.


Multi-Head Attention: Splitting the Work

A single attention head can only capture one type of relationship. Multi-head attention runs \(h\) parallel attention heads, each operating on a \(d_k = d_\text{model} / h\) subspace:

\[ \text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) \, W^O \]
flowchart LR
    X[Input X] --> WQ["W_Q projection"]
    X --> WK["W_K projection"]
    X --> WV["W_V projection"]
    WQ --> Split["Split into h heads"]
    WK --> Split
    WV --> Split
    Split --> H1["Head 1: Attention"]
    Split --> H2["Head 2: Attention"]
    Split --> Hh["Head h: Attention"]
    H1 --> Cat[Concatenate]
    H2 --> Cat
    Hh --> Cat
    Cat --> WO["W_O projection"]
    WO --> Out[Output]

In ZigLlama, the MultiHeadAttention struct encapsulates this:

const mha = @import("transformers/attention.zig").MultiHeadAttention;

var attn = try mha.init(allocator, 64, 8); // d_model=64, 8 heads -> d_k=8
defer attn.deinit();

// input shape: [batch=1, seq_len=4, d_model=64]
var input = try Tensor(f32).init(allocator, &[_]usize{ 1, 4, 64 });
defer input.deinit();
input.fill(0.1);

// Forward pass: project -> split -> attend -> concat -> project
var output_mha = try attn.forward(input, input, input, null);
defer output_mha.deinit();
// output shape: [1, 4, 64]

How the Split Works

reshapeForHeads rearranges [batch, seq, d_model] into [batch, heads, seq, d_k] by interleaving dimensions:

Dimension Before After
[0] batch batch
[1] seq_len num_heads
[2] d_model seq_len
[3] -- d_k

After attention, reshapeFromHeads reverses the operation and concatenates all head outputs back into a single d_model-wide vector.


Rotary Position Embeddings (RoPE)

LLaMA uses RoPE to inject positional information directly into the Q and K vectors. For each dimension pair \((2i, 2i+1)\) at position \(m\):

\[ \begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos m\theta_i & -\sin m\theta_i \\ \sin m\theta_i & \cos m\theta_i \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix} \]

where \(\theta_i = 10000^{-2i/d_k}\). This is a 2-D rotation whose angle depends on position, making the dot product between two positions depend only on their relative distance.

var rope_result = try attention.applyRotaryEncoding(Q, K, seq_len, allocator);
defer rope_result.q.deinit();
defer rope_result.k.deinit();
// rope_result.q and rope_result.k are now position-aware

Verifying RoPE

RoPE preserves the magnitude of each dimension pair. The test "RoPE rotation properties" in attention.zig checks exactly this by comparing \(\lVert (x, y) \rVert\) before and after rotation.


Summary

Step Math ZigLlama Function
Project Q, K, V \(QW^Q\), \(KW^K\), \(VW^V\) Tensor.matmul
Reshape for heads -- reshapeForHeads
Score \(QK^\top\) batchedMatMul(..., true)
Scale \(/ \sqrt{d_k}\) element-wise multiply
Mask set future to \(-\infty\) applyMask, createCausalMask
Softmax \(\text{softmax}(\cdot)\) softmaxLastDim
Attend \(\alpha V\) batchedMatMul(..., false)
Concat + project \(\text{Concat} \cdot W^O\) reshapeFromHeads, Tensor.matmul

What to Try Next

  • Modify d_k and observe how the scaling factor changes the attention distribution.
  • Replace Q.fill(0.5) with non-uniform values and print the attention weights -- you will see the model "focusing" on specific positions.
  • Proceed to Quantization in Practice to learn how weight compression affects these computations.