StarCoder¶

StarCoder is a family of code-generation language models developed by the BigCode open-science collaboration. Released in 2023, StarCoder demonstrated that specialized code models trained on permissively licensed data could match or exceed proprietary alternatives. Its architecture introduces two key ideas relevant to code generation: Multi-Query Attention (MQA) for efficient inference during long code completions, and Fill-in-the-Middle (FIM) training for non-left-to-right code generation¹.

1. Architecture Overview¶

The BigCode Project

BigCode is an open-science collaboration hosted by Hugging Face, involving researchers from over 60 institutions. StarCoder was trained on The Stack v1, a 6.4TB dataset of permissively licensed source code from GitHub covering 86 programming languages¹.

StarCoder is a decoder-only autoregressive transformer optimized for code. It uses Multi-Query Attention (a single KV head shared across all query heads), learned absolute position embeddings, and a context length of 8,192 tokens -- sufficient for most single-file code completions.

2. Key Innovations¶

2.1 Multi-Query Attention (MQA)¶

MQA, introduced by Shazeer (2019)², uses a single key and value head shared across all query heads:

Multi-Query Attention

[ Q = xW_Q \in \mathbb{R}^{s \times n_h \times d_h} ] [ K = xW_K \in \mathbb{R}^{s \times 1 \times d_h} ] [ V = xW_V \in \mathbb{R}^{s \times 1 \times d_h} ]

All query heads share the same K and V:

\[ \text{Attn}_i(Q, K, V) = \text{softmax}\!\left(\frac{Q_i K^T}{\sqrt{d_h}}\right) V \]

Memory savings: The KV cache stores only 1 set of K/V per layer instead of \( n_h \) sets, reducing KV cache memory by a factor of \( n_h \).

2.2 Fill-in-the-Middle (FIM)¶

Standard left-to-right training cannot generate code that fills a gap between existing code. FIM transforms training examples by splitting them into prefix, middle, and suffix, then reordering:

FIM Training Transformation

Input: Code sequence \( [t_1, t_2, \ldots, t_n] \)

Choose random split points \( a \) and \( b \) where \( a < b \)
Define:
- Prefix: \( P = [t_1, \ldots, t_a] \)
- Middle: \( M = [t_{a+1}, \ldots, t_b] \)
- Suffix: \( S = [t_{b+1}, \ldots, t_n] \)
Rearrange as: <fim_prefix> P <fim_suffix> S <fim_middle> M
Train autoregressively on this reordered sequence

At inference time, the model can be prompted with prefix and suffix to generate the middle portion -- enabling code infilling, not just completion.

2.3 Code-Specific Tokenizer¶

StarCoder uses a byte-level BPE tokenizer with special handling for code:

Whitespace grouping: Sequences of spaces are tokenized as single tokens (e.g., 4 spaces = one token), critical for indentation-sensitive languages
Repository context tokens: Special tokens encode file paths and repository structure
FIM special tokens: <fim_prefix>, <fim_middle>, <fim_suffix>

3. Architecture Diagram¶

flowchart TD
    INPUT["Code Token IDs"] --> EMB

    subgraph EMB["Embedding"]
        TOK_EMB["Token Embedding"]
        POS_EMB["Learned Position Embedding"]
        SUM["Sum"]
        TOK_EMB --> SUM
        POS_EMB --> SUM
    end

    SUM --> BLOCKS

    subgraph BLOCKS["N x StarCoderBlock"]
        LN1["LayerNorm"] --> ATTN
        subgraph ATTN["Multi-Query Attention"]
            Q_PROJ["Q Projection\n(n_heads x d_head)"]
            KV_PROJ["K,V Projection\n(1 x d_head each)"]
            Q_PROJ --> DOT["Scaled Dot-Product\n(all Q heads share K,V)"]
            KV_PROJ --> DOT
            DOT --> MASK["Causal Mask"]
            MASK --> SM["Softmax"]
            SM --> OUT["Output Projection"]
        end
        ATTN --> RES1["Residual Add"]
        RES1 --> LN2["LayerNorm"]
        LN2 --> FFN["FFN (GELU)"]
        FFN --> RES2["Residual Add"]
    end

    BLOCKS --> FINAL_LN["Final LayerNorm"]
    FINAL_LN --> LM_HEAD["LM Head"]

4. Configuration Parameters¶

Parameter	StarCoder-1B	StarCoder-3B	StarCoder-7B	StarCoder-15B
`n_layers`	24	36	42	40
`d_model`	2048	2816	4096	6144
`n_heads`	16	22	32	48
`n_kv_heads`	1	1	1	1
`d_ff`	8192	11264	16384	24576
`vocab_size`	49152	49152	49152	49152
`max_seq_len`	8192	8192	8192	8192
`activation`	GELU	GELU	GELU	GELU
`norm_type`	LayerNorm	LayerNorm	LayerNorm	LayerNorm
`positional_encoding`	Learned	Learned	Learned	Learned
`attention_type`	MQA	MQA	MQA	MQA

5. Mathematical Formulation¶

5.1 MQA Attention Scores¶

For query head \( i \) with shared key and value:

\[ A^{(i)} = \text{softmax}\!\left(\frac{Q^{(i)} K^T}{\sqrt{d_h}} + M_{\text{causal}}\right) \]

\[ O^{(i)} = A^{(i)} V \]

The final output concatenates all heads:

\[ O = \text{Concat}(O^{(1)}, \ldots, O^{(n_h)}) W_O \]

5.2 KV Cache Memory Comparison¶

KV Cache Size: MHA vs. MQA

For sequence length \( s \), model dimension \( d \), number of heads \( n_h \), head dimension \( d_h \), number of layers \( L \), and precision \( p \) bytes:

Attention Type	KV Cache Size
MHA	\( 2 \cdot L \cdot s \cdot n_h \cdot d_h \cdot p \)
GQA (g groups)	\( 2 \cdot L \cdot s \cdot g \cdot d_h \cdot p \)
MQA	\( 2 \cdot L \cdot s \cdot d_h \cdot p \)

For StarCoder-15B with \( n_h = 48 \), MQA uses \( 48\times \) less KV cache memory than MHA.

5.3 FIM Probability Factorization¶

Standard autoregressive:

\[ P(x) = \prod_{i=1}^{n} P(x_i \mid x_{<i}) \]

FIM reorders the factorization:

\[ P(x) = P(\text{prefix}) \cdot P(\text{suffix} \mid \text{prefix}) \cdot P(\text{middle} \mid \text{prefix}, \text{suffix}) \]

This allows the model to condition on both prefix and suffix when generating the middle segment.

6. Zig Implementation¶

6.1 StarCoderConfig¶

pub const StarCoderConfig = struct {
    n_layers: u32,
    d_model: u32,
    n_heads: u32,
    n_kv_heads: u32 = 1,        // MQA: always 1
    d_ff: u32,
    vocab_size: u32 = 49152,
    max_seq_len: u32 = 8192,
    norm_eps: f32 = 1e-5,
    activation: ActivationType = .gelu,
    use_bias: bool = true,

    pub fn headDim(self: StarCoderConfig) u32 {
        return self.d_model / self.n_heads;
    }

    /// KV cache is dramatically smaller with MQA
    pub fn kvCacheSize(self: StarCoderConfig, seq_len: u32) u64 {
        // Only 1 KV head per layer
        return 2 * @as(u64, self.n_layers) * seq_len
            * self.headDim() * @sizeOf(f32);
    }
};

6.2 Multi-Query Attention¶

pub const MultiQueryAttention = struct {
    wq: Linear,    // [d_model, n_heads * d_head]
    wk: Linear,    // [d_model, d_head]  -- single head
    wv: Linear,    // [d_model, d_head]  -- single head
    wo: Linear,    // [n_heads * d_head, d_model]
    n_heads: u32,
    d_head: u32,

    pub fn forward(
        self: *MultiQueryAttention,
        x: Tensor(f32),
        pos: u32,
        kv_cache: *KVCache,
    ) !Tensor(f32) {
        // Q: [seq, n_heads * d_head]
        const q = self.wq.forward(x);

        // K, V: [seq, d_head] -- single head, shared by all query heads
        const k = self.wk.forward(x);
        const v = self.wv.forward(x);

        // Update KV cache (only 1 head to cache)
        kv_cache.update(k, v, pos);

        // Each query head attends to the SAME K, V
        var outputs: [MAX_HEADS]Tensor(f32) = undefined;
        for (0..self.n_heads) |h| {
            const q_h = q.headSlice(h, self.d_head);
            const cached_k = kv_cache.keys(pos);
            const cached_v = kv_cache.values(pos);

            const scores = scaledDotProduct(q_h, cached_k, self.d_head);
            applyCausalMask(scores, pos);
            const attn = softmax(scores);
            outputs[h] = matmul(attn, cached_v);
        }

        const concat = try concatenateHeads(outputs[0..self.n_heads]);
        return self.wo.forward(concat);
    }
};

6.3 FIM Prompt Construction¶

pub const FIMFormatter = struct {
    prefix_token: u32,    // <fim_prefix>
    suffix_token: u32,    // <fim_suffix>
    middle_token: u32,    // <fim_middle>

    /// Transform code into FIM format for infilling
    pub fn formatInfill(
        self: FIMFormatter,
        prefix: []const u32,
        suffix: []const u32,
        allocator: Allocator,
    ) ![]u32 {
        var tokens = std.ArrayList(u32).init(allocator);

        try tokens.append(self.prefix_token);
        try tokens.appendSlice(prefix);
        try tokens.append(self.suffix_token);
        try tokens.appendSlice(suffix);
        try tokens.append(self.middle_token);
        // Model generates the middle portion autoregressively

        return tokens.toOwnedSlice();
    }
};

7. Variants¶

Model	Year	Parameters	Key Improvements
StarCoder	2023	15B	Original, MQA, FIM, trained on The Stack v1
StarCoderBase	2023	15B	Without Python fine-tuning
StarCoder2-3B	2024	3B	Trained on The Stack v2, GQA option
StarCoder2-7B	2024	7B	Improved data filtering
StarCoder2-15B	2024	15B	Best code generation quality

StarCoder2 Improvements

StarCoder2 (Lozhkov et al., 2024)³ trained on The Stack v2 (67.5B tokens vs. 35B), used improved data deduplication, and offered GQA options alongside MQA. The architectural changes are minimal -- primarily data and training improvements.

8. Educational Value¶

What StarCoder Teaches

Multi-Query Attention: StarCoder is the clearest practical example of MQA. With a single KV head shared across 48 query heads (in the 15B variant), students can directly observe the memory savings and understand the quality-efficiency trade-off.
Fill-in-the-Middle: FIM is a simple but powerful technique that challenges the assumption that language models can only generate left-to-right. The reordering trick (prefix-suffix-middle) demonstrates how training data formatting can expand model capabilities without architecture changes.
Code-specific design: StarCoder shows how domain-specific requirements (indentation sensitivity, long contexts for full files, repository structure) influence tokenizer design and context length decisions.
MQA vs. GQA vs. MHA spectrum: Comparing StarCoder (MQA, 1 KV head) with LLaMA (GQA, 8 KV heads) and GPT-J (MHA, all heads) provides a concrete understanding of the full attention efficiency spectrum.
Data licensing and ethics: BigCode's commitment to permissive licenses only raises important questions about training data provenance that are integral to understanding modern LLM development.

9. References¶

Li, R. et al. "StarCoder: May the Source Be with You!" arXiv:2305.06161, 2023. ↩↩
Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150, 2019. ↩
Lozhkov, A. et al. "StarCoder 2 and The Stack v2: The Next Generation." arXiv:2402.19173, 2024. ↩
Bavarian, M. et al. "Efficient Training of Language Models to Fill in the Middle." arXiv:2207.14255, 2022. ↩