Falcon¶
Overview¶
Falcon is a family of causal decoder-only language models developed by the Technology Innovation Institute (TII) of Abu Dhabi, first released in mid-20231. The Falcon architecture introduced two key efficiency innovations: Multi-Query Attention (MQA) with a single key-value head shared across all query heads, and parallel attention+FFN blocks that compute attention and the feed-forward network simultaneously rather than sequentially.
ZigLLM implements the Falcon architecture in src/models/falcon.zig, covering all four model sizes (1B, 7B, 40B, 180B) and both the original MQA design and the later GQA variants.
Key Innovations¶
Multi-Query Attention (MQA)¶
Falcon-7B pioneered the extreme form of key-value head sharing: a single key head and a single value head serve all 71 query heads.
where \( H = 71 \) query heads share a single K and V projection of dimension \( d_h = 64 \).
KV Cache Savings
| Attention Type | KV Params per Layer | Memory (S=2048) |
|---|---|---|
| MHA (H=71) | \( 2 \times 71 \times 64 \times S \) | 18.2 MB |
| MQA (H=1) | \( 2 \times 1 \times 64 \times S \) | 0.26 MB |
| GQA (H=8) | \( 2 \times 8 \times 64 \times S \) | 2.0 MB |
MQA achieves a 71x reduction in KV cache memory compared to standard MHA, which is critical for high-throughput serving with many concurrent sequences.
Parallel Attention and FFN¶
In the standard (sequential) transformer block, attention output feeds into the FFN:
Falcon's parallel block computes both sub-layers from the same normalized input and sums their outputs:
flowchart LR
subgraph "Sequential (LLaMA)"
direction TB
S_IN["Input"] --> S_N1["Norm"]
S_N1 --> S_ATT["Attention"]
S_ATT --> S_R1["+ Residual"]
S_R1 --> S_N2["Norm"]
S_N2 --> S_FFN["FFN"]
S_FFN --> S_R2["+ Residual"]
end
subgraph "Parallel (Falcon)"
direction TB
P_IN["Input"] --> P_N["Norm"]
P_N --> P_ATT["Attention"]
P_N --> P_FFN["FFN"]
P_ATT --> P_ADD["Sum + Residual"]
P_FFN --> P_ADD
end
style S_IN fill:#f0f0f0,color:#333
style P_IN fill:#f0f0f0,color:#333
style S_R2 fill:#4a9eff,color:#fff
style P_ADD fill:#4a9eff,color:#fff Why Parallel?
The parallel block uses only one normalization layer per block instead of two, and the attention and FFN computations can be overlapped on hardware that supports concurrent execution (e.g., GPUs with multiple stream processors). The quality impact is minimal -- within noise on benchmarks.
Configuration¶
FalconConfig Struct¶
pub const FalconConfig = struct {
variant: FalconVariant,
vocab_size: u32,
hidden_size: u32,
num_attention_heads: u32,
num_kv_heads: u32, // 1 for MQA, 8 for GQA
num_hidden_layers: u32,
intermediate_size: u32,
max_sequence_length: u32,
layer_norm_eps: f32,
rope_theta: f32,
parallel_attn: bool, // Parallel attention + FFN
use_bias: bool, // Linear layer bias
attention_dropout: f32,
multi_query: bool, // True for MQA (1 KV head)
alibi: bool, // ALiBi position encoding
new_decoder_architecture: bool,
};
Variant Configurations¶
| Parameter | Falcon-1B | Falcon-7B | Falcon-40B | Falcon-180B |
|---|---|---|---|---|
hidden_size | 2048 | 4544 | 8192 | 14848 |
num_attention_heads | 32 | 71 | 128 | 232 |
num_kv_heads | 1 | 1 | 8 | 8 |
num_hidden_layers | 24 | 32 | 60 | 80 |
intermediate_size | 8192 | 18176 | 32768 | 59392 |
parallel_attn | false | true | true | true |
multi_query | true | true | false | false |
alibi | true | false | false | false |
| Attention type | MQA + ALiBi | MQA + RoPE | GQA + RoPE | GQA + RoPE |
Architecture Evolution
Notice how the Falcon family evolved across sizes:
- 1B: ALiBi positions + MQA + sequential blocks (most conservative)
- 7B: RoPE + MQA + parallel blocks (MQA at scale)
- 40B/180B: RoPE + GQA + parallel blocks (GQA for better quality at scale)
Architecture Components¶
Falcon Attention¶
The attention module handles both MQA and GQA configurations through a unified projection scheme:
pub const FalconAttention = struct {
config: FalconConfig,
query_key_value: LinearLayer, // Combined QKV projection
dense: LinearLayer, // Output projection
head_dim: u32,
pub fn init(config: FalconConfig, allocator: Allocator) !Self {
const head_dim = config.hidden_size / config.num_attention_heads;
const kv_head_dim = config.hidden_size / config.num_kv_heads;
// QKV projection size depends on attention type
const qkv_size = if (config.multi_query)
config.hidden_size + 2 * kv_head_dim // Q + single K,V
else
3 * config.hidden_size; // Standard Q,K,V
const query_key_value = try LinearLayer.init(
config.hidden_size, qkv_size, config.use_bias, allocator);
// ...
}
};
Falcon MLP¶
Falcon uses a standard 2-matrix FFN with GELU activation (not gated SwiGLU):
pub const FalconMLP = struct {
dense_h_to_4h: LinearLayer, // Up projection [d, 4d]
dense_4h_to_h: LinearLayer, // Down projection [4d, d]
pub fn forward(self: *Self, hidden_states: Tensor(f32)) !Tensor(f32) {
const intermediate = try self.dense_h_to_4h.forward(hidden_states);
const activated = try neural_primitives.gelu(intermediate, self.allocator);
return try self.dense_4h_to_h.forward(activated);
}
};
Falcon Layer (Parallel vs Sequential)¶
pub fn forward(self: *Self, hidden_states: Tensor(f32),
attention_mask: ?Tensor(f32),
position_ids: ?Tensor(u32)) !Tensor(f32) {
var residual = hidden_states;
if (self.config.parallel_attn) {
// === Parallel Block ===
const normed = try self.input_layernorm.forward(hidden_states);
// Compute attention and MLP from the same normed input
const attn_output = try self.self_attn.forward(
normed, attention_mask, position_ids);
const mlp_output = try self.mlp.forward(normed);
// Sum both outputs + residual
const combined = try self.addTensors(attn_output, mlp_output);
return try self.addTensors(combined, residual);
} else {
// === Sequential Block ===
const normed1 = try self.input_layernorm.forward(hidden_states);
const attn_output = try self.self_attn.forward(
normed1, attention_mask, position_ids);
hidden_states = try self.addTensors(residual, attn_output);
residual = hidden_states;
const normed2 = try self.post_attention_layernorm.?.forward(hidden_states);
const mlp_output = try self.mlp.forward(normed2);
return try self.addTensors(residual, mlp_output);
}
}
Parallel Block Saves One Norm
In the parallel variant, post_attention_layernorm is null. Only one LayerNorm is applied per block, saving both parameters and compute. The input_layernorm output feeds both the attention and MLP paths.
Full Model¶
pub const FalconModel = struct {
config: FalconConfig,
embeddings: FalconEmbeddings, // Token embeddings (no position embeddings)
layers: []FalconLayer, // Transformer layers
ln_f: LayerNorm, // Final normalization
lm_head: LinearLayer, // Output projection
pub fn forward(self: *Self, input_ids: Tensor(u32),
attention_mask: ?Tensor(f32),
position_ids: ?Tensor(u32)) !Tensor(f32) {
var hidden_states = try self.embeddings.forward(input_ids);
for (self.layers) |*layer| {
hidden_states = try layer.forward(
hidden_states, attention_mask, position_ids);
}
hidden_states = try self.ln_f.forward(hidden_states);
return try self.lm_head.forward(hidden_states);
}
};
ALiBi (Falcon-1B)¶
Falcon-1B uses Attention with Linear Biases (ALiBi)2 instead of RoPE. ALiBi adds a position-dependent linear bias to attention scores:
where \( D[i,j] = -(i - j) \) is the distance matrix and \( m \) is a head-specific slope. Slopes are geometrically spaced: \( m_h = 2^{-8h/H} \) for head \( h \).
ALiBi vs RoPE
- ALiBi: Adds bias to attention scores. No extra parameters. Linear position decay. Excellent length extrapolation.
- RoPE: Rotates Q/K vectors. No extra parameters. Relative position via angle difference. Good extrapolation with scaling.
Falcon-7B+ switched from ALiBi to RoPE, following the broader industry trend.
Falcon vs Other Architectures¶
| Aspect | GPT-2 | LLaMA | Falcon-7B | Mistral |
|---|---|---|---|---|
| Attention | MHA | MHA | MQA (1 KV head) | GQA (8 KV heads) |
| Block style | Sequential | Sequential | Parallel | Sequential |
| FFN | GELU 2-matrix | SwiGLU 3-matrix | GELU 2-matrix | SwiGLU 3-matrix |
| Norm | LayerNorm | RMSNorm | LayerNorm | RMSNorm |
| KV cache | Full | Full | Minimal | Reduced |
Educational Value¶
Falcon demonstrates two important design principles:
- Extreme KV sharing (MQA): How far can you reduce KV heads without quality loss? Falcon-7B shows that a single KV head works surprisingly well.
- Parallel computation: The parallel block design shows that the strict sequential dependency between attention and FFN can be relaxed with minimal quality impact, enabling hardware optimization.
References¶
-
Penedo, G. et al. "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." arXiv:2306.01116, 2023. ↩
-
Press, O. et al. "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR, 2022. ↩
-
Shazeer, N. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150, 2019. ↩