Layer 5: Model Architectures¶
Overview¶
Layer 5 implements complete neural network architectures that compose the primitives from Layers 1--4 (linear algebra, tensors, neural primitives, transformer components) into end-to-end language models. ZigLLM ships 18 model implementations spanning the most influential architectures in modern NLP, covering an estimated 80% of real-world deployment scenarios despite representing only 19% of the GGUF specification's 94 registered architecture identifiers.
Coverage Analysis
ZigLLM targets the architectures that dominate actual usage. The long-tail of niche or deprecated formats (GPT-J v0, RWKV variants, Persimmon, InternLM v1, etc.) accounts for the remaining 81% of identifiers but less than 20% of production traffic.
Architecture Taxonomy¶
The 18 supported architectures are organized into four categories based on their structural characteristics and intended use cases.
graph TB
subgraph "Decoder-Only (Autoregressive)"
direction TB
L[LLaMA / LLaMA 2] --> M[Mistral]
L --> F[Falcon]
L --> Q[Qwen]
L --> P[Phi]
G2[GPT-2] --> GJ[GPT-J]
G2 --> GN[GPT-NeoX]
G2 --> BL[BLOOM]
L --> SC[StarCoder]
end
subgraph "Encoder-Only"
B[BERT]
end
subgraph "State-Space"
MA[Mamba]
end
subgraph "Multi-Modal"
MM[Multi-Modal]
end
subgraph "Specialized Components"
MoE[Mixture of Experts]
BLAS[BLAS Integration]
GE[Gemma]
end
style L fill:#4a9eff,color:#fff
style G2 fill:#4a9eff,color:#fff
style B fill:#ff9f43,color:#fff
style MA fill:#10ac84,color:#fff
style MM fill:#ee5a24,color:#fff Category Breakdown¶
| Category | Count | Architectures | Primary Use |
|---|---|---|---|
| Core Language Models | 9 | LLaMA, Mistral, GPT-2, Falcon, Qwen, Phi, GPT-J, GPT-NeoX, BLOOM | Text generation, chat, code |
| Specialized Models | 4 | Mamba, BERT, Gemma, StarCoder | SSM inference, embeddings, code completion |
| Advanced Components | 3 | MoE, Multi-modal, BLAS | Sparse routing, vision-language, hardware acceleration |
| Total | 18 |
Architecture Comparison¶
The following table summarizes the key design choices across all supported architectures.
| Architecture | Attention | Position Encoding | Normalization | Activation | FFN Style |
|---|---|---|---|---|---|
| LLaMA | MHA | RoPE | RMSNorm (Pre) | SwiGLU | Gated |
| Mistral | GQA + Sliding Window | RoPE | RMSNorm (Pre) | SwiGLU | Gated |
| GPT-2 | MHA | Learned | LayerNorm (Pre) | GELU | Standard |
| Falcon | MQA / GQA | RoPE / ALiBi | LayerNorm | GELU | Parallel |
| Qwen | GQA | RoPE (NTK/YARN) | RMSNorm (Pre) | SwiGLU | Gated |
| Phi | MHA | RoPE | LayerNorm (Pre) | GELU | Parallel |
| GPT-J | MHA | RoPE | LayerNorm | GELU | Parallel |
| GPT-NeoX | MHA | RoPE | LayerNorm (Pre) | GELU | Standard |
| BLOOM | MHA | ALiBi | LayerNorm | GELU | Standard |
| Mamba | N/A (SSM) | Implicit | RMSNorm | SiLU | N/A |
| BERT | MHA (Bidirectional) | Learned | LayerNorm (Post) | GELU | Standard |
| Gemma | MHA / GQA | RoPE | RMSNorm (Pre) | GeGLU | Gated |
| StarCoder | MQA | Learned | LayerNorm | GELU | Standard |
| MoE | GQA + Router | RoPE | RMSNorm (Pre) | SwiGLU | Sparse Gated |
Attention Abbreviations
MHA = Multi-Head Attention, GQA = Grouped-Query Attention (\( n_\text{kv} < n_\text{heads} \)), MQA = Multi-Query Attention (\( n_\text{kv} = 1 \)), SSM = Structured State-Space Model.
Parameter Scale¶
The architectures span a wide range of parameter counts, from sub-billion to hundreds of billions.
| Architecture | Smallest | Largest | Typical Deployment |
|---|---|---|---|
| GPT-2 | 124M | 1.5B | Education, prototyping |
| Phi | 1.3B | 3.8B | Edge, mobile |
| Falcon | 1B | 180B | General purpose |
| LLaMA | 7B | 65B | Foundation model |
| Mistral | 7B | 8x7B (MoE) | Efficiency-focused |
| Qwen | 0.5B | 72B | Multilingual |
| BLOOM | 560M | 176B | Multilingual open science |
| BERT | 110M | 340M | Embeddings, classification |
| StarCoder | 1B | 15B | Code generation |
| Mamba | 130M | 2.8B | Long-sequence tasks |
Learning Path¶
We recommend the following progression through the model architectures, ordered by conceptual complexity.
Beginner¶
- GPT-2 -- The classic decoder-only transformer. Learned position embeddings, GELU activation, post-LayerNorm. Start here to understand the baseline.
- LLaMA -- Modern best practices: RoPE, SwiGLU, RMSNorm, pre-normalization. The canonical reference architecture for 2023--2025 era models.
- BERT -- Bidirectional encoder. Contrasts with decoder-only design; essential for understanding embeddings and classification tasks.
Intermediate¶
- Mistral -- Sliding window attention and grouped-query attention. Introduces memory-efficient attention patterns.
- Falcon -- Multi-query attention and parallel attention+FFN blocks. Demonstrates alternative efficiency strategies.
- Qwen -- Advanced RoPE scaling (NTK-Aware, YARN, Dynamic NTK). The most comprehensive position encoding implementation.
Advanced¶
- Mamba -- State-space models. A fundamentally different approach with \( O(n) \) sequence complexity instead of \( O(n^2) \).
- Mixture of Experts -- Sparse routing, expert parallelism, load balancing. Scales parameter count without proportional compute increase.
- Multi-Modal -- Vision encoders, cross-attention fusion, modality alignment. Extends language models to process images.
Source Code Organization¶
All model implementations reside in src/models/:
src/models/
config.zig # Shared ModelConfig, ActivationType, NormalizationType
tokenizer.zig # Vocabulary, SimpleTokenizer, BPETokenizer
chat_templates.zig # ChatMessage, TemplateType, ChatTemplate (10 formats)
gguf.zig # GGUFReader, GGMLType, tensor loading
llama.zig # LLaMAModel, LLaMAConfig, LLaMATransformerLayer
mistral.zig # MistralModel, GroupedQueryAttention, SlidingWindow
gpt2.zig # GPT2Model, learned position embeddings
falcon.zig # FalconModel, parallel attention+FFN
qwen.zig # QwenModel, RopeScaling, Dynamic NTK
phi.zig # PhiModel
gptj.zig # GPTJModel
gpt_neox.zig # GPTNeoXModel
bloom.zig # BLOOMModel, ALiBi
mamba.zig # MambaModel, selective state spaces
bert.zig # BERTModel, bidirectional attention
gemma.zig # GemmaModel
starcoder.zig # StarCoderModel
mixture_of_experts.zig # MoE routing and expert dispatch
multi_modal.zig # Vision-language fusion
Infrastructure Pages¶
Before diving into individual architectures, the following pages cover the shared infrastructure that all models depend on:
- Model Configuration --
ModelConfigstruct, preset configurations, validation, parameter counting, and memory estimation. - Tokenization -- Vocabulary management, SimpleTokenizer, BPE algorithm, batch encoding, and subword theory.
- Chat Templates -- Multi-turn conversation formatting for 10 model families (Llama2, ChatML, Mistral, etc.).
- GGUF Model Loading -- Binary format parsing, metadata extraction, quantized tensor loading, and memory-mapped I/O.
What Comes Next¶
After understanding individual architectures, Layer 6 (Inference) covers how these models are used in practice: KV-cache management, sampling strategies, batched generation, streaming output, and performance profiling.
References¶
-
Touvron, H. et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
-
Jiang, A. Q. et al. "Mistral 7B." arXiv:2310.06825, 2023. ↩
-
Radford, A. et al. "Language Models are Unsupervised Multitask Learners." OpenAI, 2019. ↩
-
Penedo, G. et al. "The RefinedWeb Dataset for Falcon LLM." arXiv:2306.01116, 2023. ↩
-
Bai, J. et al. "Qwen Technical Report." arXiv:2309.16609, 2023. ↩
-
Gu, A. and Dao, T. "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752, 2023. ↩
-
Devlin, J. et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL, 2019. ↩