Layer 2: Linear Algebra¶
Layer 2 sits directly above the Foundation layer and provides the computational kernels that every neural-network operation ultimately calls. Where Layer 1 defines what a tensor is (shape, strides, memory layout), Layer 2 defines what you can do with tensors efficiently -- matrix multiplication, dot products, and the quantized arithmetic that makes multi-billion-parameter models fit in commodity RAM.
Learning Objectives¶
After completing this layer you will be able to:
- Explain how SIMD (Single Instruction, Multiple Data) instruction sets accelerate matrix arithmetic and identify the relevant instruction families (SSE, AVX, AVX2, NEON) at compile time in Zig.
- Implement vectorised dot products and matrix multiplications using Zig's
@Vectorbuilt-in, including cache-blocked variants for large matrices. - Derive the mathematics of uniform quantization -- symmetric, asymmetric, and block-wise -- and compute storage requirements and compression ratios for each format.
- Describe the Q8_0, Q4_0, and INT8 quantization formats at the byte level, including their dequantization formulas.
- Distinguish K-quantization (two-level scales with sub-block granularity) from basic quantization and explain the BlockQ4K, BlockQ5K, and BlockQ6K structures.
- Explain importance quantization (IQ) -- how saliency-weighted bit allocation and non-linear lookup tables push compression below 2 bits per weight.
- Analyse performance using roofline models, SIMD speedup curves, and memory-bandwidth bottleneck identification.
Components¶
| Page | Focus | Key Concepts |
|---|---|---|
| SIMD Matrix Operations | Vectorised arithmetic kernels | @Vector, @splat, FMA, cache blocking, tiling |
| Quantization Fundamentals | Core quantization theory | Symmetric / asymmetric, Q8_0, Q4_0, INT8, block-wise scales |
| K-Quantization | Two-level quantization with sub-block scales | BlockQ4K, BlockQ5K, BlockQ6K, QK_K = 256 |
| Importance Quantization | Saliency-weighted extreme compression | IQ1_S through IQ4_NL, importance bitmaps, non-linear LUTs |
| Performance Analysis | Benchmarks and optimization | Roofline model, SIMD speedups, bandwidth analysis |
Connection to Layer 1: Foundation¶
Layer 2 consumes Tensor(T) values produced by Layer 1. Every kernel in this layer operates on raw element slices obtained via tensor.data -- the shape and stride metadata are used only to compute offsets. The key contracts are:
flowchart LR
subgraph Layer1["Layer 1 -- Foundation"]
T["Tensor(T)"]
MM["MemoryMap"]
GGUF["GGUF Reader"]
end
subgraph Layer2["Layer 2 -- Linear Algebra"]
MO["Matrix Operations"]
QF["Quantization Formats"]
PA["Performance Kernels"]
end
T -->|".data slice"| MO
T -->|".shape, .strides"| MO
MM -->|"mmap'd weight bytes"| QF
GGUF -->|"block pointers"| QF
QF -->|"dequantized vectors"| MO
MO -->|"result tensors"| PA Dependency rule
Layer 2 may @import any Layer 1 module but never imports from Layers 3--6. This invariant is enforced by the build system and verified by the import-graph test.
What Layer 1 provides¶
| Layer 1 Module | What Layer 2 Uses |
|---|---|
tensor.Tensor(T) | Element storage, shape metadata, strided access |
memory_map.MemoryMap | Zero-copy access to weight files on disk |
gguf.GGUFReader | Block-level pointers into quantized weight data |
threading.ThreadPool | Parallel dispatch of blocked matrix multiplications |
Connection to Layer 4: Transformers¶
Transformers are the primary consumer of Layer 2 kernels. Every forward pass through a transformer block issues dozens of matrix multiplications, each of which bottlenecks on the routines documented here.
Dominant matrix multiplications in a transformer forward pass
For a model with hidden dimension \( d \), sequence length \( n \), and feed-forward expansion factor 4:
| Operation | Shape | FLOPs per token |
|---|---|---|
| QKV projection | \( (n, d) \times (d, 3d) \) | \( \mathcal{O}(n \cdot d^2) \) |
| Attention scores | \( (n, d_k) \times (d_k, n) \) | \( \mathcal{O}(n^2 \cdot d_k) \) |
| Attention output projection | \( (n, d) \times (d, d) \) | \( \mathcal{O}(n \cdot d^2) \) |
| FFN up-projection | \( (n, d) \times (d, 4d) \) | \( \mathcal{O}(n \cdot d^2) \) |
| FFN down-projection | \( (n, 4d) \times (4d, d) \) | \( \mathcal{O}(n \cdot d^2) \) |
At inference time the QKV and FFN projections dominate because \( d \gg n \) for single-token generation (the KV cache eliminates repeated attention computation). This is why SIMD matrix kernels and quantized matmul are the single most impactful optimizations in the entire stack.
flowchart TB
subgraph Layer4["Layer 4 -- Transformers"]
ATT["Multi-Head Attention"]
FFN["Feed-Forward Network"]
end
subgraph Layer2["Layer 2 -- Linear Algebra"]
MATMUL["matmulSIMD_f32"]
QMATMUL["quantized matmul"]
DEQUANT["dequantize()"]
end
ATT -->|"Q, K, V projections"| MATMUL
ATT -->|"score = Q * K^T"| MATMUL
FFN -->|"up / gate / down"| QMATMUL
QMATMUL --> DEQUANT
DEQUANT --> MATMUL Quantization Format Landscape¶
The following table summarises every quantization format implemented in Layer 2, ordered by bits per weight. Detailed coverage is split across three dedicated pages.
| Format | Bits per Weight | Page | Family |
|---|---|---|---|
| IQ1_S | 1.5 | Importance Quantization | IQ |
| IQ1_M | 1.75 | Importance Quantization | IQ |
| IQ2_XXS | 2.06 | Importance Quantization | IQ |
| IQ2_XS | 2.31 | Importance Quantization | IQ |
| IQ2_S | 2.5 | Importance Quantization | IQ |
| IQ2_M | 2.7 | Importance Quantization | IQ |
| IQ3_XXS | 3.06 | Importance Quantization | IQ |
| IQ3_XS | 3.3 | Importance Quantization | IQ |
| IQ3_S | 3.44 | Importance Quantization | IQ |
| Q4_0 | 4.5 | Quantization Fundamentals | Basic |
| IQ4_XS | 4.25 | Importance Quantization | IQ |
| Q4_K | 4.5 | K-Quantization | K |
| IQ4_NL | 4.5 | Importance Quantization | IQ |
| Q5_K | 5.5 | K-Quantization | K |
| Q6_K | 6.5 | K-Quantization | K |
| Q8_0 | 9.0 | Quantization Fundamentals | Basic |
| F16 | 16.0 | Quantization Fundamentals | Unquantized |
| F32 | 32.0 | Quantization Fundamentals | Unquantized |
Recommended Reading Order¶
flowchart LR
A["SIMD Matrix Operations"] --> B["Quantization Fundamentals"]
B --> C["K-Quantization"]
B --> D["Importance Quantization"]
C --> E["Performance Analysis"]
D --> E - Start with SIMD Matrix Operations to understand the baseline compute kernels.
- Move to Quantization Fundamentals for the mathematical framework.
- Branch into K-Quantization or Importance Quantization based on your interest (K-quant is more widely deployed; IQ-quant pushes compression further).
- Finish with Performance Analysis to tie everything together with empirical data.
Prerequisites¶
Assumed knowledge
- Basic linear algebra: matrix multiplication, dot products, transpose.
- Binary representations: IEEE 754 floating-point, two's complement integers.
- Familiarity with Zig syntax (see Getting Started).
- Completion of Layer 1: Foundation is recommended but not strictly required.
Key References¶
The material in this layer draws on the following sources:
- Dettmers, T. et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022.1
- Frantar, E. et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023.2
- Gerganov, G. et al. "llama.cpp -- Inference of LLaMA model in pure C/C++." GitHub, 2023.3
- Intel Corporation. "Intel 64 and IA-32 Architectures Optimization Reference Manual." 2024.4