Layer 2: Linear Algebra¶

Layer 2 sits directly above the Foundation layer and provides the computational kernels that every neural-network operation ultimately calls. Where Layer 1 defines what a tensor is (shape, strides, memory layout), Layer 2 defines what you can do with tensors efficiently -- matrix multiplication, dot products, and the quantized arithmetic that makes multi-billion-parameter models fit in commodity RAM.

Learning Objectives¶

After completing this layer you will be able to:

Explain how SIMD (Single Instruction, Multiple Data) instruction sets accelerate matrix arithmetic and identify the relevant instruction families (SSE, AVX, AVX2, NEON) at compile time in Zig.
Implement vectorised dot products and matrix multiplications using Zig's @Vector built-in, including cache-blocked variants for large matrices.
Derive the mathematics of uniform quantization -- symmetric, asymmetric, and block-wise -- and compute storage requirements and compression ratios for each format.
Describe the Q8_0, Q4_0, and INT8 quantization formats at the byte level, including their dequantization formulas.
Distinguish K-quantization (two-level scales with sub-block granularity) from basic quantization and explain the BlockQ4K, BlockQ5K, and BlockQ6K structures.
Explain importance quantization (IQ) -- how saliency-weighted bit allocation and non-linear lookup tables push compression below 2 bits per weight.
Analyse performance using roofline models, SIMD speedup curves, and memory-bandwidth bottleneck identification.

Components¶

Page	Focus	Key Concepts
SIMD Matrix Operations	Vectorised arithmetic kernels	`@Vector`, `@splat`, FMA, cache blocking, tiling
Quantization Fundamentals	Core quantization theory	Symmetric / asymmetric, Q8_0, Q4_0, INT8, block-wise scales
K-Quantization	Two-level quantization with sub-block scales	BlockQ4K, BlockQ5K, BlockQ6K, QK_K = 256
Importance Quantization	Saliency-weighted extreme compression	IQ1_S through IQ4_NL, importance bitmaps, non-linear LUTs
Performance Analysis	Benchmarks and optimization	Roofline model, SIMD speedups, bandwidth analysis

Connection to Layer 1: Foundation¶

Layer 2 consumes Tensor(T) values produced by Layer 1. Every kernel in this layer operates on raw element slices obtained via tensor.data -- the shape and stride metadata are used only to compute offsets. The key contracts are:

flowchart LR
    subgraph Layer1["Layer 1 -- Foundation"]
        T["Tensor(T)"]
        MM["MemoryMap"]
        GGUF["GGUF Reader"]
    end
    subgraph Layer2["Layer 2 -- Linear Algebra"]
        MO["Matrix Operations"]
        QF["Quantization Formats"]
        PA["Performance Kernels"]
    end

    T -->|".data slice"| MO
    T -->|".shape, .strides"| MO
    MM -->|"mmap'd weight bytes"| QF
    GGUF -->|"block pointers"| QF
    QF -->|"dequantized vectors"| MO
    MO -->|"result tensors"| PA

Dependency rule

Layer 2 may @import any Layer 1 module but never imports from Layers 3--6. This invariant is enforced by the build system and verified by the import-graph test.

What Layer 1 provides¶

Layer 1 Module	What Layer 2 Uses
`tensor.Tensor(T)`	Element storage, shape metadata, strided access
`memory_map.MemoryMap`	Zero-copy access to weight files on disk
`gguf.GGUFReader`	Block-level pointers into quantized weight data
`threading.ThreadPool`	Parallel dispatch of blocked matrix multiplications

Connection to Layer 4: Transformers¶

Transformers are the primary consumer of Layer 2 kernels. Every forward pass through a transformer block issues dozens of matrix multiplications, each of which bottlenecks on the routines documented here.

Dominant matrix multiplications in a transformer forward pass

For a model with hidden dimension \( d \), sequence length \( n \), and feed-forward expansion factor 4:

Operation	Shape	FLOPs per token
QKV projection	\( (n, d) \times (d, 3d) \)	\( \mathcal{O}(n \cdot d^2) \)
Attention scores	\( (n, d_k) \times (d_k, n) \)	\( \mathcal{O}(n^2 \cdot d_k) \)
Attention output projection	\( (n, d) \times (d, d) \)	\( \mathcal{O}(n \cdot d^2) \)
FFN up-projection	\( (n, d) \times (d, 4d) \)	\( \mathcal{O}(n \cdot d^2) \)
FFN down-projection	\( (n, 4d) \times (4d, d) \)	\( \mathcal{O}(n \cdot d^2) \)

At inference time the QKV and FFN projections dominate because \( d \gg n \) for single-token generation (the KV cache eliminates repeated attention computation). This is why SIMD matrix kernels and quantized matmul are the single most impactful optimizations in the entire stack.

flowchart TB
    subgraph Layer4["Layer 4 -- Transformers"]
        ATT["Multi-Head Attention"]
        FFN["Feed-Forward Network"]
    end
    subgraph Layer2["Layer 2 -- Linear Algebra"]
        MATMUL["matmulSIMD_f32"]
        QMATMUL["quantized matmul"]
        DEQUANT["dequantize()"]
    end

    ATT -->|"Q, K, V projections"| MATMUL
    ATT -->|"score = Q * K^T"| MATMUL
    FFN -->|"up / gate / down"| QMATMUL
    QMATMUL --> DEQUANT
    DEQUANT --> MATMUL

Quantization Format Landscape¶

The following table summarises every quantization format implemented in Layer 2, ordered by bits per weight. Detailed coverage is split across three dedicated pages.

Format	Bits per Weight	Page	Family
IQ1_S	1.5	Importance Quantization	IQ
IQ1_M	1.75	Importance Quantization	IQ
IQ2_XXS	2.06	Importance Quantization	IQ
IQ2_XS	2.31	Importance Quantization	IQ
IQ2_S	2.5	Importance Quantization	IQ
IQ2_M	2.7	Importance Quantization	IQ
IQ3_XXS	3.06	Importance Quantization	IQ
IQ3_XS	3.3	Importance Quantization	IQ
IQ3_S	3.44	Importance Quantization	IQ
Q4_0	4.5	Quantization Fundamentals	Basic
IQ4_XS	4.25	Importance Quantization	IQ
Q4_K	4.5	K-Quantization	K
IQ4_NL	4.5	Importance Quantization	IQ
Q5_K	5.5	K-Quantization	K
Q6_K	6.5	K-Quantization	K
Q8_0	9.0	Quantization Fundamentals	Basic
F16	16.0	Quantization Fundamentals	Unquantized
F32	32.0	Quantization Fundamentals	Unquantized

Prerequisites¶

Assumed knowledge

Basic linear algebra: matrix multiplication, dot products, transpose.
Binary representations: IEEE 754 floating-point, two's complement integers.
Familiarity with Zig syntax (see Getting Started).
Completion of Layer 1: Foundation is recommended but not strictly required.

Key References¶

The material in this layer draws on the following sources:

Dettmers, T. et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale." NeurIPS, 2022.¹
Frantar, E. et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR, 2023.²
Gerganov, G. et al. "llama.cpp -- Inference of LLaMA model in pure C/C++." GitHub, 2023.³
Intel Corporation. "Intel 64 and IA-32 Architectures Optimization Reference Manual." 2024.⁴

https://arxiv.org/abs/2208.07339 ↩
https://arxiv.org/abs/2210.17323 ↩
https://github.com/ggerganov/llama.cpp ↩
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html ↩