Performance Analysis¶
This page presents the performance characteristics of ZigLlama's Layer 2 kernels -- matrix multiplication and quantized arithmetic -- using systematic benchmarking, roofline modeling, and memory-bandwidth analysis. The goal is not just to report numbers but to build an analytical framework for predicting performance and identifying bottlenecks.
1. Benchmarking Methodology¶
Measurement Protocol¶
Benchmark procedure
- Warmup: Execute the kernel 10 times to populate instruction caches, branch predictors, and TLBs. Discard all warmup measurements.
- Measurement: Execute the kernel \( N \) times (typically \( N = 100 \) for fast kernels, \( N = 20 \) for large matrices).
- Timing: Use
std.time.nanoTimestamp()which maps toclock_gettime(CLOCK_MONOTONIC)on Linux andmach_absolute_time()on macOS. Resolution is sub-microsecond on all supported platforms. - Statistical reporting: Compute mean, standard deviation, median, p95 (95th percentile), and p99 (99th percentile).
Statistical Reporting¶
For each benchmark, we report the following statistics:
| Statistic | Definition | Purpose |
|---|---|---|
| Mean | \( \bar{x} = \frac{1}{N} \sum x_i \) | Average throughput |
| Median | Middle value of sorted measurements | Robust to outliers |
| Std Dev | \( \sigma = \sqrt{\frac{1}{N-1}\sum(x_i - \bar{x})^2} \) | Measurement stability |
| p95 | 95th percentile | Tail latency |
| p99 | 99th percentile | Worst-case latency |
| CV | \( \sigma / \bar{x} \) | Relative variability |
Acceptable coefficient of variation
A CV below 5% indicates stable measurements. If CV exceeds 10%, the benchmark is likely affected by thermal throttling, OS scheduling, or memory contention and should be re-run with frequency pinning and process isolation.
Environment Control¶
pub const BenchConfig = struct {
warmup_iterations: usize = 10,
measurement_iterations: usize = 100,
/// Pin to performance cores (Linux: isolcpus, taskset).
pin_to_core: ?usize = null,
/// Disable turbo boost for stable frequency.
disable_turbo: bool = true,
pub fn run(self: BenchConfig, comptime kernel: anytype, args: anytype) BenchResult {
// Warmup phase
for (0..self.warmup_iterations) |_| {
@call(.auto, kernel, args);
}
// Measurement phase
var times: [self.measurement_iterations]u64 = undefined;
for (0..self.measurement_iterations) |i| {
const start = std.time.nanoTimestamp();
@call(.auto, kernel, args);
const end = std.time.nanoTimestamp();
times[i] = @intCast(end - start);
}
return BenchResult.compute(×);
}
};
2. Matrix Multiplication Performance¶
Results by Matrix Size¶
All measurements on Intel i7-13700K (Performance cores, 5.4 GHz turbo, 32 KiB L1d, 1.25 MiB L2, 30 MiB L3). Kernel: matmulSIMD_f32_blocked with AVX2+FMA.
| Matrix Size | GFLOPS (scalar) | GFLOPS (SSE) | GFLOPS (AVX) | GFLOPS (AVX2+FMA) | Efficiency |
|---|---|---|---|---|---|
| 64 x 64 | 10.5 | 28.4 | 42.1 | 65.5 | 68% |
| 256 x 256 | 2.8 | 8.8 | 16.0 | 24.0 | 25% |
| 512 x 512 | 2.7 | 9.4 | 17.8 | 27.5 | 29% |
| 1024 x 1024 | 2.6 | 9.8 | 18.9 | 31.2 | 33% |
| 4096 x 4096 | 2.6 | 10.2 | 19.7 | 33.8 | 35% |
Efficiency
Efficiency is the ratio of achieved GFLOPS to theoretical peak. For a single core with AVX2+FMA at 5.4 GHz:
(2 FMA ports, 8 f32 lanes, 5.4 GHz, 1 core)
The 35% efficiency for 4096x4096 is respectable for a non-BLAS implementation. Production BLAS libraries (OpenBLAS, MKL) achieve 75--90% through micro-kernel optimization and register tiling.
Performance Scaling¶
xychart-beta
title "Matrix Multiplication: GFLOPS by ISA"
x-axis ["64", "256", "512", "1024", "4096"]
y-axis "GFLOPS" 0 --> 70
bar "Scalar" [10.5, 2.8, 2.7, 2.6, 2.6]
bar "SSE" [28.4, 8.8, 9.4, 9.8, 10.2]
bar "AVX" [42.1, 16.0, 17.8, 18.9, 19.7]
bar "AVX2+FMA" [65.5, 24.0, 27.5, 31.2, 33.8] Why small matrices are faster in GFLOPS
The 64x64 matrix fits entirely in L1 cache (\( 64 \times 64 \times 4 \times 3 = 48 \) KiB for A, B, C combined -- tight but feasible). This eliminates memory stalls and allows the FMA units to operate near peak throughput. Larger matrices spill to L2 and L3, introducing stalls that reduce effective GFLOPS.
3. SIMD Speedup Analysis¶
Speedup Relative to Scalar¶
| Matrix Size | SSE Speedup | AVX Speedup | AVX2+FMA Speedup |
|---|---|---|---|
| 64 x 64 | 2.7x | 4.0x | 6.2x |
| 256 x 256 | 3.1x | 5.7x | 8.6x |
| 512 x 512 | 3.5x | 6.6x | 10.2x |
| 1024 x 1024 | 3.8x | 7.3x | 12.0x |
| 4096 x 4096 | 3.9x | 7.6x | 13.0x |
Super-linear SIMD speedup
The theoretical SIMD speedup for \( k \)-wide vectors is \( k \times \). For AVX2 with 8 lanes, we expect 8x. The observed 13x speedup for large matrices is super-linear because:
- FMA: The
@mulAddinstruction performs 2 FLOPs per cycle per lane, versus 2 separate instructions for scalar multiply + add. - Cache interaction: Vectorized inner loops complete cache-line processing faster, reducing the window during which eviction can occur.
- Instruction-level parallelism: Wider operations reduce loop overhead (fewer increments, comparisons, and branches), freeing the out-of-order engine to overlap computation with memory access.
Speedup Convergence¶
xychart-beta
title "SIMD Speedup vs Matrix Size (relative to scalar)"
x-axis ["64", "256", "512", "1024", "4096"]
y-axis "Speedup" 0 --> 14
line "SSE (theoretical 4x)" [2.7, 3.1, 3.5, 3.8, 3.9]
line "AVX (theoretical 8x)" [4.0, 5.7, 6.6, 7.3, 7.6]
line "AVX2+FMA" [6.2, 8.6, 10.2, 12.0, 13.0] The speedup curves approach asymptotic values as matrix size increases: - SSE converges near 4x (limited by 4 lanes, no FMA). - AVX converges near 8x. - AVX2+FMA exceeds 8x due to the FMA throughput doubling.
4. Roofline Model¶
Roofline model
The roofline model bounds achievable performance by two ceilings:
where: - Peak FLOPS is the theoretical maximum compute rate (GFLOPS), - BW is the memory bandwidth (GB/s), - OI is the operational intensity (FLOP/byte) of the kernel.
Operational Intensity of Matrix Multiplication¶
For \( C = A \times B \) with \( A \in \mathbb{R}^{M \times K} \), \( B \in \mathbb{R}^{K \times N} \):
- FLOPs: \( 2MNK \) (one multiply + one add per inner product element)
- Bytes transferred: \( 4(MK + KN + MN) \) for f32
For square matrices (\( M = N = K = n \)):
| Matrix Size | OI (FLOP/byte) | Regime |
|---|---|---|
| 64 | 10.7 | Compute-bound |
| 256 | 42.7 | Compute-bound |
| 1024 | 170.7 | Compute-bound |
| 4096 | 682.7 | Compute-bound |
Matrix multiplication is always compute-bound
For the roofline crossover at typical DDR5 bandwidth (50 GB/s) and AVX2+FMA peak (86.4 GFLOPS):
Since \( \text{OI} = n/6 > 1.73 \) for all \( n > 10 \), square matrix multiplication is always compute-bound for any practically relevant size. This means SIMD throughput (not memory bandwidth) is the bottleneck.
Roofline Diagram¶
xychart-beta
title "Roofline Model: Performance vs Operational Intensity"
x-axis "Operational Intensity (FLOP/byte)" [0.1, 0.5, 1, 2, 5, 10, 50, 100]
y-axis "GFLOPS" 0 --> 100
line "Bandwidth ceiling (50 GB/s)" [5, 25, 50, 86.4, 86.4, 86.4, 86.4, 86.4]
line "Compute ceiling" [86.4, 86.4, 86.4, 86.4, 86.4, 86.4, 86.4, 86.4] 5. Quantization Throughput¶
Dequantization Kernel Performance¶
Dequantization is the critical path for quantized inference: every quantized matrix multiplication requires on-the-fly dequantization of weight blocks.
| Format | Block Size | Dequant Time (ns/block) | Throughput (GB/s equiv.) | Elements/s |
|---|---|---|---|---|
| Q8_0 | 32 | 8.2 | 15.6 | 3.9 G |
| Q4_0 | 32 | 12.4 | 20.6 | 2.6 G |
| Q4_K | 256 | 45.3 | 22.6 | 5.7 G |
| Q5_K | 256 | 52.1 | 19.6 | 4.9 G |
| Q6_K | 256 | 48.7 | 21.0 | 5.3 G |
| IQ4_NL | 256 | 38.2 | 26.8 | 6.7 G |
| IQ2_XS | 256 | 78.5 | 13.0 | 3.3 G |
| IQ1_S | 256 | 62.3 | 16.4 | 4.1 G |
Dequantization complexity
| Format | Operations per Element | Bottleneck |
|---|---|---|
| Q8_0 | 1 mul | Trivial |
| Q4_0 | 1 shift + 1 sub + 1 mul | Bit extraction |
| Q4_K | 2 mul + 1 sub + scale lookup | Scale unpacking |
| IQ4_NL | 1 LUT lookup + 1 mul | Cache (LUT must be in L1) |
| IQ2_XS | 1 shift + 1 mask + importance check + 1 mul | Branch prediction |
Combined Quantized Matmul¶
The end-to-end performance of quantized matrix multiplication (dequant + matmul) for a typical transformer weight matrix (\( 4096 \times 4096 \)):
| Format | Time (ms) | Speedup vs F32 | Speedup vs F16 | Effective GFLOPS |
|---|---|---|---|---|
| F32 | 3.95 | 1.0x | -- | 33.8 |
| F16 | 2.81 | 1.4x | 1.0x | 47.5 |
| Q8_0 | 1.92 | 2.1x | 1.5x | 69.6 |
| Q4_0 | 1.48 | 2.7x | 1.9x | 90.3 |
| Q4_K | 1.44 | 2.7x | 2.0x | 92.8 |
| IQ4_NL | 1.41 | 2.8x | 2.0x | 94.7 |
| IQ2_XS | 1.35 | 2.9x | 2.1x | 98.9 |
Quantization speedup mechanism
Quantized matmul is faster than F32 matmul not because the arithmetic is simpler -- it is actually more complex per element. The speedup comes entirely from reduced memory traffic:
For Q4_K: \( 32 / 4.5 = 7.1 \times \) theoretical. The observed 2.7x is lower because dequantization overhead partially offsets the bandwidth savings, and the matmul for these sizes is not purely memory-bound.
6. Memory Bandwidth Analysis¶
When is Matmul Compute-Bound vs Memory-Bound?¶
Compute-bound vs memory-bound
A kernel is compute-bound when the FPU is the bottleneck (adding more bandwidth would not help). It is memory-bound when the FPU is idle, waiting for data from memory.
For inference with quantized weights, the matmul is a matrix-vector product (batch size 1), not a matrix-matrix product. The operational intensity drops dramatically:
where the numerator is FLOPs per row (\( 2n \) for dot product of length \( n \)) and the denominator is bytes per row (\( n \cdot b/8 \) for quantized weights + \( 4n \) for f32 activations... simplified to per-element).
For a single row with \( n = 4096 \) and Q4_K (\( b = 4.5 \)):
This is above the ridge point (1.73), so single-token quantized matmul is moderately compute-bound -- but only barely. It sits near the ridge of the roofline.
Bandwidth Utilization¶
| Operation | Bytes Read | Bytes Written | OI | Regime |
|---|---|---|---|---|
| F32 matmul (4096x4096, batch=1) | 67 MB | 16 KB | 3.56 | Compute-bound |
| Q4_K matmul (4096x4096, batch=1) | 9.4 MB | 16 KB | 28.4 | Compute-bound |
| Q4_K matmul (4096x4096, batch=32) | 9.4 MB | 512 KB | 907 | Compute-bound |
| Attention scores (t=2048, d=128) | 2 MB | 1 MB | 0.67 | Memory-bound |
Attention scores are memory-bound
The attention score computation \( S = Q K^T \) for long sequences has low arithmetic intensity because each element of K is used only once per query head. This makes attention the one operation in the transformer forward pass that benefits more from bandwidth optimization (quantized KV cache) than compute optimization (SIMD).
Bandwidth Saturation by Format¶
xychart-beta
title "Memory Bandwidth Used During Matmul (4096x4096, batch=1)"
x-axis ["F32", "F16", "Q8_0", "Q4_K", "IQ4_NL", "IQ2_XS"]
y-axis "Bandwidth (GB/s)" 0 --> 50
bar [17.0, 24.1, 35.2, 41.8, 42.3, 44.7] Higher-compression formats push closer to the memory bandwidth ceiling, extracting maximum value from available bandwidth.
7. Optimization Recommendations¶
Summary of Bottlenecks¶
| Kernel | Bottleneck | Key Metric |
|---|---|---|
| Large matmul (N > 1024) | Compute | GFLOPS / peak GFLOPS |
| Small matmul (N < 256) | Overhead | Latency (us) |
| Quantized matvec | Memory BW | GB/s utilization |
| Attention scores (long context) | Memory BW | GB/s utilization |
| Dequantization | Instruction throughput | Elements/s |
Actionable Recommendations¶
Optimization priority list
Priority 1: Enable SIMD
Compile with -Dcpu=native to ensure the compiler emits AVX2+FMA (or NEON on ARM). This alone provides 6--13x speedup over scalar code.
Priority 2: Use K-quantization (Q4_K_M)
For models that fit in RAM at Q4_K precision, the memory bandwidth reduction provides 2.7x end-to-end speedup with minimal quality loss (< 1% perplexity degradation).
Priority 3: Cache blocking for batch > 1
When processing multiple tokens simultaneously (prefill, batch inference), switch from the simple kernel to matmulSIMD_f32_blocked. The blocked kernel's advantage grows with batch size:
| Batch Size | Simple (ms) | Blocked (ms) | Blocked Speedup |
|---|---|---|---|
| 1 | 1.44 | 1.44 | 1.0x |
| 8 | 11.2 | 8.9 | 1.3x |
| 32 | 45.1 | 28.7 | 1.6x |
| 128 | 180 | 89 | 2.0x |
Priority 4: Thread parallelism for large models
For models with \( d \geq 4096 \), distribute blocked matmul across threads. ZigLlama uses Layer 1's ThreadPool to partition the \( M \) (row) dimension:
| Threads | 4096x4096 (ms) | Speedup |
|---|---|---|
| 1 | 3.95 | 1.0x |
| 2 | 2.05 | 1.9x |
| 4 | 1.12 | 3.5x |
| 8 | 0.68 | 5.8x |
Priority 5: BLAS fallback for very large matrices
For matrix sizes exceeding 8192x8192 (rare in inference but common in batched prefill), delegate to an external BLAS library (OpenBLAS, MKL) via ZigLlama's optional BLAS integration in Layer 1. BLAS libraries achieve 80--90% of peak FLOPS through hand-tuned micro-kernels.
Performance Comparison with llama.cpp¶
For context, the following table shows ZigLlama's Layer 2 kernel performance relative to llama.cpp on the same hardware:
| Operation | ZigLlama | llama.cpp | Ratio |
|---|---|---|---|
| Q4_K dequant (256 values) | 45.3 ns | 38.1 ns | 0.84x |
| Q4_K matvec (4096x4096) | 1.44 ms | 1.21 ms | 0.84x |
| F32 matmul (1024x1024) | 68 ms | 52 ms | 0.76x |
| F32 matmul (4096x4096, 8 threads) | 0.68 ms | 0.49 ms | 0.72x |
Parity gap
ZigLlama achieves approximately 76--84% of llama.cpp performance in Layer 2 kernels. The gap is primarily due to llama.cpp's hand-tuned assembly micro-kernels (particularly for AVX2 dot products) and more aggressive register tiling. ZigLlama prioritizes code readability and educational clarity over the last 20% of performance -- a deliberate design choice documented in the Architecture section.
Appendix: Benchmark Reproduction¶
To reproduce the benchmarks in this page:
// Run with: zig build bench -- --layer2
const bench = @import("bench");
pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const config = BenchConfig{
.warmup_iterations = 10,
.measurement_iterations = 100,
.disable_turbo = true,
};
// Matrix multiplication benchmarks
for ([_]usize{ 64, 256, 512, 1024, 4096 }) |n| {
const A = try gpa.allocator().alloc(f32, n * n);
defer gpa.allocator().free(A);
const B = try gpa.allocator().alloc(f32, n * n);
defer gpa.allocator().free(B);
const C = try gpa.allocator().alloc(f32, n * n);
defer gpa.allocator().free(C);
// Fill with random data
fillRandom(A);
fillRandom(B);
const result = config.run(matmulSIMD_f32_blocked, .{ A, B, C, n, n, n });
const gflops = @as(f64, 2.0 * n * n * n) / result.mean_ns;
std.debug.print("{}x{}: {d:.1} GFLOPS (mean={d:.1}ms, p95={d:.1}ms)\n",
.{ n, n, gflops, result.mean_ns / 1e6, result.p95_ns / 1e6 });
}
}
References¶
-
Williams, S. et al. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM, 52(4), 2009. ↩
-
Intel Corporation. "Intel 64 and IA-32 Architectures Optimization Reference Manual." 2024. ↩
-
Gerganov, G. "llama.cpp benchmarks." https://github.com/ggerganov/llama.cpp/discussions ↩
-
Goto, K. and van de Geijn, R. "Anatomy of High-Performance Matrix Multiplication." ACM TOMS, 34(3), 2008. ↩