CPU Threading and NUMA Optimization¶
Transformer inference is inherently parallelizable: attention heads are independent, matrix rows can be distributed across workers, and multiple requests can be batched. ZigLlama provides a work-stealing thread pool with NUMA awareness to exploit multi-core and multi-socket hardware, implemented in src/foundation/threading.zig.
1. Parallelism in Transformer Inference¶
1.1 Which Operations Parallelize¶
Not all operations in a transformer forward pass benefit equally from threading. The key distinction is between data-parallel (embarrassingly parallel) and sequential operations:
| Operation | Parallel Strategy | Granularity |
|---|---|---|
| Matrix multiply (GEMM) | Partition output rows across workers | Row blocks |
| Multi-head attention | One head per worker (or group of heads) | Head |
| Softmax | Each row is independent | Row |
| RMSNorm / LayerNorm | Each token position is independent | Token |
| Residual addition | Element-wise, trivially parallel | Block |
| Token embedding lookup | Independent per token | Token |
| Autoregressive decoding | Sequential (each token depends on previous) | -- |
| Layer execution | Sequential (layer \( l+1 \) depends on \( l \)) | -- |
The Sequential Bottleneck
During autoregressive generation, each token is produced one at a time. The parallelism is within each token's forward pass, not across tokens (unless batching is used). This limits the achievable speedup per Amdahl's Law (see Section 7).
1.2 Parallelism Diagram¶
flowchart LR
subgraph "Parallel (within one token)"
direction TB
A["GEMM rows"] --> B["Attention heads"]
B --> C["Softmax rows"]
C --> D["FFN rows"]
end
subgraph "Sequential"
direction TB
L1["Layer 1"] --> L2["Layer 2"]
L2 --> L3["..."]
L3 --> LN["Layer N"]
end
T1["Token t"] --> L1
LN --> T2["Token t+1"] 2. ThreadPool Architecture¶
2.1 Configuration¶
pub const ThreadPoolConfig = struct {
num_threads: u32,
numa_policy: NumaPolicy = .balanced,
work_stealing: bool = true,
affinity_enabled: bool = true,
stack_size: usize = 2 * 1024 * 1024, // 2 MB per thread
pub fn detect() ThreadPoolConfig {
const cpu_count = @as(u32, @intCast(std.Thread.getCpuCount() catch 4));
return ThreadPoolConfig{
.num_threads = @max(1, cpu_count - 1),
};
}
};
Thread Count Heuristic
detect() reserves one core for the operating system and I/O. For a dedicated inference server, using all cores (cpu_count) may be preferable. For systems with hyper-threading, using only the number of physical cores often yields better throughput due to reduced cache contention.
2.2 Worker Struct¶
Each worker owns a private work-stealing queue and runs an event loop:
pub const Worker = struct {
id: u32,
thread: std.Thread,
queue: WorkStealingQueue,
pool: *ThreadPool,
should_stop: std.atomic.Atomic(bool),
fn workerLoop(self: *Worker) void {
while (!self.should_stop.load(.Acquire)) {
// 1. Try own queue
var work_item = self.queue.pop();
// 2. If empty, steal from a peer
if (work_item == null) {
work_item = self.stealWork();
}
// 3. Execute or yield
if (work_item) |item| {
item.execute();
} else {
std.Thread.yield() catch {};
}
}
}
};
2.3 ThreadPool Struct¶
pub const ThreadPool = struct {
workers: []Worker,
config: ThreadPoolConfig,
topology: CpuTopology,
allocator: std.mem.Allocator,
next_worker: std.atomic.Atomic(u32),
pub fn submit(self: *ThreadPool, work_item: *WorkItem) bool {
// Round-robin assignment
const id = self.next_worker.fetchAdd(1, .AcqRel)
% @as(u32, @intCast(self.workers.len));
return self.workers[id].queue.push(work_item);
}
};
3. Work-Stealing Algorithm¶
Work stealing balances load when tasks have uneven execution times. Each worker has a deque (double-ended queue): it pushes and pops from the tail (LIFO for locality), while thieves steal from the head (FIFO for fairness)1.
3.1 WorkStealingQueue¶
pub const WorkStealingQueue = struct {
items: []?*WorkItem,
head: std.atomic.Atomic(u32),
tail: std.atomic.Atomic(u32),
capacity: u32,
/// Owner pushes to tail
pub fn push(self: *WorkStealingQueue, item: *WorkItem) bool {
const tail = self.tail.load(.Acquire);
const next = (tail + 1) % self.capacity;
if (next == self.head.load(.Acquire)) return false; // full
self.items[tail] = item;
self.tail.store(next, .Release);
return true;
}
/// Owner pops from tail (LIFO)
pub fn pop(self: *WorkStealingQueue) ?*WorkItem {
const tail = self.tail.load(.Acquire);
if (tail == self.head.load(.Acquire)) return null; // empty
const prev = if (tail == 0) self.capacity - 1 else tail - 1;
const item = self.items[prev];
self.items[prev] = null;
self.tail.store(prev, .Release);
return item;
}
/// Thief steals from head (FIFO) using CAS
pub fn steal(self: *WorkStealingQueue) ?*WorkItem {
const head = self.head.load(.Acquire);
if (head == self.tail.load(.Acquire)) return null; // empty
const item = self.items[head] orelse return null;
const next = (head + 1) % self.capacity;
if (self.head.compareAndSwap(head, next, .AcqRel, .Monotonic) == null) {
return item; // CAS succeeded
}
return null; // contention, try again later
}
};
Work-Stealing Protocol
- Worker \( w \) checks its own deque tail. If non-empty, pop and execute.
- If empty, select victim \( v = (w + 1 + i) \mod P \) for \( i = 0, 1, \ldots, P-1 \) (round-robin).
- Attempt
steal()from \( v \)'s deque head using CAS. - If CAS fails (contention), move to next victim.
- If all victims are empty, call
Thread.yield().
3.2 Lock-Free Correctness¶
The compareAndSwap on head ensures that at most one thief can claim a given work item. The Acquire/Release memory orderings guarantee that:
- A thief sees the
items[head]written by the producer (Release on push, Acquire on steal). - The producer sees the updated
headafter a successful steal (AcqRel on CAS).
4. NUMA Awareness¶
4.1 NUMA Architecture¶
On multi-socket servers, each CPU socket has its own local memory. Accessing remote memory (attached to another socket) incurs 1.5-2x latency:
graph LR
subgraph Socket 0
C0["Cores 0-7"]
M0["DDR5 Bank 0"]
C0 --- M0
end
subgraph Socket 1
C1["Cores 8-15"]
M1["DDR5 Bank 1"]
C1 --- M1
end
C0 ---|"QPI/UPI\n~100ns"| C1
C0 ---|"Local\n~60ns"| M0
C1 ---|"Local\n~60ns"| M1 4.2 NumaPolicy Enum¶
pub const NumaPolicy = enum {
local, // Prefer local NUMA node for all allocations
interleave, // Round-robin pages across all NUMA nodes
balanced, // Automatic balancing based on topology
disabled, // No NUMA awareness (default for single-socket)
};
| Policy | Best For |
|---|---|
local | Latency-sensitive single-stream inference |
interleave | Bandwidth-bound workloads (large batch GEMM) |
balanced | General purpose; auto-selects based on topology |
disabled | Single-socket systems or debugging |
4.3 CpuTopology Detection¶
pub const CpuTopology = struct {
num_cores: u32,
num_threads: u32,
num_numa_nodes: u32,
cache_line_size: u32,
l1_cache_size: u64,
l2_cache_size: u64,
l3_cache_size: u64,
pub fn detect() CpuTopology {
const cpu_count = @as(u32, @intCast(std.Thread.getCpuCount() catch 4));
return CpuTopology{
.num_cores = cpu_count,
.num_threads = cpu_count,
.num_numa_nodes = @max(1, cpu_count / 8),
.cache_line_size = 64,
.l1_cache_size = 32 * 1024,
.l2_cache_size = 256 * 1024,
.l3_cache_size = 8 * 1024 * 1024,
};
}
};
Cache Hierarchy Relevance
The detected cache sizes inform tiling decisions in the GEMM kernel. For example, an L1-resident tile for f32 should fit within 32 KB: \( \sqrt{32768 / 4} \approx 90 \) elements per side, so a \( 64 \times 64 \) tile is a common choice.
5. Parallel Operations¶
5.1 Parallel Matrix Multiplication¶
ParallelOps.matmul partitions the output matrix by rows and distributes chunks to workers:
const MatMulContext = struct {
a: *const Tensor,
b: *const Tensor,
c: *Tensor,
start_row: u32,
end_row: u32,
fn execute(work_item: *WorkItem) void {
const ctx = @as(*MatMulContext, @ptrCast(@alignCast(work_item.data.?)));
for (ctx.start_row..ctx.end_row) |i| {
for (0..ctx.c.shape[1]) |j| {
var sum: f32 = 0.0;
for (0..ctx.a.shape[1]) |k| {
sum += ctx.a.get(.{i, k}) * ctx.b.get(.{k, j});
}
ctx.c.set(.{i, j}, sum);
}
}
}
};
Row Partitioning
For a \( M \times P \) output matrix and \( W \) workers, worker \( w \) computes rows
Each worker accesses its own output rows exclusively -- no synchronization is needed on the output matrix.
5.2 Parallel Softmax¶
Softmax is computed independently per row, making it trivially parallelizable:
ZigLlama's ParallelOps.softmax distributes rows across workers using the same row-partitioning scheme as matmul.
6. Cache-Line Alignment¶
6.1 False Sharing¶
When two threads write to different variables that share a cache line (64 bytes on x86-64), the cache coherency protocol forces both cores to repeatedly invalidate and reload the line -- a phenomenon called false sharing.
6.2 Aligned Allocation¶
pub const ThreadingUtils = struct {
pub fn allocAligned(allocator: std.mem.Allocator,
comptime T: type, count: usize) ![]T {
const cache_line_size = getCacheLineSize(); // 64
const alignment = @max(@alignOf(T), cache_line_size);
const total_size = count * @sizeOf(T) + alignment;
const raw_mem = try allocator.alloc(u8, total_size);
const addr = @intFromPtr(raw_mem.ptr);
const aligned_addr = (addr + alignment - 1) & ~(alignment - 1);
return @as([*]T, @ptrFromInt(aligned_addr))[0..count];
}
};
When to Align
Align the start of each worker's output slice to a cache-line boundary. This prevents false sharing between adjacent workers' output regions during parallel matmul.
7. Amdahl's Law Analysis¶
7.1 The Law¶
Amdahl's Law
If a fraction \( p \) of a program's execution time is perfectly parallelizable and the remaining fraction \( 1 - p \) is serial, the maximum speedup with \( W \) workers is
7.2 Transformer Workload Decomposition¶
For a typical LLaMA-7B forward pass during autoregressive generation (sequence length 1, context length 2048):
| Component | Fraction of Runtime | Parallelizable? |
|---|---|---|
| GEMM (Q/K/V + FFN) | ~75% | Yes (row partitioning) |
| Attention score + softmax | ~10% | Yes (per-head) |
| RMSNorm | ~5% | Yes (per-token) |
| Residual add | ~3% | Yes (element-wise) |
| Overhead (scheduling, sync) | ~5% | No |
| Sequential (layer transitions) | ~2% | No |
So \( p \approx 0.93 \) and \( 1 - p \approx 0.07 \).
7.3 Speedup Predictions¶
xychart-beta
title "Amdahl's Law: Speedup vs. Workers (p=0.93)"
x-axis "Workers" [1, 2, 4, 8, 16, 32, 64]
y-axis "Speedup" 0 --> 15
line [1, 1.87, 3.1, 5.2, 7.6, 10.0, 11.5] Practical Implications
With 93% parallelizable work, diminishing returns begin around 16 cores. Beyond 32 cores, the serial fraction dominates. This is why ZigLlama defaults to cpu_count - 1 workers rather than spawning hundreds of threads, and why batch inference (which increases \( p \) by parallelizing across requests) is critical for throughput-oriented deployments.
8. Performance Monitoring¶
pub const ThreadPoolStats = struct {
total_tasks_executed: std.atomic.Atomic(u64),
total_work_stolen: std.atomic.Atomic(u64),
average_queue_depth: f64,
cpu_utilization: f64,
};
Key metrics to watch:
| Metric | Healthy Range | Indicates Problem When |
|---|---|---|
| Work stolen / total tasks | 5--20% | >40% (unbalanced partitioning) |
| Average queue depth | 0--2 | >10 (submission outpaces execution) |
| CPU utilization | >90% | <50% (too many yields, insufficient work) |
References¶
-
Blumofe, R. and Leiserson, C. "Scheduling Multithreaded Computations by Work Stealing." JACM, 46(5), 1999. ↩
-
Amdahl, G. "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities." AFIPS, 1967. ↩
-
Drepper, U. "What Every Programmer Should Know About Memory." 2007. ↩
-
Pope, R. et al. "Efficiently Scaling Transformer Inference." MLSys, 2023. ↩