CPU Threading and NUMA Optimization¶

Transformer inference is inherently parallelizable: attention heads are independent, matrix rows can be distributed across workers, and multiple requests can be batched. ZigLlama provides a work-stealing thread pool with NUMA awareness to exploit multi-core and multi-socket hardware, implemented in src/foundation/threading.zig.

1. Parallelism in Transformer Inference¶

1.1 Which Operations Parallelize¶

Not all operations in a transformer forward pass benefit equally from threading. The key distinction is between data-parallel (embarrassingly parallel) and sequential operations:

Operation	Parallel Strategy	Granularity
Matrix multiply (GEMM)	Partition output rows across workers	Row blocks
Multi-head attention	One head per worker (or group of heads)	Head
Softmax	Each row is independent	Row
RMSNorm / LayerNorm	Each token position is independent	Token
Residual addition	Element-wise, trivially parallel	Block
Token embedding lookup	Independent per token	Token
Autoregressive decoding	Sequential (each token depends on previous)	--
Layer execution	Sequential (layer \( l+1 \) depends on \( l \))	--

The Sequential Bottleneck

During autoregressive generation, each token is produced one at a time. The parallelism is within each token's forward pass, not across tokens (unless batching is used). This limits the achievable speedup per Amdahl's Law (see Section 7).

1.2 Parallelism Diagram¶

flowchart LR
    subgraph "Parallel (within one token)"
        direction TB
        A["GEMM rows"] --> B["Attention heads"]
        B --> C["Softmax rows"]
        C --> D["FFN rows"]
    end

    subgraph "Sequential"
        direction TB
        L1["Layer 1"] --> L2["Layer 2"]
        L2 --> L3["..."]
        L3 --> LN["Layer N"]
    end

    T1["Token t"] --> L1
    LN --> T2["Token t+1"]

2. ThreadPool Architecture¶

2.1 Configuration¶

pub const ThreadPoolConfig = struct {
    num_threads: u32,
    numa_policy: NumaPolicy = .balanced,
    work_stealing: bool = true,
    affinity_enabled: bool = true,
    stack_size: usize = 2 * 1024 * 1024,  // 2 MB per thread

    pub fn detect() ThreadPoolConfig {
        const cpu_count = @as(u32, @intCast(std.Thread.getCpuCount() catch 4));
        return ThreadPoolConfig{
            .num_threads = @max(1, cpu_count - 1),
        };
    }
};

Thread Count Heuristic

detect() reserves one core for the operating system and I/O. For a dedicated inference server, using all cores (cpu_count) may be preferable. For systems with hyper-threading, using only the number of physical cores often yields better throughput due to reduced cache contention.

2.2 Worker Struct¶

Each worker owns a private work-stealing queue and runs an event loop:

pub const Worker = struct {
    id: u32,
    thread: std.Thread,
    queue: WorkStealingQueue,
    pool: *ThreadPool,
    should_stop: std.atomic.Atomic(bool),

    fn workerLoop(self: *Worker) void {
        while (!self.should_stop.load(.Acquire)) {
            // 1. Try own queue
            var work_item = self.queue.pop();

            // 2. If empty, steal from a peer
            if (work_item == null) {
                work_item = self.stealWork();
            }

            // 3. Execute or yield
            if (work_item) |item| {
                item.execute();
            } else {
                std.Thread.yield() catch {};
            }
        }
    }
};

2.3 ThreadPool Struct¶

pub const ThreadPool = struct {
    workers: []Worker,
    config: ThreadPoolConfig,
    topology: CpuTopology,
    allocator: std.mem.Allocator,
    next_worker: std.atomic.Atomic(u32),

    pub fn submit(self: *ThreadPool, work_item: *WorkItem) bool {
        // Round-robin assignment
        const id = self.next_worker.fetchAdd(1, .AcqRel)
                   % @as(u32, @intCast(self.workers.len));
        return self.workers[id].queue.push(work_item);
    }
};

3. Work-Stealing Algorithm¶

Work stealing balances load when tasks have uneven execution times. Each worker has a deque (double-ended queue): it pushes and pops from the tail (LIFO for locality), while thieves steal from the head (FIFO for fairness)¹.

3.1 WorkStealingQueue¶

pub const WorkStealingQueue = struct {
    items: []?*WorkItem,
    head: std.atomic.Atomic(u32),
    tail: std.atomic.Atomic(u32),
    capacity: u32,

    /// Owner pushes to tail
    pub fn push(self: *WorkStealingQueue, item: *WorkItem) bool {
        const tail = self.tail.load(.Acquire);
        const next = (tail + 1) % self.capacity;
        if (next == self.head.load(.Acquire)) return false;  // full
        self.items[tail] = item;
        self.tail.store(next, .Release);
        return true;
    }

    /// Owner pops from tail (LIFO)
    pub fn pop(self: *WorkStealingQueue) ?*WorkItem {
        const tail = self.tail.load(.Acquire);
        if (tail == self.head.load(.Acquire)) return null;  // empty
        const prev = if (tail == 0) self.capacity - 1 else tail - 1;
        const item = self.items[prev];
        self.items[prev] = null;
        self.tail.store(prev, .Release);
        return item;
    }

    /// Thief steals from head (FIFO) using CAS
    pub fn steal(self: *WorkStealingQueue) ?*WorkItem {
        const head = self.head.load(.Acquire);
        if (head == self.tail.load(.Acquire)) return null;  // empty
        const item = self.items[head] orelse return null;
        const next = (head + 1) % self.capacity;
        if (self.head.compareAndSwap(head, next, .AcqRel, .Monotonic) == null) {
            return item;  // CAS succeeded
        }
        return null;  // contention, try again later
    }
};

Work-Stealing Protocol

Worker \( w \) checks its own deque tail. If non-empty, pop and execute.
If empty, select victim \( v = (w + 1 + i) \mod P \) for \( i = 0, 1, \ldots, P-1 \) (round-robin).
Attempt steal() from \( v \)'s deque head using CAS.
If CAS fails (contention), move to next victim.
If all victims are empty, call Thread.yield().

3.2 Lock-Free Correctness¶

The compareAndSwap on head ensures that at most one thief can claim a given work item. The Acquire/Release memory orderings guarantee that:

A thief sees the items[head] written by the producer (Release on push, Acquire on steal).
The producer sees the updated head after a successful steal (AcqRel on CAS).

4. NUMA Awareness¶

4.1 NUMA Architecture¶

On multi-socket servers, each CPU socket has its own local memory. Accessing remote memory (attached to another socket) incurs 1.5-2x latency:

graph LR
    subgraph Socket 0
        C0["Cores 0-7"]
        M0["DDR5 Bank 0"]
        C0 --- M0
    end
    subgraph Socket 1
        C1["Cores 8-15"]
        M1["DDR5 Bank 1"]
        C1 --- M1
    end
    C0 ---|"QPI/UPI\n~100ns"| C1
    C0 ---|"Local\n~60ns"| M0
    C1 ---|"Local\n~60ns"| M1

4.2 NumaPolicy Enum¶

pub const NumaPolicy = enum {
    local,       // Prefer local NUMA node for all allocations
    interleave,  // Round-robin pages across all NUMA nodes
    balanced,    // Automatic balancing based on topology
    disabled,    // No NUMA awareness (default for single-socket)
};

Policy	Best For
`local`	Latency-sensitive single-stream inference
`interleave`	Bandwidth-bound workloads (large batch GEMM)
`balanced`	General purpose; auto-selects based on topology
`disabled`	Single-socket systems or debugging

4.3 CpuTopology Detection¶

pub const CpuTopology = struct {
    num_cores: u32,
    num_threads: u32,
    num_numa_nodes: u32,
    cache_line_size: u32,
    l1_cache_size: u64,
    l2_cache_size: u64,
    l3_cache_size: u64,

    pub fn detect() CpuTopology {
        const cpu_count = @as(u32, @intCast(std.Thread.getCpuCount() catch 4));
        return CpuTopology{
            .num_cores = cpu_count,
            .num_threads = cpu_count,
            .num_numa_nodes = @max(1, cpu_count / 8),
            .cache_line_size = 64,
            .l1_cache_size = 32 * 1024,
            .l2_cache_size = 256 * 1024,
            .l3_cache_size = 8 * 1024 * 1024,
        };
    }
};

Cache Hierarchy Relevance

The detected cache sizes inform tiling decisions in the GEMM kernel. For example, an L1-resident tile for f32 should fit within 32 KB: \( \sqrt{32768 / 4} \approx 90 \) elements per side, so a \( 64 \times 64 \) tile is a common choice.

5. Parallel Operations¶

5.1 Parallel Matrix Multiplication¶

ParallelOps.matmul partitions the output matrix by rows and distributes chunks to workers:

const MatMulContext = struct {
    a: *const Tensor,
    b: *const Tensor,
    c: *Tensor,
    start_row: u32,
    end_row: u32,

    fn execute(work_item: *WorkItem) void {
        const ctx = @as(*MatMulContext, @ptrCast(@alignCast(work_item.data.?)));
        for (ctx.start_row..ctx.end_row) |i| {
            for (0..ctx.c.shape[1]) |j| {
                var sum: f32 = 0.0;
                for (0..ctx.a.shape[1]) |k| {
                    sum += ctx.a.get(.{i, k}) * ctx.b.get(.{k, j});
                }
                ctx.c.set(.{i, j}, sum);
            }
        }
    }
};

Row Partitioning

For a \( M \times P \) output matrix and \( W \) workers, worker \( w \) computes rows

\[ \left[ w \cdot \left\lfloor \frac{M}{W} \right\rfloor, \; \min\!\left((w+1) \cdot \left\lfloor \frac{M}{W} \right\rfloor, \; M\right) \right) \]

Each worker accesses its own output rows exclusively -- no synchronization is needed on the output matrix.

5.2 Parallel Softmax¶

Softmax is computed independently per row, making it trivially parallelizable:

\[ \text{softmax}(\mathbf{x})_j = \frac{e^{x_j - \max(\mathbf{x})}}{\sum_i e^{x_i - \max(\mathbf{x})}} \]

ZigLlama's ParallelOps.softmax distributes rows across workers using the same row-partitioning scheme as matmul.

6. Cache-Line Alignment¶

When two threads write to different variables that share a cache line (64 bytes on x86-64), the cache coherency protocol forces both cores to repeatedly invalidate and reload the line -- a phenomenon called false sharing.

6.2 Aligned Allocation¶

pub const ThreadingUtils = struct {
    pub fn allocAligned(allocator: std.mem.Allocator,
                        comptime T: type, count: usize) ![]T {
        const cache_line_size = getCacheLineSize();  // 64
        const alignment = @max(@alignOf(T), cache_line_size);
        const total_size = count * @sizeOf(T) + alignment;
        const raw_mem = try allocator.alloc(u8, total_size);
        const addr = @intFromPtr(raw_mem.ptr);
        const aligned_addr = (addr + alignment - 1) & ~(alignment - 1);
        return @as([*]T, @ptrFromInt(aligned_addr))[0..count];
    }
};

When to Align

Align the start of each worker's output slice to a cache-line boundary. This prevents false sharing between adjacent workers' output regions during parallel matmul.

7. Amdahl's Law Analysis¶

7.1 The Law¶

Amdahl's Law

If a fraction \( p \) of a program's execution time is perfectly parallelizable and the remaining fraction \( 1 - p \) is serial, the maximum speedup with \( W \) workers is

\[ S(W) = \frac{1}{(1 - p) + \frac{p}{W}} \]

7.2 Transformer Workload Decomposition¶

For a typical LLaMA-7B forward pass during autoregressive generation (sequence length 1, context length 2048):

Component	Fraction of Runtime	Parallelizable?
GEMM (Q/K/V + FFN)	~75%	Yes (row partitioning)
Attention score + softmax	~10%	Yes (per-head)
RMSNorm	~5%	Yes (per-token)
Residual add	~3%	Yes (element-wise)
Overhead (scheduling, sync)	~5%	No
Sequential (layer transitions)	~2%	No

So \( p \approx 0.93 \) and \( 1 - p \approx 0.07 \).

7.3 Speedup Predictions¶

\[ \begin{aligned} S(4) &= \frac{1}{0.07 + 0.93/4} \approx 3.1\times \\ S(8) &= \frac{1}{0.07 + 0.93/8} \approx 5.2\times \\ S(16) &= \frac{1}{0.07 + 0.93/16} \approx 7.6\times \\ S(64) &= \frac{1}{0.07 + 0.93/64} \approx 11.5\times \\ S(\infty) &= \frac{1}{0.07} \approx 14.3\times \\ \end{aligned} \]

xychart-beta
    title "Amdahl's Law: Speedup vs. Workers (p=0.93)"
    x-axis "Workers" [1, 2, 4, 8, 16, 32, 64]
    y-axis "Speedup" 0 --> 15
    line [1, 1.87, 3.1, 5.2, 7.6, 10.0, 11.5]

Practical Implications

With 93% parallelizable work, diminishing returns begin around 16 cores. Beyond 32 cores, the serial fraction dominates. This is why ZigLlama defaults to cpu_count - 1 workers rather than spawning hundreds of threads, and why batch inference (which increases \( p \) by parallelizing across requests) is critical for throughput-oriented deployments.

8. Performance Monitoring¶

pub const ThreadPoolStats = struct {
    total_tasks_executed: std.atomic.Atomic(u64),
    total_work_stolen: std.atomic.Atomic(u64),
    average_queue_depth: f64,
    cpu_utilization: f64,
};

Key metrics to watch:

Metric	Healthy Range	Indicates Problem When
Work stolen / total tasks	5--20%	>40% (unbalanced partitioning)
Average queue depth	0--2	>10 (submission outpaces execution)
CPU utilization	>90%	<50% (too many yields, insufficient work)

References¶

Blumofe, R. and Leiserson, C. "Scheduling Multithreaded Computations by Work Stealing." JACM, 46(5), 1999. ↩
Amdahl, G. "Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities." AFIPS, 1967. ↩
Drepper, U. "What Every Programmer Should Know About Memory." 2007. ↩
Pope, R. et al. "Efficiently Scaling Transformer Inference." MLSys, 2023. ↩