Performance Profiling¶
Optimising inference performance requires measurement before speculation. ZigLlama's profiling module provides high-resolution timing, memory tracking, statistical aggregation, and benchmark orchestration -- all designed to integrate seamlessly with the inference pipeline without distorting the results.
1. Profiler Struct¶
The Profiler is the central data structure for collecting and analysing performance measurements:
pub const Profiler = struct {
measurements: HashMap([]const u8, ArrayList(MeasurementPoint), ...),
statistics: HashMap([]const u8, PerformanceStats, ...),
active_measurements: HashMap([]const u8, MeasurementPoint, ...),
allocator: Allocator,
timer: Timer,
enabled: bool,
pub fn init(allocator: Allocator) !Profiler { ... }
pub fn deinit(self: *Profiler) void { ... }
pub fn setEnabled(self: *Profiler, enabled: bool) void { ... }
pub fn startMeasurement(self: *Profiler, name: []const u8) !void { ... }
pub fn endMeasurement(self: *Profiler, name: []const u8) !void { ... }
pub fn measureBlock(self: *Profiler, name: []const u8) MeasurementBlock { ... }
pub fn getStatistics(self: Profiler, name: []const u8) ?PerformanceStats { ... }
pub fn updatePercentiles(self: *Profiler) !void { ... }
pub fn printReport(self: *Profiler, writer: anytype) !void { ... }
pub fn reset(self: *Profiler) void { ... }
};
1.1 Design Principles¶
- Named measurements: Each operation is identified by a string key (e.g.,
"forward_pass","sampling","kv_cache_update"), enabling per-component analysis. - Zero-overhead when disabled: When
enabled = false, all measurement methods return immediately. - Thread-compatible: While the profiler itself is not thread-safe (each thread should use its own instance), the
MeasurementPointdata is self-contained and can be merged across threads.
2. RAII Measurement: measureBlock¶
The measureBlock pattern leverages Zig's defer to automatically end a measurement when the enclosing scope exits, even on error paths:
pub const MeasurementBlock = struct {
profiler: *Profiler,
name: []const u8,
started: bool,
pub fn init(profiler: *Profiler, name: []const u8) MeasurementBlock { ... }
pub fn deinit(self: MeasurementBlock) void {
if (self.started) {
self.profiler.endMeasurement(self.name) catch {};
}
}
};
Usage:
// Automatic measurement -- timing stops when block exits
{
const block = profiler.measureBlock("forward_pass");
defer block.deinit();
// ... model forward pass ...
// If this errors out, the measurement still ends cleanly
try model.forward(tokens);
}
Nested Measurements
Measurements can be nested to profile sub-operations within a larger operation:
{
const outer = profiler.measureBlock("generation_step");
defer outer.deinit();
{
const inner1 = profiler.measureBlock("forward_pass");
defer inner1.deinit();
try model.forward(tokens);
}
{
const inner2 = profiler.measureBlock("sampling");
defer inner2.deinit();
_ = try generator.sampleToken(logits);
}
}
3. PerformanceStats¶
Each named operation accumulates statistics in a PerformanceStats struct:
pub const PerformanceStats = struct {
count: u64,
min_duration: f64, // milliseconds
max_duration: f64,
avg_duration: f64,
median_duration: f64,
p95_duration: f64, // 95th percentile
p99_duration: f64, // 99th percentile
std_deviation: f64,
total_memory_allocated: u64,
peak_memory_usage: usize,
};
3.1 Running Average Update¶
Statistics are updated incrementally using a running average:
This avoids storing all measurements in memory just to compute the mean.
3.2 Percentile Calculation¶
Percentiles require the full measurement history. The updatePercentiles method sorts the stored durations and computes the median, P95, and P99:
pub fn calculatePercentiles(self: *PerformanceStats, durations: []f64) void {
std.mem.sort(f64, durations, {}, std.sort.asc(f64));
self.median_duration = durations[durations.len / 2];
self.p95_duration = durations[(durations.len * 95) / 100];
self.p99_duration = durations[(durations.len * 99) / 100];
// Standard deviation
var variance: f64 = 0.0;
for (durations) |d| {
const diff = d - self.avg_duration;
variance += diff * diff;
}
variance /= @as(f64, @floatFromInt(durations.len));
self.std_deviation = @sqrt(variance);
}
Why P95 and P99?
The average latency is misleading for user-facing systems because it hides long-tail outliers. The P95 tells you that 95% of requests complete within that time; the P99 catches the worst-case spikes that users may actually experience. For production SLOs, P99 is the standard metric.
3.3 Memory Metrics¶
Each MeasurementPoint records memory usage at start and end:
pub const MeasurementPoint = struct {
name: []const u8,
start_time: u64,
end_time: u64,
start_memory: usize,
end_memory: usize,
context: ?[]const u8,
pub fn duration(self: MeasurementPoint) f64 { ... }
pub fn memoryDelta(self: MeasurementPoint) i64 { ... }
};
A positive memoryDelta indicates net allocation; a negative delta indicates net deallocation. The peak memory across all measurements is tracked in PerformanceStats.peak_memory_usage.
4. BenchmarkRunner¶
The BenchmarkRunner orchestrates systematic benchmarks with warmup, measurement rounds, and statistical analysis:
pub const BenchmarkRunner = struct {
allocator: Allocator,
profiler: Profiler,
config: BenchmarkConfig,
pub fn init(allocator: Allocator, config: BenchmarkConfig) !BenchmarkRunner { ... }
pub fn deinit(self: *BenchmarkRunner) void { ... }
pub fn benchmarkGeneration(self: *BenchmarkRunner, generator: *TextGenerator,
test_prompts: []const []const u8,
config: GenerationConfig) !BenchmarkResult { ... }
};
4.1 BenchmarkConfig¶
pub const BenchmarkConfig = struct {
warmup_runs: u32 = 5, // Warm up CPU caches and JIT
measurement_runs: u32 = 100, // Statistical sample size
max_time_per_run: u32 = 10000, // Timeout per run (ms)
show_progress: bool = true,
memory_limit: ?usize = null,
};
Why Warmup Matters
The first few runs of any benchmark are typically slower due to cold CPU caches, branch predictor training, and (on some platforms) dynamic frequency scaling. Warmup runs ensure that subsequent measurements reflect steady-state performance. Five warmup runs is generally sufficient for CPU inference.
4.2 BenchmarkResult¶
pub const BenchmarkResult = struct {
name: []const u8,
config: BenchmarkConfig,
stats: PerformanceStats,
throughput: ThroughputMetrics,
memory: MemoryMetrics,
success_rate: f32,
total_duration_ms: f64,
};
Where ThroughputMetrics and MemoryMetrics provide domain-specific measurements:
pub const ThroughputMetrics = struct {
requests_per_second: f32,
tokens_per_second: f32,
characters_per_second: f32,
batches_per_second: f32,
};
pub const MemoryMetrics = struct {
peak_usage: usize,
average_usage: usize,
per_request_allocation: usize,
efficiency: f32, // useful bytes / total bytes
};
5. Roofline Model¶
Roofline Model
The roofline model characterises the performance of a computation by two hardware parameters and one application parameter:
- \( \pi \): peak computational throughput (FLOPS)
- \( \beta \): peak memory bandwidth (bytes/second)
- \( I \): arithmetic intensity (FLOPS per byte transferred)
The achievable performance is:
If \( \beta \cdot I < \pi \), the operation is memory-bound (limited by how fast data can be fed to the compute units). Otherwise it is compute-bound (limited by raw arithmetic throughput).
5.1 Arithmetic Intensity of Key Operations¶
| Operation | FLOPS | Bytes Transferred | Intensity \( I \) | Typical Bound |
|---|---|---|---|---|
| Matrix multiply \( (m,n) \times (n,p) \) | \( 2mnp \) | \( 4(mn + np + mp) \) | \( \approx \frac{m}{2} \) for large \( n \) | Compute (large \( m \)) |
| Attention \( \mathbf{Q}\mathbf{K}^{\!\top} \) | \( 2 n^2 d_h \) | \( 4(2n d_h + n^2) \) | \( \approx d_h / 2 \) | Memory (small \( d_h \)) |
| Softmax | \( 4n \) | \( 8n \) | 0.5 | Memory |
| RMSNorm | \( 3d \) | \( 8d \) | 0.375 | Memory |
| Token embedding lookup | 0 | \( 4d \) | 0 | Memory |
5.2 Applying the Roofline to Inference¶
For single-token generation on a typical CPU:
- \( \pi \approx 100 \) GFLOPS (AVX2, 8 cores)
- \( \beta \approx 50 \) GB/s (DDR4 dual-channel)
- Model weight loading per token: \( \approx 4d^2 L \) bytes (all weight matrices)
- Compute per token: \( \approx 8d^2 L \) FLOPS
The arithmetic intensity is:
Single-token generation sits right at the boundary -- it is memory-bound for single sequences (batch size 1) and shifts toward compute-bound as batch size increases.
Practical Implication
For single-request inference, the dominant bottleneck is loading model weights from RAM, not computing the matrix products. This is why quantisation (which reduces weight size) has such a large impact on single-request throughput: Q4 quantisation roughly halves the bytes transferred per token, nearly doubling inference speed.
6. Bottleneck Identification¶
The profiling module helps identify whether an operation is compute-bound or memory-bound by comparing actual throughput against the roofline:
| Symptom | Diagnosis | Optimisation |
|---|---|---|
| Throughput scales linearly with batch size | Memory-bound | Quantise weights, increase bandwidth |
| Throughput plateaus with larger batch | Compute-bound | Use BLAS, vectorise inner loops |
| P99 >> P50 | Tail latency spikes | Check GC pauses, thread contention |
| Memory grows linearly with sequence length | KV cache dominant | Use sliding window, quantise cache |
| First-token latency >> per-token latency | Prefill bottleneck | Chunk prefill, cache prompt tokens |
6.1 Profiling Workflow¶
Bottleneck Identification Workflow
- Instrument the inference pipeline with named measurements:
forward_pass,kv_cache_update,sampling,token_decode. - Run the benchmark with
BenchmarkRunner(5 warmup + 100 measurement rounds). - Analyse the
BenchmarkResult:- Which operation has the highest P99?
- What fraction of total time does each operation consume?
- Does memory grow unexpectedly?
- Compute arithmetic intensity for the bottleneck operation.
- Compare against the roofline to determine if it is memory-bound or compute-bound.
- Apply the appropriate optimisation and re-measure.
7. Integration with Inference Pipeline¶
The profiler integrates at each stage of the generation loop:
// Example: profiled generation loop
pub fn generateWithProfiling(
generator: *TextGenerator,
profiler: *Profiler,
prompt: []const u8,
) !GenerationResult {
{
const block = profiler.measureBlock("tokenize");
defer block.deinit();
// ... tokenize prompt ...
}
while (generating) {
{
const block = profiler.measureBlock("forward_pass");
defer block.deinit();
logits = try model.forward(tokens);
}
{
const block = profiler.measureBlock("kv_cache_update");
defer block.deinit();
try cache.appendToSequence(seq_id, layer_id, &keys, &values);
}
{
const block = profiler.measureBlock("sampling");
defer block.deinit();
next_token = try generator.sampleToken(logits);
}
}
{
const block = profiler.measureBlock("decode");
defer block.deinit();
// ... decode tokens to text ...
}
}
After generation, print the report:
Sample output:
=== Performance Profile Report ===
Operation: forward_pass
Measurements: 128
Duration (ms):
Min: 12.50
Max: 18.30
Avg: 14.20
Median: 14.00
P95: 16.80
P99: 17.90
Std Dev: 1.20
Memory:
Total allocated: 524288 bytes
Peak usage: 262144 bytes
Operation: sampling
Measurements: 128
Duration (ms):
Min: 0.05
Max: 0.30
Avg: 0.08
Median: 0.07
P95: 0.15
P99: 0.25
Std Dev: 0.04
Where Does Time Go?
For a typical 7B model on CPU, the time breakdown is approximately:
| Component | Share | Bound |
|---|---|---|
| Forward pass (GEMM) | 85--95% | Memory/Compute |
| KV cache update | 2--5% | Memory |
| Sampling | 1--3% | Compute |
| Tokenize/Decode | < 1% | Compute |
The forward pass dominates, which is why optimisations at layers 1--2 (SIMD, quantisation, BLAS) have the highest return on investment.
References¶
-
Williams, S., Waterman, A. & Patterson, D. "Roofline: An Insightful Visual Performance Model for Multicore Architectures." Communications of the ACM, 52(4), 2009. ↩
-
Ivanov, A. et al. "Data Movement Is All You Need: A Case Study on Optimizing Transformers." MLSys, 2021. ↩
-
Kim, S. et al. "Full Stack Optimization of Transformer Inference: A Survey." arXiv:2302.14017, 2023. ↩
-
Pope, R. et al. "Efficiently Scaling Transformer Inference." MLSys, 2023. ↩