Perplexity Evaluation¶
Perplexity is the standard intrinsic metric for language models. ZigLlama's evaluation module (src/evaluation/perplexity.zig) provides a configurable evaluator, a sliding-window strategy for sequences that exceed the model's context length, and a benchmark suite for comparing quantisation levels.
What is Perplexity¶
Given a sequence of \(N\) tokens \(x_1, x_2, \ldots, x_N\), perplexity is defined as:
Intuitively, perplexity measures how "surprised" the model is by the data. Lower perplexity means the model assigns higher probability to the observed tokens.
Relationship to cross-entropy
The exponent is the average negative log-likelihood, which equals the cross-entropy \(H\) between the true data distribution and the model. Perplexity is therefore \(2^{H}\) when logarithms are base-2, or \(e^{H}\) for natural logarithms.
Related metrics computed alongside perplexity:
| Metric | Formula | Interpretation |
|---|---|---|
| Log-perplexity | \(-\frac{1}{N}\sum \log p(x_i \mid x_{<i})\) | Numerically stable form; used internally. |
| Bits per token (BPT) | \(\text{log-PPL} / \ln 2\) | Information cost per token in bits. |
| Bits per character (BPC) | \(\text{BPT} \times N_{\text{tokens}} / N_{\text{chars}}\) | Normalises across tokenisers. |
PerplexityConfig¶
pub const PerplexityConfig = struct {
max_sequence_length: usize = 2048,
batch_size: usize = 1,
sliding_window: usize = 512,
window_overlap: usize = 64,
use_log_probs: bool = true,
temperature: f32 = 1.0,
normalize_probs: bool = true,
verbose: bool = false,
};
| Parameter | Default | Purpose |
|---|---|---|
max_sequence_length | 2048 | Sequences shorter than this are evaluated in one pass. |
batch_size | 1 | Number of sequences processed in parallel. |
sliding_window | 512 | Window width for long-sequence evaluation. |
window_overlap | 64 | Tokens shared between consecutive windows. |
temperature | 1.0 | Applied to logits before softmax. Keep at 1.0 for standard perplexity. |
Temperature and perplexity
Setting temperature != 1.0 distorts the probability distribution and makes perplexity values incomparable across runs. Always use 1.0 for benchmarking.
PerplexityResult¶
Every evaluation returns a PerplexityResult:
pub const PerplexityResult = struct {
perplexity: f64,
log_perplexity: f64,
bits_per_char: f64,
bits_per_token: f64,
total_tokens: usize,
total_chars: usize,
token_log_probs: []f64,
sequence_perplexities: []f64,
evaluation_time_ms: u64,
peak_memory_usage: usize,
};
The token_log_probs array stores the per-position log-probability, enabling fine-grained analysis (e.g., identifying passages where the model struggles). sequence_perplexities is populated when evaluating a multi-sequence dataset.
Sliding Window Evaluation¶
When a token sequence exceeds max_sequence_length, the evaluator switches to a sliding-window strategy:
flowchart LR
subgraph Window 1
T1[t1...t512]
end
subgraph Window 2
T2[t449...t960]
end
subgraph Window 3
T3[t897...t1408]
end
T1 -- "overlap 64" --> T2
T2 -- "overlap 64" --> T3 - The first window spans positions \([0, W)\) where \(W\) =
sliding_window. - Each subsequent window advances by \(W - O\) positions (\(O\) =
window_overlap). - Within each window, log-probabilities are computed via a full forward pass.
- Tokens in the overlap region are discarded from the second window to avoid double-counting. Only positions in the non-overlapping suffix contribute to the aggregate.
The overlap ensures that every token has sufficient left-context for accurate prediction, even at window boundaries.
Choosing window parameters
- Set
sliding_windowclose to the model's training context length for maximum accuracy. - An
overlapof 64--128 tokens is usually sufficient for stable boundary transitions.
Interpreting Results¶
What is "good" perplexity?¶
Perplexity depends on the vocabulary, tokeniser, dataset, and model size. The following table gives rough baselines on WikiText-103 for common model sizes:
| Model Size | FP16 PPL | Q4_K_M PPL | Q8_0 PPL |
|---|---|---|---|
| 7B | ~5.7 | ~5.9 | ~5.7 |
| 13B | ~5.1 | ~5.3 | ~5.1 |
| 30B | ~4.3 | ~4.5 | ~4.3 |
| 65B | ~3.5 | ~3.7 | ~3.5 |
Absolute vs. relative
Absolute perplexity varies wildly across datasets (news articles might yield PPL 15 while code yields PPL 3). Compare models on the same dataset and focus on relative differences.
Degradation thresholds¶
| Degradation | Verdict |
|---|---|
| < 1 % | Negligible -- quantisation is lossless for practical purposes. |
| 1--5 % | Acceptable for most applications. |
| 5--15 % | Noticeable quality loss; consider a higher-bit quantisation. |
| > 15 % | Severe -- avoid for tasks requiring factual accuracy. |
Benchmarking Quantisation Quality¶
A typical quantisation-quality workflow:
const std = @import("std");
const perplexity = @import("evaluation/perplexity.zig");
// 1. Evaluate FP16 baseline
var config = perplexity.PerplexityConfig{ .verbose = true };
var evaluator = perplexity.PerplexityEvaluator.init(
allocator, config, model_fp16, tokenizer,
);
const baseline = try evaluator.evaluateDataset(wiki_dataset);
// 2. Evaluate Q4_K_M
var evaluator_q4 = perplexity.PerplexityEvaluator.init(
allocator, config, model_q4km, tokenizer,
);
const q4_result = try evaluator_q4.evaluateDataset(wiki_dataset);
// 3. Compare
const comparison = perplexity.PerplexityUtils.compareResults(
baseline, q4_result,
);
std.log.info("Relative degradation: {d:.2f}%",
.{comparison.relative_difference * 100});
The PerplexityComparison struct returned by compareResults reports:
| Field | Type | Description |
|---|---|---|
absolute_difference | f64 | \(\text{PPL}_1 - \text{PPL}_2\) |
relative_difference | f64 | \((\text{PPL}_1 - \text{PPL}_2) / \text{PPL}_2\) |
is_statistically_significant | bool | true if relative difference exceeds 5 %. |
better_result | enum | .First or .Second. |
Benchmark Suite¶
For systematic evaluation across multiple datasets, use BenchmarkSuite:
var suite = perplexity.BenchmarkSuite.init(allocator, evaluator);
defer suite.deinit();
// Add standard benchmarks
try suite.generateSyntheticDataset("synthetic-small", 100, 256);
try suite.generateSyntheticDataset("synthetic-large", 50, 1024);
// Run all benchmarks
var results = try suite.runBenchmark();
defer results.deinit(allocator);
results.printReport();
try results.saveToFile("benchmark_results.json", allocator);
The suite supports the following standard benchmark identifiers via StandardBenchmarks:
WIKITEXT_103PENN_TREEBANKLAMBADAHELLASWAGSYNTHETIC_SMALL/SYNTHETIC_LARGE
Confidence Intervals¶
PerplexityUtils.calculateConfidenceInterval computes percentile-based confidence bounds over per-sequence perplexity values:
const ci = PerplexityUtils.calculateConfidenceInterval(
result.sequence_perplexities, 0.95,
);
std.log.info("95% CI: [{d:.2f}, {d:.2f}]", .{ ci.lower, ci.upper });
This is particularly useful when comparing two quantisation levels: if their 95 % confidence intervals overlap, the difference is likely not meaningful.
Source Reference¶
| File | Key Types |
|---|---|
src/evaluation/perplexity.zig | PerplexityEvaluator, PerplexityConfig, PerplexityResult, BenchmarkSuite, BenchmarkResults, PerplexityUtils, StandardBenchmarks |
examples/perplexity_demo.zig | End-to-end demo of configuration, evaluation, and reporting |