Tokenization¶
Overview¶
Tokenization is the bridge between human-readable text and the numerical representations consumed by transformer models. ZigLLM implements a complete tokenization pipeline in src/models/tokenizer.zig, providing both a simple word-level tokenizer for prototyping and a full BPE (Byte Pair Encoding) tokenizer compatible with SentencePiece models used by LLaMA and its derivatives.
flowchart LR
A["Raw Text"] --> B["Preprocessing"]
B --> C["Subword\nSegmentation"]
C --> D["Token ID\nMapping"]
D --> E["Special Token\nWrapping"]
E --> F["Token IDs\n[u32]"]
style A fill:#f0f0f0,color:#333
style F fill:#4a9eff,color:#fff The Vocabulary¶
The Vocabulary struct manages the bidirectional mapping between text pieces and integer token IDs.
pub const Vocabulary = struct {
allocator: Allocator,
piece_to_id: HashMap([]const u8, TokenId, ...), // text -> ID
id_to_piece: ArrayList(TokenPiece), // ID -> text + metadata
vocab_size: usize,
};
Each entry in the vocabulary is a TokenPiece:
pub const TokenPiece = struct {
piece: []const u8, // The text content (e.g., "hello", "ing", "<s>")
score: f32, // SentencePiece score (higher = more preferred)
id: TokenId, // Numeric ID
is_special: bool, // Whether this is a control token
};
Special Tokens¶
Special tokens carry structural information rather than linguistic content. They are always present at fixed positions in the vocabulary.
| Token | ID | Symbol | Purpose |
|---|---|---|---|
| UNK | 0 | <unk> | Unknown / out-of-vocabulary fallback |
| BOS | 1 | <s> | Beginning of sequence marker |
| EOS | 2 | </s> | End of sequence marker |
| PAD | 3 | <pad> | Padding for batch alignment |
pub const SpecialTokens = struct {
pub const UNK: TokenId = 0;
pub const BOS: TokenId = 1;
pub const EOS: TokenId = 2;
pub const PAD: TokenId = 3;
pub fn isSpecial(token_id: TokenId) bool {
return token_id <= 3;
}
};
Role of Special Tokens
BOS signals to the model that a new sequence is starting and activates the model's initial state distribution. EOS signals completion and is used during generation as a stopping criterion. PAD tokens are masked in attention so they do not influence computation.
Vocabulary Initialization¶
var vocab = try Vocabulary.init(allocator, 32000); // LLaMA vocab size
// Special tokens are added automatically: <unk>, <s>, </s>, <pad>
// Add regular tokens
try vocab.addToken("hello", -1.0, 4, false);
try vocab.addToken("world", -1.5, 5, false);
SimpleTokenizer¶
The SimpleTokenizer provides a word-level tokenizer suitable for testing and prototyping. It splits text on whitespace and performs exact vocabulary lookups.
Encoding¶
pub fn encode(self: SimpleTokenizer, text: []const u8) ![]TokenId {
var tokens = ArrayList(TokenId).init(self.allocator);
// 1. Prepend BOS
try tokens.append(SpecialTokens.BOS);
// 2. Split on spaces, look up each word
var word_iter = std.mem.split(u8, text, " ");
while (word_iter.next()) |word| {
if (self.vocabulary.getTokenId(word)) |token_id| {
try tokens.append(token_id);
} else {
try tokens.append(SpecialTokens.UNK); // Fallback
}
}
// 3. Append EOS
try tokens.append(SpecialTokens.EOS);
return try tokens.toOwnedSlice();
}
When to Use SimpleTokenizer
Use SimpleTokenizer for unit tests and educational examples where you want deterministic, easily verifiable tokenization. For production inference with real model weights, always use BPETokenizer.
Decoding¶
Decoding reverses the process: look up each token ID, concatenate pieces, and skip special tokens.
pub fn decode(self: SimpleTokenizer, token_ids: []const TokenId) ![]u8 {
var result = ArrayList(u8).init(self.allocator);
for (token_ids, 0..) |token_id, i| {
if (token_id == SpecialTokens.BOS or
token_id == SpecialTokens.EOS or
token_id == SpecialTokens.PAD) continue;
if (self.vocabulary.getTokenPiece(token_id)) |piece| {
if (i > 0 and result.items.len > 0) try result.append(' ');
try result.appendSlice(piece.piece);
}
}
return try result.toOwnedSlice();
}
BPETokenizer¶
The BPETokenizer implements the full Byte Pair Encoding algorithm used by SentencePiece, which is the standard tokenizer for LLaMA, Mistral, and most modern decoder-only models.
Data Structures¶
pub const BPETokenizer = struct {
vocabulary: Vocabulary,
allocator: Allocator,
merges: ArrayList(MergePair), // Ordered merge rules
merge_ranks: HashMap([]const u8, u32, ...), // "left right" -> rank
};
pub const MergePair = struct {
left: []const u8, // Left symbol
right: []const u8, // Right symbol
rank: u32, // Priority (lower = merged first)
};
BPE Encoding Algorithm¶
The encoding procedure follows five steps.
BPE Encode
Input: raw text string
Output: sequence of token IDs
- Prepare text: prepend the SentencePiece marker
\u{2581}(Unicode LOWER ONE EIGHTH BLOCK) and replace all spaces with\u{2581}. - Initial split: segment the prepared text into individual UTF-8 characters as initial symbols.
- Iterative merging: repeatedly find the adjacent pair with the lowest merge rank and concatenate them into a single symbol. Stop when no more merges are applicable.
- Vocabulary lookup: map each final symbol to its token ID. Symbols not found in the vocabulary map to
UNK(ID 0). - Wrap: prepend
BOS(ID 1) and appendEOS(ID 2).
flowchart TD
A["'hello world'"] --> B["'\u2581hello\u2581world'"]
B --> C["['\u2581', 'h', 'e', 'l', 'l', 'o', '\u2581', 'w', 'o', 'r', 'l', 'd']"]
C --> D{"Find lowest\nrank pair"}
D -->|"merge 'l'+'l' rank=42"| E["['\u2581', 'h', 'e', 'll', 'o', '\u2581', 'w', 'o', 'r', 'l', 'd']"]
E --> D
D -->|"no more merges"| F["Vocabulary\nLookup"]
F --> G["[1, 234, 567, 89, 2]"]
style G fill:#4a9eff,color:#fff Implementation¶
pub fn encode(self: *const BPETokenizer, text: []const u8) ![]TokenId {
var tokens = ArrayList(TokenId).init(self.allocator);
try tokens.append(SpecialTokens.BOS);
// Step 1: Prepare text with SentencePiece marker
var prepared = ArrayList(u8).init(self.allocator);
try prepared.appendSlice("\xe2\x96\x81"); // U+2581 in UTF-8
for (text) |c| {
if (c == ' ') {
try prepared.appendSlice("\xe2\x96\x81");
} else {
try prepared.append(c);
}
}
// Step 2: Split into UTF-8 characters
var symbols = ArrayList([]const u8).init(self.allocator);
var pos: usize = 0;
while (pos < prepared.items.len) {
const byte = prepared.items[pos];
const char_len: usize = if (byte < 0x80) 1
else if (byte < 0xE0) 2
else if (byte < 0xF0) 3
else 4;
const end = @min(pos + char_len, prepared.items.len);
try symbols.append(try self.allocator.dupe(u8, prepared.items[pos..end]));
pos = end;
}
// Step 3: BPE merge loop
while (symbols.items.len > 1) {
var best_rank: u32 = std.math.maxInt(u32);
var best_idx: ?usize = null;
for (0..symbols.items.len - 1) |i| {
const key = try std.fmt.allocPrint(self.allocator,
"{s} {s}", .{ symbols.items[i], symbols.items[i + 1] });
defer self.allocator.free(key);
if (self.merge_ranks.get(key)) |rank| {
if (rank < best_rank) { best_rank = rank; best_idx = i; }
}
}
if (best_idx == null) break;
// Merge the pair at best_idx
const idx = best_idx.?;
const merged = try std.fmt.allocPrint(self.allocator,
"{s}{s}", .{ symbols.items[idx], symbols.items[idx + 1] });
self.allocator.free(symbols.items[idx]);
self.allocator.free(symbols.items[idx + 1]);
symbols.items[idx] = merged;
_ = symbols.orderedRemove(idx + 1);
}
// Step 4: Vocabulary lookup
for (symbols.items) |sym| {
if (self.vocabulary.getTokenId(sym)) |id| {
try tokens.append(id);
} else {
try tokens.append(SpecialTokens.UNK);
}
}
try tokens.append(SpecialTokens.EOS);
return try tokens.toOwnedSlice();
}
BPE Complexity
The naive BPE merge loop is \( O(n^2 \cdot M) \) where \( n \) is the number of symbols and \( M \) is the number of merge rules. Production implementations use a priority queue to achieve \( O(n \cdot M \log n) \). ZigLLM's implementation uses the simpler approach for clarity.
Batch Encoding and Padding¶
For batch inference, multiple sequences must be encoded and padded to equal length.
// Encode a batch of texts
const texts = [_][]const u8{ "hello world", "hi there friend" };
const encoded = try tokenizer.batchEncode(&texts);
// Pad to uniform length
try tokenizer.padSequences(encoded, null); // null = pad to longest
// Result: all sequences are now the same length, padded with PAD (3)
The padding procedure:
- Find the maximum sequence length in the batch (or use a specified
max_length). - Extend shorter sequences with
PADtokens. - Truncate longer sequences if
max_lengthis specified and shorter than the sequence.
Attention Masking
Padded tokens must be masked in attention computation. The attention mask should be 0 for real tokens and -inf for pad positions. This prevents the model from attending to meaningless padding content.
Subword Tokenization Theory¶
Three major subword algorithms are used in modern LLMs. ZigLLM implements BPE; the others are documented here for comparison.
BPE (Byte Pair Encoding)¶
Used by: LLaMA, Mistral, GPT-2, Falcon
Training: Start with character vocabulary. Iteratively count all adjacent pairs in the training corpus, merge the most frequent pair, and add the merged symbol to the vocabulary. Repeat until the desired vocabulary size is reached.
Inference: Apply learned merge rules in rank order (greedy, left-to-right).
WordPiece¶
Used by: BERT, DistilBERT
Training: Similar to BPE but selects merges by maximizing the likelihood of the training data, not raw frequency. The merge score is:
Inference: Greedy longest-match-first from left to right. Unknown subwords are prefixed with ##.
Unigram¶
Used by: SentencePiece (T5, mBART)
Training: Start with a large vocabulary and iteratively remove tokens that least reduce the corpus likelihood. Uses the EM algorithm.
Inference: Find the segmentation that maximizes the product of unigram probabilities using the Viterbi algorithm.
Comparison¶
| Property | BPE | WordPiece | Unigram |
|---|---|---|---|
| Merge criterion | Frequency | Likelihood ratio | Marginal likelihood |
| Inference | Greedy merge | Greedy longest-match | Viterbi (optimal) |
| Deterministic | Yes | Yes | Can produce multiple segmentations |
| OOV handling | Falls back to bytes | ## prefix | Log-probability fallback |
| Typical vocab size | 32K--64K | 30K | 32K |
API Summary¶
| Type | Method | Description |
|---|---|---|
Vocabulary | init(allocator, vocab_size) | Create empty vocabulary with special tokens |
Vocabulary | addToken(piece, score, id, is_special) | Register a token |
Vocabulary | getTokenId(piece) | Look up ID by text |
Vocabulary | getTokenPiece(id) | Look up text by ID |
SimpleTokenizer | encode(text) | Word-level tokenization |
SimpleTokenizer | decode(token_ids) | Reconstruct text from IDs |
SimpleTokenizer | batchEncode(texts) | Encode multiple texts |
SimpleTokenizer | padSequences(seqs, max_len) | Pad batch to uniform length |
BPETokenizer | addMerge(left, right, rank) | Register a BPE merge rule |
BPETokenizer | encode(text) | Full BPE tokenization |
BPETokenizer | decode(token_ids) | Reconstruct text from IDs |
References¶
-
Sennrich, R., Haddow, B., and Birch, A. "Neural Machine Translation of Rare Words with Subword Units." ACL, 2016. ↩
-
Kudo, T. and Richardson, J. "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing." EMNLP, 2018. ↩
-
Kudo, T. "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates." ACL, 2018. ↩