GGUF Binary Format Specification¶
GGUF (GGML Unified Format) is the standard binary format for distributing quantized large language models. Popularized by the llama.cpp ecosystem, GGUF encodes model weights, tokenizer vocabularies, and architectural metadata in a single, self-describing file that can be memory-mapped and parsed with minimal overhead.
1. Format History¶
1.1 GGML Era¶
The original GGML library stored tensors in a minimalist binary layout: a small header followed by raw tensor data. Metadata (model architecture, vocabulary, hyperparameters) lived in separate files or was hardcoded in the loader.
1.2 Motivation for GGUF¶
As the number of supported architectures grew (LLaMA, Falcon, GPT-NeoX, MPT, ...), hardcoded loaders became unmaintainable. GGUF was designed to be:
| Requirement | Solution in GGUF |
|---|---|
| Self-describing | Typed key-value metadata section |
| Architecture-agnostic | Architecture name stored as metadata |
| Extensible | New keys can be added without breaking readers |
| Efficient | Tensor data section is contiguous and aligned for mmap |
| Single file | Weights + vocabulary + config in one .gguf file |
1.3 Timeline¶
timeline
title Format Evolution
2023-03 : GGML format (ggml.h)
2023-06 : GGJT format (versioned headers)
2023-08 : GGUF v1 (self-describing metadata)
2023-09 : GGUF v2 (64-bit tensor counts)
2023-10 : GGUF v3 (current, stable specification) 2. GGUF v3 Specification¶
2.1 File Layout Overview¶
block-beta
columns 1
A["Header (24 bytes)"]
B["Metadata Key-Value Pairs (variable)"]
C["Tensor Info Array (variable)"]
D["Alignment Padding (to 32-byte boundary)"]
E["Tensor Data (contiguous, aligned)"] 2.2 Header¶
The header is exactly 24 bytes in little-endian format:
| Offset | Size | Field | Type | Value |
|---|---|---|---|---|
| 0 | 4 | magic | u32 | 0x46554747 ("GGUF" in ASCII, little-endian) |
| 4 | 4 | version | u32 | 3 |
| 8 | 8 | tensor_count | u64 | Number of tensors in the file |
| 16 | 8 | metadata_kv_count | u64 | Number of metadata key-value pairs |
pub const GGUFHeader = struct {
magic: u32,
version: u32,
tensor_count: u64,
metadata_kv_count: u64,
pub fn validate(self: GGUFHeader) !void {
if (self.magic != GGUF_MAGIC) return error.InvalidGGUFMagic;
if (self.version != GGUF_VERSION) return error.UnsupportedGGUFVersion;
}
};
Magic Number Mnemonic
0x46554747 decodes to the ASCII bytes G, G, U, F when read in little-endian order. This allows quick identification with hexdump:
2.3 Metadata Section¶
Immediately after the header, metadata_kv_count key-value pairs are stored sequentially. Each pair has the following layout:
┌──────────────────────────────────────────────────┐
│ key_length : u64 │
│ key_data : u8[key_length] │
│ value_type : u32 (GGUFType enum) │
│ value_data : (type-dependent) │
└──────────────────────────────────────────────────┘
Common metadata keys for LLaMA-family models:
| Key | Type | Example Value |
|---|---|---|
general.architecture | string | "llama" |
general.name | string | "LLaMA-2-7B" |
llama.context_length | u32 | 4096 |
llama.embedding_length | u32 | 4096 |
llama.block_count | u32 | 32 |
llama.attention.head_count | u32 | 32 |
llama.attention.head_count_kv | u32 | 32 |
llama.rope.dimension_count | u32 | 128 |
tokenizer.ggml.model | string | "llama" |
tokenizer.ggml.tokens | array[string] | 32000 token strings |
2.4 Tensor Info Section¶
After all metadata, tensor_count tensor descriptors appear:
┌──────────────────────────────────────────────────┐
│ name_length : u64 │
│ name_data : u8[name_length] │
│ n_dimensions : u32 │
│ dimensions : u64[n_dimensions] │
│ type : u32 (GGMLType enum) │
│ offset : u64 (from start of data section)│
└──────────────────────────────────────────────────┘
pub const GGUFTensorInfo = struct {
name: []const u8,
n_dims: u32,
dimensions: []u64,
type: GGMLType,
offset: u64,
pub fn elementCount(self: GGUFTensorInfo) u64 {
var count: u64 = 1;
for (self.dimensions) |dim| count *= dim;
return count;
}
pub fn sizeInBytes(self: GGUFTensorInfo) u64 {
const elements = self.elementCount();
const block_size = self.type.blockSize();
const type_size = self.type.typeSize();
if (block_size == 1) return elements * type_size;
return ((elements + block_size - 1) / block_size) * type_size;
}
};
2.5 Alignment Padding¶
After the last tensor info entry, the file is padded to the next 32-byte boundary. This alignment ensures that tensor data can be accessed with SIMD instructions without unaligned-access penalties.
2.6 Tensor Data Section¶
Starting at data_offset, tensor data is stored contiguously. Each tensor's data begins at data_offset + tensor_info.offset. The data is stored in the format specified by the tensor's GGMLType -- for quantized types, this is a sequence of quantization blocks (see Section 4).
3. Supported Value Types¶
The GGUFType enum defines the types available in metadata key-value pairs:
pub const GGUFType = enum(u32) {
uint8 = 0,
int8 = 1,
uint16 = 2,
int16 = 3,
uint32 = 4,
int32 = 5,
float32 = 6,
bool = 7,
string = 8,
array = 9,
uint64 = 10,
int64 = 11,
float64 = 12,
};
| Type | Size (bytes) | Notes |
|---|---|---|
uint8 | 1 | |
int8 | 1 | |
uint16 | 2 | |
int16 | 2 | |
uint32 | 4 | |
int32 | 4 | |
float32 | 4 | IEEE 754 single |
float64 | 8 | IEEE 754 double |
bool | 1 | 0 = false, non-zero = true |
string | variable | Length-prefixed: u64 length + u8[length] |
array | variable | Type tag (u32) + count (u64) + elements |
String Encoding
GGUF strings are not null-terminated. They are length-prefixed with a u64 byte count, which can safely contain embedded null bytes (important for tokenizer vocabularies).
4. GGMLType -- Tensor Data Types¶
The GGMLType enum specifies how tensor data is encoded. This ranges from full-precision floats to aggressive sub-2-bit quantization:
4.1 Unquantized Types¶
| Enum | Value | Bits/Element | Block Size |
|---|---|---|---|
f32 | 0 | 32 | 1 |
f16 | 1 | 16 | 1 |
bf16 | 25 | 16 | 1 |
4.2 Basic Quantization (Q-series)¶
| Enum | Value | Bits/Element | Block Size | Bytes/Block |
|---|---|---|---|---|
q4_0 | 2 | ~4.5 | 32 | 20 |
q4_1 | 3 | ~5.0 | 32 | 24 |
q5_0 | 6 | ~5.5 | 32 | 24 |
q5_1 | 7 | ~6.0 | 32 | 28 |
q8_0 | 8 | ~8.5 | 32 | 36 |
q8_1 | 9 | ~9.0 | 32 | 40 |
4.3 K-Quantization¶
K-quantization uses larger blocks (256 elements) with more sophisticated encoding, achieving better quality at the same bit rate1:
| Enum | Value | Block Size | Bytes/Block | Approx. Bits/Weight |
|---|---|---|---|---|
q2_k | 10 | 256 | 82 | ~2.6 |
q3_k | 11 | 256 | 110 | ~3.4 |
q4_k | 12 | 256 | 144 | ~4.5 |
q5_k | 13 | 256 | 176 | ~5.5 |
q6_k | 14 | 256 | 208 | ~6.5 |
q8_k | 15 | 256 | 256 | ~8.0 |
4.4 Importance Quantization (IQ-series)¶
IQ methods use non-uniform codebooks and importance-weighted quantization for extreme compression2:
| Enum | Value | Block Size | Bytes/Block | Approx. Bits/Weight |
|---|---|---|---|---|
iq1_s | 19 | 256 | 50 | ~1.6 |
iq1_m | 24 | 256 | 56 | ~1.75 |
iq2_xxs | 16 | 256 | 66 | ~2.06 |
iq2_xs | 17 | 256 | 74 | ~2.31 |
iq2_s | 22 | 256 | 82 | ~2.56 |
iq3_xxs | 18 | 256 | 98 | ~3.06 |
iq3_s | 21 | 256 | 110 | ~3.44 |
iq4_nl | 20 | 256 | 144 | ~4.5 |
iq4_xs | 23 | 256 | 144 | ~4.5 |
iq4_ks | 26 | 256 | 144 | ~4.5 |
5. GGUFReader Implementation¶
5.1 Opening and Reading¶
The GGUFReader struct wraps a std.fs.File and provides sequential parsing:
pub const GGUFReader = struct {
file: std.fs.File,
allocator: std.mem.Allocator,
pub fn init(file: std.fs.File, allocator: std.mem.Allocator) GGUFReader {
return GGUFReader{ .file = file, .allocator = allocator };
}
pub fn readFile(self: *GGUFReader) !GGUFFile {
var gguf = GGUFFile.init(self.allocator);
errdefer gguf.deinit();
gguf.file_size = (try self.file.stat()).size;
try self.file.seekTo(0);
gguf.header = try self.readHeader();
try gguf.header.validate();
try self.readMetadata(&gguf);
try self.readTensorInfo(&gguf);
// Align to 32-byte boundary for tensor data
const current_pos = try self.file.getPos();
gguf.data_offset = std.mem.alignForward(u64, current_pos, 32);
return gguf;
}
};
5.2 Reading the Header¶
fn readHeader(self: *GGUFReader) !GGUFHeader {
const reader = self.file.reader();
return GGUFHeader{
.magic = try reader.readInt(u32, .little),
.version = try reader.readInt(u32, .little),
.tensor_count = try reader.readInt(u64, .little),
.metadata_kv_count = try reader.readInt(u64, .little),
};
}
5.3 Reading Metadata¶
Each key-value pair is read sequentially. The value type tag determines how the value bytes are interpreted:
fn readValue(self: *GGUFReader, value_type: GGUFType) !GGUFValue {
const reader = self.file.reader();
return switch (value_type) {
.uint8 => GGUFValue{ .uint8 = try reader.readInt(u8, .little) },
.int32 => GGUFValue{ .int32 = try reader.readInt(i32, .little) },
.float32 => GGUFValue{ .float32 = @bitCast(try reader.readInt(u32, .little)) },
.string => GGUFValue{ .string = try self.readString() },
.array => blk: {
const arr_type = @as(GGUFType, @enumFromInt(try reader.readInt(u32, .little)));
const arr_len = try reader.readInt(u64, .little);
const elem_size = arr_type.size() orelse return error.UnsupportedArrayType;
const data = try self.allocator.alloc(u8, arr_len * elem_size);
_ = try reader.readAll(data);
break :blk GGUFValue{ .array = .{ .type = arr_type, .len = arr_len, .data = data } };
},
// ... remaining types ...
};
}
5.4 Loading Tensor Data¶
pub fn readTensorData(self: *GGUFReader, gguf: *GGUFFile,
tensor_info: *GGUFTensorInfo, allocator: std.mem.Allocator) ![]u8 {
const size = tensor_info.sizeInBytes();
const data = try allocator.alloc(u8, size);
try self.file.seekTo(gguf.data_offset + tensor_info.offset);
_ = try self.file.readAll(data);
return data;
}
6. Model Loading Pipeline¶
The complete pipeline from file on disk to usable tensors:
flowchart TD
A["Open .gguf file"] --> B["Read 24-byte header"]
B --> C{"Magic == 0x46554747?"}
C -->|No| ERR["Error: InvalidGGUFMagic"]
C -->|Yes| D["Read metadata KV pairs"]
D --> E["Read tensor info descriptors"]
E --> F["Compute data_offset\n(align to 32 bytes)"]
F --> G["For each tensor:\nSeek to data_offset + offset\nRead sizeInBytes() bytes"]
G --> H{"Quantized?"}
H -->|No (f32/f16)| I["Direct cast to Tensor(f32)"]
H -->|Yes| J["Dequantize blocks\n(Q4_0, Q8_0, K-quant, ...)"]
J --> I
I --> K["Tensor ready for inference"] 6.1 GGUFModelLoader Convenience API¶
pub const GGUFModelLoader = struct {
gguf: GGUFFile,
reader: GGUFReader,
allocator: std.mem.Allocator,
pub fn init(file_path: []const u8, allocator: std.mem.Allocator) !GGUFModelLoader {
const file = try std.fs.cwd().openFile(file_path, .{});
var reader = GGUFReader.init(file, allocator);
const gguf = try reader.readFile();
return GGUFModelLoader{ .gguf = gguf, .reader = reader, .allocator = allocator };
}
pub fn loadTensor(self: *GGUFModelLoader, name: []const u8) !Tensor {
const info = self.gguf.getTensor(name) orelse return error.TensorNotFound;
const raw = try self.reader.readTensorData(&self.gguf, info, self.allocator);
defer self.allocator.free(raw);
// ... dequantize based on info.type ...
}
};
References¶
-
Dettmers, T. et al. "The case for 4-bit precision: k-bit Inference Scaling Laws." arXiv:2212.09720, 2022. ↩
-
Egiazarian, V. et al. "Extreme Compression of Large Language Models via Additive Quantization." arXiv:2401.06118, 2024. ↩
-
Gerganov, G. "GGUF Specification." GitHub -- ggerganov/ggml, 2023. ↩
-
Frantar, E. et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR, 2023. ↩