GGUF Binary Format Specification¶

GGUF (GGML Unified Format) is the standard binary format for distributing quantized large language models. Popularized by the llama.cpp ecosystem, GGUF encodes model weights, tokenizer vocabularies, and architectural metadata in a single, self-describing file that can be memory-mapped and parsed with minimal overhead.

1. Format History¶

1.1 GGML Era¶

The original GGML library stored tensors in a minimalist binary layout: a small header followed by raw tensor data. Metadata (model architecture, vocabulary, hyperparameters) lived in separate files or was hardcoded in the loader.

1.2 Motivation for GGUF¶

As the number of supported architectures grew (LLaMA, Falcon, GPT-NeoX, MPT, ...), hardcoded loaders became unmaintainable. GGUF was designed to be:

Requirement	Solution in GGUF
Self-describing	Typed key-value metadata section
Architecture-agnostic	Architecture name stored as metadata
Extensible	New keys can be added without breaking readers
Efficient	Tensor data section is contiguous and aligned for mmap
Single file	Weights + vocabulary + config in one `.gguf` file

1.3 Timeline¶

timeline
    title Format Evolution
    2023-03 : GGML format (ggml.h)
    2023-06 : GGJT format (versioned headers)
    2023-08 : GGUF v1 (self-describing metadata)
    2023-09 : GGUF v2 (64-bit tensor counts)
    2023-10 : GGUF v3 (current, stable specification)

2. GGUF v3 Specification¶

2.1 File Layout Overview¶

block-beta
    columns 1
    A["Header (24 bytes)"]
    B["Metadata Key-Value Pairs (variable)"]
    C["Tensor Info Array (variable)"]
    D["Alignment Padding (to 32-byte boundary)"]
    E["Tensor Data (contiguous, aligned)"]

2.2 Header¶

The header is exactly 24 bytes in little-endian format:

Offset	Size	Field	Type	Value
0	4	`magic`	`u32`	`0x46554747` ("GGUF" in ASCII, little-endian)
4	4	`version`	`u32`	`3`
8	8	`tensor_count`	`u64`	Number of tensors in the file
16	8	`metadata_kv_count`	`u64`	Number of metadata key-value pairs

pub const GGUFHeader = struct {
    magic: u32,
    version: u32,
    tensor_count: u64,
    metadata_kv_count: u64,

    pub fn validate(self: GGUFHeader) !void {
        if (self.magic != GGUF_MAGIC) return error.InvalidGGUFMagic;
        if (self.version != GGUF_VERSION) return error.UnsupportedGGUFVersion;
    }
};

Magic Number Mnemonic

0x46554747 decodes to the ASCII bytes G, G, U, F when read in little-endian order. This allows quick identification with hexdump:

$ hexdump -C model.gguf | head -1
00000000  47 47 55 46 03 00 00 00  ...

2.3 Metadata Section¶

Immediately after the header, metadata_kv_count key-value pairs are stored sequentially. Each pair has the following layout:

┌──────────────────────────────────────────────────┐
│  key_length : u64                                │
│  key_data   : u8[key_length]                     │
│  value_type : u32  (GGUFType enum)               │
│  value_data : (type-dependent)                   │
└──────────────────────────────────────────────────┘

Common metadata keys for LLaMA-family models:

Key	Type	Example Value
`general.architecture`	string	`"llama"`
`general.name`	string	`"LLaMA-2-7B"`
`llama.context_length`	u32	`4096`
`llama.embedding_length`	u32	`4096`
`llama.block_count`	u32	`32`
`llama.attention.head_count`	u32	`32`
`llama.attention.head_count_kv`	u32	`32`
`llama.rope.dimension_count`	u32	`128`
`tokenizer.ggml.model`	string	`"llama"`
`tokenizer.ggml.tokens`	array[string]	32000 token strings

2.4 Tensor Info Section¶

After all metadata, tensor_count tensor descriptors appear:

┌──────────────────────────────────────────────────┐
│  name_length : u64                               │
│  name_data   : u8[name_length]                   │
│  n_dimensions : u32                              │
│  dimensions   : u64[n_dimensions]                │
│  type         : u32  (GGMLType enum)             │
│  offset       : u64  (from start of data section)│
└──────────────────────────────────────────────────┘

pub const GGUFTensorInfo = struct {
    name: []const u8,
    n_dims: u32,
    dimensions: []u64,
    type: GGMLType,
    offset: u64,

    pub fn elementCount(self: GGUFTensorInfo) u64 {
        var count: u64 = 1;
        for (self.dimensions) |dim| count *= dim;
        return count;
    }

    pub fn sizeInBytes(self: GGUFTensorInfo) u64 {
        const elements = self.elementCount();
        const block_size = self.type.blockSize();
        const type_size = self.type.typeSize();
        if (block_size == 1) return elements * type_size;
        return ((elements + block_size - 1) / block_size) * type_size;
    }
};

2.5 Alignment Padding¶

After the last tensor info entry, the file is padded to the next 32-byte boundary. This alignment ensures that tensor data can be accessed with SIMD instructions without unaligned-access penalties.

\[ \text{data\_offset} = \lceil \text{current\_position} / 32 \rceil \times 32 \]

2.6 Tensor Data Section¶

Starting at data_offset, tensor data is stored contiguously. Each tensor's data begins at data_offset + tensor_info.offset. The data is stored in the format specified by the tensor's GGMLType -- for quantized types, this is a sequence of quantization blocks (see Section 4).

3. Supported Value Types¶

The GGUFType enum defines the types available in metadata key-value pairs:

pub const GGUFType = enum(u32) {
    uint8   = 0,
    int8    = 1,
    uint16  = 2,
    int16   = 3,
    uint32  = 4,
    int32   = 5,
    float32 = 6,
    bool    = 7,
    string  = 8,
    array   = 9,
    uint64  = 10,
    int64   = 11,
    float64 = 12,
};

Type	Size (bytes)	Notes
`uint8`	1
`int8`	1
`uint16`	2
`int16`	2
`uint32`	4
`int32`	4
`float32`	4	IEEE 754 single
`float64`	8	IEEE 754 double
`bool`	1	0 = false, non-zero = true
`string`	variable	Length-prefixed: `u64` length + `u8[length]`
`array`	variable	Type tag (`u32`) + count (`u64`) + elements

String Encoding

GGUF strings are not null-terminated. They are length-prefixed with a u64 byte count, which can safely contain embedded null bytes (important for tokenizer vocabularies).

4. GGMLType -- Tensor Data Types¶

The GGMLType enum specifies how tensor data is encoded. This ranges from full-precision floats to aggressive sub-2-bit quantization:

4.1 Unquantized Types¶

Enum	Value	Bits/Element	Block Size
`f32`	0	32	1
`f16`	1	16	1
`bf16`	25	16	1

4.2 Basic Quantization (Q-series)¶

Enum	Value	Bits/Element	Block Size	Bytes/Block
`q4_0`	2	~4.5	32	20
`q4_1`	3	~5.0	32	24
`q5_0`	6	~5.5	32	24
`q5_1`	7	~6.0	32	28
`q8_0`	8	~8.5	32	36
`q8_1`	9	~9.0	32	40

4.3 K-Quantization¶

K-quantization uses larger blocks (256 elements) with more sophisticated encoding, achieving better quality at the same bit rate¹:

Enum	Value	Block Size	Bytes/Block	Approx. Bits/Weight
`q2_k`	10	256	82	~2.6
`q3_k`	11	256	110	~3.4
`q4_k`	12	256	144	~4.5
`q5_k`	13	256	176	~5.5
`q6_k`	14	256	208	~6.5
`q8_k`	15	256	256	~8.0

4.4 Importance Quantization (IQ-series)¶

IQ methods use non-uniform codebooks and importance-weighted quantization for extreme compression²:

Enum	Value	Block Size	Bytes/Block	Approx. Bits/Weight
`iq1_s`	19	256	50	~1.6
`iq1_m`	24	256	56	~1.75
`iq2_xxs`	16	256	66	~2.06
`iq2_xs`	17	256	74	~2.31
`iq2_s`	22	256	82	~2.56
`iq3_xxs`	18	256	98	~3.06
`iq3_s`	21	256	110	~3.44
`iq4_nl`	20	256	144	~4.5
`iq4_xs`	23	256	144	~4.5
`iq4_ks`	26	256	144	~4.5

5. GGUFReader Implementation¶

5.1 Opening and Reading¶

The GGUFReader struct wraps a std.fs.File and provides sequential parsing:

pub const GGUFReader = struct {
    file: std.fs.File,
    allocator: std.mem.Allocator,

    pub fn init(file: std.fs.File, allocator: std.mem.Allocator) GGUFReader {
        return GGUFReader{ .file = file, .allocator = allocator };
    }

    pub fn readFile(self: *GGUFReader) !GGUFFile {
        var gguf = GGUFFile.init(self.allocator);
        errdefer gguf.deinit();

        gguf.file_size = (try self.file.stat()).size;
        try self.file.seekTo(0);

        gguf.header = try self.readHeader();
        try gguf.header.validate();

        try self.readMetadata(&gguf);
        try self.readTensorInfo(&gguf);

        // Align to 32-byte boundary for tensor data
        const current_pos = try self.file.getPos();
        gguf.data_offset = std.mem.alignForward(u64, current_pos, 32);

        return gguf;
    }
};

5.2 Reading the Header¶

fn readHeader(self: *GGUFReader) !GGUFHeader {
    const reader = self.file.reader();
    return GGUFHeader{
        .magic            = try reader.readInt(u32, .little),
        .version          = try reader.readInt(u32, .little),
        .tensor_count     = try reader.readInt(u64, .little),
        .metadata_kv_count = try reader.readInt(u64, .little),
    };
}

5.3 Reading Metadata¶

Each key-value pair is read sequentially. The value type tag determines how the value bytes are interpreted:

fn readValue(self: *GGUFReader, value_type: GGUFType) !GGUFValue {
    const reader = self.file.reader();
    return switch (value_type) {
        .uint8   => GGUFValue{ .uint8   = try reader.readInt(u8, .little) },
        .int32   => GGUFValue{ .int32   = try reader.readInt(i32, .little) },
        .float32 => GGUFValue{ .float32 = @bitCast(try reader.readInt(u32, .little)) },
        .string  => GGUFValue{ .string  = try self.readString() },
        .array   => blk: {
            const arr_type = @as(GGUFType, @enumFromInt(try reader.readInt(u32, .little)));
            const arr_len  = try reader.readInt(u64, .little);
            const elem_size = arr_type.size() orelse return error.UnsupportedArrayType;
            const data = try self.allocator.alloc(u8, arr_len * elem_size);
            _ = try reader.readAll(data);
            break :blk GGUFValue{ .array = .{ .type = arr_type, .len = arr_len, .data = data } };
        },
        // ... remaining types ...
    };
}

5.4 Loading Tensor Data¶

pub fn readTensorData(self: *GGUFReader, gguf: *GGUFFile,
                      tensor_info: *GGUFTensorInfo, allocator: std.mem.Allocator) ![]u8 {
    const size = tensor_info.sizeInBytes();
    const data = try allocator.alloc(u8, size);
    try self.file.seekTo(gguf.data_offset + tensor_info.offset);
    _ = try self.file.readAll(data);
    return data;
}

6. Model Loading Pipeline¶

The complete pipeline from file on disk to usable tensors:

flowchart TD
    A["Open .gguf file"] --> B["Read 24-byte header"]
    B --> C{"Magic == 0x46554747?"}
    C -->|No| ERR["Error: InvalidGGUFMagic"]
    C -->|Yes| D["Read metadata KV pairs"]
    D --> E["Read tensor info descriptors"]
    E --> F["Compute data_offset\n(align to 32 bytes)"]
    F --> G["For each tensor:\nSeek to data_offset + offset\nRead sizeInBytes() bytes"]
    G --> H{"Quantized?"}
    H -->|No (f32/f16)| I["Direct cast to Tensor(f32)"]
    H -->|Yes| J["Dequantize blocks\n(Q4_0, Q8_0, K-quant, ...)"]
    J --> I
    I --> K["Tensor ready for inference"]

6.1 GGUFModelLoader Convenience API¶

pub const GGUFModelLoader = struct {
    gguf: GGUFFile,
    reader: GGUFReader,
    allocator: std.mem.Allocator,

    pub fn init(file_path: []const u8, allocator: std.mem.Allocator) !GGUFModelLoader {
        const file = try std.fs.cwd().openFile(file_path, .{});
        var reader = GGUFReader.init(file, allocator);
        const gguf = try reader.readFile();
        return GGUFModelLoader{ .gguf = gguf, .reader = reader, .allocator = allocator };
    }

    pub fn loadTensor(self: *GGUFModelLoader, name: []const u8) !Tensor {
        const info = self.gguf.getTensor(name) orelse return error.TensorNotFound;
        const raw = try self.reader.readTensorData(&self.gguf, info, self.allocator);
        defer self.allocator.free(raw);
        // ... dequantize based on info.type ...
    }
};

References¶

Dettmers, T. et al. "The case for 4-bit precision: k-bit Inference Scaling Laws." arXiv:2212.09720, 2022. ↩
Egiazarian, V. et al. "Extreme Compression of Large Language Models via Additive Quantization." arXiv:2401.06118, 2024. ↩
Gerganov, G. "GGUF Specification." GitHub -- ggerganov/ggml, 2023. ↩
Frantar, E. et al. "GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers." ICLR, 2023. ↩