linear_algebra.iq_quantization¶
Module Path¶
Source file: src/linear_algebra/iq_quantization.zig
Internal module. This API may change between releases.
Public Types¶
IQType¶
Importance quantization formats. These achieve very low bit rates (1--4 bits per weight) by exploiting non-uniform weight distributions.
| Variant | Approx Bits/Weight | Notes |
|---|---|---|
IQ1_S | 1.5 | 1-bit with super-block scales |
IQ1_M | 1.75 | 1-bit with mixed scales |
IQ2_XXS | 2.06 | 2-bit extra-extra-small overhead |
IQ2_XS | 2.31 | 2-bit extra-small overhead |
IQ2_S | 2.5 | 2-bit standard |
IQ3_XXS | 3.06 | 3-bit extra-extra-small overhead |
IQ3_S | 3.44 | 3-bit standard |
IQ4_XS | 4.25 | 4-bit extra-small overhead |
IQ4_NL | 4.5 | 4-bit with non-linear mapping |
BlockIQ1S¶
BlockIQ2XS¶
BlockIQ3S¶
pub const BlockIQ3S = struct {
d: f16,
qs: [3 * QK_K / 8]u8,
qh: [QK_K / 8]u8,
signs: [QK_K / 8]u8,
scales: [QK_K / 64]u8,
};
BlockIQ4XS¶
pub const BlockIQ4XS = struct {
d: f16,
scales_h: u16,
scales_l: [QK_K / 64]u8,
qs: [QK_K / 2]u8,
};
IQuantizer¶
pub const IQuantizer = struct {
allocator: std.mem.Allocator,
lookup_tables: ?*const LookupTables,
};
Quantizer/dequantizer for importance-quantized formats. Uses precomputed lookup tables for fast dequantization of IQ4_NL and IQ3_S.
Public Functions¶
IQuantizer.quantize¶
Quantize a contiguous f32 buffer. Input length must be a multiple of QK_K (256).
IQuantizer.dequantize¶
Dequantize packed bytes back to f32.
Error Types¶
error{InvalidAlignment}-- input length is not a multiple ofQK_K.error{UnsupportedIQType}-- the requested IQ format is not yet implemented.error{OutOfMemory}
Usage Example¶
const iq = @import("zigllama").linear_algebra.iq_quantization;
var iquantizer = iq.IQuantizer{
.allocator = allocator,
.lookup_tables = null, // auto-initialize on first use
};
const packed = try iquantizer.quantize(float_data, .IQ4_NL);
defer allocator.free(packed);
const restored = try iquantizer.dequantize(packed, .IQ4_NL);
defer allocator.free(restored);
Related Modules¶
linear_algebra.k_quantization-- K-quant family (Q4_K, Q5_K, Q6_K).linear_algebra.quantization-- Higher-level quantization API.foundation.gguf_format--GGMLTypevariants that correspond to IQ formats.