Importance Quantization¶

Importance quantization (IQ) is a family of formats that extend the idea of non-uniform bit allocation: instead of treating every weight equally, IQ formats use importance maps to steer precision toward weights that most affect model output. Combined with non-linear quantization levels and codebook-based encoding, IQ formats achieve viable model quality at compressions that were previously considered destructive.

1. Theory: Importance-Weighted Quantization¶

Saliency and Bit Allocation¶

Weight importance (saliency)

The importance of weight \( w_i \) measures how much the model's loss changes when \( w_i \) is perturbed. A common proxy is the diagonal of the Fisher information matrix:

\[ I_i = \mathbb{E}\!\left[\left(\frac{\partial \mathcal{L}}{\partial w_i}\right)^{\!2}\right] \]

Weights with high \( I_i \) are "important" -- quantization error in these weights causes disproportionate output degradation.

Optimal Bit Allocation¶

Rate-distortion optimal allocation

Given a total bit budget \( B \) across \( n \) weights, the distortion-minimizing allocation assigns bits proportional to the log of importance:

\[ b_i^* = \bar{b} + \frac{1}{2} \log_2 \frac{I_i}{\left(\prod_{j=1}^n I_j\right)^{1/n}} \]

where \( \bar{b} = B/n \) is the average bit rate. High-importance weights receive more bits; low-importance weights receive fewer.

In practice, IQ formats approximate this optimal allocation using discrete importance levels (bitmaps) rather than per-weight bit rates.

flowchart LR
    W["Weight Tensor"] --> S["Compute Saliency\n(Fisher diagonal)"]
    S --> IM["Build Importance Map\n(bitmap / multi-level)"]
    IM --> HI["High-Importance Weights\n(more bits / finer grid)"]
    IM --> LO["Low-Importance Weights\n(fewer bits / coarser grid)"]
    HI --> PACK["Pack into IQ Block"]
    LO --> PACK

2. IQ Format Family¶

ZigLlama implements the full IQ format family from llama.cpp, spanning 1.5 to 4.5 bits per weight:

Format	bpw	Block Size	Key Feature
IQ1_S	1.56	256	Extreme compression with importance bitmaps
IQ1_M	1.75	256	Medium variant with additional scale bits
IQ2_XXS	2.06	256	Ultra-small 2-bit with grid quantization
IQ2_XS	2.31	256	Extra-small 2-bit
IQ2_S	2.50	256	Standard 2-bit with importance
IQ2_M	2.70	256	Medium 2-bit
IQ3_XXS	3.06	256	Ultra-small 3-bit
IQ3_XS	3.30	256	Extra-small 3-bit
IQ3_S	3.44	256	Standard 3-bit
IQ4_XS	4.25	256	Extra-small 4-bit with importance
IQ4_NL	4.50	256	Non-linear quantization with lookup table

Naming convention

The naming follows a pattern: IQ{bits}_{size} where bits is the approximate integer precision and size indicates the overhead level: S (standard), XS (extra-small), XXS (ultra-small), M (medium), NL (non-linear).

3. Block Structures¶

BlockIQ1S -- Extreme 1-bit Compression¶

IQ1_S block format

IQ1_S represents 256 values using importance bitmaps that classify each value as either +1 or -1 (after scaling), with high-importance values receiving separate treatment.

Field	Type	Size (bytes)	Description
`d`	`f16`	2	Super-block scale
`qs`	`[32]u8`	32	Packed sign bits (1 bit per value)
`qh`	`[8]u16`	16	Importance bitmaps (high-importance indicators)
Total		50	Per 256 values

pub const BlockIQ1S = extern struct {
    d: f16,
    qs: [32]u8,       // sign bits: 0 = +scale, 1 = -scale
    qh: [8]u16,       // importance bitmaps for 8 sub-blocks

    pub fn dequantize(self: BlockIQ1S, output: *[QK_K]f32) void {
        const scale: f32 = @floatCast(self.d);

        for (0..QK_K) |i| {
            // Extract sign bit
            const sign: u1 = @truncate(self.qs[i / 8] >> @truncate(i % 8));
            const base_val: f32 = if (sign == 0) scale else -scale;

            // Check importance bitmap for this sub-block
            const sub_block = i / 32;
            const sub_idx: u4 = @truncate(i % 32);
            const important: bool = (self.qh[sub_block] >> sub_idx) & 1 == 1;

            // Important values get a scaled magnitude boost
            output[i] = if (important) base_val * 1.5 else base_val;
        }
    }
};

IQ1_S bits per weight

\[ \text{bpw} = \frac{50 \times 8}{256} = 1.5625 \approx 1.56 \]

This is approximately 20x compression versus F32.

BlockIQ2XS -- 2-bit with Importance¶

Field	Type	Size (bytes)	Description
`d`	`f16`	2	Super-block scale
`qs`	`[64]u8`	64	Packed 2-bit values (4 per byte)
`scales`	`[8]u8`	8	Sub-block scales with importance flags
Total		74	Per 256 values

pub const BlockIQ2XS = extern struct {
    d: f16,
    qs: [64]u8,        // packed 2-bit quantized values
    scales: [8]u8,      // sub-block scales + importance bits

    pub fn dequantize(self: BlockIQ2XS, output: *[QK_K]f32) void {
        const d_val: f32 = @floatCast(self.d);

        for (0..8) |j| {
            const sc_raw = self.scales[j];
            const sc: f32 = @floatFromInt(sc_raw & 0x0F);
            const importance_shift: u1 = @truncate(sc_raw >> 4);

            for (0..32) |k| {
                const idx = j * 32 + k;
                const byte_idx = idx / 4;
                const shift: u3 = @truncate((idx % 4) * 2);
                const q2: i32 = @as(i32, (self.qs[byte_idx] >> shift) & 0x03) - 1;

                const imp_factor: f32 = if (importance_shift == 1 and
                    q2 != 0) 1.25 else 1.0;
                output[idx] = d_val * sc * @as(f32, @floatFromInt(q2)) * imp_factor;
            }
        }
    }
};

BlockIQ3S -- 3-bit with Importance¶

Field	Type	Size (bytes)	Description
`d`	`f16`	2	Super-block scale
`qs`	`[96]u8`	96	Packed 3-bit values
`qh`	`[16]u8`	16	High bits for 3-bit reconstruction
`signs`	`[16]u8`	16	Sign bits for importance-weighted values
`scales`	`[8]u8`	8	Sub-block scales
Total		138	Per 256 values

BlockIQ4XS -- 4-bit Extra-Small¶

Field	Type	Size (bytes)	Description
`d`	`f16`	2	Super-block scale
`scales_h`	`[2]u8`	2	High bits of sub-block scales
`scales_l`	`[8]u8`	8	Low bits of sub-block scales
`qs`	`[128]u8`	128	Packed 4-bit quantized values
Total		140	Per 256 values

Bits per weight: \( 140 \times 8 / 256 = 4.375 \approx 4.25 \) (with alignment).

4. Importance Maps¶

How Importance Bitmaps Work¶

Importance maps are bit arrays where each bit indicates whether the corresponding weight is "important" (high saliency) or "normal." The quantizer computes importance during the quantization pass and embeds the bitmap into the block structure.

flowchart TB
    subgraph Quantization["Quantization Pass"]
        FISH["Compute Fisher\ndiagonal I_i"]
        THRESH["Threshold:\nI_i > median(I)?"]
        BIT["Set importance\nbit = 1"]
        NOBIT["Set importance\nbit = 0"]
    end
    subgraph Dequantization["Dequantization"]
        CHECK["Check importance bit"]
        HI_PATH["High-importance path:\nfiner grid or magnitude boost"]
        LO_PATH["Normal path:\nstandard dequantization"]
    end

    FISH --> THRESH
    THRESH -->|Yes| BIT
    THRESH -->|No| NOBIT
    BIT --> CHECK
    NOBIT --> CHECK
    CHECK -->|bit = 1| HI_PATH
    CHECK -->|bit = 0| LO_PATH

Multi-Level Importance¶

Some IQ formats use multiple importance levels rather than a binary bitmap. For example, IQ2_M uses 4 importance tiers:

Tier	Fraction of Weights	Effective Precision	Treatment
0 (low)	~50%	1.5 bits	Coarse grid, sign only
1 (medium)	~25%	2 bits	Standard 2-bit grid
2 (high)	~15%	2.5 bits	Finer grid
3 (critical)	~10%	3+ bits	Maximum available precision

Information-theoretic justification

The importance-weighted total distortion is:

\[ D = \sum_{i=1}^{n} I_i \cdot (w_i - \hat{w}_i)^2 \]

Minimizing \( D \) subject to a fixed total bit budget yields the allocation that assigns more bits to high-\( I_i \) weights -- exactly what the multi-level importance map achieves in a discrete approximation.

Importance Array Storage¶

/// Multi-level importance map for a 256-element super-block.
pub const ImportanceMap = struct {
    /// 2-bit importance level per value, packed 4 per byte.
    levels: [64]u8,  // 256 values * 2 bits / 8 = 64 bytes

    pub fn getLevel(self: ImportanceMap, idx: usize) u2 {
        const byte = self.levels[idx / 4];
        const shift: u3 = @truncate((idx % 4) * 2);
        return @truncate(byte >> shift);
    }
};

5. Non-Linear Quantization: IQ4_NL¶

Motivation¶

Non-linear quantization

Standard (linear) quantization uses uniformly spaced levels. Non-linear quantization uses arbitrarily spaced levels chosen to minimize reconstruction error for the actual weight distribution.

Neural network weights follow approximately Gaussian distributions, not uniform. More weights cluster near zero than at the tails. Non-linear quantization places more levels near zero where weights are dense:

flowchart LR
    subgraph Linear["Linear 4-bit (16 levels)"]
        L0["|----|----|----|----|----|----|----|"]
        L1["uniform spacing"]
    end
    subgraph NonLinear["Non-linear 4-bit (16 levels)"]
        N0["|--|-|--|---|-----|---|--|-|--|"]
        N1["dense near zero, sparse at tails"]
    end

IQ4_NL Lookup Table¶

IQ4_NL uses 16 cluster centers determined by k-means clustering on representative weight distributions:

/// IQ4_NL dequantization lookup table.
/// 16 non-linearly spaced reconstruction values.
pub const IQ4NL_LUT: [16]f32 = .{
    -1.0000, -0.6962, -0.5251, -0.3949,
    -0.2876, -0.1922, -0.1040, -0.0218,
     0.0536,  0.1290,  0.2093,  0.2972,
     0.3979,  0.5220,  0.6944,  1.0000,
};

IQ4_NL dequantization

Extract the 4-bit index \( q_i \in [0, 15] \) from the packed data.
Look up the reconstruction value: \( v_i = \text{LUT}[q_i] \).
Scale by the block scale: \( \hat{w}_i = d \cdot v_i \).

No subtraction, no zero-point -- just a table lookup and a multiply.

Block Structure¶

Field	Type	Size (bytes)	Description
`d`	`f16`	2	Block scale
`qs`	`[128]u8`	128	Packed 4-bit indices into LUT (2 per byte)
Total		130	Per 256 values

Block size note

IQ4_NL can use either 32-element or 256-element blocks depending on the implementation. The llama.cpp reference uses a super-block of 256 with sub-block scales, yielding the same 4.5 bpw as Q4_K but with non-linear levels.

pub const BlockIQ4NL = extern struct {
    d: f16,
    qs: [128]u8,        // 4-bit LUT indices, packed

    pub fn dequantize(self: BlockIQ4NL, output: *[QK_K]f32) void {
        const scale: f32 = @floatCast(self.d);

        for (0..QK_K / 2) |j| {
            const byte = self.qs[j];
            const idx_lo: u4 = @truncate(byte);
            const idx_hi: u4 = @truncate(byte >> 4);

            output[2 * j] = scale * IQ4NL_LUT[idx_lo];
            output[2 * j + 1] = scale * IQ4NL_LUT[idx_hi];
        }
    }
};

Non-Linear Importance Curves¶

The cluster centers in the LUT are not uniformly spaced. The spacing reflects the Gaussian-like distribution of neural network weights:

Region	LUT Indices	Spacing	Weight Density
Near zero	6--9	~0.05	High (many weights)
Mid-range	3--5, 10--12	~0.09	Medium
Tails	0--2, 13--15	~0.17	Low (few weights)

This non-uniform spacing ensures that the dense central region has fine-grained representation, while the sparse tails (which contribute less to total error) use coarser levels.

6. Extreme Compression¶

IQ1_S: The 1.5 bpw Frontier¶

Compression ratios at extreme bit rates

Format	bpw	7B Model Size	Compression vs F32
F32	32.0	26.0 GB	1x
Q4_K	4.5	4.0 GB	6.5x
IQ3_S	3.44	3.0 GB	8.7x
IQ2_XS	2.31	2.0 GB	13.0x
IQ1_S	1.56	1.4 GB	18.6x

At 1.56 bpw, IQ1_S achieves nearly 20x compression -- a 7B model fits in under 1.5 GB of RAM. This makes it possible to run LLMs on devices with as little as 2 GB total memory.

Quality at Extreme Compression¶

The trade-off is significant quality degradation:

Format	bpw	Perplexity (LLaMA-2 7B)	PPL Increase
F16	16.0	5.68	baseline
IQ4_XS	4.25	5.75	+1.2%
IQ3_S	3.44	5.98	+5.3%
IQ2_S	2.50	6.72	+18.3%
IQ2_XS	2.31	7.10	+25.0%
IQ1_M	1.75	8.95	+57.6%
IQ1_S	1.56	10.42	+83.5%

Scaling law for extreme quantization

Empirically, perplexity increases approximately as:

\[ \text{PPL}(b) \approx \text{PPL}_0 \cdot \exp\!\left(\frac{\alpha}{b^2}\right) \]

where \( b \) is bits per weight, \( \text{PPL}_0 \) is the F16 baseline, and \( \alpha \) is a model-dependent constant. The quadratic exponent explains why quality degrades gently from 8 to 4 bpw but steeply below 2 bpw.

When Extreme Compression Makes Sense¶

Edge deployment: Running any model is better than no model when RAM is severely constrained (mobile, embedded, browser WASM).
Draft models: Speculative decoding uses a small/fast draft model to propose tokens, verified by a larger model. IQ1_S draft models are fast and fit in L3 cache.
Rapid prototyping: Quick experiments where exact quality is less important than iteration speed.

7. K-Quant vs IQ-Quant Comparison¶

Criterion	K-Quantization	Importance Quantization
Core idea	Two-level hierarchical scales	Saliency-weighted bit allocation
Scale structure	Super-block + sub-block scales	Super-block + importance bitmaps
Minimum bpw	~3.35 (Q2_K)	~1.56 (IQ1_S)
Best quality at 4.5 bpw	Q4_K_M (PPL 5.73)	IQ4_NL (PPL 5.72)
Dequantization speed	Fast (simple arithmetic)	Moderate (LUT lookups, bit extraction)
Quantization speed	Fast	Slow (requires importance computation)
Format complexity	Moderate	High
llama.cpp maturity	Stable, well-tested	Newer, actively evolving

Decision Matrix¶

flowchart TD
    START["Choose Quantization Format"]
    BPW{"Target bpw?"}
    QUAL{"Quality priority?"}
    SPEED{"Dequant speed?"}

    START --> BPW
    BPW -->|"> 5 bpw"| Q6K["Q6_K"]
    BPW -->|"4-5 bpw"| QUAL
    BPW -->|"2-4 bpw"| IQ3["IQ3_S or IQ3_XS"]
    BPW -->|"< 2 bpw"| IQ1["IQ1_S or IQ1_M"]

    QUAL -->|"Maximum"| SPEED
    QUAL -->|"Good enough"| Q4KS["Q4_K_S"]

    SPEED -->|"Fast dequant"| Q4KM["Q4_K_M"]
    SPEED -->|"Quality over speed"| IQ4["IQ4_NL or IQ4_XS"]

8. IQuantizer API¶

pub const IQuantizer = struct {
    allocator: std.mem.Allocator,
    /// Pre-computed importance scores (optional; if null, uniform importance).
    importance: ?[]const f32,

    pub fn init(
        allocator: std.mem.Allocator,
        importance: ?[]const f32,
    ) IQuantizer {
        return .{
            .allocator = allocator,
            .importance = importance,
        };
    }

    /// Quantize to IQ4_NL (non-linear 4-bit).
    pub fn quantizeIQ4NL(
        self: IQuantizer,
        data: []const f32,
    ) ![]BlockIQ4NL {
        const n_blocks = (data.len + QK_K - 1) / QK_K;
        const blocks = try self.allocator.alloc(BlockIQ4NL, n_blocks);

        for (0..n_blocks) |bi| {
            const start = bi * QK_K;
            const end = @min(start + QK_K, data.len);
            const imp = if (self.importance) |imp| imp[start..end] else null;
            blocks[bi] = quantizeBlockIQ4NL(data[start..end], imp);
        }

        return blocks;
    }

    /// Quantize to IQ2_XS (2-bit with importance).
    pub fn quantizeIQ2XS(
        self: IQuantizer,
        data: []const f32,
    ) ![]BlockIQ2XS {
        const n_blocks = (data.len + QK_K - 1) / QK_K;
        const blocks = try self.allocator.alloc(BlockIQ2XS, n_blocks);

        for (0..n_blocks) |bi| {
            const start = bi * QK_K;
            const end = @min(start + QK_K, data.len);
            const imp = if (self.importance) |imp| imp[start..end] else null;
            blocks[bi] = quantizeBlockIQ2XS(data[start..end], imp);
        }

        return blocks;
    }

    /// Quantize to IQ1_S (extreme 1-bit compression).
    pub fn quantizeIQ1S(
        self: IQuantizer,
        data: []const f32,
    ) ![]BlockIQ1S {
        const n_blocks = (data.len + QK_K - 1) / QK_K;
        const blocks = try self.allocator.alloc(BlockIQ1S, n_blocks);

        for (0..n_blocks) |bi| {
            const start = bi * QK_K;
            const end = @min(start + QK_K, data.len);
            const imp = if (self.importance) |imp| imp[start..end] else null;
            blocks[bi] = quantizeBlockIQ1S(data[start..end], imp);
        }

        return blocks;
    }

    /// Dequantize IQ4_NL blocks to f32.
    pub fn dequantizeIQ4NL(
        blocks: []const BlockIQ4NL,
        output: []f32,
    ) void {
        for (blocks, 0..) |block, bi| {
            block.dequantize(output[bi * QK_K ..][0..QK_K]);
        }
    }

    /// Compute importance scores from gradient information.
    /// Returns Fisher diagonal approximation.
    pub fn computeImportance(
        self: IQuantizer,
        gradients: []const f32,
        weights: []const f32,
    ) ![]f32 {
        std.debug.assert(gradients.len == weights.len);
        const imp = try self.allocator.alloc(f32, weights.len);

        for (0..weights.len) |i| {
            // Fisher diagonal: (dL/dw)^2
            imp[i] = gradients[i] * gradients[i];
        }

        return imp;
    }

    pub fn deinit(self: *IQuantizer, blocks: anytype) void {
        self.allocator.free(blocks);
    }
};

IQ quantization workflow

Compute importance scores from gradient data (or use uniform importance if gradients are unavailable).
Sort weights by importance within each super-block to determine which weights receive higher precision.
Build importance bitmap by thresholding importance scores.
Quantize high-importance weights using finer grids or more bits.
Quantize low-importance weights using coarser grids.
Pack both groups into the IQ block structure with embedded bitmaps.

The quantization step is more expensive than K-quantization (requires a sorting pass and importance computation), but the resulting model achieves better quality at the same bit rate.

References¶

Gerganov, G. "Importance matrix quantization." llama.cpp, 2024. https://github.com/ggerganov/llama.cpp/pull/4773 ↩
Dettmers, T. and Zettlemoyer, L. "The case for 4-bit precision: k-bit Inference Scaling Laws." ICML, 2023. https://arxiv.org/abs/2212.09720 ↩
Egiazarian, V. et al. "Extreme Compression of Large Language Models via Additive Quantization." 2024. https://arxiv.org/abs/2401.06118 ↩
Chee, J. et al. "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." NeurIPS, 2023. https://arxiv.org/abs/2307.13304 ↩