linear_algebra.quantization¶
Module Path¶
Source file: src/linear_algebra/quantization.zig
Public Types¶
QuantType¶
Supported quantization formats. Lower-bit formats trade precision for memory savings.
| Variant | Bits/Weight | Description |
|---|---|---|
F32 | 32 | Full precision (no quantization) |
F16 | 16 | Half precision |
INT8 | 8 | Symmetric 8-bit integer |
Q4_0 | 4 | 4-bit with per-block scale |
Q4_1 | 4 | 4-bit with per-block scale and minimum |
Q8_0 | 8 | 8-bit with per-block scale |
QuantParams¶
pub const QuantParams = struct {
quant_type: QuantType,
block_size: usize = 32,
symmetric: bool = true,
calibration_data: ?Tensor(f32) = null,
};
Parameters controlling the quantization process.
QuantizedTensor(quant_type)¶
Compile-time-specialized tensor that stores weights in the given quantized format. Exposes the same shape metadata as Tensor(f32) but uses a packed internal representation.
Public Functions¶
quantizeTensor¶
Quantize a full-precision tensor into the specified format. The input tensor is not modified.
Parameters:
| Name | Type | Description |
|---|---|---|
tensor | Tensor(f32) | Source tensor in f32 |
quant_type | QuantType | Target format |
Returns: a QuantizedTensor containing the packed data and quantization metadata (scales, zero-points).
dequantizeTensor¶
Reconstruct an approximate f32 tensor from quantized data. The result is allocated with the allocator stored in the quantized tensor.
quantizedMatmul¶
Multiply two quantized tensors without fully dequantizing them first. Uses fused dequantize-multiply kernels for better cache utilization.
Error Types¶
error{UnsupportedQuantType}-- requested format is not implemented.error{InvalidBlockSize}-- tensor element count is not a multiple of the block size.TensorError.OutOfMemory
Usage Example¶
const quant = @import("zigllama").linear_algebra.quantization;
const Tensor = @import("zigllama").foundation.tensor.Tensor;
var weights = try Tensor(f32).init(allocator, &[_]usize{ 4096, 4096 });
defer weights.deinit();
// ... fill weights ...
// Quantize to 4-bit
var q_weights = try quant.quantizeTensor(weights, .Q4_0);
defer q_weights.deinit();
// Dequantize back to f32 for verification
var recovered = try quant.dequantizeTensor(q_weights);
defer recovered.deinit();
Related Modules¶
linear_algebra.k_quantization-- K-quant formats (Q4_K, Q5_K, Q6_K).linear_algebra.iq_quantization-- Importance quantization formats.foundation.gguf_format--GGMLTypeenum maps to these quantization variants.