HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

2026-06-22Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors introduce HyperQuant, a method to shrink the memory size of big language and image models after training, without losing much accuracy. It works by cleverly transforming and compressing model weights and key-value caches using known math techniques combined in a new way. Their approach outperforms several previous methods across different compression levels and works well even on large video models. HyperQuant also keeps important details intact to ensure the model's attention mechanism still functions correctly. Finally, they integrate their method with existing hardware-efficient routines for fast computation.

post-training quantizationlarge language modelskey-value cacheHadamard transformlattice quantizationRice codingbias correctionentropy codingTensor-Core MMAcompression ratio
Authors
Yuval Domb, Hadar Sackstein, Tomer Solberg
Abstract
We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: https://moonmath.ai/hyperquant/