GRINQH: Graded Input-based Quantization Hierarchy for Efficient LLM Generation

2026-06-22Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors address the problem of slow text generation in large language models due to limitations in GPU memory bandwidth. They introduce GRINQH, a method that quantizes model weights based on activation importance, allowing the model to use different precision levels dynamically during decoding. This approach speeds up the process without greatly sacrificing output quality. The authors tested GRINQH on popular models like Llama3 and Qwen3, showing better performance than previous fixed-precision methods. They also built custom GPU software to make these improvements practical.

Autoregressive decodingLarge Language Models (LLMs)QuantizationPost-training quantizationGPU memory bandwidthSparse representationMixed-precision arithmeticActivation magnitudeCustom GPU kernelLlama3 and Qwen3 models
Authors
Jette Oberländer, Jan Finkbeiner, Catherine M. Schöfmann, Emre Neftci
Abstract
Autoregressive decoding with LLMs is primarily bottlenecked by GPU memory bandwidth, especially in edge-computing settings. While quantization is essential for mitigating this bottleneck, most existing methods treat inference as a uniform process and fail to account for the asymmetry between the compute-bound prefill stage and the memory-bound decoding stage. We propose GRINQH (GRaded INput-based Quantization Hierarchy), a weight-only post-training quantization framework that accelerates decoding by unifying quantization and sparsification. GRINQH leverages activation magnitudes as a proxy for computational importance to dynamically assign weight channels to different precision levels, enabling flexible average bit widths during decoding. Evaluated on Llama3 and Qwen3 models, GRINQH outperforms state-of-the-art fixed- and mixed-precision baselines at comparable 3- and 4-bit settings, even enabling effective 2-bit generation. We experimentally verify theoretical speedups by leveraging a hierarchical nested memory layout for multi-precision storage in a custom GPU kernel. Ultimately, GRINQH establishes a new state-of-the-art Pareto frontier for LLM generation, enabling a dynamic trade-off between generation quality and inference speed.