SageBwd: A Trainable Low-bit Attention

2026-03-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied a way to make the attention part of machine learning models use less memory and run faster by using lower-bit numbers, called SageBwd. They found that while it worked well for fine-tuning, it lagged behind full-precision attention during initial training. By analyzing their method, they discovered how to fix this gap: adjusting normalization and smoothing techniques, and controlling how many tokens are processed at once. These changes allow SageBwd to perform just as well as full-precision attention during pre-training.

Low-bit attentionSageAttentionSageBwdQuantizationAttention matrix multiplicationPre-trainingQK-normBackward-pass gradientK-smoothingTokens per step
Authors
Jintao Zhang, Marco Chen, Haoxu Wang, Kai Jiang, Ion Stoica, Joseph E. Gonzalez, Jianfei Chen, Jun Zhu
Abstract
Low-bit attention, such as SageAttention, has emerged as an effective approach for accelerating model inference, but its applicability to training remains poorly understood. In prior work, we introduced SageBwd, a trainable INT8 attention that quantizes six of seven attention matrix multiplications while preserving fine-tuning performance. However, SageBwd exhibited a persistent performance gap to full-precision attention (FPA) during pre-training. In this work, we investigate why this gap occurs and demonstrate that SageBwd matches full-precision attention during pretraining. Through experiments and theoretical analysis, we reach a few important insights and conclusions: (i) QK-norm is necessary for stable training at large tokens per step, (ii) quantization errors primarily arise from the backward-pass score gradient dS, (iii) reducing tokens per step enables SageBwd to match FPA performance in pre-training, and (iv) K-smoothing remains essential for training stability, while Q-smoothing provides limited benefit during pre-training.