Attention Drift: What Autoregressive Speculative Decoding Models Learn

2026-05-11Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study a problem with speeding up large language models using smaller helper models called drafters. They find that as these drafters generate token after token, their focus drifts away from the original prompt and onto their own recent outputs, which hurts performance. This "attention drift" happens because the model's internal values keep growing without proper normalization. To fix this, the authors propose two changes to the model's architecture involving normalization techniques that stabilize these values. These adjustments help the drafters produce longer, more reliable token sequences across various tasks and conditions.

speculative decodinglarge language modelsattention drifttransformer architectureresidual connectionsnormalizationPost-normRMSNormautoregressive generationcontext length
Authors
Doğaç Eldenk, Payal Mohapatra, Yigitcan Comlek, Kaan Oktay, Hongyang Zhang, Stephen Xia
Abstract
Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \emph{EAGLE3} drafters and \emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to $2\times$ under template perturbation, $1.18\times$ on long-context tasks, and $1.10\times$ on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.