Stabilizing Extrapolation in Looped Transformers via Learned Stochastic Stopping
2026-06-29 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study Looped Transformers, a type of model that uses the same transformer block multiple times to handle sequences of different lengths. They find that these models can struggle when asked to work with sequences longer than those seen during training, showing inconsistent results. The authors discover that this problem comes from a link between sequence length and the number of loops during training. By training the model with a random number of loops, they reduce this inconsistency and make predictions more stable. They also explore a learned method called RL-Halting, which further balances accuracy and stability, but sometimes leads to less ideal results.
Looped Transformerslength generalizationout-of-distribution variancestochastic trainingRL-Haltingalgorithmic tasksbinary additionDyck-1 languageinference-time computationtraining-time design
Authors
Hsun-Yu Kuo, El Mahdi Chayti, Patrik Reizinger, Wieland Brendel, Martin Jaggi
Abstract
Looped Transformers, which repeatedly apply a shared transformer block, are an architecturally natural fit for variable-length algorithmic tasks. Although they can exhibit strong length generalization beyond the length of training sequences, this behavior is brittle, yielding high out-of-distribution (OOD) variance, even across well-performing in-distribution solutions. We trace this variance to the spurious correlation in simple algorithmic tasks between sequence length and number of loops. Introducing stochasticity into the number of loops during training sharply reduces OOD variance and stabilizes predictions across inference-time loop counts. To improve upon heuristic randomization schemes, we further analyze RL-Halting as a learned stochastic schedule and find that it generally improves the accuracy-stability trade-off. Across binary addition, Dyck-1, Unique Set, and Copy, learned stochastic stopping often improves this trade-off but can also stabilize a suboptimal computation. Our work suggests that "when to stop" should be treated as a training-time design choice, not merely an inference-time computation-allocation rule.