AI summaryⓘ
The authors propose a new way to improve language model reasoning by adding a second method of creating different answer attempts, not just random sampling of words but also reusing parts of the model's layers in a specific way. They introduce Entropy-Gated Latent Recursion (EGLR), which repeatedly applies the top layers of the model when the model is uncertain about the next word until it settles on a stable answer. This method, combined with traditional randomness, creates many more diverse solutions without extra cost. Their experiments show that combining these two approaches helps solve more math problems than using either method alone. This new approach could improve various methods that rely on multiple answers from language models, without adding randomness.
inference-time scalinglanguage modelsstochastic token-level samplinglayer spanEntropy-Gated Latent Recursion (EGLR)temperature samplingdecoder layersself-consistencyverifiersgroup-relative RL training (GRPO)
Authors
Soham Bhattacharjee, Dushyant Singh Chauhan, Salem Lahlou, Martin Takac, Nils Lukas
Abstract
Inference-time scaling has become the dominant lever for improving language-model reasoning, but existing methods derive rollout diversity from a single source: stochastic token-level sampling. We argue that this single-axis sampling space is fundamentally limiting, and identify a second, fully deterministic and complementary axis: the layer span $L$ at which a frozen model's top decoder layers are recursively re-applied at high-uncertainty tokens. Different choices of $L$ produce distinct rollouts that solve different subsets of problems, with no stochasticity. We instantiate this axis through Entropy-Gated Latent Recursion (EGLR), a training-free decoding procedure that re-applies the top-$L$ layers for at most $K_{\max}$ iterations until the next-token distribution converges. Combined with $T$ temperature samples, EGLR turns a single-axis stochastic rollout pool into an $L\times T$ Cartesian sampling space at almost the same per-rollout cost. We characterize this space across $8$ instruction-tuned models and $6$ math reasoning benchmarks, and show that the $L$-axis is genuinely complementary to temperature: on MATH-500 with Qwen2.5-3B-Instruct, the joint $L\times T$ oracle reaches $91.6\%$, $+8.2$ percentage points beyond the temperature-only oracle ($83.4\%$) and $+10.4$ points beyond the layer-only oracle ($81.2\%$), confirming that the two axes capture genuinely complementary problems. The expanded rollout pool provides richer per-prompt candidates for any downstream procedure that consumes rollouts, including self-consistency, best-of-$N$ with verifiers, and group-relative RL training (GRPO), opening a new direction for inference-time scaling that does not rely on stochastic noise.