LatentRevise: Learning from Zero-Hit Reasoning

2026-06-29Computation and Language

Computation and Language
AI summary

The authors tackle a problem in reinforcement learning where finding correct solutions by chance is very unlikely, so normal training doesn't get useful feedback. They propose LatentRevise, a method that learns from failed attempts by adjusting the model’s internal understanding to move closer to the right answer. This adjustment happens in a way that keeps the model's reasoning sensible by staying within realistic token representations. Using LatentRevise leads to better answers that were previously missed and improves learning on math problem tasks compared to usual methods.

reinforcement learningverifiable rewardslatent revisionsampling frontierinput embeddingstoken embeddingsself-reflectionsupervised fine-tuningmath benchmarks
Authors
Yiqiu Guo, Xueting Han, Qi Jia, Guangtao Zhai, Jing Bai
Abstract
Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little useful signal. We frame such zero-hit prompts as RLVR's sampling frontier, where new reasoning behavior is most valuable yet least likely to be sampled. Importantly, failed rollouts can be informative: they expose where the model's reasoning went wrong. We introduce LatentRevise, a first-order latent revision method that recovers training signal for this zero-hit regime. Given a failed rollout and the gold answer as an anchor, LatentRevise optimizes the input embeddings of its reasoning prefix under two complementary gradients, moving the prefix away from the failed continuation and toward the gold answer. The optimization is constrained to the convex hull of the model's vocabulary embeddings, so each update moves the latent toward a real token embedding rather than an arbitrary feature direction. We find that continuations from the revised prefix lengthen, exhibit self-reflection, and reach correct answers missed by the original rollouts. Used as training data, these trajectories improve SFT and RLVR on math benchmarks over standard baselines.