LatentRevise: Learning from Zero-Hit Reasoning

2026-06-29 • Computation and Language

Computation and Language

AI summaryⓘ

The authors tackle a problem in reinforcement learning where finding correct solutions by chance is very unlikely, so normal training doesn't get useful feedback. They propose LatentRevise, a method that learns from failed attempts by adjusting the model’s internal understanding to move closer to the right answer. This adjustment happens in a way that keeps the model's reasoning sensible by staying within realistic token representations. Using LatentRevise leads to better answers that were previously missed and improves learning on math problem tasks compared to usual methods.

reinforcement learningverifiable rewardslatent revisionsampling frontierinput embeddingstoken embeddingsself-reflectionsupervised fine-tuningmath benchmarks

Authors

Yiqiu Guo, Xueting Han, Qi Jia, Guangtao Zhai, Jing Bai

Abstract

Reinforcement learning with verifiable rewards (RLVR) is bottlenecked by hard prompts on which correct trajectories have low probability, so sampling misses them within a practical budget and leaves the policy update with little useful signal. We frame such zero-hit prompts as RLVR's sampling frontier, where new reasoning behavior is most valuable yet least likely to be sampled. Importantly, failed rollouts can be informative: they expose where the model's reasoning went wrong. We introduce LatentRevise, a first-order latent revision method that recovers training signal for this zero-hit regime. Given a failed rollout and the gold answer as an anchor, LatentRevise optimizes the input embeddings of its reasoning prefix under two complementary gradients, moving the prefix away from the failed continuation and toward the gold answer. The optimization is constrained to the convex hull of the model's vocabulary embeddings, so each update moves the latent toward a real token embedding rather than an arbitrary feature direction. We find that continuations from the revised prefix lengthen, exhibit self-reflection, and reach correct answers missed by the original rollouts. Used as training data, these trajectories improve SFT and RLVR on math benchmarks over standard baselines.

View PDFOpen arXiv