Credit Assignment with Resets in Language Model Reasoning
2026-05-25 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors discuss a problem in teaching AI language models to solve reasoning tasks step-by-step, where typically the model is rewarded only at the end, making it hard to know which steps led to success or failure. They propose two new methods that allow the model to 'reset' to earlier points in its reasoning and try different paths, helping it figure out and fix the specific mistakes. One method picks random steps to reset, while the other lets the model find the mistake itself and focus learning there. Their experiments show the self-reset approach helps improve reasoning better than previous methods without needing extra guidance.
reinforcement learningcredit assignmentlanguage modelsmulti-step reasoningpolicy optimizationcounterfactual reasoningtrajectory resetsConservative Policy Iterationreward methodsself-localization
Authors
Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Daniel Jiang, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni
Abstract
Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.