Provable Benefits of RLVR over SFT for Reasoning Models: Learning to Backtrack Efficiently
2026-06-22 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors explain why fine-tuning large language models using reinforcement learning (RL) with reward signals helps them reason better than traditional supervised fine-tuning (SFT). They think of reasoning as finding a path on a graph and show that SFT struggles to learn how to backtrack when it hits dead ends because it only sees perfect examples. RL methods, however, learn from outcomes to backtrack efficiently, making reasoning faster and better. The authors also find that the improved reasoning from RL models can be taught back to base models for better performance.
Large Language ModelsReinforcement LearningSupervised Fine-TuningChain-of-Thought ReasoningBacktrackingPathfindingInference-time ComputeReward SignalsModel Distillation
Authors
Stanley Wei, Juno Kim
Abstract
Recent advances in large language models (LLMs) have demonstrated that reinforcement fine-tuning of pretrained base models can lead to significant gains in reasoning performance at inference time. In this work, we theoretically analyze why reinforcement fine-tuning induces better reasoning ability than purely supervised fine-tuning (SFT) methods. We model chain-of-thought (CoT) reasoning as a pathfinding problem on graphs and compare the popular method of reinforcement learning with verifiable rewards (RLVR) against traditional SFT. We prove that SFT, when trained on golden shortest paths without negative examples, fails to learn how to efficiently backtrack. In contrast, an RLVR-trained model can learn how to efficiently backtrack from dead ends using only outcome reward. This leads to an exponential separation in inference-time compute between the two methods, and demonstrates that RLVR leads the model to learn the location of difficult decisions in a reasoning chain, ultimately allowing for better allocation of inference-time compute. Finally, we show that the reasoning traces of an RLVR model can be distilled to train a base model to backtrack efficiently as well.