ReNIO: Reweighting Negative Trajectory Importance for LLM On-Policy Distillation

2026-06-22Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied a way to improve large language models (LLMs) by training them using their own generated answers, a method called on-policy distillation (OPD). They found it works better to focus on the model's incorrect answers rather than only the correct ones because incorrect answers encourage more detailed thinking. To use this idea without needing complete answers, they designed ReNIO, which automatically gives more weight to parts of the reasoning that lead to mistakes using probability scores. This method helps the models get better at reasoning tasks like math and coding without losing the benefits of current training techniques.

large language modelson-policy distillationself-distillationreasoning tracesprobability ratioexploratory reasoningreinforcement learningtoken weightingmathematical reasoningcode generation
Authors
Chen Lin, Kedi Chen, Wei Zhang
Abstract
On-policy distillation (OPD) improves LLM reasoning by training a student model on its own generated outputs, but standard OPD treats all student-generated outputs (SGOs) equally regardless of their informativeness. We observe a consistent asymmetry in controlled filtering experiments: in both OPD and on-policy self distillation (OPSD), training only on incorrect SGOs outperforms training only on correct ones. Our further analysis suggests that models trained on correct-only SGOs tend to generate shorter reasoning traces and show weaker reflection behavior, while incorrect SGOs better preserve exploratory reasoning near the model's capability boundary. To exploit this signal without requiring full answer-containing rollouts, we introduce ReNIO, which Reweights Negative trajectory Importance for LLM On-policy distillation. By using the student-to-teacher probability ratio, ReNIO identifies pivotal tokens leading to wrong reasoning traces and aggregates their information into a normalized sample weight, inherently assigning larger weights to likely negative trajectories without observing the correctness of final-answer. Since Re-NIO only uses prefix-conditioned token probabilities, it preserves OPD's prefix training advantage over full-rollout reinforcement learning. Across both mathematical reasoning and code generation tasks, ReNIO improves both OPD and OPSD, with representative relative gains of up to 8.90% for Qwen3-1.7B and 10.00% for R1-Distill-Qwen-7B on mathematical reasoning benchmarks. Code repo: https://github.com/BDML-lab/ReNIO.