TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

2026-05-11 • Artificial Intelligence

Artificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied a way for reinforcement learning models to teach themselves better by focusing only on important parts of their training output rather than all tokens. They found that learning from every token caused problems like spreading errors and losing important information, especially in long math problems. Their new method, TRACE, uses teacher guidance selectively on critical sections marked by annotators, which helps the model learn more effectively and avoid degradation when tested on new problems. This approach improved performance across several benchmarks and worked well even when the model annotated itself during training.

reinforcement learningself-distillationKL divergenceprivileged contextentropylong-horizon reasoningforward KLreverse KLGRPOout-of-distribution

Authors

Jiaxuan Wang, Xuan Ouyang, Zhiyu Chen, Yulan Hu, Zheng Pan, Xin Li, Lan-Zhe Guo

Abstract

On-policy self-distillation (self-OPD) densifies reinforcement learning with verifiable rewards (RLVR) by letting a policy teach itself under privileged context. We find that when this guidance spans the full response, all-token KL spends gradients on mostly redundant positions and amplifies privileged-information leakage, causing entropy rise, shortened reasoning, and out-of-distribution degradation in long-horizon math training. We propose Token-Routed Alignment for Critical rEasoning (TRACE), which distills only on annotator-marked critical spans: forward KL on key spans of correct rollouts, optional reverse KL on localized error spans, and GRPO on all remaining tokens, with the KL channel annealed away after a short warm-up. Our analysis explains TRACE through two effects: forward KL provides non-vanishing lift to teacher-supported tokens that the student under-allocates, while span masking and decay keep cumulative privileged-gradient exposure finite. On four held-out math benchmarks plus GPQA-Diamond, TRACE improves over GRPO by 2.76 percentage points on average and preserves the Qwen3-8B base OOD score on GPQA-Diamond, where GRPO and all-token self-OPD baselines degrade. Gains persist under online self-annotation (+1.90 percentage points, about 69% of the strong-API gain), reducing the concern that TRACE merely imports external annotator capability. Across scales, the best routed action is base-dependent: on Qwen3-8B it is forward KL on key spans, while on Qwen3-1.7B it shifts to reverse KL on error spans.

View PDFOpen arXiv