DemoPSD: Disagreement-Modulated Policy Self-Distillation

2026-07-02Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study how a single language model teaching itself can sometimes learn shortcuts based on extra hidden information it shouldn't use when tested normally. They propose a new method called DemoPSD, which carefully mixes the teacher's guidance with the student's own reasoning so the model doesn't overfit or lose its ability to explore different answers. Their approach reduces the problem of leaking privileged information and helps the model generalize better to different topics. Tests on scientific reasoning tasks show DemoPSD leads to stronger and more flexible models compared to previous methods.

on-policy self-distillationlarge language modelstoken-level supervisionprivileged information leakagereverse-KL barycentermodel generalizationentropyout-of-distributionscientific reasoning benchmarks
Authors
Yunhe Li, Hao Shi, Wenhao Liu, Mengzhe Ruan, Hanxu Hou, Zhongxiang Dai, Shuang Qiu, Linqi Song
Abstract
On-policy self-distillation (OPSD) has emerged as a practical method for training large language models (LLMs) to reason, where a single model acts as both the teacher and the student with different levels of information access. However, recent studies have found that the teacher's dense token-level supervision, conditioned on privileged information, can lead to overfitting to in-domain patterns, suppress exploration, and hurt cross-domain generalization, while also introducing a more fundamental issue: *privileged information leakage*, where the student encodes answer-dependent shortcuts that are unavailable at test time. We introduce **DemoPSD**, a novel framework that resolves such problems through the idea of *selective adoption of teacher guidance*. Instead of fitting the full teacher distribution, DemoPSD steers the student toward a *reverse-KL barycenter target*, a weighted geometric combination of the teacher and student distributions, that naturally balances learning from the teacher with preserving the student's own reasoning capacity. We measure the difference between their distributions and use such a discrepancy to adaptively control the blending at each token position. We provably show that DemoPSD achieves **(1)** *leakage attenuation*, i.e., effective mitigation of privileged information leakage; and **(2)** *exploration preservation*, i.e., preservation of exploration capacity under dense token-level distillation. Extensive experiments on SciKnowEval across four scientific fields show that DemoPSD outperforms both GRPO and SDPO while maintaining higher training entropy and robustly generalizing to out-of-distribution GPQA benchmarks.