DOPD: Dual On-policy Distillation

2026-06-29 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors study a method called on-policy distillation (OPD), which helps a 'student' model learn from a 'teacher' model by giving feedback on the student's own outputs. They identify a problem they call 'privilege illusion,' where extra information given to the teacher or student creates gaps that can't truly be copied, causing confusion. To fix this, they propose DOPD, a method that smartly decides which model should teach specific parts of the output based on how much better one is over the other. Their experiments show DOPD works better than existing methods across language and vision-language tasks, and it also helps with stability and robustness in learning.

On-Policy DistillationTeacher-Student LearningPrivileged InformationToken-Level SupervisionAdvantage GapLarge Language ModelsVision-Language ModelsModel DistillationRobustnessContinual Learning

Authors

Xinlei Yu, Gen Li, Qingyi Si, Guibin Zhang, Yuqi Xu, Congcong Wang, Shuai Dong, Kaiwen Tuo, Xiangyu Zeng, Kaituo Feng, Qunzhong Wang, Yang Shi, Xiaobin Hu, Xiangyu Yue, Jiaqi Wang, Shuicheng Yan

Abstract

On-policy distillation (OPD) offers superior capacity transfer by supervising student-sampled trajectories with dense token-level signals. To furnish high-quality supervision sources and thereby elevate the performance frontier of distillation, an intuitive direction is to infuse privileged information to either teacher or student itself. However, this additional input induces a potential failure mode we dub privilege illusion: a pattern that conflates the transferable capability gap that students are meant to close, and the information asymmetry gap that can only be mimicked but never replicated. This issue is further amplified by the inherent non-uniformity of token-level supervision, where only a small subset of tokens carries pivotal capability-bearing signals. To this end, we propose DOPD, an advantage-aware dual distillation paradigm that dynamically routes token-level supervision between privileged teacher and privileged student policies based on their advantage gap and relative probabilities. Each token receives supervision of different strength, objective, and strategy from either teacher or student itself, which transfers credible capability while simultaneously receiving auxiliary signals, to alleviate privilege illusion. Extensive experiments on both large language model (LLM) and vision-language model (VLM) settings demonstrate that DOPD consistently outperforms Vanilla OPD and other counterparts. Further results on stability, robustness, continual learning, and out-of-distribution tasks validate its superiority.

View PDFOpen arXiv