Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization
2026-06-08 • Machine Learning
Machine LearningComputer Vision and Pattern Recognition
AI summaryⓘ
The authors study a technique called on-policy distillation (OPD), which helps trained models learn better by using feedback from a stronger teacher model. They noticed that the usual way of teaching at the token level can cause unstable training because some state feedback values are too extreme. To fix this, they propose a new method called GNDPO that normalizes feedback scores across batches to keep training stable while still learning from detailed guidance. Their experiments show that GNDPO improves training stability and performance on tasks involving multiple types of information.
on-policy distillationreinforcement learningKL divergencegradient instabilitypolicy optimizationmultimodal reasoningbatch normalizationteacher modeltoken-level supervisiontrajectory sampling
Authors
Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu
Abstract
On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.