Escaping the KL Agreement Trap in On-Policy Distillation

2026-06-08Machine Learning

Machine LearningComputation and Language
AI summary

The authors looked at a problem in on-policy distillation where the student model sometimes gets stuck in bad prediction patterns, causing the teacher to give misleading approval and making training less effective. They call this issue the low-KL agreement trap and show that the signals given during these traps are not helpful. To fix this, they created KAT, a method that detects when the student is stuck in this trap and stops training on those poor signals. This approach improves accuracy on math tasks and reduces the length of the training sequences needed.

on-policy distillationteacher-student modelKL divergencetraining signalrolloutsupervisionlow-KL agreement trappass@kaccuracytraining termination
Authors
Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong
Abstract
On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.