A Formula-Driven Survey and Research Agenda for On-Policy Distillation

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors study on-policy distillation (OPD), a method where a student language model learns by generating text and getting scored by a teacher model on those generated tokens. They present a new way to categorize OPD methods based on how feedback from the teacher updates the student, focusing on both direct loss approaches and policy-gradient approaches. They identify key factors that affect OPD's success, like how feedback is weighted over time and how probability is shifted when certain tokens are discouraged. The authors also highlight two often-confused mechanisms related to timing of feedback and routing of probabilities, proposing new ideas to improve stability and accuracy. Finally, their work provides a framework for diagnosing and reporting OPD experiments systematically.

on-policy distillationlarge language modelpolicy gradientlog-ratio updatesKL divergencetemporal credit assignmentprobability routinggeneralized advantage estimationcounterfactual routingfeedback-to-update
Authors
Bowen Zhang
Abstract
On-policy distillation (OPD) trains an LLM on states induced by the current or recent student policy: the student generates complete or partial rollouts, a teacher or self-teacher scores the resulting tokens under their generated contexts, and dense log-probability, logit, or distributional signals are converted into post-training updates. This survey studies OPD as a feedback-to-update problem rather than a single loss family. We develop a formula-driven taxonomy from two routes -- direct distributional losses and policy-gradient-style log-ratio updates -- and use it to organize core methods, verifier- or outcome-guided hybrids, industrial reports, framework implementations, failure modes, and stabilization recipes under explicit evidence boundaries. The taxonomy shows that OPD effectiveness depends not only on KL direction or teacher access, but also on state compatibility, support construction, temporal credit, vocabulary-level probability routing, gates and weights, and regularization. We further separate two mechanisms often conflated in sampled-token OPD stability discussions. Temporal credit asks how teacher-student log-ratio returns should weight sampled actions across a rollout; vocabulary routing asks where probability mass should move when negative feedback suppresses a sampled token. This distinction yields bias boundaries for immediate, return-to-go, discounted, and baseline-corrected estimators, motivates GAE-OPD as a value-based hypothesis for log-ratio returns, and motivates Counterfactual Routed OPD (CR-OPD) for routing probability mass toward teacher-supported, student-reachable alternatives. We close by mapping actionability diagnostics, failure mechanisms, case studies, open problems, and a reporting checklist onto the same feedback-to-update variables.