On-Policy Context Distillation for Language Models
2026-02-12 • Computation and Language
Computation and Language
AI summaryⓘ
The authors introduce On-Policy Context Distillation (OPCD), a method where a model learns by practicing on its own outputs and matching a teacher model's behavior. This approach helps models remember important knowledge from their past experiences and useful instructions from prompts. They tested OPCD on tasks like math reasoning and text games, finding it performs better than existing methods and keeps good performance even on new, different problems. Additionally, OPCD helps smaller models learn from bigger models effectively.
Context DistillationOn-Policy LearningKullback-Leibler DivergenceTeacher-Student ModelExperiential KnowledgePrompt DistillationMathematical ReasoningText-Based GamesModel DistillationOut-of-Distribution Generalization
Authors
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei
Abstract
Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.