Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

2026-06-08Machine Learning

Machine Learning
AI summary

The authors discuss a technique called On-Policy Distillation (OPD), used to teach smaller language models using larger expert models. Usually, this method requires the teacher and student models to use the same way of breaking text into pieces, called tokenizers, which limits its use. The authors created a way to allow OPD between models with different tokenizers by mapping tokens precisely. Their experiments show this new method is more efficient and works well in many tests, making OPD usable for a wider variety of model pairs.

On-Policy DistillationLarge Language ModelsTokenizerSupervised Fine-TuningCross-tokenizer DistillationToken MappingProbability DistributionModel Adaptation
Authors
Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li
Abstract
On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.