Strong Teacher Not Needed? On Distillation in LLM Pretraining

2026-05-22Machine Learning

Machine LearningComputation and Language
AI summary

The authors investigated how teaching smaller or weaker models affects larger language models during training. They discovered that even small or less trained teachers can help improve bigger student models if the training is done properly. Surprisingly, making the teacher model bigger or training it more doesn't always lead to better student models and can sometimes hurt performance. They also found that distillation helps models perform better on new or different tasks more than on the tasks they were trained on. Overall, the authors challenge the idea that only strong teachers are useful for training students in language models.

knowledge distillationlanguage modelpretrainingteacher-student modelmodel sizetraining tokensgeneralizationout-of-distributiondownstream tasks
Authors
Taiming Lu, Zhuang Liu
Abstract
Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.