Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails
2026-06-22 • Computation and Language
Computation and Language
AI summaryⓘ
The authors study how to make smaller language models learn from bigger ones, a process called Knowledge Distillation (KD), especially after initial training. They find that KD helps more when there isn’t much data, but its benefits decrease with lots of data unless the big model is very good. They also suggest a two-step approach using computer-generated labels first, then refining with real human-labeled data, which helps smaller models do better in specific areas with little data. This work helps understand how to build efficient models for limited resource settings.
Large Language ModelsKnowledge DistillationSupervised Fine-TuningInstruction TuningPost-TrainingLow-Data RegimesSynthetic DataHuman AnnotationsResource-Constrained EnvironmentsDomain-Specific Models
Authors
Xin Liu, Simin Ma, Shujian Liu, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Lu Wang, Kaiqiang Song
Abstract
Large language models (LLMs) achieve strong performance across many tasks, but their high computational cost limits deployment in resource-constrained environments. Knowledge Distillation (KD) offers a practical solution by transferring knowledge from a teacher model of a larger size to a smaller student model. While prior work has mainly examined task-specific or small-scale settings, the post-training stage for building general instruction-following models has received limited attention. In this paper, we conduct a systematic study of KD in post-training using the large-scale Tulu 3 dataset. We find that KD outperforms supervised fine-tuning (SFT) in low-data regimes, but its advantage diminishes as more training data is added. Distilling from a stronger instruction-tuned teacher restores substantial gains even with abundant data, indicating that KD remains effective when the teacher provides knowledge that the student cannot easily acquire from the training data alone. We further study domain-specific, low-resource scenarios and propose a two-stage KD strategy that leverages synthetic teacher-labeled data followed by refinement on human annotations. This method consistently improves student performance, providing practical guidance for building compact models in data-scarce environments.