Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

2026-04-03 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors explore improving text summarization using multiple teacher models in situations with limited data. They create methods called EWAD and CPDP that decide how much to trust teacher models based on their agreement and ensure the student model stays balanced among them. Testing on Bangla language datasets and others, they find simpler teacher guidance often works better, especially for longer summaries. They also show cross-language learning can keep much of the teacher's quality while shrinking model size. Human checks reveal that relying on a single judge can bias evaluations, highlighting the need for careful validation.

knowledge distillationabstractive summarizationentropy weightingmulti-teacher learninglogit-level distillationcross-lingual learningROUGE scoremodel compressionsemantic similarityevaluation bias

Authors

Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin, Niloy Farhan, Farig Yousuf Sadeque

Abstract

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

View PDFOpen arXiv