ARKD: Adaptive Reinforcement Learning-Guided Bidirectional KL Divergence Distillation for Text Generation

2026-06-29Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors study a way to make large language models smaller and faster by teaching a smaller model to imitate a bigger one, a process called knowledge distillation. They found that using just one type of mathematical tool (KL divergence) doesn’t balance learning common cases and rare cases well. To fix this, the authors combined two types of KL divergence and used a smart system to decide how much to pay attention to each type during training. Their method improved how well the smaller model learns and performs on tests compared to previous methods.

Knowledge DistillationLarge Language ModelsKL DivergenceForward KLReverse KLReinforcement LearningDistribution AlignmentRouge-LBertScore
Authors
Zilong Liu, Xuewen Zhang, Jinrui Xing, Juyi Qiao, Huiyong Wang, Junming Jiao
Abstract
Knowledge distillation (KD) is a key technique for compressing Large Language Models (LLMs), yet methods relying on a single KL objective often fail to balance primary distribution fitting with long-tail probability modeling, limiting both generation quality and generalization. To address this, we analyze the complementary roles of forward and reverse KL divergence (FKL/RKL) in distribution alignment from theoretical and empirical perspectives. We then propose a reinforcement-learning-based adaptive KL-weighted distillation framework, in which a policy network dynamically assigns weights to FKL and RKL based on teacher-student distributional characteristics, guided by immediate reward signals to achieve dual alignment on principal and long-tail modes. Extensive experiments demonstrate consistent improvements across Rouge-L and BertScore metrics, surpassing greedy heuristics by 0.4-0.6 points and outperforming other baseline methods on diverse benchmarks.