PRIDE: Privileged Information-enhanced Distillation for Empathetic Dialogue Generation
2026-06-22 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors address the problem that large empathetic language models are too big to run on small devices. They propose PRIDE, a method that uses extra helpful information only during training to teach smaller models how to understand and respond with empathy. PRIDE breaks down empathy into clear steps and uses special techniques to make sure the smaller model learns well without needing the extra info later. Their tests show that these smaller models can perform as well or better than bigger ones in understanding and generating empathetic responses.
Large language modelsEmpathetic dialogueKnowledge distillationPrivileged informationMulti-source attentionDual-alignment lossKullback-Leibler divergenceMaximum mean discrepancyModel compressionNatural language generation
Authors
Jiaqiang Wu, Zhouan Zhu, Shangfei Wang
Abstract
Large language models have demonstrated significant capabilities in generating diverse and context-aware responses for empathetic dialogue. However, their computational demands severely limit their deployment in resource-constrained environments. While knowledge distillation offers a promising compression solution, it often fails to transfer the nuanced understanding essential for empathy, as it overlooks the implicit contextual cues that guide human connection. To bridge this gap, we propose a \textbf{pr}ivileged \textbf{i}nformation-enhanced knowledge \textbf{d}istillation method for \textbf{e}mpathetic dialogue generation (PRIDE). Our method leverages privileged information, such as expert psychological annotations or future event summaries, which is available exclusively during training but unavailable at inference time. This allows us to transfer the teacher model's empathetic reasoning to smaller models without relying on extra inputs during deployment. Specifically, PRIDE has three key components: (1) An empathy-reasoning prompt that guides the teacher to explicitly decompose the empathetic process into understanding feelings and analyzing situations step-by-step; (2) A multi-source attention mechanism that directs the student to effectively integrate privileged information; (3) A dual-alignment loss that combines reversed Kullback-Leibler divergence and maximum mean discrepancy to ensure robust knowledge transfer at both logit and feature levels. Experiments on multi-modal and text-only datasets demonstrate that our method achieves competitive performance, and in some cases matches or even surpasses larger teacher models in terms of accuracy and semantic relevance.