From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding

2026-06-15 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied how spoken dialogue systems sometimes fail to keep track of conversation context in multi-turn talks. They found that even though the system internally knows important past information, it struggles to use it actively when generating responses. To fix this, they created a method called Context-Aware Decoding that helps the system focus on key past parts of the conversation during response creation. Tests showed this method improves how well the system remembers and sticks to the conversation context.

end-to-end spoken dialogue systemscontext adherencemulti-turn conversationlatent context awarenessparametric priorsattention mechanismsContext-Aware Decodingmultimodal contextual signalsAudio MultiChallenge benchmarksemantic memory

Authors

Che Hyun Lee, Heeseung Kim, Sungroh Yoon

Abstract

Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.

View PDFOpen arXiv