From Awareness to Adherence: Bridging the Context Gap in Spoken Dialogue Systems via Context-Aware Decoding
2026-06-15 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how spoken dialogue systems sometimes fail to keep track of conversation context in multi-turn talks. They found that even though the system internally knows important past information, it struggles to use it actively when generating responses. To fix this, they created a method called Context-Aware Decoding that helps the system focus on key past parts of the conversation during response creation. Tests showed this method improves how well the system remembers and sticks to the conversation context.
end-to-end spoken dialogue systemscontext adherencemulti-turn conversationlatent context awarenessparametric priorsattention mechanismsContext-Aware Decodingmultimodal contextual signalsAudio MultiChallenge benchmarksemantic memory
Authors
Che Hyun Lee, Heeseung Kim, Sungroh Yoon
Abstract
Despite the success of end-to-end (E2E) spoken dialogue systems, maintaining strict context adherence in multi-round conversations remains a challenge. While prior works attribute these failures to models forgetting dialogue history, we highlight an equally critical but overlooked bottleneck: a gap between latent context awareness and active adherence. Although models internally recognize relevant past utterances, strong parametric priors often overshadow these signals during decoding. To bridge this gap, we propose an audio-adapted Context-Aware Decoding (CAD) approach. By leveraging internal attention mechanisms to isolate key historical rounds, our approach contrasts output distributions with and without this key context during inference, directly amplifying multimodal contextual signals. Evaluations on the Audio MultiChallenge benchmark demonstrate significant improvements in Semantic Memory and Self Coherence subtasks, successfully enforcing strict, context-faithful adherence.