Proactive for Uncertainty: Cause-Aware Error Diagnosis and Interactive Clarification for Spoken Dialogue Systems

2026-05-25Computation and Language

Computation and Language
AI summary

The authors studied how to improve speech recognition systems used in dialogue programs. Instead of just using simple confidence scores to catch errors, they developed detailed detectors that can tell if a problem is due to hearing mistakes, understanding mistakes, or missing words. Their method helps the language model ask better questions to fix these errors, making conversations smoother. Tests showed their approach caught more errors and made the system more accurate across different accents and noisy environments.

Automatic Speech RecognitionLarge Language ModelsSpoken Dialogue SystemsError PropagationConfidence ScoresToken-level ErrorsPerception ErrorsComprehension ErrorsDeletion ErrorsWord Error Rate
Authors
Yizhou Peng, Ziyang Ma, Changsong Liu, Yi-Wen Chao, Xie Chen, Eng Siong Chng
Abstract
Cascaded Automatic Speech Recognition -- Large Language Model (ASR-LLM) pipelines remain popular for industrial Spoken Dialogue Systems (SDS), primarily because their decoupled design ensures perceptual verifiability. However, cascaded systems suffer from error propagation, as transcription failures inevitably cascade to subsequent components, thereby degrading the final interaction quality. Although ASR confidence scores offer a simple filter for unreliable inputs, this approach is fundamentally limited because it typically fails to detect deletion errors or to distinguish between acoustic (inability to hear clearly) and linguistic (inability to understand) mismatches, both of which require targeted recovery strategies. In this paper, we propose a cause-aware error recovery paradigm that fundamentally rethinks robustness in SDS. Unlike traditional confidence filtering, we introduce a suite of small precision-focused detectors that exploit deep ASR latent representations to disentangle token-level errors into perception, comprehension, and deletion failures. This fine-grained diagnostic intelligence empowers the LLM to orchestrate targeted, multi-turn clarification strategies, effectively transforming ambiguous signals into seamless user interactions. Experimental results validate the precision of our approach, which more than doubles the recall on domain-shift errors (57.96% vs. 23.66%) compared to baselines. Crucially, this diagnostic precision yields up to a 30% reduction in WER and a 17% improvement on the downstream task across diverse accents, distortions, and domains.