From Text Metrics to Model Internals: A Study of Whisper ASR Hallucination Detection
2026-06-22 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors studied how to detect when an automatic speech recognition (ASR) system like Whisper produces hallucinations, meaning it outputs text that doesn't match the audio at all. They tried three methods: analyzing just the text output, using large language models (LLMs) with special prompts, and examining Whisper's own internal decoding layers. They found that checking the internal decoder states worked best without needing the correct transcript. Combining text analysis and internal decoder info gave the most accurate detection overall. This helps make ASR systems more reliable by catching these confident but wrong transcriptions.
ASR (Automatic Speech Recognition)hallucination detectionWhisper modeldecoder stateslarge language models (LLMs)text classificationprompt conditioningmeta-classifierend-to-end speech recognitionerror detection
Authors
Jan Jasiński, Mateusz Barański, Julitta Bartolewska, Marcin Witkowski, Konrad Kowalczyk
Abstract
Hallucinations of ASR models - fluent transcriptions with no basis in audio - degrade system performance and pose risks in downstream applications. Robust detection of such errors remains a challenge. This paper studies Whisper large v3 hallucination detection on real-speech human-annotated data across three paradigms: text-based, LLM-based, and internal decoder state probing. Text classifiers utilizing metrics for text evaluation achieve high recall but degrade without reference transcripts. LLM-based detection improves precision with domain-specific prompt conditioning, yet remains less competitive than the lightweight text-based methods. Probing Whisper's decoder representations, without a ground-truth reference, yields the strongest performance, revealing that hallucination traits are encoded across intermediate decoding layers. A late-fusion meta-classifier combining text and internal-state outputs achieves the best overall detection performance.