HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

2026-06-22Sound

SoundArtificial Intelligence
AI summary

The authors created HALAS, a new dataset that shows when speech recognition systems make mistakes called hallucinations on real, natural audio from earnings calls. They labeled these mistakes in detail to study how often and what kind of errors happen across seven advanced ASR models. They found that hallucinations can happen even when the overall transcription looks mostly correct. Their results also show current methods to detect these hallucinations are not very accurate, making HALAS the first good real-world test for improving ASR error detection.

Automatic Speech RecognitionASR hallucinationsdataset annotationearnings callstranscription errorsWord Error Rateerror detectionbenchmark dataset
Authors
Mateusz Barański, Jan Jasiński, Julitta Bartolewska, Marcin Witkowski, Konrad Kowalczyk
Abstract
End-to-end Automatic Speech Recognition (ASR) systems hallucinate on natural speech, yet existing mitigation methods are typically evaluated on non-speech or artificially corrupted audio. We introduce HALAS, the first human-annotated dataset of naturally occurring hallucinations from seven state-of-the-art ASR models on real unprocessed earnings call recordings. HALAS provides span-level labels, enabling analysis of hallucination patterns and their severity. Our analysis reveals strong cross-model vocabulary overlap and confirms that hallucinations also occur for almost correctly transcribed speech (characterized by a low Word Error Rate). The proposed benchmark with HALAS shows that the character and semantic-level metrics used as a proxy for hallucination detection reach 81% ROC-AUC, while state-of-the-art detection methods achieve an F1 score of only 53.1%. As such, HALAS establishes the first rigorous non-artificial benchmark for the detection and mitigation of ASR hallucinations.