Evaluation of Automatic Speech Recognition Using Generative Large Language Models

2026-04-23Computation and Language

Computation and Language
AI summary

The authors looked at better ways to judge how well speech recognition works by focusing on meaning instead of just counting errors. They tested large language models (LLMs) in three ways: picking the better sentence from two options, measuring how close meanings are, and sorting errors by type. Their tests showed LLMs agree with humans much more than traditional methods like Word Error Rate. This suggests LLMs could help make speech recognition evaluations that are easier to understand and more focused on meaning.

Automatic Speech RecognitionWord Error RateLarge Language ModelsSemantic EvaluationEmbeddingsHypothesis SelectionGenerative ModelsHATS DatasetDecoder-based ModelsHuman Annotation
Authors
Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu, Mickael Rouvier, Jane Wottawa, Richard Dufour
Abstract
Automatic Speech Recognition (ASR) is traditionally evaluated using Word Error Rate (WER), a metric that is insensitive to meaning. Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task. This paper evaluates their relevance through three approaches: (1) selecting the best hypothesis between two candidates, (2) computing semantic distance using generative embeddings, and (3) qualitative classification of errors. On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics. Embeddings from decoder-based LLMs show performance comparable to encoder models. Finally, LLMs offer a promising direction for interpretable and semantic ASR evaluation.