Overcoming Decoder Inconsistencies in Whisper for Dravidian and Low-Resource Languages

2026-06-08 • Computation and Language

Computation and LanguageSound

AI summaryⓘ

The authors studied how a popular speech recognition model called Whisper struggles more with Dravidian languages than with Indo-Aryan ones. They found that Dravidian languages have longer words and more varied vocabulary, making it harder for the model to predict accurately. To fix this, the authors created two new techniques that help the model pay better attention to both the sound and language parts of speech. Their methods improved the recognition accuracy, especially for languages that have fewer resources and complex word structures.

Multilingual ASRWhisper modelWord Error Rate (WER)Dravidian languagesIndo-Aryan languagesdecoder attentionself-attentioncross-attentiontoken distributionagglutinative languages

Authors

Chowdam Venkata Kumar, Kumud Tripathi, Pankaj Wasnik

Abstract

Multilingual ASR models such as Whisper perform well on high-resource languages but exhibit substantially higher Word Error Rates (WER) for Dravidian languages compared to Indo-Aryan ones. Through linguistic and dataset analysis, we show that Dravidian languages have longer words, higher vocabulary diversity, and lower repetition, resulting in sparse token distributions and frequent character-level substitution errors. Baseline fine-tuning further reveals decoder imbalance between self-attention (linguistic context) and cross-attention (acoustic cues). Although synthetic token-repetition experiments indicate potential gains, they are impractical. Motivated by these observations, we introduce two decoder-level enhancements: Weighted-Attention, which adaptively balances attention sources, and Self-Conditioning, which reinjects intermediate predictions to improve token consistency. Experiments demonstrate consistent WER reductions for low-resource and agglutinative languages.

View PDFOpen arXiv