WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

2026-06-01Computation and Language

Computation and LanguageComputers and SocietyHuman-Computer Interaction
AI summary

The authors tested if smaller, specialized speech recognition models work better than huge multilingual ones for conversational African languages. They found that fine-tuned smaller models made much fewer errors across 19 languages compared to large zero-shot models. These smaller models also handled new kinds of speech decently and worked differently depending on language family and script. The authors created an error guide and showed that standard error rates can be misleading for some writing systems. They also shared all their models and data to help future research.

ASR (Automatic Speech Recognition)WER (Word Error Rate)CTC (Connectionist Temporal Classification)autoregressive modelszero-shot learningmultilingual modelsfine-tuningout-of-distribution (OOD)syllabary scriptWAXAL corpus
Authors
Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi, Wanchi Lucia Yen, John Uzodinma, Sunday Ajayi, Oluwademilade Williams, Kausar Moshood, Innocent Elendu Anyaele, Akebert Arefaine, Candace Hunzwi, Wongel Dawit Daniel, Emmilly Namuganga, Cleophas Kadima, Athanase Bahizire, Onitsiky Ranaivoson, Emmanuel Aaron, Nicholaus Ladislaus, Idris Muhammed, Jonathan Enoch Simenya, Martin Koome, Matewos Tegete Endaylalu, Peter Ifeoluwa Adeyemo, Hondi Prisca Birindwa, Ukachi Agnes Eze-Mbey, Yacoba Oduro-Yeboah, Pericles Adjovi, Mikel K. Ngueajio, Toluwani Aremu, Prasenjit Mitra
Abstract
We evaluate whether compact domain-specialized ASR models can outperform massively multilingual foundation models for conversational African speech across 19 languages in the WAXAL corpus. Fine-tuned edge models achieve a macro-averaged WER of $38.0\%$ compared to $64.9\%$ for the best zero-shot baseline, a $26.9$ percentage-point reduction using models $3-40\times$ smaller. Results confirm that domain specialization dominates scale for spontaneous African speech. Cross-domain evaluation shows that fine-tuned models recover usable performance on out-of-distribution (OOD) speech, while zero-shot models regain an advantage when the test domain matches their pretraining distribution. A distributed native-speaker audit across all surveyed languages produces a linguistically-grounded error taxonomy, showing that CTC and autoregressive architectures behave differently across language families. We further show that WER alone misrepresents performance for syllabary-script languages where CER/WER ratios reveal substantially higher character-level accuracy than headline WER suggests. Finally, to contribute to future African ASR research, we release all model weights, fine-tuning and evaluation scripts, and a cleaned WAXAL subset covering all $19$ languages.