NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

2026-04-17Sound

Sound
AI summary

The authors created a new dataset called NaijaS2ST to help translate speech between English and four Nigerian languages, which don't have much available data. They collected about 50 hours of varied speech for each language to mimic real-world accents and speakers. They tested different computer methods for translating speech and found that large language models using a few examples work better for converting speech to text, while other methods perform similarly for translating speech to speech. Their work provides useful tools and benchmarks to improve speech translation for low-resource African languages.

speech translationlow-resource languagesparallel datasetIgboHausaYorùbáNigerian Pidginend-to-end modelsAudioLLMcascaded translation
Authors
Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Maryam Ibrahim Mukhtar, Daud Abolade, Joel Okepefi, Johnson Sewedo, David Ifeoluwa Adelani
Abstract
Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yorùbá, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for speech-to-speech translation, the cascaded and audio LLM paradigms yield comparable performance, indicating that there is still considerable room for improvement in developing targeted, task-specific models for this setting. By providing both a high-quality dataset and a systematic benchmark, we hope that NaijaS2ST will serve as a strong foundation for advancing research in low-resource, multilingual speech translation.