WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data
2026-05-13 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors created WARDEN, a system that can turn spoken Wardaman, a rare Australian indigenous language, into English text. Because they only had 6 hours of recorded and labeled Wardaman audio, they split the task into two steps: first turning speech into phoneme sounds, then translating those into English. They improved the first step by borrowing sounds from a similar language, Sundanese, and helped the second step by giving the system a Wardaman-English dictionary. Their two-step method worked better than trying to do transcription and translation all at once with such little data. WARDEN performed better than bigger models despite the small dataset.
Wardaman languagelanguage transcriptionlanguage translationlow-resource languagesphonemic transcriptionfine-tuninglarge language modelsSundanese languageindigenous languagesmachine learning
Authors
Ziheng Zhang, Yunzhong Hou, Naijing Liu, Liang Zheng
Abstract
This paper introduces WARDEN, an early language model system capable of transcribing and translating Wardaman, an endangered Australian indigenous language into English. The significant challenge we face is the lack of large-scale training data: in fact, we only have 6 hours of annotated audio. Therefore, while it is common practice to train a single model for transcription and translation using large datasets (like English to French), this practice is no longer viable in the Wardaman to English context. To tackle the low-resource challenge, we design WARDEN to have separate transcription and translation models: WARDEN first turns a Wardaman audio input into phonemic transcription, and then the transcription into English translation. Further, we propose two useful techniques to enhance performance. For transcription, we initialize the Wardaman token from Sundanese, a language that shares similar phonemes with Wardaman, to accelerate fine-tuning of the transcription model. For translation, we compile a Wardaman-English dictionary from expert annotations, and provide this domain-specific knowledge to a large language model (LLM) to reason and decide the final output. We empirically demonstrate that this two-stage design works better than data-hungry unified approaches in extremely low data settings. Using a mere 6 hours of annotated data, WARDEN outperforms larger open-source and proprietary models and establishes a strong baseline. Data and code are available.