BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language

2026-06-02Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors created BaltiVoice, a collection of 16.8 hours of spoken Balti language recordings, which had no previous speech recognition data. They used these recordings to teach a computer model called Whisper-small to understand Balti better. After training, the model made fewer mistakes recognizing Balti speech compared to before training. The authors have shared the recordings, the improved model, and a tool to try it live for others to use.

Balti languageAutomatic Speech Recognition (ASR)corpusWhisper modelWord Error Rate (WER)Mozilla Common VoiceNastaliq scriptfine-tuningspeech datasetlanguage resources
Authors
Muhammad Ali
Abstract
We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.