Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering

2026-04-24Machine Learning

Machine LearningComputation and Language
AI summary

The authors developed a new way to find word patterns in Giriama, a Bantu language with very little labeled data. They combined learning from a related language, Swahili, with a method that groups similar word forms without supervision. This approach helped them discover new word prefixes and achieve good accuracy in breaking words down and identifying their base forms. Their work also expanded the known vocabulary of Giriama, supporting better documentation of the language's structure.

Bantu languagesmorphologycross-lingual transfer learningunsupervised clusteringnoun classeslemmatizationvocabulary expansionSwahiliprefixeslinguistic documentation
Authors
Hillary Mutisya, John Mugane
Abstract
We present a method for discovering morphological features in low-resource Bantu languages by combining cross-lingual transfer learning with unsupervised clustering. Applied to Giriama (nyf), a language with only 91 labeled paradigms, our pipeline discovers noun class assignments for 2,455 words and identifies two previously undocumented morphological patterns: an a- prefix variant for Class 2 (vowel coalescence - the merger of two adjacent vowels - of wa-, 95.1% consistency) and a contracted k'- prefix (98.5% consistency). External validation on 444 known Giriama verb paradigms confirms 78.2% lemmatization accuracy, while a v3 corpus expansion to 19,624 words (9,014 unique lemmas) achieves 97.3% segmentation and 86.7% lemmatization rates across all major word classes. Our ensemble of transfer learning from Swahili and unsupervised clustering, combined via weighted voting, exploits complementary strengths: transfer excels at cognate detection (leveraging ~60% vocabulary overlap) while clustering discovers language-specific innovations invisible to transfer. We release all code and discovered lexicons to support morphological documentation for low-resource Bantu languages.