Multilinguality of Large Language Models From a Structural Perspective

2026-06-01 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how large language models (LLMs) understand different languages beyond just English. Instead of looking at single words, they examined the overall structure of the language representations inside the models. They found that languages with fewer resources are structurally more different from English than those with more available data. They also showed that training the models specifically on a language changes these structures but keeps the relationships between languages intact.

large language modelsmultilingualitylanguage representationlow-resource languagespost-trainingstructural analysistoken representationsinter-language relationships

Authors

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

Abstract

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

View PDFOpen arXiv