LangMAP: A Language-Adaptive Approach to Tokenization

2026-06-22Computation and Language

Computation and Language
AI summary

The authors created LangMAP, a method to make tokenizers better at handling different languages using one shared vocabulary. Instead of needing a separate tokenizer for each language or retraining models, their approach adapts tokenization to each language without changing the underlying vocabulary. LangMAP works with models built from scratch or pretrained ones, and it figures out language-specific tokenization during use without knowing which language the input is in. They tested it on multiple natural and programming languages and found it improved how well token boundaries match word parts and code structure. However, when fine-tuning models, the improvements were clearer for grammar-related tasks than for knowledge-based tasks.

tokenizerUnigramLMmultilingual language modeltokenizationvocabulary adaptationmorphological boundaryabstract syntax tree (AST)fine-tuninggrammatical acceptabilitylanguage labels
Authors
Clara Meister, Suchir Salhan, Andrzej Szablewski, Pietro Lesci, Paula Buttery, Tiago Pimentel
Abstract
Language-specific tokenizers improve tokenization quality and the downstream performance of models on those languages. However, using such a tokenizer comes at a cost: either a new model must be trained from scratch, or the vocabulary of an existing pretrained model must be adapted. We propose Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, a tokenization scheme that extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Notably, LangMAP can be used when training a multilingual language model from scratch or to adapt a pretrained model's tokenizer to individual languages without changing its vocabulary. While language labels are required at training time, a key feature of the algorithm is that it then performs language-specific tokenization at inference without knowledge of the input's language. Across 14 open-source tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and, for all coding languages tested, alignment with abstract syntax tree (AST) leaf boundaries. In fine-tuning experiments, results are mixed: LangMAP improves target-language grammatical acceptability (MultiBLiMP) on the languages tested; its benefits are less consistent on knowledge-related tasks (Global-PIQA, Belebele).