Learning to Route Languages for Multilingual Policy Optimization
2026-05-25 • Computation and Language
Computation and Language
AI summaryⓘ
The authors developed a new method called language-routed policy optimization (LRPO) to improve how large language models learn from multiple languages during training. Instead of limiting each question to one response language, LRPO treats language choice as flexible and uses feedback from different languages to guide learning. They also created a system to smartly pick which languages to focus on, balancing trying new ones and using the most helpful ones. Their experiments show that this approach helps the models perform better across languages.
Large Language ModelsMultilingual TrainingPolicy OptimizationReinforcement LearningMulti-Armed BanditCross-Lingual LearningRolloutsPreference-Based Learning
Authors
Geyang Guo, Hiromi Wakaki, Yuki Mitsufuji, Alan Ritter, Wei Xu
Abstract
Large language models~(LLMs) are trained on heterogeneous multilingual corpora, yet existing policy optimization methods often implicitly restrict each training question to a single response language or rely on a fixed dominant language for supervision. We propose language-routed policy optimization (LRPO), an online policy optimization framework that treats language as a selectable variable. LRPO elicits multilingual rollouts for each training question and integrates their relative quality into preference-based policy updates, increasing the diversity and informativeness of training signals under the fixed rollout budget. To adaptively determine which languages to explore during reinforcement learning, we introduce a trainable language router formulated as a multi-armed bandit, balancing exploration of underutilized languages with exploitation of more informative ones. Extensive experiments show that LRPO consistently improves multilingual performance, demonstrating that adaptive language routing enables effective cross-lingual knowledge exploitation for training. We release all the resources at https://github.com/Guochry/LRPO.