DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning

2026-05-25Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors look at improving how language models learn when they have to balance multiple goals at once. They point out problems with existing methods that either cause unstable training or rely on fixed settings. Their new method, called Dynamic Variance-adaptive Advantage Optimization (DVAO), changes how goals are weighted based on how reliable each learning signal is during training. They prove mathematically that this keeps training stable and works better than previous approaches in tests involving math reasoning and tool use.

Reinforcement LearningLarge Language ModelsMulti-objective OptimizationPolicy OptimizationReward ScalarizationAdvantage FunctionTraining StabilityPareto FrontierQwen Models
Authors
Guochao Jiang, Jingyi Song, Guofeng Quan, Chuzhan Hao, Guohua Liu, Yuewei Zhang
Abstract
Reinforcement Learning has become a standard paradigm for aligning Large Language Models with human intent and task requirements. While Group Relative Policy Optimization offers an efficient, value-model-free alternative to Proximal Policy Optimization, adapting it to real-world multi-reward settings remains challenging. Standard scalarization practices, such as Reward Combination and Advantage Combination, suffer from significant drawbacks: Reward Combination frequently generates advantages with excessively large squared magnitudes that lead to training instability, while Advantage Combination relies on static hyperparameters and ignores cross-objective correlations. To address these limitations, we propose Dynamic Variance-adaptive Advantage Optimization (DVAO), which dynamically adjusts combination weights based on the empirical reward variance of each objective within a rollout group, effectively up-weighting objectives with a stronger learning signal while suppressing noisy ones. We mathematically prove that DVAO maintains bounded advantage magnitudes for stable training and introduces a self-adaptive cross-objective regularization mechanism. Extensive experiments on mathematical reasoning and tool-use benchmarks using Qwen3 and Qwen2.5 models demonstrate that DVAO significantly outperforms baseline methods, achieving a superior multi-objective Pareto frontier and robust training stability.