Exploration-Driven Optimization for Test-Time Large Language Model Reasoning

2026-05-11Machine Learning

Machine Learning
AI summary

The authors identify a problem where training large language models (LLMs) with reinforcement learning makes their outputs less diverse, while diversity is needed during later use for better reasoning. They propose a method called Exploration-Driven Optimization (EDO) that encourages more varied outputs during training without losing quality. By applying EDO to existing training methods, their models produce more diverse and better reasoning answers, especially when combined with certain test-time techniques. Their experiments show small but consistent accuracy improvements and more stable training behavior.

large language modelsreinforcement learninginference-time scalingreward biasingexploration-exploitationpost-trainingDirect Preference Optimizationself-consistencymodel entropypolicy optimization
Authors
Changhao Li, Yuchen Zhuang, Chenxiao Gao, Haotian Sun, Rushi Qiang, Chao Zhang, Bo Dai
Abstract
Post-training techniques combined with inference-time scaling significantly enhance the reasoning and alignment capabilities of large language models (LLMs). However, a fundamental tension arises: inference-time methods benefit from diverse sampling from a relatively flattened probability distribution, whereas reinforcement learning (RL)-based post-training inherently sharpens these distributions. To address this, we propose Exploration-Driven Optimization (EDO), which extends reward-biasing style exploration objectives to iterative post-training and integrates them into standard RL objectives, encouraging greater diversity in sampled solutions while facilitating more effective inference-time computation. We incorporate EDO into iterative Direct Preference Optimization (iDPO) and Group Relative Policy Optimization (GRPO), resulting in two variants: ED-iDPO and ED-GRPO. Extensive experiments demonstrate that both ED-iDPO and ED-GRPO exhibit greater solution diversity and improved reasoning abilities, particularly when combined with test-time computation techniques like self-consistency. Across three in-distribution reasoning benchmarks, EDO achieves a 1.0-1.3\% improvement over the strongest baselines, and delivers an additional 1.5\% average gain on five out-of-distribution tasks. Beyond accuracy, EDO preserves model entropy and stabilizes RL training dynamics, highlighting its effectiveness in preventing over-optimization collapse. Taken together, these results establish EDO as a practical framework for balancing exploration and exploitation in LLM reasoning, especially in settings that rely on test-time scaling.