Dual-Flow Reinforcement Learning with State-Aware Exploration

2026-06-29 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors address challenges in reinforcement learning where actions and their outcomes can be complex and have multiple possible modes. They introduce Dual-Flow RL, a framework that simultaneously models both the variety of possible returns and multiple action options using a method called conditional flow matching. To improve how well the system explores different actions, they add a regulator that adjusts exploration based on uncertainty and diversity. Their experiments show that this approach performs better than previous methods on standard continuous control tasks.

reinforcement learningcontinuous controlmultimodal distributionvalue estimationpolicy distributionconditional flow matchingentropyexplorationactor-criticDeepMind Control Suite

Authors

Qijun Li, Zheng Fu, Qi Song, Yifei He, Weitao Zhou, Kun Jiang, Diange Yang

Abstract

In complex continuous-control reinforcement learning tasks, multimodal optimal actions often coincide with uncertain, multimodal return distributions, making reliable value estimation and multimodal exploration challenging. Existing value estimation methods using unimodal Gaussians restrict expressiveness and yield biased estimates. Recent generative policies can represent multimodal actions but often collapse to a few modes and under-explore high-value areas of the action space. Motivated by these challenges, we propose Dual-Flow RL, a unified actor-critic framework that jointly models a continuous return distribution and a multimodal policy distribution using conditional flow matching (CFM). This design supports reliable value estimation and sustained multimodal exploration. To further enhance exploration, we introduce an Entropy-Covariance Exploration Regulator (ECER) that enables state-aware exploration regulation leveraging policy entropy and action-uncertainty covariance. Experiments on DeepMind Control Suite and Humanoid-Bench show that Dual-Flow RL achieves state-of-the-art performance on most tasks, significantly outperforming prior diffusion-based and flow-based methods.

View PDFOpen arXiv