Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

2026-06-15Machine Learning

Machine LearningNeural and Evolutionary Computing
AI summary

The authors address a problem where reinforcement learning agents don't work well when faced with new situations different from their training environment. They create a method called GERS that improves how well agents generalize by shaping rewards using feedback from limited information environments, specifically only scalar scores rather than full data. Their approach uses a two-level optimization: the agent learns on some training environments, while an algorithm tweaks the reward shaping based on performance in validation environments without detailed data access. Tests show GERS performs better than standard methods and similarly to domain randomization, even though it uses less information, helping agents adapt under privacy or data restrictions.

Reinforcement LearningGeneralizationDomain RandomizationReward ShapingBilevel OptimizationCMA-ESScalar FeedbackContinuous ControlTrajectory DataValidation Environment
Authors
Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto
Abstract
Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.