Evolutionary Bilevel Reward Shaping for Generalization in Reinforcement Learning

2026-06-15 • Machine Learning

Machine LearningNeural and Evolutionary Computing

AI summaryⓘ

The authors address a problem where reinforcement learning agents don't work well when faced with new situations different from their training environment. They create a method called GERS that improves how well agents generalize by shaping rewards using feedback from limited information environments, specifically only scalar scores rather than full data. Their approach uses a two-level optimization: the agent learns on some training environments, while an algorithm tweaks the reward shaping based on performance in validation environments without detailed data access. Tests show GERS performs better than standard methods and similarly to domain randomization, even though it uses less information, helping agents adapt under privacy or data restrictions.

Reinforcement LearningGeneralizationDomain RandomizationReward ShapingBilevel OptimizationCMA-ESScalar FeedbackContinuous ControlTrajectory DataValidation Environment

Authors

Ekasit Usaratniwart, Xilin Gao, Marc Ong, Youhei Akimoto

Abstract

Reinforcement learning (RL) often suffers from performance degradation when deployed in environments that differ from those encountered during training. Existing techniques such as domain randomization (DR) mitigate this, but require access to diverse training environments and full trajectory observability, assumptions that fail in privacy-preserving or restricted scenarios where only scalar performance metrics are available. We propose Generalization via Evolutionary Reward Shaping (GERS), a bilevel optimization approach to improve generalization on unseen test environments using only scalar feedback from validation environments. At the lower level, an RL agent guided via a reward function shaped by the upper level learns a policy on a limited set of training environments with accessible trajectory data; at the upper level, CMA-ES optimizes the reward shaping parameters to maximize the cumulative unshaped reward on separate validation environments for which trajectory access is unavailable. Results on continuous control tasks indicate that GERS outperforms the standard RL baseline on unseen test environments. GERS performance is comparable to DR, despite DR treating the combined set of training and validation environments of GERS as a single training set that requires trajectory access, whereas GERS cannot access validation trajectories. These results confirm that GERS effectively enhances generalization under restricted data access constraints.

View PDFOpen arXiv