AI summaryⓘ
The authors explain why reinforcement learning (RL) systems often struggle when they face new or changing situations different from their training. They develop a way to understand these changes by looking at the causes behind shifts in the environment and the agent's responses, using a framework called POMDP. Their approach links two common scenarios—when data is just a bit different (in-distribution vs. out-of-distribution) and when environments change over time (non-stationarity)—by breaking down the shifts into parts caused by the agent or the environment. They also propose a method to measure how much these shifts affect the agent's performance and how well it can adapt. Overall, the authors create a clearer way to analyze and improve RL robustness when conditions change.
Reinforcement LearningDistributional ShiftIn-Distribution GeneralizationOut-of-Distribution GeneralizationNon-StationarityPartially Observable Markov Decision Process (POMDP)Agent-Environment InteractionPolicyTransition DynamicsRobustness
Authors
Ardianto Wibowo, Paulo E Santos, Amer Baghdadi, Matthew Stephenson, Karl Sammut, Jean-Philippe Diguet
Abstract
Reinforcement learning (RL) systems often degrade when operating conditions differ from those previously encountered, reflecting distributional shifts in the underlying data-generating process. Such shifts may occur between training and evaluation, as in In-Distribution (ID) and Out-of-Distribution (OOD) generalization, or within non-stationary settings where environment dynamics evolve over time. However, the formal relationship between these views remains unclear, and existing work mainly focuses on mitigation rather than the causal origin of shift within the agent-environment interaction. This work develops a unified causal-origin taxonomy that characterizes sources of distributional shift in RL and relates ID/OOD generalization to non-stationary settings. We transfer the classical dataset-shift principle from supervised learning to RL by reformulating distributional shift in terms of the generative interaction process. Using a Partially Observable Markov Decision Process (POMDP), we decompose the interaction into structural components, including the state distribution, observation process, policy, reward, and transition dynamics, together with the shifted-time boundary. The proposed taxonomy distinguishes internal, agent-driven, and external, environment-driven, distributional shifts. The shifted-time boundary perspective further characterizes explicit, implicit, and hybrid shifts. This formulation unifies ID/OOD generalization and non-stationarity as structured changes in the underlying process. We also introduce an evaluation framework for measuring shift impact and adaptation through performance degradation and recovery metrics. By grounding distributional shift in the causal-origin structure of RL, this work supports systematic analysis of robustness under distributional shift.