SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

2026-04-10Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors address the challenge of updating reinforcement learning (RL) policies safely when environments or goals change. They introduce the Rashomon set, which is a safe zone in policy parameters that guarantees safety before updates. By constraining policy updates to this Rashomon set, they can provide formal, provable safety guarantees during continual learning. Their experiments on navigation tasks show that this method prevents forgetting safety rules, unlike other approaches. This means the updated policies stay safe while still adapting to new tasks.

reinforcement learningpolicy updatesafety guaranteescontinual learningnon-stationary environmentsRashomon setpolicy parameter spacecatastrophic forgetting
Authors
Maksim Anisimov, Francesco Belardinelli, Matthew Wicker
Abstract
Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.