Robust Probabilistic Shielding for Safe Offline Reinforcement Learning

2026-05-11 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors work on offline reinforcement learning, where a policy is learned from a fixed dataset without trying things in the real environment. They address two main challenges: making sure the new policy performs better than a safe baseline and ensuring it stays safe. Their method combines safe policy improvement (which guarantees performance) with a shielding technique (which blocks unsafe actions) using only the given data and knowledge of safe states. Experiments show that their shielded approach improves both average and worst-case results, especially when data is limited.

offline reinforcement learningpolicysafe policy improvement (SPI)shieldingsafety guaranteeperformance guaranteebaseline policyaction spaceoffline datasetlow-data regimes

Authors

Maris F. L. Galesloot, Thomas Rhemrev, Nils Jansen

Abstract

In offline reinforcement learning (RL), we learn policies from fixed datasets without environment interaction. The major challenges are to provide guarantees on the (1) performance and (2) safety of the resulting policy. A technique called safe policy improvement (SPI) provides a performance guarantee: with high probability, the new policy outperforms a given baseline policy, which is assumed to be safe. Orthogonally, in the context of safe RL, a shield provides a safety guarantee by restricting the action space to those actions that are provably safe with respect to a given safety-relevant model. We integrate these paradigms by extending shielding to offline RL, relying solely on the available dataset and knowledge of safe and unsafe states. Then, we shield the policy improvement steps, guaranteeing, with high probability, a safe policy. Experimental results demonstrate that shielded SPI outperforms its unshielded counterpart, improving both average and worst-case performance, particularly in low-data regimes.

View PDFOpen arXiv