Sample-Efficient Learning of Probabilistic Causes for Reachability in Markov Decision Processes with Probabilistic Guarantees
2026-06-29 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors focus on a method to understand why certain bad outcomes happen in systems modeled as Markov decision processes (MDPs). They improve on existing techniques by developing a way to find states that increase the chance of these outcomes without needing full knowledge of the system's probabilities. Their new approach uses a restart trick to simplify checking causes and comes with theoretical guarantees about its accuracy and learning efficiency. They also created an algorithm that learns and refines its knowledge over time, successfully tested on example problems.
Markov decision processprobabilistic model checkingprobability-raising causalityconditional reachabilitysample complexityvalue iterationMDP learningrestart mechanismanytime algorithm
Authors
Ryohei Oura, Georgios Fainekos, Hideki Okamoto, Bardh Hoxha
Abstract
Probabilistic model checking for Markov decision processes (MDPs) provides quantitative guarantees, but often offers limited insight into why undesired outcomes occur. Probability-raising (PR) causality addresses this by identifying states whose visitation increases the probability of reaching designated states. Existing PR-cause identification methods, however, use MDP modifications not well-suited for learning: the gap between conditional and unconditional reachability probabilities can be hard to detect from transition samples, and construction requires reachability probabilities of the MDP, which are unavailable when transition probabilities are unknown. We study unknown MDPs and propose a learning approach with probabilistic guarantees for PR-cause identification. Our key ingredient is a restart-based MDP modification that reduces PR-cause checking to two conditional reachability queries without using reachability values of the original MDP. We prove correctness, establish sample-complexity bounds, and develop an anytime learning-and-checking algorithm based on two-sided value iteration that progressively classifies states as causal, non-causal, or undecided. Experiments on two benchmarks demonstrate reliable and fast identification of PR causes.