Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

2026-06-15Machine Learning

Machine Learning
AI summary

The authors studied a type of decision-making problem called average-reward Markov decision processes (MDPs), focusing on cases where the system can be weakly connected but not necessarily fully ergodic. They developed new methods that learn from just one sequence of decisions and outcomes, instead of needing many or special assumptions. Their methods, which do not rely on knowing specific problem details beforehand, come with theoretical guarantees on how much data is needed to learn good strategies. This work expands the understanding of learning in these MDPs without strong assumptions or extra data models.

Markov decision processesaverage rewardsample complexityweakly communicating MDPsmodel-free methodspolicy-based methodsvalue-based methodssingle trajectoryergodicityreinforcement learning
Authors
Jongmin Lee, Ernest K. Ryu, Vaneet Aggarwal
Abstract
While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generative model. In this work, we establish the first finite sample complexity guarantees from a single trajectory for weakly communicating average-reward MDPs. To this end, we study the dynamics of a single trajectory in weakly communicating MDPs and based on this analysis, we develop novel model-free methods. Notably, our value-based and policy-based methods provide finite sample complexity guarantees of $\widetilde{O}(1/\varepsilon^2)$ and $\widetilde{O}(1/\varepsilon^4)$ from a single trajectory in weakly communicating MDPs, respectively. Furthermore, we introduce the first model-free method that requires no prior knowledge of problem-dependent quantities for communicating MDPs.