Minimax-Optimal Policy Regret in Partially Observable Markov Games

2026-06-01 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study how to make good decisions step-by-step in situations where you can't see everything clearly and where an opponent changes their behavior based on your actions. They model this as a complex game with unknown hidden rules and show an algorithm that learns to do almost as well as the best fixed strategy over time, measured by a special regret metric. Their method gradually refines its policies using growing periods of experience and keeps track of uncertainty using mathematical tools. They also prove that their performance results are close to the best possible and extend their approach to handle changing time horizons and opponents who gradually forget past interactions.

Partially Observable Markov GamesSequential Decision-MakingStrategic OpponentPolicy RegretOptimistic Maximum-LikelihoodEpoch-Based AlgorithmEluder DimensionConfidence SetsAdversary MemoryFading Memory

Authors

Raman Arora

Abstract

We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.

View PDFOpen arXiv