Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

2026-06-15 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study how to figure out the motivation behind observed actions in complex systems where many agents interact over time, called mean-field games. They assume experts behave based on some unknown rewards and try to recover the policies explaining this behavior using a math technique called maximum causal entropy. The authors develop methods for both simple (linear) and complex (infinite-dimensional) reward models and show how to solve the problem efficiently with guaranteed convergence. They test their methods on examples like malware spreading and consumer choices, finding that the recovered policies match expert behavior well.

Inverse Reinforcement LearningMean-Field GamesMaximum Causal EntropyOccupation MeasureSoft Bellman EquationReproducing Kernel Hilbert SpaceGradient DescentAverage-Reward CriterionStationary Equilibrium

Authors

Şevket Kaan Alkır, Naci Saldı, Berkay Anahtarcı, Can Deha Karıksız

Abstract

We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.

View PDFOpen arXiv