Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

2026-06-01Artificial Intelligence

Artificial Intelligence
AI summary

The authors tackle a problem where many agents must learn to work together while following certain rules, which becomes really hard as the number of agents grows. They create a system called CG-CMARL that breaks the big problem into smaller parts between pairs of agents, making it easier to manage. Their method uses a special coordination technique and a mathematical tool to balance goals and rules without needing to relearn everything for different tradeoffs. They prove their approach is reliable and show it works better than other methods in tasks where teams of up to 10 agents must cooperate while respecting constraints.

Multi-agent reinforcement learningCoordination graphsLagrangian dualityMax-Sum message passingPareto frontConstrained optimizationQ-functionsDecentralized controlReward shapingScalability
Authors
Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson
Abstract
Constrained Multi-agent reinforcement learning (CMARL) faces two intertwined challenges: the joint action space grows exponentially with the number of agents, and additional requirements couple agents in ways that reward structure alone does not capture. We introduce Coordination Graphs for Constrained Multi-Agent Reinforcement Learning (CG-CMARL), a framework that addresses both challenges by combining coordination graphs with Lagrangian duality. The system decomposes the joint problem into pairwise regions, each served by a set of shared Q-functions, one for the primary objective and one for each of the constraints, so that the number of learned models is independent of the number of agents. At execution time, Max-Sum message passing coordinates actions across the factor graph, while a Lagrangian multiplier controls the objective--constraint tradeoff, allowing a single trained model to trace a Pareto front without retraining. We provide convergence guarantees under mild conditions, together with a compositional error bound that decomposes into separate interpretable sources, each traceable to a specific design choice and independently controllable. Experiments on cooperative navigation tasks (where teams of up to 10 agents must coordinate to reach target positions while satisfying pairwise constraints) show that our method produces Pareto fronts dominating established baselines trained at fixed reward-shaping ratios, while scaling to team sizes where centralized approaches become intractable.