EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

2026-05-11 • Artificial Intelligence

Artificial IntelligenceMultiagent Systems

AI summaryⓘ

The authors created EnactToM, a set of tasks where AI agents must understand what others know and act together in a 3D household setting. Unlike previous tests that only ask AI about beliefs directly, these tasks require AI to use that knowledge to complete actions successfully. They found that current AI models struggle a lot with this more realistic, functional understanding of others’ beliefs, even though they do better on simple belief questions. Most failures were due to problems in communication and coordination between agents. This shows a clear gap for improving AI teamwork skills.

Theory of Mindepistemic statefunctional Theory of Mindembodied environmentmulti-agent systemspartial observabilityprivate informationepistemic coordinationbenchmarkepistemic depth

Authors

Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang

Abstract

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

View PDFOpen arXiv