MEME: Multi-entity & Evolving Memory Evaluation

2026-05-12Machine Learning

Machine LearningComputation and Language
AI summary

The authors studied how well language model agents remember and update information across many sessions, especially when it involves multiple facts that depend on each other or when facts are removed. They created a new set of tests called MEME that includes harder tasks than before, like reasoning about dependent facts and forgetting information. They tested six memory systems and found that most struggle a lot with these difficult reasoning tasks, even when using the best current techniques. Only one system paired with a powerful model improved performance but required much more computing power, making it impractical for real use. This shows current methods still have big limits in handling complex memory updates over time.

LLMmemory systemsmulti-session reasoningdependency reasoningevaluation benchmarkinformation updatingretrievallanguage modelsagent architectures
Authors
Seokwon Jung, Alexander Rubinstein, Arnas Uselis, Sangdoo Yun, Seong Joon Oh
Abstract
LLM-based agents increasingly operate in persistent environments where they must store, update, and reason over information across many sessions. While prior benchmarks evaluate only single-entity updates, MEME defines six tasks spanning the full space defined by the multi-entity and evolving axes, including three not scored by prior work: Cascade and Absence (dependency reasoning) and Deletion (post-removal state). Evaluating six memory systems spanning three memory paradigms on 100 controlled episodes, we find that all systems collapse on dependency reasoning under the default configuration (Cascade: 3%, Absence: 1% in average accuracy) despite adequate static retrieval performance. Prompt optimization, deeper retrieval, reduced filler noise, and most stronger LLMs fail to close this gap. Only a file-based agent paired with Claude Opus 4.7 as its internal LLM partially closes the gap, but at ~70x the baseline cost, indicating closure currently depends on configurations that are not practical at scale. Code and data are available on the project page: https://seokwonjung-jay.github.io/meme-eval/.