MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

2026-06-29 • Computation and Language

Computation and LanguageMachine Learning

AI summaryⓘ

The authors created MemDelta, a way to test memory systems by changing only one part at a time to see what really affects performance. They found that improvements sometimes come from swapping parts like the embedding model, not just the memory method itself. Also, self-memory for agents worked worse than simple retrieval, and some improvements were only seen in specific cases but were much more expensive. They suggest being careful to keep certain parts fixed when testing and to consider costs before saying which memory system is better.

agent memory systemretrieval-augmented generation (RAG)embedding modelfull-context modelLongMemEval-Sself-memorywrite-path costmemory evaluationmodel family

Authors

Kuan Wang

Abstract

Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it unclear what is actually being measured. We present MemDelta, a controlled evaluation protocol that varies one component at a time on LongMemEval-S (500 questions, 50+ sessions, three model families). Four findings emerge: (1) verbatim RAG matches full-context GPT-4o-mini (47.2% vs. 49.8%, p = 0.34), but the ranking reverses across models: Gemini gains +14pp from full context, while Sonnet gains +31pp from RAG, partly because it refuses 63% of full-context queries; (2) swapping only the embedding model in an identical pipeline shifts accuracy by +6.2pp at n = 500 (p = 0.004), and Mem0 beats MiniLM-RAG by +11pp but loses to cloud-RAG by 1.2pp, so one variable flips the conclusion; (3) agent self-memory (42%) underperforms basic retrieval (47%); (4) on 2 of 6 question types (n = 88), Mem0 matches cloud RAG (72.7% vs. 73.9%, p = 1.0) at 50x the cost, suggesting narrow rather than general gains. We recommend memory evaluations fix embedding models across comparisons, stratify by model family, and report write-path cost before attributing gains to architecture.

View PDFOpen arXiv