TriggerBench: Investigating Prospective Memory for Large Language Models

2026-06-22 • Computation and Language

Computation and Language

AI summaryⓘ

The authors study how well large language models (LLMs) remember to do things without being reminded, called prospective memory (PM). They created TriggerBench, a test that checks if models can notice and act on hidden cues during conversations, unlike past tests focusing on memory when directly asked. They found that models struggle more with PM than with regular memory, especially when the information is hidden or when many tasks happen at once. Also, how well models do on PM seems connected to their remaining thinking ability. Overall, PM is harder for LLMs and still needs improvement.

Large Language ModelsProspective MemoryRetrospective MemoryTriggerBenchContext LengthAttentionProactive RecallFalse-AlarmsBehavioral ProbeReasoning Capacity

Authors

Tianhua Zhang, Xinjiang Wang, Qianxi Zhang, Qi Chen, Kun Li, Yaoqi Chen, Dingdong Wang, Helen Meng, Yan Lu

Abstract

While Large Language Models (LLMs) are increasingly deployed in long interactions, existing evaluations focus predominantly on retrospective memory (RM) via explicit queries. Prospective memory (PM), the critical ability to spontaneously recall and act on latent constraints without direct prompts, remains largely unevaluated. We introduce TriggerBench, a comprehensive PM benchmark spanning five dimensions across both daily assistants and professional workflows. TriggerBench pairs scenarios with matched RM controls, contrastive positive/negative variants, and overloaded triggers, enabling fine-grained measurement of proactive recall, false-alarm rate, and attentional robustness under a single protocol. Our evaluation yields three key findings. (i) PM shows a precision-recall trade-off and attentional fragility. Though enhanced reasoning significantly improves proactive recall, models may overfit to an "always-remind" heuristic. Furthermore, PM accuracy degrades substantially under implicit constraints or triggers overloaded by concurrent user requests, indicating that robust PM remains an open challenge. (ii) PM is notably harder than RM: on identical contexts, RM near-saturates up to 100K tokens, while PM decays sharply as context length scales. (iii) PM may serve as a behavioral probe of spare reasoning capacity. Pairing PM scenarios with AIME-2025 math problems reveals that successful trajectories yield higher PM accuracy than failed ones at the same context length, showing PM tracks spare reasoning budget that token count obscures. Project page: https://github.com/KristenZHANG/TriggerBench-Official.

View PDFOpen arXiv