Plans Don't Persist: Why Context Management Is Load Bearing for LLM Agents

2026-06-22 • Artificial Intelligence

Artificial IntelligenceComputation and Language

AI summaryⓘ

The authors studied how language models handle plans, which are important instructions used over many steps but often removed from memory early. They found that these models do not remember plans internally but rely on having the plan text visible in their context. They developed a method called replay pairing to measure plan memory in the model’s hidden states and discovered that plan information decays quickly once removed. Their tests also showed that simply removing plans hurts task success, and trying to protect plans alone isn’t enough to keep performance high. Overall, they highlight that managing what stays in a model's limited context is crucial for long tasks.

long-horizon agentscontext managementhidden-state representationsplan evictionreplay pairinghidden-state cosine distancelayer probereasoning-trace confoundLLama modelcompression stress test

Authors

Aman Mehta, Anupam Datta

Abstract

Long-horizon agents depend on context management: systems compress, summarize, and evict old tokens so tasks can continue beyond finite windows. That is safe only when dropped information is no longer needed or has been internalized. Plans are the stress case: they are written early, used for many steps, and first to be evicted. We introduce replay pairing, a diagnostic that runs the same trajectory with and without the plan in history and measures hidden-state cosine distance. On Llama-3.1-70B, plan signal spikes to 0.453 one step after the plan, then falls 4.1x in a single action-observation step; HotpotQA falls 12.4x. This is evidence that standard LLM agents do not carry plans forward as persistent state, and instead depend on the plan remaining in context. A layer-L32 probe detects this decay as a diagnostic, not as proof that it reads plan content itself. Reasoning models add a measurement confound: their `<think>` traces re-derive plan content, so standard stripping leaves plan evidence in the stripped condition. We name this the reasoning-trace confound and fix it with strict stripping, which removes prior `<think>` blocks from the stripped run only. It recovers +163% of the step+1 signal in-sample and +153% held out, while not meaningfully changing non-reasoning Llama (+4.8%). On DeepSeek-R1-Distill-Llama-70B, a Llama-trained probe transfers at AUROC 0.748 (p=6e-4), while R1-specific probes reach 1.000, suggesting R1 encodes plan signal in a different hidden-state direction. Finally, a compression stress test shows the practical cost: naive plan eviction cuts ALFWorld success by 34.7pp, while probe-gated re-surfacing does not recover it. The contribution is a measurement and stress-test framework showing that agent-critical information can be context-resident rather than persistent. Context management is load bearing, but plan protection alone is not enough.

View PDFOpen arXiv