LLM Agents Are Latent Context Managers: Eliciting Self-Managed Context via a Proprioceptive Dashboard
2026-06-29 • Computation and Language
Computation and Language
AI summaryⓘ
The authors found that language models don't naturally know how much of their recent memory (context) they're using or how old it is, which makes managing long conversations tricky. They propose VISTA, a tool that shows the model a clear 'dashboard' of its memory usage, highlighting which parts are recent or important without needing extra training. This helps models keep better track of information over long tasks and improves performance, as shown in various tests. Their approach works across different models and gets better as the memory gets more crowded.
language modelscontext windowtool agentsmemory managementprompt engineeringtoken usagemodel-agnosticruntime dashboardcontext compressionlong-horizon tasks
Authors
Binyan Xu, Haitao Li, Kehuan Zhang
Abstract
Long-horizon tool agents are bottlenecked by how their context grows toward the limits of the context window. Recent systems make context management agent- or system-controlled, but they either learn a compression policy that discards evidence or manage context in a layer the agent never sees. We argue both leave a more basic gap unaddressed. Frontier language models are proprioceptively blind to their own context. From the prompt alone they cannot see how large, how old, or how used each block is, the signals a keep-or-drop decision needs. We hypothesize that competent context management is already latent in capable models, and that what is missing is not a learned policy but an interface exposing this state. We introduce VISTA (Visible Internal State for Tool Agents), a training-free, model-agnostic layer that represents working memory as typed, addressable blocks, surfaces a runtime dashboard of per-block token usage, recency, and access history, and archives blocks as recoverable full-fidelity payloads. On LOCA-Bench, BrowseComp-Plus, and GAIA, the same untrained interface transfers across million-, 100K-, and 10K-scale trajectories. On LOCA-Bench it improves four backbones and lifts Gemini-3-Flash from 22.7 to 50.7%. The lift grows with context pressure and transfers across backbones. Ablations further confirm that the dashboard matters beyond archive and recovery tools.