Self-Compacting Language Model Agents

2026-06-22Computation and Language

Computation and Language
AI summary

The authors identify that long chains of thoughts and actions by AI agents can pile up outdated information, making it hard for the model to keep track and fit everything in its memory. Instead of summarizing the information at fixed intervals, they created SelfCompact, which lets the model decide when to summarize based on a simple set of rules and a special summarizing tool. This approach helps the model clean up its context more effectively, saving computation while improving performance on math and problem-solving tasks. Their method works well without needing extra training and highlights how adding simple ‘thinking about thinking’ rules helps models manage their memory better.

agent tracechain of thoughtcompactioncontext windowscaffoldinference-timesummarizationmeta-cognitionrubricagentic search
Authors
Tianjian Li, Jingyu Zhang, William Jurayj, Xi Wang, Chuanyang Jin, Mehrdad Farajtabar, Eric Nalisnick, Daniel Khashabi
Abstract
Long agent traces composed of chains of thought and tool calls accumulate stale content that anchor subsequent generations, and eventually outgrow the context window. Existing scaffolds mitigate it with fixed-interval compaction triggered at a token threshold. Such triggers pay no heed to trajectory structure, risking discard of partial results mid-derivation or mid-search. We propose SelfCompact, a scaffold that allows the model itself to decide when and how to compact. Specifically, it pairs two inference-time elements: (i) a compaction tool the model invokes to summarize the accumulated context, and (ii) a lightweight rubric specifying when to fire (a sub-task has resolved, or the trajectory is converging) and when to suppress (mid-derivation, or when stuck). Both are needed. The tool alone is unevenly used across open-weight models, often invoked at unhelpful moments or not at all; the rubric alone cannot act. Together, they elicit effective adaptive compaction without any fine-tuning or external supervision. We present empirical results on six benchmarks (competitive math and agentic search) and seven models. Our results show that SelfCompact matches or exceeds fixed-interval summarization at a fraction of the token cost, improving over a no-summarization baseline by up to 18.1 points on math and 5-9 points on agentic search at 30-70% lower per-question cost. Our results expose a meta-cognitive gap: although unprompted models cannot reliably tell when their own context is rotting, a lightweight rubric closes this gap, reframing when to compact as a capability that scaffolds can supply without training.