Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

2026-04-13Computation and Language

Computation and Language
AI summary

The authors explore how to improve tasks where an AI agent works through many steps in parallel and then combines all the results. They point out that simply combining final answers or all steps can either lose important details or overwhelm the model with too much information. To solve this, the authors create AggAgent, which looks at all the parallel attempts like a mini environment and uses smart tools to check and merge information effectively. They tested this method on several tasks and models, finding it consistently better than previous ways without needing much extra work. This shows that their approach is a practical way to handle complex, multi-step tasks with many parallel processes.

agentic tasksparallel rolloutsaggregationchain-of-thought reasoningcontext windowtrajectorytest-time scalingagentic searchdeep researchGLM-4.7
Authors
Yoonsang Lee, Howard Yen, Xi Ye, Danqi Chen
Abstract
We study parallel test-time scaling for long-horizon agentic tasks such as agentic search and deep research, where multiple rollouts are generated in parallel and aggregated into a final response. While such scaling has proven effective for chain-of-thought reasoning, agentic tasks pose unique challenges: trajectories are long, multi-turn, and tool-augmented, and outputs are often open-ended. Aggregating only final answers discards rich information from trajectories, while concatenating all trajectories exceeds the model's context window. To address this, we propose AggAgent, an aggregation agent that treats parallel trajectories as an environment. We equip it with lightweight tools to inspect candidate solutions and search across trajectories, enabling it to navigate and synthesize information on demand. Across six benchmarks and three model families (GLM-4.7, Qwen3.5, MiniMax-M2.5), AggAgent outperforms all existing aggregation methods-by up to 5.3% absolute on average and 10.3% on two deep research tasks-while adding minimal overhead, as the aggregation cost remains bounded by a single agentic rollout. Our findings establish agentic aggregation as an effective and cost-efficient approach to parallel test-time scaling.