Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback
2026-06-08 • Artificial Intelligence
Artificial IntelligenceComputation and LanguageMachine Learning
AI summaryⓘ
The authors studied how deep research agents (DRAs) improve their reports when they get feedback over multiple rounds. They tested two ways of giving feedback: letting the agents rethink on their own (self-reflection) and giving them targeted hints about where their research is weak (process-level feedback). They found that self-reflection alone does not lead to clear improvements, while one round of targeted feedback helps quite a bit. However, repeated feedback rounds didn’t lead to ongoing improvements because agents sometimes undo earlier good changes. Overall, the authors show that current DRAs still struggle to reliably improve through multiple rounds of feedback.
Deep Research AgentsMulti-turn EvaluationSelf-reflectionProcess-level FeedbackResearch Gap InferenceRubric CriteriaReport RevisionPerformance ImprovementEvaluation Benchmark
Authors
Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan
Abstract
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.