Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle
2026-06-05 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors studied advanced AI systems designed to act like research assistants but found they still miss important subtle details that human researchers easily catch. They created a new test called AARRI-Bench to better measure if AI can truly think and work like real research interns. Their experiments showed that even top AI agents perform only about 68% as well as humans on this test. This suggests AI needs more focus on understanding research behavior, not just improving technical skills, to better support scientific work.
foundation modelsAI agentsresearch benchmarkslong-horizon tasksscientific judgmentagent scaffoldingAARRI-Benchautonomous researchClaude Opus 4.7Mini-SWE-Agent
Authors
Jiayu Wang, Weijiang Lv, Bowen Fu, Jing Fu, Jiayi Song, Lingyu Zhang, Lanxuan Xue, Luodi Chen, Zepeng Xin, Kaiyu Li, Xiangyong Cao
Abstract
As foundation models advance and agent scaffolding becomes increasingly sophisticated, agents have demonstrated remarkable proficiency in complex, long-horizon coding tasks and even autonomous experiment execution. Despite their evolution from research assistants into autonomous research agents, these systems still exhibit significant limitations in field sensitivity, research ethics, and nuanced scientific judgment. Consequently, frontier agents remain unable to fully replace human researchers. To bridge this gap, we conceptualize the AARR (Act As a Real Researcher) benchmark series. Unlike existing benchmarks that primarily assess macro-level execution capabilities, AARR focuses on whether agents can emulate the professionalism, thoroughness, and nuanced reasoning that characterize human researchers in granular research scenarios. In this work, we propose AARRI-Bench (Act As a Real Research Intern), the first benchmark in this series. We conduct extensive experiments across frontier models and agentic systems, revealing that even the best-performing configuration (Mini-SWE-Agent with Claude Opus 4.7) achieves only 68.3\% success rate, frequently overlooking subtle yet critical details that are obvious to real human researchers. Our results indicate that developing researcher-like AI requires further exploration of research behavior, rather than merely complex scaffolding. Our data is released at https://github.com/AARR-bench/AARRI-bench.