Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

2026-06-15Computation and Language

Computation and LanguageInformation Retrieval
AI summary

The authors created MetaSyn, a dataset that helps test how well computer systems can do meta-analyses, which are studies that combine many research papers to answer a question. Their dataset includes expert-chosen questions, lots of medical articles, the right studies to include, and tricky false leads that look similar but don't fit the criteria. They found that while many relevant papers can be found, current systems and language models struggle to correctly pick which studies really belong, especially when many look almost right. The authors suggest that looking at detailed steps helps understand where systems fail better than a single overall score.

Meta-analysisPI/ECO criteriaEvidence synthesisLiterature retrievalScreeningRecallLarge Language Models (LLMs)BenchmarkingSystematic reviewResearch corpus
Authors
Anzhe Xie, Weihang Su, Yujia Zhou, Yiqun Liu, Qingyao Ai
Abstract
Meta-analysis is a demanding form of evidence synthesis that combines literature retrieval, PI/ECO-guided study selection, and statistical aggregation. Its structured, verifiable workflow makes it an ideal substrate for evaluating systematic scientific reasoning, yet existing benchmarks lack ground truth across the full retrieval-screening-synthesis pipeline. We introduce MetaSyn, a dataset of 442 expert-curated meta-analyses from Nature Portfolio journals. Each entry pairs a research question with PI/ECO criteria, a retrieval corpus of 140k PubMed articles, verified positive studies, hard negatives that are topically similar but PI/ECO-ineligible, and complete search strategies and date bounds. Benchmarking twelve pipeline configurations (nine RAG variants and a protocol-driven agent) reveals a critical screening bottleneck: despite a retrieval ceiling of 90.9% recall at K=200, no system recovers more than 52.7% of ground-truth included literature. Current LLMs fail to reliably separate eligible studies from PI/ECO-failing distractors in pools of comparable topical relevance. Stage-attributed metrics capture where systems succeed and fail; a single end-to-end score does not.