StatABench: Dataset and Framework for Evaluating Statistical Analysis Capabilities of LLMs
2026-06-22 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors created StatABench, a new test to check how good large language models (LLMs) are at understanding and doing statistics. StatABench has two parts: one with 404 varied questions and another with 30 tough real-world tasks. They tested multiple LLMs, including GPT-5.1, and found that even the best models don't fully master statistical analysis yet. Their work shows that LLMs still struggle with practical use of tools, choosing methods, and completing full statistical projects.
large language modelsstatistical analysisbenchmarkmultiple-choice questionsopen-ended tasksmodel evaluationLangChain MCPLLM-as-Judgemethodological decision-makingstatistical modeling
Authors
Youxin Zhu, Yixuan Ding, Peng Lai, Longyue Wang, Bingyi Jing, Guanhua Chen
Abstract
Statistical analysis is a broad, complex field requiring both domain knowledge and tool proficiency. While prior work has evaluated large language models (LLMs) in this domain, existing benchmarks remain limited in scope and format. To bridge this gap, we introduce StatABench (Statistical AnalysisBenchmark), a benchmark designed to systematically assess LLMs' statistical analysis capabilities. StatABench comprises two complementary components: Stat-Closed, containing 404 questions across 18 statistical topics in multiple formats (multiple-choice, fill-in-the-blank, decision-making, and practical application), and Stat-Open, featuring 30 complex open-ended modeling tasks adapted from professional competitions. We evaluate diverse LLMs using the LangChain MCP framework and multiple data science agents, and assess Stat-Open solutions via a validated LLM-as-Judge protocol. Experiments show that even GPT-5.1 achieves only 68.6% on Stat-Closed, while the best open-source model reaches 60.6%. On Stat-Open, the top agent framework scores 61.86 on average. These results reveal the gap between current LLMs and reliable statistical analysis, highlighting persistent challenges in tool-grounded reasoning, methodological decision-making, and end-to-end statistical modeling.