SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

2026-06-29Software Engineering

Software EngineeringArtificial Intelligence
AI summary

The authors created SpreadsheetBench 2, a new test set designed to see how well AI can handle realistic, multi-step spreadsheet tasks used in business, like creating formulas, fixing errors, and making charts. Unlike previous tests that only looked at small tasks, their benchmark uses large, real-world spreadsheet examples with many linked sheets. They tested eight advanced AI models and found that these systems still struggle, especially with finding and fixing errors. The authors highlight that current AI often fails to properly check the spreadsheets or pick the right cells to work on, showing there is still a lot of room for improvement.

spreadsheet automationworkflow benchmarklarge language modelsdebugging spreadsheetsspreadsheet visualizationmulti-sheet workbooksbusiness datatask accuracycell modificationspreadsheet agents
Authors
Jian Zhu, Yuzheng Zhang, Zeyao Ma, Bohan Zhang, Armin Schoepf, Daniel Woloch, Peter Yiliu Wang, Guangyu Robert Yang, Samuel Jacob, Siddharth Nagisetty, Abhiram Chundru, Jean Lin, Spencer Mateega, Jing Zhang
Abstract
Spreadsheets are widely used for business analysis, financial modeling, reporting, and decision-making. However, most existing spreadsheet benchmarks evaluate isolated operations such as single-formula generation or local cell edits, and therefore fail to capture end-to-end workflows in realistic business settings. We introduce \textsc{SpreadsheetBench 2}, a workflow-level benchmark for spreadsheet agents that covers three task categories: generation, debugging, and visualization. The benchmark is constructed from authentic business data, including financial reports and corporate filings, and is annotated and validated by domain experts. The benchmark contains 321 tasks; each instance averages 11.8 worksheets and requires 593.5 cell modifications, reflecting large multi-sheet workbooks with cross-sheet dependencies. We evaluate eight frontier large language models under a unified multi-turn agent scaffold, and additionally include several LLM-based spreadsheet products as complementary baselines. Results show that current systems remain far from reliable on real-world workflows: the best model achieves 34.89\% overall task accuracy, and debugging accuracy is as low as 12.00\%. Trajectory analysis and a failure taxonomy further indicate that insufficient spreadsheet inspection and incorrect target-cell selection are the dominant bottlenecks. Together, these findings position \textsc{SpreadsheetBench 2} as a challenging testbed for advancing reliable spreadsheet automation. Project page: https://spreadsheetbench.github.io/