ParseBench: A Document Parsing Benchmark for AI Agents

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors point out that for AI to work well with documents in businesses, it’s important to keep the meaning and structure exactly right, like tables and charts being understood correctly. They created a new test called ParseBench, which uses about 2,000 real pages from industries like finance and insurance to check AI tools on five key skills. When they tested 14 different AI methods, no single one did well on everything, though one called LlamaParse Agentic scored the best overall. Their work helps show where current AI still struggles with understanding complex documents.

document parsingsemantic correctnesstableschartscontent faithfulnesssemantic formattingvisual groundingenterprise automationvision-language modelsLlamaParse

Authors

Boyang Zhang, Sebastián G. Acosta, Preston Carlson, Sacha Bron, Pierre-Loïc Doulcet, Simon Suo

Abstract

AI agents are changing the requirements for document parsing. What matters is \emph{semantic correctness}: parsed output must preserve the structure and meaning needed for autonomous decisions, including correct table structure, precise chart data, semantically meaningful formatting, and visual grounding. Existing benchmarks do not fully capture this setting for enterprise automation, relying on narrow document distributions and text-similarity metrics that miss agent-critical failures. We introduce \textbf{ParseBench}, a benchmark of ${\sim}2{,}000$ human-verified pages from enterprise documents spanning insurance, finance, and government, organized around five capability dimensions: tables, charts, content faithfulness, semantic formatting, and visual grounding. Across 14 methods spanning vision-language models, specialized document parsers, and LlamaParse, the benchmark reveals a fragmented capability landscape: no method is consistently strong across all five dimensions. LlamaParse Agentic achieves the highest overall score at \agenticoverall\%, and the benchmark highlights the remaining capability gaps across current systems. Dataset and evaluation code are available on \href{https://huggingface.co/datasets/llamaindex/ParseBench}{HuggingFace} and \href{https://github.com/run-llama/ParseBench}{GitHub}.

View PDFOpen arXiv