POTATR: A Lightweight Image-to-Graph Model for Page-Level Table Extraction

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new model called POTATR to help computers better understand and extract tables from document images quickly and accurately. Unlike other large models that are slow and costly, POTATR is much smaller and faster, yet still performs better on a test involving scientific tables. It also marks exactly where each table element is on the page, which helps with checking and combining results from multiple pages. Their work makes processing whole documents with tables easier and more efficient.

table extractionimage-to-graph modelTable TransformerPubTables-v2 benchmarkbounding boxOCRcross-page mergingMLLMs
Authors
Brandon Smock, Libin Liang, Max Sokolov, Amrit Ramesh, Valerie Faucon-Morin, Tayyibah Khanam, Maury Courtland
Abstract
Large-scale document processing requires contextually aware table extraction (TE) that is both accurate and efficient. Yet current approaches require billions of parameters, hundreds of autoregressive steps, or costly API inference. Motivated by this, we introduce the Page-Object Table Transformer (POTATR), a lightweight 29M parameter image-to-graph model that extends the Table Transformer (TATR) for contextualized page-level TE. POTATR outperforms all models tested on the PubTables-v2 Single Pages benchmark -- including frontier MLLMs -- achieving $\textrm{GriTS}_\textrm{Con}$ of 0.964 while running over 130$\times$ faster at roughly 300$\times$ lower cost. Further, POTATR's output is spatially grounded: every recognized element has a bounding box, enabling visual verification and geometric text assignment. As a result, POTATR performs unified page-level TE while composing with other models, enabling extension to scanned documents via external OCR and to full-document TE via techniques like cross-page merging. Code and models will be released.