Beyond IID: How General Are Tabular Foundation Models, Really?
2026-06-29 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors highlight that existing tests for machine learning models working with table-shaped data are mostly focused on easy cases, which limits progress. They created BeyondArena, a new, more complete test set that includes a variety of hard tasks involving different types of data and challenges. They also made Data Foundry, a tool to help organize and prepare these datasets for testing. Their experiments show that current foundation models do well on small, simple data, but older methods perform better on bigger, more complex problems. This new benchmark aims to push research toward better models that can handle a wide range of real-world tabular data challenges.
foundation modelstabular databenchmarkIID datanon-IID datatree-based modelsdeep learningfeature dimensionalitytemporal datahigh cardinality
Authors
Lennart Purucker, Andrej Tschalzev, Nick Erickson, Gioia Blayer, David Holzmüller, Alan Arazi, Alexander Pfefferle, Mustafa Tajjar, Gaël Varoquaux, Frank Hutter
Abstract
Foundation models for predictive machine learning on tabular data have recently gained significant traction in academia and industry. Research communities across disciplines are increasingly evaluating tabular foundation models on diverse datasets and tasks. However, these task- and discipline-specific evaluations remain largely inaccessible to model researchers because benchmark software and evaluation protocols are fragmented. As a result, model researchers rely on standard benchmarks, which are mostly defined for tasks where tabular foundation models already excel. The most challenging scenarios are excluded, limiting meaningful progress in the field by focusing on marginal improvements on IID data rather than on broader, more demanding challenges. To overcome this, we introduce BeyondArena, the first unified holistic benchmark for tabular data that supports diverse task types (IID, temporal, grouped), across sample size and feature dimensionality scales, with diverse feature types (with text, with high cardinality) from a broad range of disciplines. To enable unified benchmarking beyond standard benchmarks, we introduce Data Foundry, a Python framework and metadata schema for curating tabular datasets for predictive machine learning. Our results across 11 models and 142 curated datasets show that existing tabular foundation models excel on tiny- to medium-sized IID data, while traditional tree-based and deep learning models still dominate on non-IID, large, and high-dimensional datasets. BeyondArena guides model research for the most demanding challenges in tabular data, enabling progress towards truly foundational tabular models.