Benchmarking Tabular Foundation Models for Conditional Density Estimation in Regression

2026-03-27 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied how well some new tabular foundation models can estimate the full range of possible outcomes (conditional density) based on input data, especially when the uncertainty varies or is complex. They compared these models to traditional methods on 39 real datasets with different sizes, checking accuracy, calibration, and speed. They found that the foundation models generally had the best performance in estimating distributions, though sometimes other neural methods were better at calibration with lots of data. In a specific astronomy example, one foundation model trained on fewer data still outperformed all others trained on much more. Overall, the authors show that these new models are strong, ready-to-use tools for this type of prediction.

Conditional Density EstimationTabular Foundation ModelsTabPFNTabICLHeteroscedasticityCalibrationLog-likelihoodCRPSPhotometric RedshiftSDSS DR18

Authors

Rafael Izbicki, Pedro L. C. Rodrigues

Abstract

Conditional density estimation (CDE) - recovering the full conditional distribution of a response given tabular covariates - is essential in settings with heteroscedasticity, multimodality, or asymmetric uncertainty. Recent tabular foundation models, such as TabPFN and TabICL, naturally produce predictive distributions, but their effectiveness as general-purpose CDE methods has not been systematically evaluated, unlike their performance for point prediction, which is well studied. We benchmark three tabular foundation model variants against a diverse set of parametric, tree-based, and neural CDE baselines on 39 real-world datasets, across training sizes from 50 to 20,000, using six metrics covering density accuracy, calibration, and computation time. Across all sample sizes, foundation models achieve the best CDE loss, log-likelihood, and CRPS on the large majority of datasets tested. Calibration is competitive at small sample sizes but, for some metrics and datasets, lags behind task-specific neural baselines at larger sample sizes, suggesting that post-hoc recalibration may be a valuable complement. In a photometric redshift case study using SDSS DR18, TabPFN exposed to 50,000 training galaxies outperforms all baselines trained on the full 500,000-galaxy dataset. Taken together, these results establish tabular foundation models as strong off-the-shelf conditional density estimators.

View PDFOpen arXiv