Active Tabular Augmentation via Policy-Guided Diffusion Inpainting

2026-05-11 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors look at how to create fake data to help train machine learning models when there isn't much real data available. They find that just making data that looks realistic doesn't always help the model perform better. So, they develop a new method called TAP that decides not only how to generate data but also when and what kind of fake data to add, focusing on improving the model's accuracy. Their method works better than previous ones, especially when data is very limited, across several real-world tests.

Generative augmentationTabular dataData scarcityDiffusion inpaintingMachine learningClassification accuracyRegression RMSEFidelity-utility gapPolicy learning

Authors

Zheyu Zhang, Shuo Yang, Bardh Prenkaj, Gjergji Kasneci

Abstract

Generative tabular augmentation is appealing in data-scarce domains, yet the prevailing focus on distributional fidelity does not reliably translate into better downstream models. We formalize a fidelity-utility gap: common generative objectives prioritize distributional plausibility, whereas augmentation succeeds only when injected samples reduce the current learner's held-out evaluation loss. This gap motivates learning not just how to generate, but what to generate and when to inject as training evolves. We propose TAP (Tabular Augmentation Policy), which couples diffusion inpainting with a lightweight, learner-conditioned policy to steer generation toward high-utility regions and controls safe injection via explicit gating and conservative windowed commitment. Under severe data scarcity, TAP consistently outperforms strong generative baselines on seven real-world datasets, improving classification accuracy by up to 15.6 percentage points and reducing regression RMSE by up to 32%.

View PDFOpen arXiv