BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

2026-06-08Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors address the problem of learning data patterns when there are many features but very few samples, which is common in fields like omics. They introduce BSTabDiff, a method that groups features into smaller blocks and models their relationships in a simpler, lower-dimensional space. This approach captures complex dependencies and handles issues like missing data and noisy measurements better than previous methods. Their experiments show BSTabDiff creates more realistic synthetic datasets in these challenging scenarios.

High-Dimensional Low-Sample Size (HDLSS)Tabular DataGenerative ModelsLatent VariablesCopulasDiffusion ModelsNormalizing FlowsMissing DataHeteroscedastic Noise
Authors
Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh
Abstract
High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.