Statistical Matching via Schrödinger Bridge beyond Conditional Independence

2026-06-22Machine Learning

Machine Learning
AI summary

The authors address the problem of combining two datasets that share some variables but measure other important variables separately. Traditional methods assume that once you know the shared variables, the auxiliary data doesn't add new information about the target variable. The authors propose a new method using a dependency-aware Schrödinger bridge that better captures hidden relationships between the target and auxiliary data. Their approach improves prediction accuracy and provides a complete way to fill in missing data in both datasets. They test their method on synthetic and real-world data, showing better results especially when the target and auxiliary variables are strongly connected.

Statistical matchingConditional independence assumptionSchrödinger bridgeData imputationProbabilistic modelingJoint distributionGaussian distributionTransportation costPredictive utilityData recoding
Authors
Eunho Koo, Tongseok Lim, Jinwon Sohn
Abstract
Statistical matching combines partially overlapping datasets that share covariates $X$ but observe the target $Y$ and auxiliary variables $Z$ separately. Classical approaches typically invoke the conditional independence assumption (CIA), which makes the problem identifiable but fundamentally implies that the imported auxiliary variable provides no additional predictive power for $Y$ once $X$ is known. To capture this latent $Y$--$Z$ dependence, we propose a novel dependency-aware Schrödinger bridge for predictive statistical matching. Our approach couples the two separated databases by tilting the conservative CIA baseline with a transportation-based compatibility cost, recovering an informative joint distribution. The resulting statistical learning framework yields full probabilistic posterior rules for bidirectional imputation. Theoretically, we establish a sufficient condition under which the learned bridge strictly improves over the CIA baseline, alongside an exact joint recovery guarantee in the Gaussian setting under an appropriate cost. Across synthetic benchmarks and real-world datasets (CelebA and Adult), we demonstrate that our dependency-aware completion consistently improves downstream predictive utility, proving especially beneficial in settings like data recoding where the underlying population exhibits strong $Y$--$Z$ dependence.