Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

2026-06-08Machine Learning

Machine Learning
AI summary

The authors talk about using fake patient data to protect privacy but say it's hard to check if this fake data really works like real data in medical research. They created a new way to test fake data that looks at how well it describes patients, helps make predictions, and shows cause-and-effect relationships. They tested four different fake data methods on a large heart disease dataset and found that while all methods copied basic data patterns, none captured all important medical details correctly. This means current ways of judging fake health data might be too simple and that tests should focus on whether the data can actually support good medical research.

synthetic dataprivacy-preservinggenerative modelsGANVAEdiffusion modelsmasked modelingclinical validityepidemiologydata evaluation
Authors
Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm
Abstract
Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.