Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

2026-06-02Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning
AI summary

The authors explore whether natural experiments, which are events affecting only some individuals like the COVID-19 pandemic, exist within real-world datasets. They use tools from causal discovery to find cause-and-effect relationships and see if treating some data as influenced by interventions improves model results. By testing on both simulated and actual data, they find evidence that such natural experiments do exist in real datasets. Their work shows that recognizing these experiments can help improve predictive models using causal methods. This is an early study in this area with preliminary findings.

Natural experimentsCausal discoveryCausal graphFeature selectionInterventional dataObservational dataSynthetic graphsCausal inferenceModel performance
Authors
Gautam Gare, John Galeotti, Michael Mozer, Deva Ramanan, Nan Rosemary Ke
Abstract
In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.