Reducing cross-sample prediction churn in scientific machine learning
2026-05-13 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied how consistent machine learning models are when trained on slightly different samples of the same data in chemistry tasks. They found that even if the overall accuracy is similar, the models often disagree on which specific molecules they classify differently—a problem they call "cross-sample prediction churn." They showed that common methods focusing on model parameters don't fix this inconsistency, but data-focused approaches like K-bootstrap bagging and their new twin-bootstrap method reduce it significantly without losing accuracy. They suggest reporting this churn measure alongside accuracy to better understand model reliability.
scientific machine learningcross-sample prediction churnbootstrapbaggingdeep ensemblesMC dropoutstochastic weight averagingsymmetrized KL divergencepredictive consistencychemistry benchmarks
Authors
Gordan Prastalo, Kevin Maik Jablonka
Abstract
Scientific machine learning reports predictive performance. It does not report whether the same prediction would survive a different draw of training data. Across $9$ chemistry benchmarks, two classifiers trained on independent bootstraps of the same training set agree on aggregate accuracy to within $1.3\text{--}4.2$ percentage points but disagree on the class label of $8.0\text{--}21.8\%$ of test molecules. We call this gap \emph{cross-sample prediction churn}. The standard parameter-side techniques (deep ensembles, MC dropout, stochastic weight averaging) do not reduce this gap; two data-side methods do. The first is $K$-bootstrap bagging, which cuts the rate $40\text{--}54\%$ on every dataset at no accuracy cost ($K{\times}$-ERM compute). The second is \emph{twin-bootstrap}, our proposal: two networks trained jointly on independent bootstraps with a sym-KL consistency loss between their predictions, which at matched $2{\times}$-ERM compute reduces churn a further median $45\%$ beyond bagging-$K{=}2$. Cross-sample prediction churn deserves a column alongside predictive performance in scientific-ML benchmark reports, because without it the parameter-side and data-side methods are indistinguishable on the metric they actually differ on.