RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning
2026-06-01 • Machine Learning
Machine Learning
AI summaryⓘ
The authors address problems in machine learning with small-to-medium scientific datasets where chosen features can change a lot with slight data changes, and performance scores can be misleading if data is reused improperly. They introduce RobustModelMaker, a Python tool that combines careful data splitting and feature selection methods to produce more reliable feature sets and unbiased performance results. Evaluations on three real datasets show their approach balances accuracy and stability better than other popular methods. They also demonstrate practical use cases in cancer biomarker discovery and superconductivity prediction, highlighting benefits of focusing on stability from the start.
feature selectionnested cross-validationbootstrap stabilityperformance biasmachine learning pipelineclassificationregressionstability measureJaccard indexbiomarker discovery
Authors
Amanda S Barnard
Abstract
Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.