The Hidden Cost of Resampling: How Imbalance Correction Degrades Probability Calibration in Tree Ensembles
2026-06-29 • Machine Learning
Machine LearningComputation and Language
AI summaryⓘ
The authors studied how different methods that balance classes in data affect the reliability of predicted probabilities (calibration). They found that SMOTE, a method that creates synthetic examples, slightly worsens calibration but improves classification accuracy, while random undersampling strongly harms calibration as class imbalance increases. They showed that using a simple recalibration step after resampling fixes these issues with minimal impact on classification power. They also explained why a correction method that works for undersampling does not apply to SMOTE, emphasizing the need for recalibration after using these techniques.
SMOTErandom undersamplingclass imbalanceprobability calibrationexpected calibration error (ECE)post-hoc recalibrationPlatt scalingisotonic regressionrandom forestgradient boosting
Authors
Zewen Liu
Abstract
Resampling methods such as SMOTE and random under/over-sampling are standard tools for class-imbalanced classification, almost always evaluated by minority-class accuracy or F1. Prior work has established that undersampling degrades probability calibration by distorting the training prior [1]. We extend this lens to synthetic oversampling (SMOTE) and provide a practical, evidence-based guide to when calibration damage matters and how to fix it. Across five public datasets (imbalance ratio 1.9-70) and two ensemble models (random forest, gradient boosting), with ten seeds and paired statistics, we find: (1) SMOTE's calibration cost is real but small (ECE +0.009; Cliff's delta = +0.27, small-to-moderate) across the studied imbalance range (IR 1.9-70) and its discrimination gains typically outweigh the calibration penalty; (2) random undersampling is the genuine danger -- its damage grows sharply with imbalance, inflating ECE from 0.008 to 0.395 on a dataset with ratio 70, largely because the resulting training sets are too small to estimate probabilities reliably; (3) a single post-hoc recalibration step (Platt or isotonic) eliminates the damage, reducing ECE by up to 66% at a negligible ranking-power cost (AUC -0.002, Cliff's delta = -0.07); and (4) the analytic prior-shift correction that repairs undersampling does not transfer to SMOTE, because SMOTE distorts the class-conditional density rather than only the prior -- so data-driven recalibration remains necessary. We recommend that imbalanced-learning studies report calibration alongside discrimination, and that practitioners recalibrate after resampling whenever predicted probabilities drive decisions.