Characterizing the Generalization Error of Random Feature Regression with Arbitrary Data-Augmentation

2026-05-11 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study how data augmentation—a way to create more varied training data—helps improve supervised regression models when the number of features grows at the same rate as the number of samples. They derive exact formulas to predict the model's test error using only basic properties of the original data and the augmentation method. Their findings hold even if the model's feature extraction isn't perfect and apply to networks where only the last layer is trained. They also confirm their results with experiments using Gaussian data, showing their formulas are accurate in that case.

Data AugmentationSupervised RegressionProportional RegimeTest ErrorMean Squared ErrorFeature MapsNeural NetworkGaussian DataAsymptotic AnalysisRegularization

Authors

Lucas Morisset, Alain Durmus, Adrien Hardy

Abstract

This paper aims at analyzing the regularization effect that data augmentation induces on supervised regression methods in the proportional regime, where the number of covariates grows proportionally to the number of samples. We provide a tight characterization of the test error, measured in mean squared error, in terms only of the population quantities of the true data, as well as first and second order statistics of the augmentation scheme. Our results are valid under misspecified feature maps, and for any network architecture where only the last readout layer is trained, and the rest of the network is either frozen or randomly initialized. We specify our results in the case of Gaussian data, and show that our asymptotic characterization is tight in this setting.

View PDFOpen arXiv