Target-Aware Linear Regression Under Distribution Shift

2026-06-22 • Machine Learning

Machine Learning

AI summaryⓘ

The authors look at how to improve predictions when the data used to train a model is different from the data where the model is applied, especially when some overall information about the new data is known. They focus on a type of linear regression and propose a complex ideal method (hybrid-loss estimator) that is hard to compute. To solve this, they develop two easier methods and compare all three mathematically and through experiments. Their results show when the simpler methods work well and when they don't, helping to choose the best approach depending on the situation.

distribution shiftmultivariate linear regressionconditional meanhybrid-loss estimatormoment-matching estimatortwo-stage estimatormean squared errorMonte Carlo experimentssignal-to-noise ratio

Authors

Zhewen Hou, Tian Zheng

Abstract

Distribution shift between training and deployment is a pervasive challenge for modern AI systems. In many cases, the target marginals of covariates and response are known or specified through population-level observations, boundary conditions, properties of simulator configurations, or alignment-time distributional constraints. Such knowledge may provide valuable side information for regression estimation. We study this problem in the multivariate linear regression setting with a stable conditional mean $E[Y\mid X]$ across source and target, and identify the hybrid-loss estimator, which jointly incorporates both target marginals, as a benchmark target-aware estimator. Its direct computation, however, requires solving a coupled nonlinear optimization that is expensive at scale. Our main contribution is to develop and evaluate two computationally tractable alternatives: a constrained moment-matching estimator and a two-stage estimator that augments ordinary least squares with a calibration step. For all three estimators, we derive and compare closed-form asymptotic mean squared errors, yielding conditions under which the tractable alternatives match or closely approximate the hybrid benchmark, and regimes in which they do not. Monte Carlo experiments across three controlled shift regimes validate the theoretical results, investigate the accuracy-runtime tradeoffs among the three estimators, and translate into guidance on estimator choice. In particular, the two-stage estimator nearly matches the hybrid benchmark in the high signal-to-noise regime at essentially no additional cost, providing theoretical grounding for empirical observations in nonlinear settings.

View PDFOpen arXiv