Accelerated Dynamic Importance Weighting with Versatile Divergence-Minimizing Estimators

2026-05-25Machine Learning

Machine Learning
AI summary

The authors address a problem where the data used for training a model is different from the data it sees when making predictions, called joint distribution shift. They improve an existing technique called dynamic importance weighting (DIW) that adjusts training to handle this difference but can be slow and limited to one method of estimating weights. Their new approach, accelerated DIW (ADIW), speeds up the process by updating weights more efficiently and supports several different ways to calculate these weights. They also provide mathematical proof that their method works well and show through experiments that it is both fast and accurate compared to previous methods.

Importance WeightingJoint Distribution ShiftDynamic Importance WeightingKernel Mean MatchingProjected Gradient DescentKullback-Leibler DivergenceWasserstein DistanceDensity Ratio EstimationConvergence GuaranteeDeep Learning
Authors
Tongtong Fang, Nan Lu, Gang Niu, Kenji Fukumizu, Masashi Sugiyama
Abstract
Importance weighting (IW) is a golden solver for joint distribution shift, where the joint distributions differ between the training and test data. To solve this problem, IW estimates test-to-training density ratios as importance weights and reweights the training losses accordingly. Recent advances in dynamic IW (DIW) integrate weight estimation into model training, enabling scalable IW for deep models and achieving strong performance on large modern datasets. Despite its promise, DIW remains limited in two aspects. First, it incurs substantial computational overhead by solving a kernel mean matching (KMM)-induced optimization problem to convergence in every mini-batch. Second, it relies solely on KMM for weight estimation, whereas the IW literature contains diverse estimation methods based on different divergence measures. In this paper, we propose accelerated DIW (ADIW), a unified and efficient IW framework for deep learning under joint distribution shift. ADIW performs a few lightweight projected gradient descent updates that warm-start from previously updated weights, substantially improving efficiency. Moreover, ADIW generalizes DIW into a unified divergence-minimization framework that supports diverse weight-estimation methods in a plug-and-play manner, including those based on the Kullback-Leibler divergence, squared distance, and Wasserstein-1 distance. We establish convergence guarantees for ADIW under mild conditions, and empirical results demonstrate that ADIW achieves state-of-the-art IW performance while being substantially more efficient.