Sample-Mean Anchored Thompson Sampling for Offline-to-Online Learning with Distribution Shift

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study a problem where a system tries to make good decisions by learning from old offline data and new online data, despite differences between these data sources. They focus on improving Thompson Sampling algorithms, which are good at balancing exploration and exploitation, but can struggle when offline and online data do not match well. To fix this, they create Anchor-TS, a method that uses a median of different data-based estimates to better correct biases caused by distribution shifts. Their method is shown theoretically and experimentally to use offline data safely and improve learning speed.

Offline-to-online learningThompson SamplingDistribution shiftBandit algorithmsUCB (Upper Confidence Bound)Bias correctionPosterior samplingRegretBayesian methodsSample mean
Authors
Bochao Li, Yao Fu, Wei Chen, Fang Kong
Abstract
Offline-to-online learning aims to improve online decision-making by leveraging offline logged data. A central challenge in this setting is the distribution shift between offline and online environments. While some existing works attempt to leverage shifted offline data, they largely rely on UCB-type algorithms. Thompson sampling (TS) represents another canonical class of bandit algorithms, well known for its strong empirical performance and naturally suited to offline-to-online learning through its Bayesian formulation. However, unlike UCB indices, posterior samples in TS are not guaranteed to be optimistic with respect to the true arm means. This makes indices constructed from purely online and hybrid data difficult to compare and complicates their use. To address this issue, we propose sample-mean anchored TS (Anchor-TS), which introduces a novel median-based anchoring rule that defines the arm index as the median of an online posterior sample, a hybrid posterior sample, and the online sample mean. The median anchoring systematically corrects bias induced by distribution shift by mitigating over-estimation for suboptimal arms and under-estimation for optimal arms, while exploiting offline information to obtain more accurate estimates when the shift is small. We establish theoretical guarantees showing that the proposed algorithm safely leverages offline data to accelerate online learning, and quantifying how the degree of distribution shift and the size of offline data affect the resulting regret reduction. Extensive experiments demonstrate consistent improvements of our algorithm over baselines.