PFN-TS: Thompson Sampling for Contextual Bandits via Prior-Data Fitted Networks
2026-05-11 • Machine Learning
Machine Learning
AI summaryⓘ
The authors introduce PFN-TS, a new method that improves Thompson sampling by using Prior-data Fitted Networks (PFNs) to efficiently estimate the uncertainty needed for decision making. Unlike typical PFNs that predict noisy rewards, their approach extracts better samples of the underlying mean reward function using a clever statistical technique with less computational effort. They prove their method works well in theory and show that it performs strongly on a variety of benchmark tasks, including healthcare applications. This work helps make Bayesian bandit strategies faster and more reliable.
Thompson samplingcontextual banditsPrior-data Fitted Networks (PFNs)Bayesian posteriorpredictive distributioncentral limit theoremposterior varianceBayesian regretoffline evaluationpolicy value
Authors
Yan Shuo Tan, Kenyon Ng, Ruizhe Deng, Sumetha Loganathan, Qiong Zhang, Bibhas Chakraborty
Abstract
Thompson sampling is a widely used strategy for contextual bandits: at each round, it samples a reward function from a Bayesian posterior and acts greedily under that sample. Prior-data fitted networks (PFNs), such as TabPFN v2+ and TabICL v2, are attractive candidates for this purpose because they approximate Bayesian posterior predictive distributions in a single forward pass. However, PFNs predict noisy future rewards, while Thompson sampling requires uncertainty over the latent mean reward function. We propose PFN-TS, a Thompson sampling algorithm that converts PFN posterior predictives into mean-reward samples using a subsampled predictive central limit theorem. The method estimates posterior variance from a geometric grid of $O(\log n)$ dataset prefixes rather than the full $O(n)$ predictive sequence used in previous predictive-sequence approaches, and reuses TabICL's cached representations across rounds. We prove consistency of the subsampled variance estimator and give a Bayesian regret bound that decomposes PFN-TS regret into exact posterior-sampling regret under the PFN prior plus approximation terms. Empirically, PFN-TS achieves the best average rank across nonlinear synthetic and OpenML classification-to-bandit benchmarks, remains competitive on linear and BART-generated rewards, and attains the highest estimated policy value in an offline mobile-health evaluation. Code is available at https://anonymous.4open.science/r/PFN_TS-36ED/.