A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

2026-04-09Machine Learning

Machine Learning
AI summary

The authors revisit a model from Nelson et al. (2022) where rewards depend on hidden Markov chains but find that Nelson et al. simplified the problem by looking at rewards based on estimated probabilities rather than the hidden states themselves. They point out Nelson et al.'s analysis did not fully handle how to estimate the hidden parameters and focused on average results with complex assumptions. The authors instead study a more direct and natural version of the problem, where rewards depend directly on hidden states, and provide stronger guarantees that hold with high probability. Their approach also estimates hidden Markov model parameters online and results in cleaner bounds that do not rely on specific reward function details.

finite-armed linear bandithidden Markov model (HMM)contextual banditsposterior probabilitiesregret boundshigh-probability boundsHMM parameter estimationadaptive algorithmsreward functionsonline learning
Authors
Zhen Li, Gilles Stoltz
Abstract
We revisit the finite-armed linear bandit model by Nelson et al. (2022), where contexts and rewards are governed by a finite hidden Markov chain. Nelson et al. (2022) approach this model by a reduction to linear contextual bandits; but to do so, they actually introduce a simplification in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts, rather than functions of the hidden states themselves. Their analysis (but not their algorithm) also does not take into account the estimation of the HMM parameters, and only tackles expected, not high-probability, bounds, which suffer in addition from unnecessary complex dependencies on the model (like reward gaps). We instead study the more natural model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits) and also obtain stronger, high-probability, regret bounds for a fully adaptive strategy that estimates HMM parameters online. These bounds do not depend on the reward functions and only depend on the model through the estimation of the HMM parameters.