Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates
2026-05-31 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study a type of decision-making problem called linear contextual bandits, where updates to the learning model happen rarely. They propose two algorithms that only update parameters a very small number of times but still learn nearly as well as methods updating every round. Their first algorithm achieves near-best possible performance using a fixed update schedule, while the second is faster to run and simpler but still performs very well. They also extend these ideas to more general models. Overall, the work shows how to learn efficiently with very few updates and less computation.
linear contextual banditsparameter updatesminimax regretbatched learningG-optimal designsequential decision makingcontextual adaptivityalgorithm runtimegeneralized linear modelsregret minimization
Authors
Sanghoon Yu, Min-hwan Oh
Abstract
We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only $O(\log\log T)$ parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in $T$) simultaneously in both the small-$K$ and large-$K$ regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step -- a dominant computational bottleneck in prior strictly batched static-grid methods -- yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under $O(\log\log T)$ parameter updates that are also computationally efficient in practice.