Offline Local Search for Online Stochastic Bandits

2026-04-10 • Machine Learning

Machine Learning

AI summaryⓘ

The authors study a problem where a decision-maker repeatedly picks actions and learns their costs, aiming to do nearly as well as the best fixed action in hindsight. They explore local search algorithms, which are commonly used offline but less studied online, and show how to turn such offline methods into online algorithms that achieve very low regret growing roughly with the cube of the logarithm of time steps. This improves on previous approaches that had regret growing polynomially with time. They demonstrate their method on three practical problems: scheduling, matroid optimization, and clustering under uncertainty.

combinatorial multi-armed banditsregret minimizationonline algorithmslocal searchoffline-to-online conversionstochastic optimizationmatroidschedulingclustering

Authors

Gerdus Benadè, Rathish Das, Thomas Lavastida

Abstract

Combinatorial multi-armed bandits provide a fundamental online decision-making environment where a decision-maker interacts with an environment across $T$ time steps, each time selecting an action and learning the cost of that action. The goal is to minimize regret, defined as the loss compared to the optimal fixed action in hindsight under full-information. There has been substantial interest in leveraging what is known about offline algorithm design in this online setting. Offline greedy and linear optimization algorithms (both exact and approximate) have been shown to provide useful guarantees when deployed online. We investigate local search methods, a broad class of algorithms used widely in both theory and practice, which have thus far been under-explored in this context. We focus on problems where offline local search terminates in an approximately optimal solution and give a generic method for converting such an offline algorithm into an online stochastic combinatorial bandit algorithm with $O(\log^3 T)$ (approximate) regret. In contrast, existing offline-to-online frameworks yield regret (and approximate regret) which depend sub-linearly, but polynomially on $T$. We demonstrate the flexibility of our framework by applying it to three online stochastic combinatorial optimization problems: scheduling to minimize total completion time, finding a minimum cost base of a matroid and uncertain clustering.

View PDFOpen arXiv