On the Benefits of Free Exploration for Regret Minimization in Multi-Armed Bandits
2026-05-25 • Machine Learning
Machine LearningArtificial IntelligenceInformation Theory
AI summaryⓘ
The authors study a problem where an agent gets some initial 'free' tries to explore options before it starts losing points (regret). They design a strategy that smartly uses this free exploration phase to reduce losses later. They create a new algorithm called UFE-KLUCB-H that combines exploration with careful decision-making, showing it performs better than methods without free exploration. They also prove their method is almost the best possible in certain cases and show that having free exploration changes how much regret accumulates. Their simulations confirm that forcing exploration and adapting based on past data helps save more regret.
multi-armed banditstochastic banditregret minimizationfree explorationadaptive policyinstance-dependent boundsUFE-KLUCB-H algorithmphase transitionsprobably saving policiesexploration-exploitation tradeoff
Authors
Yunlong Hou, Zixin Zhong, Vincent Y. F. Tan
Abstract
We study a stochastic multi-armed bandit problem where an agent is granted a free exploration budget before regret accumulates, a setting not captured by the classic regret minimization or pure exploration paradigms. The goal is to design an adaptive policy that strategically explores the bandit instance in the initial free exploration phase and minimizes the cumulative regret in the subsequent phase. We formalize this regret minimization with free exploration problem and identify an interesting regime where the free exploration budget scales logarithmically with the time horizon. To quantify the amount of regret saved with high probability as a result of the availability of the free exploration phase, we introduce a novel set of policies known as $(α,β)$-probably saving policies. We propose a two-phase, probably saving algorithm, UFE-KLUCB-H, which consists of a principled free exploration policy, UFE, and a history-aware regret minimization policy KLUCB-H. Instance-dependent upper bounds on UFE-KLUCB-H are derived, showing that UFE-KLUCB-H accumulates strictly less regret than policies that do not have access to a free exploration phase. Complementarily, we derive instance-dependent lower bounds based on novel multi-instance perturbation arguments tailored to the free-exploration setting, demonstrating the near-optimality of UFE-KLUCB-H for two-valued bandits. Our upper and lower bounds reveal sharp phase transitions in the accumulated regret depending on the amount of available free exploration. Simulations are conducted to demonstrate that forced exploration and adaptivity in the algorithm lead to greater regret savings.