Multi-Armed Bandits with Arriving Arms: Sequential Screening, Dynamic Regret, and Sublinear Guarantees
2026-06-08 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study a version of the multi-armed bandit problem where new options (arms) appear over time, which makes comparing to a single best option unfair. They propose a method called UCB-AA that screens new arms before fully testing them alongside existing ones. This approach aims to make better decisions by focusing on the best options available at each time, and their results show it keeps regret (lost opportunity) low under reasonable conditions. They also developed a way to run their method without knowing how long the experiment will last. Simulations suggest their method uses resources more efficiently by testing fewer poor arms.
multi-armed banditstochasticdynamic regretarrival processUCB algorithmelimination methodonline learninggap evolutionsequential experimentationregret bounds
Authors
Deqi Zheng, Xiaoyang Xu, Yuhong Yang
Abstract
We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making regret against a single best arm in hindsight inappropriate. We instead evaluate performance relative to the best arm currently available, leading to a dynamic-regret criterion for arriving-arm environments. To address the resulting challenges of arrival information discrepancy (AID) and a drifting benchmark (DB), we propose UCB for Arriving Arms (UCB-AA), an elimination-based procedure with an aiding preliminary screening step for newly arrived arms before full competition with incumbent arms. We show that UCB-AA attains regret bounds that depend explicitly on the arrival process, achieves sublinear dynamic regret under regularity conditions on gap evolution, and admits an online extension for unknown horizons. Simulation results show that UCB-AA reduces wasted pulls and maintains a smaller active arm set while preserving competitive regret performance.