Near-Optimal Regret for KL-Regularized Multi-Armed Bandits
2026-03-02 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study a version of the multi-armed bandit problem where a KL-divergence regularizer is added to guide learning. They analyze the KL-UCB algorithm and provide a new upper bound on regret that grows linearly with the number of options (arms) and depends on the regularization strength. They also prove a matching lower bound showing these results are nearly optimal. Additionally, they find that when regularization is weak, the regret behaves like in classical settings, independent of the regularization. Overall, their work clarifies how regularization affects learning performance in these problems.
Multi-Armed BanditsKL-DivergenceRegularizationRegret BoundKL-UCB AlgorithmOnline LearningPeeling ArgumentBayes PriorStatistical EfficiencyTime Horizon
Authors
Kaixuan Ji, Qingyue Zhao, Heyang Zhao, Qiwei Di, Quanquan Gu
Abstract
Recent studies have shown that reinforcement learning with KL-regularized objectives can enjoy faster rates of convergence or logarithmic regret, in contrast to the classical $\sqrt{T}$-type regret in the unregularized setting. However, the statistical efficiency of online learning with respect to KL-regularized objectives remains far from completely characterized, even when specialized to multi-armed bandits (MABs). We address this problem for MABs via a sharp analysis of KL-UCB using a novel peeling argument, which yields a $\tilde{O}(ηK\log^2T)$ upper bound: the first high-probability regret bound with linear dependence on $K$. Here, $T$ is the time horizon, $K$ is the number of arms, $η^{-1}$ is the regularization intensity, and $\tilde{O}$ hides all logarithmic factors except those involving $\log T$. The near-tightness of our analysis is certified by the first non-constant lower bound $Ω(ηK \log T)$, which follows from subtle hard-instance constructions and a tailored decomposition of the Bayes prior. Moreover, in the low-regularization regime (i.e., large $η$), we show that the KL-regularized regret for MABs is $η$-independent and scales as $\tildeΘ(\sqrt{KT})$. Overall, our results provide a thorough understanding of KL-regularized MABs across all regimes of $η$ and yield nearly optimal bounds in terms of $K$, $η$, and $T$.