Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

2026-06-01Machine Learning

Machine Learning
AI summary

The authors study a problem where a learner must find the best choice by comparing pairs out of many options, assuming a clear winner exists. They introduce a new method, TG-ITE, that uses a tree-based process to identify a likely best option efficiently and then applies different strategies to either quickly find the best choice or minimize regret from wrong decisions. Their approach works well for three common goals: identifying the best arm, minimizing weak regret, and minimizing strong regret, all with good efficiency. They also show it's possible to do well on both identifying the best arm and minimizing weak regret simultaneously, avoiding some inefficiencies in past methods.

N-armed stochastic dueling banditsCondorcet winnerbest-arm identificationweak regretstrong regretsample complexitywinner-stays algorithmexploration-exploitation trade-offtree-guided identificationregret minimization
Authors
Pu Wang, Yao-Xiang Ding
Abstract
We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.