Learning to Sparsify Stochastic Linear Bandits

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study a problem where a decision-maker must pick sparse actions—actions with only a few nonzero elements—from a large set in a way that learns over time and minimizes mistakes (regret). Because choosing the best sparse action is usually very hard, they design an algorithm that alternates between exploring and exploiting, using least squares for learning and special methods to find sparse actions. They prove their method performs well under different conditions on the action set, including cases where exact solutions are too hard and approximate methods are used. They also tested their approach in simulations and a recommendation system application.

stochastic linear banditssparsity constraintcumulative regretordinary least squaresEuclidean ballconvex action setsstrong convexityapproximation ratioexploration-exploitationrecommendation systems
Authors
Zhengmiao Wang, Ming Chi, Zhi-Wei Liu, Lintao Ye, Carla Fabiana Chiasserini
Abstract
This paper addresses the problem of learning to sparsify stochastic linear bandits, where a decision-maker sequentially selects actions from a high-dimensional space subject to a sparsity constraint on the number of nonzero elements in the action vector. The key challenge lies in minimizing cumulative regret while tackling the potential NP-hardness of finding optimal sparse actions due to the inherent combinatorial structure of the problem. We propose an adaptively phased exploration and exploitation algorithmic framework, utilizing ordinary least squares for parameter learning and specialized subroutines for sparse action selection. When the action set is a Euclidean ball, optimal sparse actions can be efficiently computed, enabling us to establish a $\tilde{\mathcal{O}}(d\sqrt{T})$ regret, where $d$ is the dimension of the action vector and $T$ is the time horizon length. For general convex and compact action sets where finding optimal sparse actions is intractable, we employ a greedy subroutine. For general strongly convex action sets, we derive a $\tilde{\mathcal{O}}(d \sqrt{T})$ $α$-regret; for general compact sets lacking strong convexity, we establish a $\tilde{\mathcal{O}}(d T^{2/3})$ $α$-regret, where $α$ pertains to the approximation ratio of the greedy algorithm. Finally, we validate the performance of our algorithms using extensive experiments including an application to recommendation system.