Near-Optimal Stochastic Linear Bandits with Delay
2026-06-15 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study a decision-making problem where choices have delayed outcomes, focusing on settings where outcomes depend linearly on actions. They show that when delays don’t depend on the outcome, the extra regret (or error) caused by delays is small and easy to handle, regardless of problem complexity. However, when delays depend on the outcome, the problem becomes harder for linear models compared to simpler ones, requiring more careful analysis. Their work clarifies exactly how delayed feedback affects learning in these linear settings.
stochastic linear banditsdelayed feedbackmulti-armed banditsregret boundsdelay-independent delaysloss-dependent delayslinear generalizationstochastic delaysadversarial delaysdelay-as-payoff model
Authors
Ofir Schlisselberg, Mengxiao Zhang, Yishay Mansour
Abstract
We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for \emph{loss-independent delays}, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for \emph{loss-dependent delays}, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the \emph{delay-as-payoff model}, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.