Signature Approach for Contextual Bandits with Nonlinear and Path-dependent Rewards

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study a problem where decisions are made step-by-step and rewards depend on a sequence of past events in complicated ways. They use a mathematical tool called the signature transform to turn these complex reward patterns into simpler linear ones, making the problem easier to solve. Using this approach, they create a new algorithm named DisSigUCB, which they prove has good performance guarantees. Their tests show that DisSigUCB works better than traditional methods on tasks like monitoring temperature, classifying sleep stages, and planning nurse schedules in hospitals.

contextual banditspath-dependent rewardssignature transformupper confidence bound (UCB)sublinear regretnonlinear functionalssequential decision makinglinear approximationssensor monitoringclassification
Authors
Xin Guo, Grace He, Xinyu Li
Abstract
We study contextual bandits with nonlinear and path-dependent rewards through a novel signature-transform-based approach. Leveraging the universal nonlinearity property of signatures, we approximate continuous path-dependent reward functionals by linear functionals in the signature space. This representation enables the use of efficient linear contextual bandit methods while preserving expressive sequential structure. Building on this framework, we propose \texttt{DisSigUCB}, a signature-based disjoint upper confidence bound (UCB) algorithm. Under boundedness and non-degeneracy assumptions, we prove a high-probability data-dependent sublinear regret bound of order \(\tilde{\mathcal O}(\sqrt{(d+m)KT})\) where \(d\) is the context dimension and \(m\) is the signature feature dimension. Synthetic experiments and numerical applications on temperature sensor monitoring, sleep-stage classification, and hospital nurse staffing demonstrate that \texttt{DisSigUCB} consistently outperforms classical linear and kernelized contextual bandit baselines in nonlinear and path-dependent settings.