Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration

2026-05-11Machine Learning

Machine Learning
AI summary

The authors study a way to speed up Tree-of-Thought (ToT) reasoning for large language models, which normally slows down due to waiting for reward signals in a step-by-step process. They propose SPEX, a method that guesses promising reasoning paths early, shares computing resources smartly across tasks, and stops exploring unhelpful branches sooner. By testing SPEX with different models and techniques, they show it can make ToT reasoning 1.2 to 3 times faster, and even up to 4.1 times faster when combined with other speed-ups. Their work helps make complex model reasoning more efficient and scalable.

Large Language ModelsTree-of-Thought ReasoningSpeculative ExecutionReward DependencyParallelismToken-level Speculative DecodingSearch TreeEarly TerminationResource AllocationInference Efficiency
Authors
Shuzhang Zhong, Haochen Huang, Shengxuan Qiu, Pengfei Zuo, Runsheng Wang, Meng Li
Abstract
Tree-of-Thought (ToT) reasoning structures Large Language Model (LLM) inference as a tree-based search, demonstrating strong potential for solving complex mathematical and programming tasks. However, its efficiency is constrained by the reward dependency barrier -- a synchronization bottleneck caused by sequential reward-guided exploration that limits search parallelism and introduces substantial latency. Prior system optimizations, mainly designed for linear Chain-of-Thought (CoT) reasoning, cannot address these challenges, leaving the efficiency of ToT underexplored. To enhance ToT reasoning efficiency, we observe that the reasoning paths can be explored speculatively to break the reward synchronization barrier. Therefore, in this paper, we propose SPEX and introduce three key techniques: (i) intra-query speculative path selection to predict and expand high-potential branches of ToT, (ii) inter-query budget allocation to balance speculative resource allocation across queries dynamically, and (iii) adaptive early termination to prune deep and redundant branches for a skewed search tree. We implement SPEX on top of the SGLang framework and evaluate it across diverse ToT algorithms and LLMs. Extensive experiments show that SPEX achieves $1.2 \sim 3 \times$ speedup for different ToT reasoning algorithms. Moreover, SPEX synergizes with token-level speculative decoding, achieving cumulative speedups of up to $4.1\times$. Ablation studies further confirm the contributions of each technique. Overall, SPEX represents a significant step toward efficient and scalable ToT reasoning, unlocking the parallelism required for high-performance inference-time scaling for LLMs.