PruneTIR: Inference-Time Tool Call Pruning for Effective yet Efficient Tool-Integrated Reasoning

2026-05-11 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors study how to make large language models (LLMs) that use external tools, like code interpreters, reason better when answering questions without extra training. They noticed that mistakes in tool use hurt the model's answers, but some errors can be fixed quickly while others cause the model to get stuck. To address this, they created PruneTIR, a method that selectively prunes and retries tool calls during the model’s thinking process to reduce errors and avoid getting stuck. Their experiments show that PruneTIR improves accuracy and efficiency for tool-using LLMs.

large language modelstool-integrated reasoningcode interpretersinference timeerror correctionPruneTIRtrajectory pruningresamplingtool suspensionPass@1

Authors

Luan Zhang, Dandan Song, Zhijing Wu, Zhengyu Chen, Chen Zhang, Yuhang Tian, Huipeng Ma, Chenhao Li, Changzhi Zhou, Xudong Li, Shuhao Zhang

Abstract

Tool-integrated reasoning (TIR) enables large language models (LLMs) to enhance their capabilities by interacting with external tools, such as code interpreters (CI). Most recent studies focus on exploring various methods to equip LLMs with the ability to use tools. However, how to further boost the reasoning ability of already tool-capable LLMs at inference time remains underexplored. Improving reasoning at inference time requires no additional training and can help LLMs better leverage tools to solve problems. We observe that, during tool-capable LLM inference, both the number and the proportion of erroneous tool calls are negatively correlated with answer correctness. Moreover, erroneous tool calls are typically resolved successfully within a few subsequent turns. If not, LLMs often struggle to resolve such errors even with many additional turns. Building on the above observations, we propose PruneTIR, a rather effective yet efficient framework that enhances the tool-integrated reasoning at inference time. During LLM inference, PruneTIR prunes trajectories, resamples tool calls, and suspends tool usage through three components: Success-Triggered Pruning, Stuck-Triggered Pruning and Resampling, and Retry-Triggered Tool Suspension. These three components enable PruneTIR to mitigate the negative impact of erroneous tool calls and prevent LLMs from getting stuck in repeated failed resolution attempts, thereby improving overall LLM performance. Extensive experimental results demonstrate the effectiveness of PruneTIR, which significantly improves Pass@1 and efficiency while reducing the working context length for tool-capable LLMs.

View PDFOpen arXiv