Tool Verification for Test-Time Reinforcement Learning

2026-03-02Artificial Intelligence

Artificial IntelligenceComputation and Language
AI summary

The authors studied a way for large reasoning models to improve themselves while answering questions without extra labeled examples, called test-time reinforcement learning (TTRL). They found that TTRL can get stuck when it mistakenly trusts a common but wrong answer. To fix this, their method T³RL uses an external tool to check answers during testing, giving more weight to verified correct answers. This approach led to better performance across different math problems and models, especially on harder questions. Overall, the authors highlight that adding verification during testing helps these models learn more reliably on their own.

Test-time reinforcement learningLarge reasoning modelsSelf-evolving modelsPseudo-labelsMajority votingMode collapseTool verificationCode executionOnline adaptationMath problem datasets
Authors
Ruotong Liao, Nikolai Röhrich, Xiaohan Wang, Yuhui Zhang, Yasaman Samadzadeh, Volker Tresp, Serena Yeung-Levy
Abstract
Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.