CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

2026-06-01Artificial Intelligence

Artificial Intelligence
AI summary

The authors point out that current large language model agents learning to reason with search often struggle because they rarely get fully correct answers during training, making it hard to learn from success alone. They suggest using extra feedback from a verifier to help the model recognize and fix its mistakes while still learning. Their method, called CAPF, allows the model to improve its wrong answers during training by giving partial credit for these revisions, even though this extra feedback won't be available when the model is later used. Tests show that their approach helps improve the accuracy of a language model on several question-answering tasks.

large language model (LLM)reinforcement learningverifiable rewardssearch-augmented reasoningrolloutsprivileged feedbackcredit assignmentopen-domain question answeringexact-match scorepolicy revision
Authors
Bin Chen, Xinye Liao, Yiming Liu, Xin Liao, Chonghan Liu
Abstract
Recent LLM search agents use reinforcement learning with verifiable rewards (RLVR) to learn search-augmented reasoning from outcome rewards. On hard problems, these agents rarely sample end-to-end successful rollouts, leaving outcome-only RLVR with few positive-reward trajectories. We argue that improving learning on such problems requires additional guidance during training, and RLVR already contains verifier-side information that can provide it. This information can identify errors or omissions in the agent's submitted answer and guide revision within the rollout. We propose a training-time mechanism called \textbf{Credit-Attenuated Privileged Feedback} (CAPF), which makes this verifier-side information available through a Privileged Feedback call during training. CAPF lets the policy revise zero-reward attempts into positive-reward repair trajectories and attenuates credit for the feedback call and earlier actions to accommodate deployment without this call. Empirical research demonstrates that CAPF improves Qwen3-4B's average exact-match score from 44.7% under outcome-only RLVR to 48.5% on seven open-domain QA benchmarks.