PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents
2026-06-15 • Computation and Language
Computation and LanguageArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors address how to train AI agents that use tools over multiple steps, which is hard because typical methods either get weak feedback or become too fixed to specific example paths. They propose a new method called PACT, which uses expert examples only to guide learning during training, not to steer the agent’s decisions directly. PACT combines reinforcement learning with supervised learning signals in a way that balances following expert knowledge and allowing flexibility. Tests show that this approach improves performance compared to existing methods.
multi-turn tool usereinforcement learningsupervised fine-tuningexpert tracespolicy optimizationcredit assignmentprompt-only inferenceco-trainingtrace-conditioned learning
Authors
Zhenbang Du, Jun Luo, Zhiwei Zheng, Xiangchi Yuan, Kejing Xia, Dachuan Shi, Qirui Jin, Qijia He, Shaofeng Zou, Yingbin Liang, Wenke Lee
Abstract
Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.