Self-Evolution for Multi-Turn Tool-Calling Agents via Divergence-Point Preference Learning
2026-06-22 • Machine Learning
Machine LearningArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors study agents that use tools in multiple steps and need to keep track of what they are doing and follow certain rules. They propose ToolGraph, a method that uses knowledge about tool connections, success rates from past actions, and awareness of history to improve decision-making. They also create a way to fine-tune the agents using preference comparisons from action outcomes. Testing on many tasks shows ToolGraph improves rewards, and combining it with a preference optimization method (DPO) improves results even more, especially in airline and retail tasks. They also find some tasks run out of time before finishing, and reward positivity helps improve training.
multi-turn tool usedialogue state trackingpolicy constraintsToolGraphtransition weightsrolloutsDPO (Direct Preference Optimization)tau2-benchpreference learningreward signals
Authors
Jiaqiang Tang
Abstract
Multi-turn tool-using agents must coordinate long-horizon tool sequences while tracking dialogue state and policy constraints. Existing approaches often separate inference-time orchestration from parameter-level learning, leaving tool selection weakly structured and preference updates vulnerable to train--deployment prompt mismatch. For within-benchmark self-improvement, ToolGraph combines schema-derived topology, transition weights estimated from successful rollouts, and history-aware controls for write prerequisites and repeated-search loops. We then construct 161 preference pairs by locating divergence points via state-based matching and prefix-based alignment, filtered through action-correctness annotations, and train DPO under the same ToolGraph context used at inference. Across 375 tau2-bench tasks, ToolGraph raises the weighted average reward from 0.304 to 0.338 (+11.2% relative), while ToolGraph+DPO reaches 0.355 (+16.8% over the baseline), with the DPO gain concentrated in airline and retail. Fine-grained diagnostics further show that roughly half of telecom trajectories exhaust the step budget before action execution and that chosen reward positivity is the most useful checkpoint signal across our 16 evaluated DPO configurations.