Towards Pareto-Optimal Tool-Integrated Agents with Pareto Ranking Policy Optimization

2026-06-15 • Computation and Language

Computation and Language

AI summaryⓘ

The authors address a problem where language models that use tools focus mostly on doing tasks accurately but ignore other important goals like using tools efficiently. They propose ParetoPO, a two-step method that balances multiple goals by adjusting rewards based on overall progress and carefully ranking actions to encourage better trade-offs. Their experiments on math and question-answering tasks show that ParetoPO finds better solutions that balance accuracy and efficiency than other simpler methods. This approach helps make language models more practical by considering multiple objectives at once.

large language modelstool usemulti-objective optimizationPareto frontierdynamic scalarizationadvantage computationcredit assignmentmathematical reasoningmulti-hop question answeringpolicy optimization

Authors

Junyi Li, Xiaowei Qian, Yingyi Zhang, Wenlin Zhang, Guojing Li, Sheng Zhang, Xiao Han, Yichao Wang, Xiangyu Zhao

Abstract

Recent advances in tool-integrated language agents have significantly improved their ability to solve complex reasoning tasks. However, existing alignment methods predominantly focus on maximizing task accuracy, while overlooking auxiliary objectives such as tool-use efficiency, which are essential for practical deployment. To address this gap, we introduce ParetoPO, a two-stage multi-objective optimization framework for aligning tool-using large language models (LLMs) under competing objectives. In the first stage, ParetoPO leverages hypervolume-guided dynamic scalarization to adapt reward weights based on global Pareto frontier progress. In the second stage, it replaces scalarized learning signals with Pareto-ranking-based advantage computation, promoting nondominated trajectories through dominance-aware credit assignment. This design enables fine-grained, action-level optimization across multiple conflicting objectives. Experimental results on mathematic reasoning and multi-hop QA tasks show that ParetoPO consistently discovers policies with superior accuracy-efficiency trade-offs compared to static and heuristic baselines.

View PDFOpen arXiv