PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models

2026-06-02 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors created a big test called PyraMathBench to check how well language models can do math problems that need both number crunching and reasoning. They found that current models struggle especially with math questions that require precise calculations or abstract thinking. To help, they made new tools called SOLVE and IRPO that improve how models use external tools for math tasks by better managing when and how to call them. Their tests showed these tools help a model named Qwen-2.5 perform better by about 5 points. Overall, the work focuses on making language models better at combining math skills with reasoning using smarter tool use.

Large Language ModelsNumerical ReasoningMathematical ReasoningBenchmarkingPyraMathBenchTool Use in AISOLVE ModuleInteractive Relative Policy OptimizationFuzzy MatchingMathematical Word Problems

Authors

Zetian Ouyang, Linlin Wang, Gerard de Melo, Liang He

Abstract

Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.

View PDFOpen arXiv