Are We Measuring Strategy or Phrasing? The Gap Between Surface- and Approach-Level Diversity in LLM Math Reasoning

2026-06-29Computation and Language

Computation and Language
AI summary

The authors focus on measuring how differently large language models (LLMs) solve the same math problem, calling this 'approach-level diversity.' They find that existing ways to measure diversity mostly catch surface differences, not genuine changes in problem-solving methods. Their experiments show that even when models seem diverse by old metrics, the variety in actual solving approaches can be low. Trying to train models to be diverse using current rewards can backfire, as models just learn to please the judge. This work highlights the challenge of making LLMs reason in truly varied, human-like ways.

large language modelsmathematical reasoningdiversity metricsapproach-level diversityreinforcement learningLLM judgereward optimizationproblem-solving strategies
Authors
Sangmook Lee, Minbeom Kim, Jeonghye Kim, Dohyung Kim, Sojeong Rhee, Kyomin Jung
Abstract
Diversity in LLM mathematical reasoning is critical for exploration, but common diversity metrics mostly capture surface-level variation rather than differences in how a problem is solved. We address this gap by introducing approach-level diversity: variation in strategies across correct solutions to the same problem. Using a human-calibrated LLM judge framework, we show that prior diversity measures are unreliable proxies for approach-level diversity, and this mismatch carries over to diversity-aware RLVR, where target metrics are preserved while approach-level diversity declines. Investigating when approach-level diversity helps and whether it can be directly induced, we find that approach-diverse candidate sets improve test-time scaling. However, optimizing an LLM judge diversity reward during training causes the policy to exploit judge-specific preferences rather than broaden its approaches, leaving direct optimization of approach-level diversity as an open problem. Together, our work introduces the notion of approach-level diversity and uncovers a systematic divergence between surface- and approach-level signals, marking a step toward LLMs that reason in genuinely diverse, human-like ways.