The Quality-Utility Paradox: Why High-Reward Data Impairs Small Model Mathematical Reasoning

2026-06-15 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors found that when teaching small language models (SLMs) to solve math problems, using better quality answers from a stronger 'Oracle' model sometimes made the SLMs perform worse. This happens because the improved answers are different from what the SLM usually expects, making it harder for the SLM to learn from them. They created a method called Style-Aligned Refinement that fixes the answers while keeping them similar to the SLM's usual style, helping the SLM learn better. Their work suggests that both answer quality and how well the answers match the learner's style matter for teaching SLMs math reasoning.

Knowledge DistillationSmall Language ModelsMathematical ReasoningReward ModelQuality-Utility ParadoxOracle ModelDistributional DriftRejection SamplingStyle-Aligned RefinementAdaptation Cost

Authors

Haolong Qian, Xianliang Yang, Yinuo ma, Lirong Che, Feng Lu, Ye Guo, Lei Song, Jiang Bian, Chun Yuan

Abstract

Knowledge distillation from powerful reasoning models is widely used to improve Small Language Models (SLMs) on mathematical reasoning, often assuming that traces with higher reward model scores provide more useful supervision. We identify a counterintuitive \textbf{Quality-Utility Paradox} in mathematical reasoning distillation. Data refined or synthesized by a stronger Oracle obtains higher perceived quality according to reward models, yet consistently underperforms traces generated by the SLM itself and selected through rejection sampling across Qwen2.5, LLaMA-3, and DeepSeek families. Our analysis shows that Oracle refinement couples logical repair with distributional drift away from the SLM's native reasoning distribution. This drift increases the learner's adaptation cost and can outweigh the benefit of improved reasoning logic. To test this mechanism, we introduce \textbf{Style-Aligned Refinement}, which preserves the native trajectory of the SLM while retaining logical repair from the Oracle. This intervention lowers adaptation cost and restores downstream utility. These findings suggest that effective mathematical reasoning distillation should jointly optimize perceived solution quality and learner-data compatibility, rather than relying solely on reward-model scores. The datasets and code are available at https://github.com/Dracoqhl/Quality-Utility-Paradox.

View PDFOpen arXiv