Too long; didn't solve

2026-04-08Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied how the length of math problems and their solutions affects how well large language models solve them. They created a special set of hard math problems and found that longer questions and longer answers tend to make the models mess up more. They also looked at how different models disagree on these problems and saw some small connections, especially related to question length. Overall, the authors showed that longer math problems are generally harder for these models to handle.

large language modelsmathematical benchmarksprompt lengthsolution lengthmodel performanceadversarial datasetcross-model disagreementmodel difficultyreasoning abilities
Authors
Lucía M. Cabrera, Isaac Saxton-Knight
Abstract
Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.