Generalization in LLM Problem Solving: The Case of the Shortest Path

2026-04-16 • Artificial Intelligence

Artificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied how well language models can solve problems that require combining steps, using a simple map-based pathfinding task. They showed that models can handle new maps well but have trouble when the problems require making many steps in a row, due to errors piling up. The researchers also found that having good training data is key, and while some training tricks make learning more stable, they don't fix the fundamental limits. Trying to improve results during testing helps a bit but doesn't solve the problem of longer tasks.

language modelssystematic generalizationshortest-path planningspatial transferlength scalingreinforcement learningtraining data coverageinference-time scalingrecursive instability

Authors

Yao Tong, Jiayuan Ye, Anastasia Borovykh, Reza Shokri

Abstract

Whether language models can systematically generalize remains actively debated. Yet empirical performance is jointly shaped by multiple factors such as training data, training paradigms, and inference-time strategies, making failures difficult to interpret. We introduce a controlled synthetic environment based on shortest-path planning, a canonical composable sequential optimization problem. The setup enables clean separation of these factors and supports two orthogonal axes of generalization: spatial transfer to unseen maps and length scaling to longer-horizon problems. We find that models exhibit strong spatial transfer but consistently fail under length scaling due to recursive instability. We further analyze how distinct stages of the learning pipeline influence systematic problem-solving: for example, data coverage sets capability limits; reinforcement learning improves training stability but does not expand those limits; and inference-time scaling enhances performance but cannot rescue length-scaling failures.

View PDFOpen arXiv