LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

2026-04-15Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors created LongCoT, a large set of 2,500 tricky problems that test how well language models can reason step-by-step over very long processes. These problems come from areas like chemistry, math, computer science, chess, and logic, and solving them requires many linked steps, each easy alone but hard together. The authors found that even the best models only get less than 10% of these problems right, showing they struggle with very long chains of thought. LongCoT helps measure and understand these long-range reasoning challenges in AI models.

language modelschain-of-thought reasoninglong-horizon reasoningbenchmarkchemistry problemsmathematics problemscomputer sciencechesslogicAI evaluation
Authors
Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Hammoud, Tavish McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Natasha Jaques, Ameya Prabhu, Brian Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt
Abstract
As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.