Reasoning Gets Harder for LLMs Inside A Dialogue

2026-03-20 • Computation and Language

Computation and Language

AI summaryⓘ

The authors studied how well large language models (LLMs) handle reasoning tasks when these tasks are part of a conversation, rather than standalone problems. They created a new benchmark called BOULDER that includes several travel-related reasoning tasks presented both as single questions and within dialogues. Their experiments showed that LLMs perform worse when reasoning happens in multi-turn dialogue compared to isolated tasks. The authors found that the challenges of dialogue format, role-playing, and tool use contribute to this performance drop. They suggest that testing LLMs in realistic conversational settings is important for understanding their true reasoning abilities.

Large Language ModelsTask-Oriented DialogueBenchmarkReasoningArithmetic ReasoningSpatial ReasoningTemporal ReasoningMulti-turn DialogueRole ConditioningTool Use

Authors

Ivan Kartáč, Mateusz Lango, Ondřej Dušek

Abstract

Large Language Models (LLMs) achieve strong performance on many reasoning benchmarks, yet these evaluations typically focus on isolated tasks that differ from real-world usage in task-oriented dialogue (TOD). In this setting, LLMs must perform reasoning inherently while generating text and adhering to instructions on role, format, and style. This mismatch raises concerns about whether benchmark performance accurately reflects models' reasoning robustness in TOD setting. We investigate how framing reasoning tasks within TOD affects LLM performance by introducing BOULDER, a new dynamic benchmark covering eight travel-related tasks that require arithmetic, spatial, and temporal reasoning with both commonsense and formal aspects. Each problem is presented in both isolated and dialogue-based variants, enabling controlled comparison while mitigating data contamination. Experiments on eight LLMs reveal a substantial and consistent performance gap between isolated and dialogue settings. Through ablations and qualitative analysis, we show that this gap is largely driven by the multi-turn nature of dialogue, with additional effects from role conditioning and tool-use requirements. Our results highlight the need to evaluate LLM reasoning in realistic interactive scenarios.

View PDFOpen arXiv