The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors show that many current multimodal large language models (MLLMs) do well on visual reasoning tests because these tests use grid-like layouts that the models can easily turn into coordinates and solve using text reasoning. To test if the models really understand visuals, they created a new benchmark called Polaris-Bench that uses polar coordinates instead of Cartesian grids but keeps the logic the same. They found that models’ performance dropped a lot on Polaris-Bench, revealing that current models struggle with visual reasoning that isn’t based on grids. This means these models lack general understanding of different spatial layouts.
Multimodal Large Language ModelsVisual ReasoningCartesian CoordinatesPolar CoordinatesBenchmarkCoordinate SystemsTopologyLogical ConstraintsModel EvaluationSpatial Reasoning
Authors
Xia Hu, Zhenrui Yue, Brian Potetz, Howard Zhou, Leonidas Guibas, Chun-Ta Lu, Zhicheng Wang
Abstract
As current Multimodal Large Language Models rapidly saturate canonical visual reasoning benchmarks, a key question emerges: do these strong scores genuinely reflect robust visual understanding? We identify a pervasive vulnerability, the \textbf{Cartesian Shortcut}: visual reasoning benchmarks prevalently build on orthogonal grid-based layouts that can be readily discretized into explicit textual coordinates. Models systematically exploit this property, heavily leveraging text-based deductive reasoning to assist visual problem-solving. To systematically dismantle this shortcut, we introduce \textbf{Polaris-Bench}, which re-formulates 53 visual reasoning tasks in Polar coordinate space with paired Cartesian counterparts as reference, while preserving consistent logical constraints and task semantics -- thus fundamentally breaking the orthogonal prior that models exploit. Comprehensive evaluation across $14$ state-of-the-art MLLMs reveals that frontier models achieving $70$--$83\%$ on Cartesian layouts collapse to $31$--$39\%$ on Polar equivalents, with degradation persisting even under complete logical equivalence. Moreover, reasoning gains observed on Cartesian layouts are severely diminished on Polar equivalents. These findings expose a critical deficiency in current MLLMs: the lack of topology-invariant visual reasoning.