Can MLLMs Reason Beyond Language? VisReason: A Comprehensive Benchmark for Vision-Centric Reasoning

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created VisReason, a new test to see how well multimodal large language models (MLLMs) can understand and think about everyday images in a way that truly uses what they see. VisReason has 1,505 questions in 10 different types that mix seeing and reasoning together. Their tests show that current models struggle more than humans and that common tricks used during testing don't help much. This means VisReason is useful for checking if these models really think using visual information, not just language.

multimodal large language modelsvisual reasoningbenchmarkperceptual reasoningstructural reasoningconceptual reasoninginferencetest-time reasoning strategiesvision-centric reasoning
Authors
Longteng Guo, Yifan Wang, Pengkang Huo, Tailai Chen, Yuze Wu, Jing Liu, Xinxin Zhu
Abstract
Recent multimodal large language models (MLLMs) achieve strong performance on visual reasoning benchmarks, yet it remains unclear to what extent such performance reflects reasoning directly grounded in visual evidence. We introduce VisReason, a benchmark for vision-centric reasoning in everyday scenarios where perception and inference are tightly coupled. VisReason contains 1,505 questions across 10 categories spanning perceptual, structural, and conceptual reasoning. Our evaluation shows that VisReason poses a qualitatively different challenge from existing benchmarks, exposing substantial gaps between humans and current MLLMs and revealing limited benefits from test-time reasoning strategies. VisReason offers a focused diagnostic for evaluating vision-centric reasoning beyond language.