Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors developed Gen-VCoT, a new method that helps large language models think through visual problems by creating pictures at different reasoning steps. Their approach uses expert vision tools to break down images into parts, depth info, and meanings, then decides how much detail to use when answering questions. Tests showed Gen-VCoT improves answers for spatial and depth-related questions but doesn’t work as well for simple facts. They also found that text-based reasoning can be better for some tasks, like CLEVR, highlighting that the best method depends on the question type.
Multimodal Large Language ModelsChain-of-Thought ReasoningVisual GroundingImage SegmentationDepth MapsSemantic ReasoningAdaptive RoutingVisual IntermediatesCLEVR DatasetSpatial Reasoning
Authors
Zhiqiang Zhou, Junliang Dai, Xu ling
Abstract
Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.