Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors developed Gen-VCoT, a new method that helps large language models think through visual problems by creating pictures at different reasoning steps. Their approach uses expert vision tools to break down images into parts, depth info, and meanings, then decides how much detail to use when answering questions. Tests showed Gen-VCoT improves answers for spatial and depth-related questions but doesn’t work as well for simple facts. They also found that text-based reasoning can be better for some tasks, like CLEVR, highlighting that the best method depends on the question type.

Multimodal Large Language ModelsChain-of-Thought ReasoningVisual GroundingImage SegmentationDepth MapsSemantic ReasoningAdaptive RoutingVisual IntermediatesCLEVR DatasetSpatial Reasoning

Authors

Zhiqiang Zhou, Junliang Dai, Xu ling

Abstract

Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

View PDFOpen arXiv