Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

2026-06-08Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present Visual Para-Thinker++, a new system that breaks down visual reasoning tasks into smaller parts handled by different agents working in parallel. These agents have specific roles: one main agent divides the task, multiple worker agents process parts separately, and a summary agent combines their results carefully. This approach helps reduce mistakes caused by early assumptions or hallucinations in understanding images. The authors show that their method works better than other approaches on several visual reasoning tests, especially where accuracy is important.

Visual reasoningMulti-agent systemLarge language models (LLMs)Role-conditioned agentsParallel processingTask decompositionHallucination in AIMulti-agent optimizationInference engineVisual prefix
Authors
Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Zizhao Tong, Xiaofeng Zhang, Xiaosong Yuan
Abstract
Visual reasoning requires integrating evidence distributed across regions, attributes, and relations, making single-chain reasoning prone to early perceptual commitment and hallucination. We propose Visual Para-Thinker++, a single-policy multi-agent framework in which one shared MLLM policy is instantiated as role-conditioned Main, Worker, and Summary Agents. The Main Agent decomposes the task with fixed allocation patterns; Worker Agents reason in parallel under context isolation; and the Summary Agent reconciles full Worker reasoning traces rather than majority-voting on final labels. The shared policy is trained by Multi-Agent Capability Injection and Role-Decoupled Multi-Agent Optimization, which assign role-specific rewards and advantages to corresponding token segments to reduce gradient conflict among collaborative roles. A native inference engine enables efficient multi-agent rollout through shared visual prefix and KV cache reuse. Across V*, CountBench, the RefCOCO family, and HallusionBench, Visual Para-Thinker++ consistently outperforms single-trajectory and inference-time parallel baselines, with especially strong gains on hallucination-sensitive visual reasoning.