V-ABS: Action-Observer Driven Beam Search for Dynamic Visual Reasoning

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language

AI summaryⓘ

The authors address the problem that current multimodal large language models struggle with complex multi-step visual reasoning because they do not properly use feedback from their actions. They created V-ABS, a new method that improves reasoning by repeatedly thinking, acting, and observing, while balancing prior guesses with actual feedback using an adaptive weighting system. They also made a large dataset to help the model learn better action paths. Their experiments show that V-ABS significantly improves performance on various benchmarks compared to previous models.

multimodal large language modelsvisual reasoningbeam searchaction-observer feedbackimagination-action-observer biasentropy-based adaptive weightingsupervised fine-tuningQwen3-VL-8Bmulti-step reasoningthinker-actor-observer iterations

Authors

Zhiwei Ning, Xuanang Gao, Jiaxi Cao, Gengming Zhang, Shengnan Ma, Wenwen Tong, Hanming Deng, Jie Yang, Wei Liu

Abstract

Multimodal large language models (MLLMs) have achieved remarkable success in general perception, yet complex multi-step visual reasoning remains a persistent challenge. Although recent agentic approaches incorporate tool use, they often neglect critical execution feedback. Consequently, they suffer from the imagination-action-observer (IAO) bias, a misalignment between prior imagination and observer feedback that undermines reasoning stability and optimality. To bridge this gap, we introduce V-ABS, an action-observer driven beam search framework that enables deliberate reasoning through thinker-actor-observer iterations. We also propose an entropy-based adaptive weighting algorithm to mitigate the IAO bias by dynamically balancing the confidence scores between the policy priors and the observational feedback. Moreover, we construct a large-scale supervised fine-tuning (SFT) dataset comprising over 80k samples to guide the model to assign higher prior confidence to correct action paths. Extensive experiments across eight diverse benchmarks show that V-ABS achieves state-of-the-art performance, delivering an average improvement of 19.7% on the Qwen3-VL-8B baseline and consistent gains across both open-source and proprietary models.

View PDFOpen arXiv