Symbolic and Abstractive Reasoning with Complex Visual Queries

2026-06-08Computation and Language

Computation and Language
AI summary

The authors studied how to help large multi-modal language models better understand and think about complex visual ideas. They created a new type of task called complex visual queries (CVQs) that require symbolic and abstract reasoning, similar to how humans think. To do this, they built a large dataset using multi-modal knowledge graphs and designed a two-step training process to improve the models' abilities. They tested the models thoroughly on different tasks and scenarios to see how well they learned to reason visually. Their work aims to push forward the ability of these models to handle more advanced visual reasoning tasks.

multi-modal large language modelscomplex visual queriessymbolic reasoningabstract reasoningmulti-modal knowledge graphsfirst-order logictraining frameworkvisual reasoningcross-task generalizationneuro-symbolic reasoning
Authors
Yichi Zhang, Jingdian Lu, Zhuo Chen, Lingbing Guo, Jun Xu, Wen Zhang, Huajun Chen
Abstract
Understanding and reasoning over abstract visual content remains a challenge for current multi-modal large language models (MLLMs). In this paper, we explore a novel abstract data type termed complex visual query (CVQ), designed to probe symbolic and abstractive reasoning, which is a critical yet underexplored dimension of human-like neuro-symbolic reasoning for MLLMs. We present a comprehensive investigation from three perspectives: \textbf{Data $\times$ Paradigm $\times$ Exploration}. Specifically, we propose a scalable pipeline for synthesizing CVQs grounded in large-scale multi-modal knowledge graphs, generating a diverse dataset encompassing 14 distinct query types via systematic combinations of first-order logic operators. We further introduce a two-stage training framework that progressively equips MLLMs with robust visual reasoning capabilities. We conduct extensive experiments to rigorously evaluate MLLMs across multiple dimensions, including reasoning performance on CVQs, as well as cross-task and cross-scenario generalization. We believe our work opens new perspectives and avenues for advancing the reasoning frontiers of MLLMs.