Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts
2026-04-09 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors studied special AI models that handle both images and text, called multimodal Mixture-of-Experts (MoE) models. They found that these models can see what's in an image but struggle to think about it properly afterward, unlike when the problem is given in just text. They discovered that the model routes image and text information differently inside, which causes the reasoning part to not activate well for images. To fix this, they proposed a method to better guide the model's routing so reasoning experts get more involved. Their tests showed this improves the model's performance on visual reasoning tasks across various benchmarks.
Multimodal ModelsMixture-of-Experts (MoE)Vision-Language TasksCross-modal Semantic SharingRouting MechanismDomain ExpertsVisual ReasoningRouting DistractionBenchmark Evaluation
Authors
Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang, Longtao Huang, Hui Xue, Yongliang Shen, Weiming Lu, Yueting Zhuang
Abstract
Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.