Information Router for Mitigating Modality Dominance in Vision-Language Models

2026-04-17Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning
AI summary

The authors study Vision Language models that sometimes rely too much on one type of input, like just images or just text, instead of using both equally. They point out that past methods only adjust where the model pays attention, which doesn't help if one type of input lacks important information. To fix this, the authors propose MoIR, which identifies weaker input parts and replaces them with better information from the other modality before combining them. Their experiments show that MoIR helps balance the use of both inputs and makes the models perform better, even when one input is not as useful.

Vision Language ModelsModality DominanceAttention MechanismInformation FusionMulti-modal LearningToken RepresentationSignal-to-Noise RatioRobustnessDownstream PerformanceMulti-modal Benchmarks
Authors
Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib
Abstract
Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.