VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors identify that current fast methods for understanding images and text lose some important details because they simplify visual information too much. To fix this, they created VEN-VL, which first gathers richer visual details from multiple views and then smartly condenses them to keep key information. Their approach also uses a special way to check that important parts are not lost during this process. Tests show that their method works well on complicated image tasks while keeping things efficient.

multimodal understandingvisual tokensmixture of experts (MoE)attention alignmentinformation capacityvisual representationadaptive routingvisual supervisiontoken compressionperformance-efficiency tradeoff
Authors
Yinghao Wu, Zhuoyan Luo, Yiyao Yu, Zhaojian Yu, Yujiu Yang, Xiao-Ping Zhang
Abstract
Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.