Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address a problem in Multimodal Large Language Models where reducing the number of visual tokens to save memory and speed up processing causes important position and attention information to be lost. They propose RESTORE, a method that fixes these distortions by adjusting attention weights based on how far tokens are from each other and by carefully choosing which tokens to merge. Their experiments show that RESTORE improves accuracy while keeping the models efficient. This helps make vision-language tasks faster without losing important details.
Multimodal Large Language ModelsVisual TokensToken ReductionAttention MechanismPositional EncodingFeature AveragingComputational ComplexityVision-Language Tasks
Authors
Hyeonwoo Cho, DongHyeon Baek, Yewon Kim, Bumsub Ham
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks, yet the quadratic computational complexity arising from the vast number of visual tokens incurs significant memory and latency bottlenecks. While visual token reduction (VTR) strategies have been explored to mitigate this burden, existing methods overlook the positional and attentional consistency between the full and reduced sequences, resulting in a distorted representation. To this end, we propose RESTORE, a novel VTR framework that rectifies the positional and attentional distortions while maintaining efficiency. Specifically, we present a simple yet effective calibration method that restores lost visual attention by augmenting attention weights based on relative distances. We also introduce a distinctive anchor selection for token merging to mitigate information loss during feature averaging. Experimental results on multiple benchmarks demonstrate that our method consistently improves the accuracy of various reduction methods, achieving state-of-the-art performance while maintaining computational efficiency.