ERASE: Eliminating Redundant Visual Tokens via Adaptive Two-Stage Token Pruning

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the problem of Vision-Language Models needing to process many visual pieces, which is slow and uses a lot of computing power. They created a method called ERASE that smartly cuts down the number of these visual pieces, especially adapting to how complicated the image is. Their tests show that ERASE keeps most of the model’s accuracy even after removing a large number of visual tokens, doing better than older methods. This helps models understand images faster without losing much detail.

Vision-Language ModelsLarge Language ModelsVision TokensToken PruningMultimodal UnderstandingSemantic FeaturesImage ComplexityComputational OverheadAccuracy Retention

Authors

Yuna Lee, Kyoungho Min, Yulhwa Kim

Abstract

Recent advancements in Vision-Language Models (VLMs) enable large language models (LLMs) to process high-resolution images, significantly improving real-world multimodal understanding. However, this capability introduces a large number of vision tokens, resulting in substantial computational overhead. To mitigate this issue, various vision token pruning methods have been proposed. Nevertheless, existing approaches predominantly rely on learned semantic features within the model to capture visual redundancy. Moreover, they lack adaptive mechanisms to adjust pruning strategies according to the complexity of the input image. In this paper, we propose ERASE, a two-stage vision token pruning framework that identifies and retains salient tokens through pruning strategies adaptive to image complexity. Experiment results demonstrate that ERASE significantly reduces vision tokens while preserving accuracy. For Qwen2.5-VL-7B, at a token pruning ratio of 85\%, ERASE retains 89.46% of the original model accuracy, whereas the best prior method retains only 78.1%. Our code is available at https://github.com/Tuna-Luna/ERASE.

View PDFOpen arXiv