Combating Textual Noise and Redundancy: Entropy-Aware Dense Visual Token Pruning

2026-07-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors study how to speed up vision-language models (VLMs) by removing unnecessary parts of images, called visual token pruning. They found problems with existing methods that lose important details because of noisy text information and broken-up image features. To fix this, they created a new approach called Entropy-Aware Dense Pruning (EADP) that uses a technique to measure and remove text noise and then carefully selects image parts to keep a clear overall picture. Their method helps maintain fine details better while making the models faster and performs well on complex tests.

Vision-Language ModelsVisual Token PruningTextual NoiseCross-Modal ScoringEntropySubmodular MaximizationToken SelectionMultimodal BenchmarksFeature FragmentationEfficiency-Accuracy Trade-off

Authors

Xuehui Wang, Xuankun Yang, Wei Shen

Abstract

Visual token pruning is a crucial strategy for accelerating VLMs by compressing redundant image patches, yet existing methods often fail to preserve critical cues under dense instructions and fine-grained queries. In this paper, we investigate this failure and identify two underlying bottlenecks: the widespread dispersion of textual noise that corrupts dense cross-modal scoring, and the feature fragmentation inherent to standard token selection. To address these issues, we propose Entropy-Aware Dense Pruning (EADP), a framework that reformulates pruning as a structured compression problem. EADP first leverages statistical entropy to quantify and filter out textual noise, yielding a robust, fine-grained instruction relevance score. Subsequently, instead of naive Top-K selection, EADP casts token selection as a submodular maximization problem with a spatial prior, explicitly ensuring a holistic and non-redundant visual representation. Extensive experiments demonstrate that EADP improves the accuracy-efficiency trade-off of VLMs, robustly preserving fine-grained visual cues under strict token budgets while achieving SoTA performance on challenging multimodal benchmarks.

View PDFOpen arXiv