DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation
2026-06-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors examined why vision models lose accuracy when most image tokens are removed. They found that big shifts in the token feature distribution relate to worse performance. To fix this, they designed a two-step method: first, moving context information before pruning tokens, and second, picking new important tokens if the distribution shifts too much. Their approach keeps the model accurate even with very few tokens, shown by strong results with only 16 tokens.
vision token pruningfeature distribution shiftdistribution consistency metricAnchor-Context Graph Recovery (ACGR)Text-Aware Token Cluster Selection (TATCS)token budgetLLaVA-1.5-7Bimage tokens
Authors
Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun
Abstract
Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.