MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

2026-05-25Artificial Intelligence

Artificial IntelligenceComputation and Language
AI summary

The authors study how to make large vision-language models smaller and faster without losing their problem-solving ability, called chain-of-thought reasoning. They find that existing ways of shrinking these models miss important parts that help keep reasoning accurate, especially because these models handle both images and text differently. To fix this, the authors create a new method, MuCRASP, that carefully prunes the model while keeping key reasoning parts intact. Tests show MuCRASP keeps models smart at reasoning tasks better than older methods, even when the models are cut by nearly half.

Vision-language modelsChain-of-thought reasoningStructured pruningMultimodal tasksCross-modal alignmentParameter compressionActivation distributionModel pruningReasoning consistencyPerplexity
Authors
Aritra Dutta, Somak Aditya
Abstract
Vision-language models (VLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex multimodal tasks, but their large parameter sizes make deployment expensive. Structured pruning offers a natural solution; however, existing methods fail to preserve CoT reasoning accuracy in VLMs. We identify two key reasons: (1) CoT consistency depends on sparse transition points (pivot tokens) in the generation trajectory, while existing pruning methods are CoT-agnostic; and (2) pruning methods designed for unimodal LLMs do not account for activation-distribution differences across visual and textual modalities. Motivated by these observations, we propose MuCRASP, a structured pruning framework that targets reasoning-critical components while preserving cross-modal alignment and accounting for layer-wise sensitivity under a global parameter budget. Experiments on four VLMs across three reasoning benchmarks show that MuCRASP consistently preserves reasoning quality under increasing compression. At 30% pruning on Qwen2.5-VL-7B, MuCRASP achieves an LLM-as-a-Judge score of 8.87 versus 7.32 for the strongest baseline on physical reasoning tasks. Furthermore, MuCRASP maintains high reasoning consistency up to 50% pruning, significantly outperforming prior pruning approaches while exhibiting lower perplexity degradation.