Adversarial Attacks Against MLLMs via Progressive Resolution Processing and Adaptive Feature Alignment
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors study how to trick multimodal large language models (MLLMs) into misidentifying images as specific target objects, which is important for testing model safety. They found existing attack methods are limited because they only focus on the model’s final features and fixed image regions. To improve this, the authors propose PRAF-Attack, a method that uses multiple layers of the model and processes images at different scales to better fool black-box MLLMs. Their approach consistently works better than previous methods on a variety of models, including commercial APIs.
Multimodal Large Language ModelsAdversarial PerturbationsTransfer-based AttacksBlack-box ModelsFeature AlignmentIntermediate Layer RepresentationsMulti-scale ProcessingPatch-level OptimizationGradient ConsistencyTargeted Attack
Authors
Haobo Wang, Xiaorong Ma, Weiqi Luo, Xiaojun Jia, Jiwu Huang
Abstract
Adversarial perturbations can mislead Multimodal Large Language Models (MLLMs) recognize a benign image as a specific target object, posing serious risks in safety-critical scenarios such as autonomous driving and medical diagnosis. This makes transfer-based targeted attacks crucial for understanding and improving black-box MLLM robustness. Existing transfer-based targeted attack methods typically rely on the final global features of the surrogate encoder and anchor optimization to original-resolution target crops, leading to their limited transferability and robustness. To address these challenges, we propose Progressive Resolution Processing and Adaptive Feature Alignment (PRAF-Attack), a targeted transfer-based attack framework that integrates multi-scale global semantic guidance with robust intermediate-layer local alignment. Unlike prior methods that align only the surrogate encoder's final layer, we design an adaptive feature alignment strategy that leverages intermediate representations to enhance transferability. Specifically, we introduce an adaptive intermediate layer selection mechanism to identify transferable hierarchical features across surrogate ensembles via gradient consistency, along with an adaptive patch-level optimization strategy that preserves highly correlated local regions through efficient patch filtering. To overcome the reliance on fixed original-resolution target crops, we propose a progressive resolution processing strategy that gradually refines optimization from coarse to fine, enabling the attack to better exploit target information at multiple scales and achieve stronger transferability. We evaluate PRAF-Attack on a diverse suite of black-box MLLMs, including six open-source models and six closed-source commercial APIs. Compared with seven state-of-the-art targeted attack baselines, the proposed PRAF-Attack consistently achieves superior transferability.