AI summaryⓘ
The authors study how attacks that trick vision-language AI models (VLMs) work better when the attacker uses the same kind of model for both planning and testing (homogeneous) versus different kinds (heterogeneous). They call this difference 'surrogate dependency.' To fix this, they create Mosaic, a method that uses multiple models and different parts of images to make attacks less dependent on any one model or image perspective. Mosaic changes some text words, adjusts images in many ways, and combines feedback from several models to make attacks more effective on real, closed-source VLMs. Their tests show Mosaic works better than previous methods at bypassing safety features in commercial VLMs.
Vision-Language ModelsMultimodal jailbreak attacksSurrogate modelsHomogeneous vs Heterogeneous settingsAdversarial optimizationText-Side TransformationMulti-View Image OptimizationSurrogate Ensemble GuidanceAttack Success RateClosed-source VLMs
Authors
Yuqin Lan, Gen Li, Yuanze Hu, Weihao Shen, Zhaoxin Fan, Faguo Wu, Xiao Zhang, Laurence T. Yang, Zhiming Zheng
Abstract
Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.