AI summaryⓘ
The authors studied how multimodal large language models (MLLMs), which process both text and images, can be tricked into generating harmful content. They found that current safety filters only catch harmful inputs when these are obvious in one mode, like just text, but not when harmful intent is hidden across text and images together. To show this, they created a method called Distributed Semantic Recomposition (DSR) that breaks down harmful ideas into safe-looking text and images which the model then combines to produce harmful outputs. Their experiments show that DSR can bypass safety checks effectively while the inputs appear safe, revealing a challenge where the model's ability to follow instructions also makes it vulnerable to exploitation.
Multimodal Large Language ModelsSafety GuardrailsCross-Modal JailbreakHarm-Bearing ContentDistributed Semantic RecompositionInput ToxicityInstruction FollowingContent ModerationModel ExploitationArtificial Intelligence Safety
Authors
Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang, Zuobin Ying, Zhuo Ma
Abstract
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in content synthesis and autonomous reasoning. Previous safety guardrails are primarily designed for unimodal textual input interception, leaving them vulnerable to cross-modal jailbreak attacks. However, regardless unimodal textual attack or cross-modal jailbreak, typically inclusive part of explicit harmful or sensitive content at the input level, which is called Harm-Bearing. It allow the model's safety filters to detect and block such content easily. To address this limitations, we propose Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreak framework that decomposes harmful intent into a set of benign textual and visual primitives. By exploiting the model's reasoning ability, DSR enables the latent fusion of these seemingly innocent components into harmful outputs during the cross-modal inference phase. Extensive experiments on multiple commercial MLLMs pipelines demonstrate that DSR achieves superior attack success rates while maintaining an extremely low or even negligible input toxicity rate. Our findings uncover a critical Utility-Safety Paradox in MLLMs, where the model's instruction-following proficiency facilitates its own cognitive exploitation. Content Warning: This paper contains harmful model responses.