DE-FIVE: Detecting Malicious Image Prompts via Fourier Features and Image Vector Embeddings

2026-06-22Cryptography and Security

Cryptography and SecurityComputer Vision and Pattern Recognition
AI summary

The authors studied vision language models, which combine images and text but can be tricked by harmful images designed to confuse them. They point out that current defenses need lots of data or extra tools, and none focus well on these sneaky image tricks called indirect prompt injections. To fix this, the authors created DE-FIVE, a method that detects bad images without retraining by looking at image features in the frequency domain and the model’s internal image representations. Their tests show DE-FIVE works better than existing methods at catching malicious images in these models.

vision language modelsadversarial perturbationsindirect prompt injectionFourier featuresvisual encoderimage vector embeddingsblack-box detectorwhite-box detectorfew-shot learningmalicious image prompts
Authors
Xingwei Zhong, Varun Sharma, Kar Wai Fok, Vrizlynn L. L. Thing
Abstract
Vision language models (VLMs) employ both visual and textual modalities to enable advanced vision-language inference. However, incorporating visual modalities expands the attack surface of VLMs, making them more susceptible to security threats such as adversarial perturbations and indirect prompt injection, wherein crafted malicious image prompts can elicit unintended model outputs. Existing defense methods against malicious image prompts remain insufficient as they typically demand extensive datasets for retraining or the deployment of additional, complex classifiers. Most critically, there is a profound lack of specialized defense mechanisms specifically targeting indirect prompt injections, a gap that serves as a primary motivation for this work. To address these limitations, we introduce DE-FIVE, a novel training-free framework for detecting malicious image prompts by leveraging Fourier features and the hidden state representations of the visual encoder (image vector embeddings) across perturbations. Specifically, we develop a hybrid detection strategy consisting of a black-box detector that operates on Fourier-domain features and a white-box detector that exploits image vector embeddings derived from only a few-shot malicious set. Extensive experiments demonstrate that the proposed framework consistently outperforms state-of-the-art baselines against malicious image prompts.