Anti-I2V: Safeguarding your photos from malicious image-to-video generation

2026-03-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors focus on stopping fake videos made from someone's photo and text by video-generation AI models. Existing defenses mostly work on images or simpler AI designs, but their method, Anti-I2V, works better for video models including newer transformer-based ones. Their approach changes image features in special color and frequency spaces and targets important parts of the AI's processing to reduce video quality and consistency. Tests show Anti-I2V can effectively defend against various video diffusion models creating fake human animations.

diffusion modelsimage-to-video generationadversarial attacksDiffusion Transformer (DiT)UNet architecturevideo diffusion modelstemporal coherencefrequency domainL*a*b* color spacedenoising process

Authors

Duc Vu, Anh Nguyen, Chi Tran, Anh Tran

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the $L$*$a$*$b$* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

View PDFOpen arXiv