Concept Unlearning via Cross-Attention Activation Projection for Diffusion Models

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summary

The authors propose a method called PURE to help pretrained text-to-image models forget specific concepts without retraining them. Unlike previous methods that identify concepts using text prompts, PURE focuses on how the model activates internally while generating images to better erase the target concept. This approach makes it harder for different or tricky prompts to bypass the forgetting process. PURE works by adjusting the model's attention mechanisms in a single step, improving the balance between forgetting unwanted concepts and keeping the rest of the model intact.

concept unlearningtext-to-image diffusion modelcross-attentionclosed-form methoddenoising trajectorytext encoderlinear projectorprompt paraphrasingmodel editingforget-retain trade-off
Authors
Saemi Moon, Suhyeon Jun, Seoyeon Lee, Dongwoo Kim
Abstract
Concept unlearning aims to erase a target concept from a pretrained text-to-image diffusion model without retraining. Closed-form methods are attractive in this setting because they apply a single deterministic edit to the cross-attention weights and add no inference-time cost. Existing closed-form methods, however, represent the target concept through the text encoder's response to a few short anchor prompts that name it, and paraphrased prompts that evoke the concept without naming it consistently bypass the edit. We argue that the target should instead be represented in the cross-attention activation space. Text embeddings describe the user's prompt, while cross-attention activations describe what the model is about to render, and the latter generalize to paraphrase the anchor templates do not cover. Building on this observation, we propose PURE (Projection in U-Net Rendering for Erasure), a closed-form method that builds the forget and retain bases from per-layer cross-attention activations captured along a short denoising trajectory and applies a single linear projector to the cross-attention key and value weights. On a recent holistic concept-unlearning benchmark covering ten concepts across artistic style, intellectual property, celebrity, and NSFW categories, PURE significantly reduces target leakage under paraphrased and adversarial prompts while preserving retain concepts close to the unedited model, yielding the best overall forget-retain trade-off among evaluated methods.