Concept Removal Guidance: Evidence-Calibrated Negative Guidance for Safe Diffusion Sampling
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how text-to-image AI models can be tricked into creating unwanted or harmful images by certain prompts. They found that existing methods to block unwanted content either cause the images to look bad or don't work well. To fix this, the authors developed a new method called Concept Removal Guidance (CRG) that checks for unwanted content step-by-step during image creation and adjusts the process to keep that content out without messing up the picture quality. Their method works without extra training and can block different types of unwanted content while keeping normal images looking good.
text-to-image diffusion modelsadversarial promptsnegative guidanceposterior-oddsdenoising trajectoryConcept Removal Guidanceconditional trajectoryattack success rateimage fidelityred-teaming benchmarks
Authors
Yoonseok Choi, Chaeyoung Oh, Hyunjun Choi, Seokin Seo, Kee-Eung Kim
Abstract
Text-to-image diffusion models remain vulnerable to adversarial prompts that elicit disallowed content, motivating reliable inference-time controls. A popular approach is negative guidance, which subtracts a negative prompt direction with a fixed weight. However, it often forces a safety-fidelity trade-off, causing artifacts or prompt drift when over-applied and failing under attacks when under-applied. Dynamic variants reweight guidance using posterior-odds signals, which can be brittle for open-vocabulary compositional prompts, while lightweight similarity-based methods ignore the evolving image evidence along the denoising trajectory. We introduce Concept Removal Guidance (CRG), a training-free method that estimates unwanted-concept presence at each diffusion step from the model's noise predictions, and adaptively calibrates negative guidance via a closed-form constrained update enforcing a target presence threshold while minimally perturbing the conditional trajectory. Across red-teaming benchmarks, CRG reduces attack success rates while preserving benign fidelity, and extends to additional suppression targets such as artist style and violence without fine-tuning or external classifiers.