SHIFT: Steering Hidden Intermediates in Flow Transformers

2026-04-10 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present SHIFT, a simple method to change images made by DiT diffusion models without needing to retrain them. It works by steering the model's internal activations during image generation to remove unwanted parts or add new styles and objects. This approach lets users control what appears in the image, keeping the rest of the content and quality intact. SHIFT is flexible and fast because it adjusts things dynamically as the image is created.

diffusion modelsDiT (Diffusion Transformer)image generationactivation steeringinference timeconcept removalstyle transferprompt adherence

Authors

Nina Konovalova, Andrey Kuznetsov, Aibek Alanov

Abstract

Diffusion models have become leading approaches for high-fidelity image generation. Recent DiT-based diffusion models, in particular, achieve strong prompt adherence while producing high-quality samples. We propose SHIFT, a simple but effective and lightweight framework for concept removal in DiT diffusion models via targeted manipulation of intermediate activations at inference time, inspired by activation steering in large language models. SHIFT learns steering vectors that are dynamically applied to selected layers and timesteps to suppress unwanted visual concepts while preserving the prompt's remaining content and overall image quality. Beyond suppression, the same mechanism can shift generations into a desired \emph{style domain} or bias samples toward adding or changing target objects. We demonstrate that SHIFT provides effective and flexible control over DiT generation across diverse prompts and targets without time-consuming retraining.

View PDFOpen arXiv