ResEdit: Residual embeddings for precise generative image editing

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics
AI summary

The authors explore how to improve image editing using conditional diffusion models without needing large paired datasets. They introduce a residual image encoding that helps keep the original image details intact while making precise edits. To avoid this extra detail from messing up the intended changes, they use a special training method to separate the residual information from the editing instructions. Their approach shows better results in preserving image identity and offers more accurate edits like lighting changes and simple text-driven modifications.

conditional diffusion modelsimage editinginversionresidual encodinggradient reversalidentity preservationimage reconstructionintrinsic-based editingrelightingtext-guided manipulation
Authors
Ahmet Canberk Baykal, Valentin Deschaintre, Yannick Hold-Geoffroy, Michael Fischer, Anna Frühstück, Cengiz Öztireli, Iliyan Georgiev
Abstract
Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.