Mosaic: Compositional Multi-Concept Erasure via Vector Field Blending

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors study how to remove multiple unwanted objects or ideas at once from images created by Text-to-Image models, which is harder than just removing one. They introduce a new test called CoME-Bench to check how well methods work on removing several concepts in one image. They also create a new method named Mosaic that cleverly uses special masks to erase these unwanted parts without messing up the rest of the image. Their experiments show Mosaic can erase multiple things effectively while keeping the rest of the picture intact.

Text-to-Image modelsconcept erasureflow-based modelsmulti-concept erasurecompositional scenesCoME-Benchspatial localitymaskingimage synthesisvector field
Authors
Junseok Ko, Jungwoo Kim, Jong-Seok Lee
Abstract
Concept erasure has emerged as a key research direction for ensuring safe and ethical image synthesis in Text-to-Image (T2I) models. While existing studies have explored concept erasure across multiple concepts, they typically assume only a single target concept per image, a limitation increasingly exposed by modern flow-based T2I models, which can generate complex scenes with multiple concepts simultaneously. To address this gap, we introduce compositional multi-concept erasure, a new task that aims to simultaneously remove multiple target concepts within a single scene. We propose CoME-Bench, a benchmark for evaluating compositional multi-concept erasure, which covers both intra- and cross-category scenarios. We further propose Mosaic, a novel framework for multi-concept erasure in flow-based T2I models, which exploits the spatial locality of target concepts in the vector field by dynamically constructing concept-specific masks and selectively blending them without additional optimization. Extensive experiments demonstrate that Mosaic effectively removes multiple target concepts in complex compositional scenes while preserving non-target contexts.