HomeDiffusion: Zero-Shot Object Customization with Multi-View Representation Learning for Indoor Scenes

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new method called HomeDiffusion to improve how computers can create customized images of objects like furniture in different settings without needing extensive prior training. Unlike earlier approaches that struggle to keep details and accurate poses, especially for objects viewed from only one angle, their method uses multiple images from different viewpoints to keep the object's look consistent and realistic. They also designed a special dataset and a way to focus on fine details during image generation. Tests showed their method works better than existing zero-shot or few-shot customization methods.

zero-shot generationobject customizationdiffusion modelsmulti-view imagescross-attentionlatent spacefurniture renderingpose harmonizationimage synthesisfew-shot learning
Authors
Guoqiu Li, Jin Song, Yiyun Fei
Abstract
Recently, zero-shot object customization generation methods have rapidly developed and shown tremendous potential for applications. For instance, in the e-commerce domain, consumers can observe the visual effect of furniture placed within their personal living spaces or clothes worn on their own bodies. Many existing approaches perform object customization generation based on diffusion models and extracted reference object features. However, the generated object significantly diverges from the original reference object in details such as patterns and curves. Particularly for asymmetrical reference objects, the absence of comprehensive multi-viewpoint information prevents the generation of object poses that harmonize with the background scene. To address these shortcomings, we have constructed a novel dataset comprising multi-angle images of furniture and indoor scenes. Based on diffusion models, we introduce HomeDiffusion, which can leverage multi-viewpoint images of the same reference object to accurately generate visually harmonious object poses within specified areas of the background scene. During the diffusion process, we further extract high-fidelity details of the reference object and perform cross-attention with the noise latents in the latent space, thereby ensuring the preservation of details in the customized object generation. Extensive qualitative and quantitative experiments demonstrate that our method achieves superior performance over other existing zero-shot as well as few-shot object customization approaches.