Fashion130K: An E-commerce Fashion Dataset for Outfit Generation with Unified Multi-modal Condition
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a new fashion dataset called Fashion130k that includes various clothes, models, and occasions to help generate outfit images. They designed a system called Unified Multi-modal Condition (UMC) that combines information from both text and pictures to make outfit generation more consistent. Their method uses a special part called a Fusion Transformer to better align text and image data, so the model focuses on important details when creating new outfits. Tests show their approach works better than existing methods for keeping outfits visually consistent.
fashion outfit generationmulti-modal embeddingFusion Transformervisual consistencytext-image alignmente-commerce datasetattention mechanismembedding refinernoise image
Authors
Yu He, Ting Zhu, Yichun Liu, Lichen Ma, Xinyuan Shan, Jingling Fu, Yu Shi, Junshi Huang, Yan Li
Abstract
Recent research work on fashion outfit generation focuses on promoting visual consistency of garments by leveraging key information from reference image and text prompt. However, the potential of outfit generation remains underexplored, requiring comprehensive e-commercial dataset and elaborative utilization of multi-modal condition. In this paper, we propose a brand-new e-commerce dataset, named Fashion130k, with various occasions, models, and garment types. For the consistent generation of garment, we design a framework with Unified Multi-modal Condition (UMC) to align and integrate the text and visual prompts into generation model. Specifically, we explore an embedding refiner to extract the unified embeddings of multi-modal prompts, within which a Fusion Transformer is proposed to align the multi-modal embeddings by adjusting the modality gap between text and image. Based on unified embeddings, the attention in generation model is redesigned to emphasis the correlations between prompts and noise image, inducing that the noise image can select the pivotal tokens of prompts for consistent outfit generation. Our dataset and proposed framework offer a general and nuanced exploration of multi-modal prompts for generation models. Extensive experiments on real-world applications and benchmark demonstrate the effectiveness of UMC in visual consistency, achieving promising result than that of SoTA methods.