Mixed-Modality Dual Face-Hair Retrieval

2026-06-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors introduce a new image search task called Dual Face-Hair Retrieval (DFHR), where the goal is to find images based on both a person's face and a hairstyle, with the hairstyle given either as a picture or as text. This is challenging because face identity and hairstyle are separate features that come from different types of data. To support this, the authors created DFHR-Bench, a large dataset with carefully labeled examples, and developed a method called MFHC that combines face and hair information effectively. Their work helps advance searching for images using multiple kinds of information at once, while keeping identity and hairstyle details clear and separate.

image retrievalface recognitionhairstyle attributecross-modal learningembedding spacefeature disentanglementmultimodal fusiondataset annotationvisual searchsemantic alignment

Authors

Quoc-Anh Bui-Huynh, Mai-Tuyen Lam, Dai-Anh-Tuan Nguyen, Thanh Duc Ngo

Abstract

We introduce Dual Face-Hair Retrieval (DFHR), a new mixed-modality dual-reference task in image retrieval where a query consists of a face image specifying identity and a hairstyle reference expressed as either an image or text. Unlike prior retrieval settings, DFHR requires cross-component reasoning between two semantically independent attributes -- identity and hairstyle -- originating from heterogeneous modalities. This formulation demands localized feature disentanglement, cross-modal semantic alignment, and mixed-modality composition within a unified embedding space. We construct DFHR-Bench, the first benchmark for mixed-modality face-hair retrieval, comprising over 180K annotated triplets across dual-image and image-text settings, built via a multi-stage annotation protocol ensuring semantic and identity integrity. We further propose MFHC (Multimodal Face-Hair Combiner), a unified framework that fuses disentangled identity and hairstyle embeddings through token injection and multi-view supervision. DFHR and DFHR-Bench together establish a new paradigm for identity-aware, attribute-controllable visual retrieval across modalities.

View PDFOpen arXiv