NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
2026-03-05 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors worked on a way to find nail design pictures based on very detailed descriptions that include colors, styles, and decorations users want. They noted that current systems have trouble understanding these complex descriptions and the color choices users make. To solve this, they created a new method called NaiLIA that better matches the descriptions and color preferences with images. They tested it on a large set of nail design images described by many people and found their method works better than existing ones.
multimodal retrievalvision-language modelsintent descriptionspalette queriesnail design imagesconfidence scoresrelaxed lossimage annotationbenchmark dataset
Authors
Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura
Abstract
We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.