Fashion Florence: Fine-Tuning Florence-2 for Structured Fashion Attribute Extraction
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors created Fashion Florence, a tuned vision-language model that looks at a clothing photo and outputs detailed fashion info in an easy-to-use JSON format. They simplified a large fashion dataset into a smaller set of labels and trained their model with an efficient fine-tuning method called LoRA. Their model showed better accuracy than other similar models for recognizing clothing category, material, and style. It runs efficiently on one GPU and is available for public use in recommendation systems.
vision-language modelLoRAfashion attribute extractioniMaterialist Fashion datasetJSON outputfine-tuningcategory accuracystyle tagsHugging Face Spaceoutfit recommendation system
Authors
Anushree Berlia
Abstract
We present Fashion Florence, a Florence-2 vision-language model fine-tuned with LoRA to extract structured fashion attributes from clothing images. Given a single photograph, the model generates a JSON object containing category, color, material, style tags, and occasion tags, structured output suitable for direct programmatic consumption by downstream recommendation and retrieval systems. Fine-tuning data is derived from the iMaterialist Fashion dataset (228 labels), where we collapse fine-grained annotations into a compact 6-category, 16-color, 19-style schema via rule-based label engineering. We apply LoRA (r=16, alpha=32) to all decoder linear layers, training for 3 epochs on 3,688 examples. On a held-out test set of 461 images, Fashion Florence achieves 94.6% category accuracy and 63.0% material accuracy, compared to 89.3% / 43.3% for GPT-4o-mini and 87.4% for Gemini 2.5 Flash. Fashion Florence produces valid JSON in 99.8% of outputs while running at 0.77B parameters on a single GPU at zero marginal inference cost. Style tag F1 reaches 0.753 vs. 0.612 (Gemini) and 0.398 (GPT-4o-mini). The model is deployed as a Hugging Face Space and integrated into Loom, an open-source outfit recommendation system.