StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

2026-06-18Computation and Language

Computation and LanguageComputer Vision and Pattern Recognition
AI summary

The authors created a large and detailed dataset called StylisticBias to study how multimodal large language models (MLLMs) judge people based on visual features. By keeping a person's identity the same and changing only one visual trait at a time, they could see which specific features influence the models' social judgments. They found that attributes like age and body type mainly affect judgments linked to identity, while fashion style and similar cues cause the biggest changes in how models judge individuals. Most bias was caused by about 15 visual traits, showing that bias tends to focus on a few key features. The authors provide this dataset and code for others to test bias in MLLMs more precisely.

multimodal large language modelssocial biasvisual attributesidentity effectsbias evaluation benchmarkphotorealistic imagesattribute-level variationsocial judgmentfashion stylebody type
Authors
Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner
Abstract
Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.