How LLMs See Creativity: Zero-Shot Scoring of Visual Creativity with Interpretable Reasoning

2026-06-29Computation and Language

Computation and Language
AI summary

The authors studied whether advanced AI language models that can also see images can judge how creative pictures are without any special training. They tested six such models on nearly 2,500 images and sketches already rated for creativity by humans. The models' ratings were quite similar to the humans', showing these AIs can evaluate visual creativity fairly well on their own. The authors also looked at how the models explain their ratings step-by-step, which helps understand their decision process, but this explanation didn’t make their scores more accurate. Overall, the research shows that these AI models can judge creativity in images in a way that's understandable and aligned with human opinions.

multimodal large language modelsvisual creativityzero-shot evaluationAI-generated imageshuman creativity ratingsstep-by-step reasoningcreativity assessmentmodel interpretabilityautomated scoringhuman-AI alignment
Authors
William Orwig, Roger E. Beaty
Abstract
Evaluating the originality of visual images poses enduring challenges for creativity assessment. Automated scoring using AI models has proven effective in the verbal domain, yet key questions remain about evaluating visual creativity and understanding how models arrive at their ratings. The present research asks whether multimodal large language models (LLMs) can serve as judges of visual creativity zero-shot (without any fine-tuning or examples of human ratings) and whether their "reasoning" output offers an interpretable window into their evaluation process. We tested six multimodal LLMs (Gemini 3 Flash, Gemma 4 31B IT, GPT-5.4 Mini, GLM-5v Turbo, Kimi K2.5, and Qwen 3.6 Plus) on 992 AI-generated images (based on human-written prompts) and 1,500 hand-drawn sketches scored for creativity by human raters. In Study 1, all models showed substantial alignment with human creativity ratings on both datasets (r = .57-.68 on AI-generated images; r = .29-68 on sketches). In Study 2, we analyzed the step-by-step reasoning processes of three LLMs evaluating the same images and drawings. Although reasoning made model evaluations interpretable -- showing what they attend to, how they balance originality vs. quality, and how they justify their ratings -- reasoning did not improve alignment with human ratings. In sum, our findings indicate that multimodal LLMs can match human judgments of visual creativity without any additional training, and that their reasoning reveals how AI models evaluate creativity. An open scoring app implementing this pipeline is available at https://review-visual-eval-scoring.hf.space.