The Human Creativity Benchmark

2026-06-29Artificial Intelligence

Artificial IntelligenceComputer Vision and Pattern RecognitionHuman-Computer Interaction
AI summary

The authors explain that in creative areas, disagreements among experts aren't just mistakes but reflect different personal tastes. They introduce the Human Creativity Benchmark (HCB) to separate agreement (like on technical quality) from disagreement (like on style preferences) when judging creative AI. They collected many expert opinions across different creative steps and found that some qualities are widely agreed upon, while others vary by individual. Their work shows that combining all feedback into one score loses important details about where AI should be precise and where it should allow flexibility.

creative AI evaluationevaluator disagreementHuman Creativity Benchmarkpairwise preferencesprompt adherenceusabilityvisual appealconvergencedivergencecreative workflow phases
Authors
Aspen Hopkins, Allison Nulty, Alexandria Minetti, Anoop Pakki, Angad Singh
Abstract
Modern AI evaluation frameworks treat evaluator disagreement as noise to be resolved. In creative domains, professional disagreement reflects genuine differences in taste, not measurement error. We argue that evaluating creative AI requires preserving two distinct signals: convergence, where professionals align around shared best practices, and divergence, where individual taste legitimately varies. We present the Human Creativity Benchmark (HCB), a benchmark that operationalizes this separation by collecting pairwise preferences, scalar ratings on prompt adherence, usability, and visual appeal, and qualitative rationale from domain professionals. Across 15,000 professional judgments spanning five creative domains and three workflow phases (ideation, mockup, refinement), we find that convergence concentrates on verifiable dimensions like technical correctness and visual hierarchy, while divergence concentrates on taste-driven dimensions like aesthetic direction and conceptual risk. No model excels uniformly across all phases. Collapsing these signals into a single quality metric discards the most actionable information: where models must be correct versus where they should remain steerable.