PerceptionRubrics: Calibrating Multimodal Evaluation to Human Perception

2026-06-26Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present PerceptionRubrics, a new way to evaluate image understanding models that goes beyond just overall matching scores. Instead of checking if captions roughly fit images, their system uses detailed checklists (rubrics) for many image parts, separating must-get-right facts from easy-to-get-wrong details. They also introduce a special scoring method that strongly penalizes failures on essential facts. Their tests show that models often handle parts of images correctly but struggle when all parts need to be right together, reveal differences between open and closed models, and better align with human judgment than existing benchmarks.

PerceptionRubricsrubric-based evaluationbenchmarkingimage captioningsemantic matchingGated Scoringpeer-review consensusopen-source modelsperceptual fidelityconjunctive constraints
Authors
Yana Wei, Hongbo Peng, Yanlin Lai, Liang Zhao, Kangheng Lin, En Yu, Keyu Lv, Han Zhou, Yin Tang, Haodong Li, Mitt Huang, Hangyu Guo, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Vishal M. Patel
Abstract
We introduce PerceptionRubrics, a rubric-based evaluation framework that addresses the gap between saturated benchmark scores and real-world brittleness. Shifting evaluation from holistic semantic matching to rigorous atomic auditing, PerceptionRubrics pairs 1,038 information-dense images with over 12,000 instance-specific rubrics. These criteria are derived from golden captions constructed via a novel Circular Peer-Review consensus pipeline and then distilled into a dual-stream system of Must-Right (essential facts) and Easy-Wrong (fine-grained details) rubrics. Crucially, PerceptionRubrics implements a Gated Scoring mechanism: unlike linear averages, failure on mandatory visual facts triggers sharp binary penalties. Extensive evaluation yields critical insights: (1) The Reliability Gap: models often verify fragmented elements correctly yet fail strict conjunctive constraints, exposing brittleness in dense domains; (2) Open-Closed Stratification: contrary to reasoning trends, we reveal a persistent 8% perception deficit between open-source and proprietary frontiers; and (3) Human-Aligned Rigor: our gated metrics substantially out-align conventional benchmarks, validating that strict perceptual fidelity is the prerequisite for reliable generation.