ChinaHeritaQA: A Culturally-Grounded Visual Question Answering Dataset for World Heritage Sites in China

2026-06-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language

AI summaryⓘ

The authors created ChinaHeritaQA, a dataset with pictures and questions about UNESCO World Heritage sites in China to test how well AI models understand culture and history. The questions cover different thinking skills, from identifying objects to understanding historical periods and architecture. They found that while AI is good at recognizing images, it struggles with deeper cultural and historical reasoning. This dataset can help improve AI models' cultural awareness in the future.

ChinaHeritaQAvision-language modelsUNESCO World Heritage sitesmultimodal datasetcultural reasoninghistorical periodizationarchitectural analysisbilingual QAhuman annotation

Authors

Yi Zhang, Bolei Ma, Yong Cao, Chengyan Wu, Daniel Hershcovich, Anna-Carolina Haensch

Abstract

We introduce ChinaHeritaQA, a multimodal benchmark dataset for evaluating the cultural reasoning abilities of vision-language models (VLMs) on UNESCO World Heritage sites in China. The dataset comprises 2,279 in-the-wild images paired with 14,133 bilingual (Chinese/English) multiple-choice QA pairs spanning seven cognitive dimensions, from basic identity recognition to historical periodization and architectural analysis. Guided by a UNESCO-aligned heritage ontology and verified through rigorous human annotation, the dataset ensures linguistic quality and factual consistency. Evaluations of state-of-the-art VLMs reveal that while top models exceed human performance on average, substantial task-level variation emerges: models excel at visual recognition but struggle with culturally grounded reasoning. Performance also varies by dynasty and region. ChinaHeritaQA reveals that strong visual retrieval does not extend to cultural and historical understanding. We release the dataset to support future research on culturally aware multimodal learning.

View PDFOpen arXiv