ArtiFact: A Large-Scale Multi-Modal Cultural Heritage Dataset

2026-06-08Databases

DatabasesArtificial Intelligence
AI summary

The authors created ArtiFact, a large dataset combining tables, text, and images from museum records to help study multi-modal data management. They tested it by introducing different types of errors and found it hard for current systems to catch subtle mistakes related to history and materials. They also showed that existing tools struggle to answer complex questions involving cultural and historical context. This makes ArtiFact a useful and challenging resource for improving how computers handle mixed data types.

multi-modal datadata integrationsemantic query processingdata quality assessmentmuseum recordserror detectioncultural heritagebenchmark dataset
Authors
Luciano Duarte, Olga Ovcharenko, Sebastian Schelter
Abstract
Multi-modal data management has emerged as a central research topic in the database community, spanning data integration, semantic query processing, and data quality assessment. Despite this growing interest, the community lacks large-scale, real-world datasets combining tables, text, and images. We present ArtiFact, a multi-modal cultural heritage dataset of 651045 museum records collected from the Metropolitan Museum of Art, the Art Institute of Chicago, and the Rijksmuseum. We demonstrate the utility of ArtiFact through two downstream tasks. For cross-modal error detection, we introduce a curated taxonomy of seven error categories injected into 130209 records and show that reliably detecting subtle domain-specific errors such as material anachronisms and temporal shifts remain an open challenge. For semantic query processing, we show that current systems struggle with queries involving cultural proximity, ambiguous object types, and historically contingent terminology. Our results position ArtiFact as a challenging benchmark for multi-modal data management research.