Clean Me If You Can: A Large Collection of Real-World Addresses for Data Cleaning Benchmarking
2026-06-30 • Databases
Databases
AI summaryⓘ
The authors talk about the problem of cleaning up mistakes in spreadsheet-like data, which is important but hard to do well outside of simple examples. They point out that researchers don't have many real-world messy datasets to practice on. To fix this, the authors collected a big dataset of postal addresses with errors and corrected versions. They tested current cleaning methods on this data and found they don't work as well as hoped, offering advice for better future methods.
data cleaningerror detectiondata correctiontabular datadatasetground truthpostal datasetdata qualitydata preprocessingautomated cleaning
Authors
Fatemeh Ahmadi, Tobias Bernhard, Mohamed Abdelmaksoud, Luca Zecchini, Tilmann Rabl, Ziawasch Abedjan
Abstract
There has been extensive research on automating and scaling data cleaning, i.e., the detection and correction of erroneous values in tabular data. Yet, existing approaches often perform well only within controlled environments. One of the major bottlenecks in data cleaning research is the lack of real-world datasets. In this paper, we address this gap by providing a large, dirty dataset with postal entries and their corresponding ground truth. We discuss the design decisions and challenges for obtaining the dataset. We demonstrate the limitations of existing cleaning approaches when faced with our proposed datasets and derive guidelines for future research.