Evaluating Data Quality Tools: Measurement Capabilities and LLM Integration

2026-04-10Databases

Databases
AI summary

The authors looked at six popular tools used to check data quality, including both free and paid options. They judged the tools based on how well they handle things like making rules, finding duplicate data, and handling uncertainty. They also checked if these tools use smart language models (LLMs) to help out. They found that paid tools have more features and some AI help, but free tools are more flexible though harder to set up. None of the tools can yet use AI for directly checking data.

data qualitydata validationLarge Language Modelsrule definitionduplicate detectionmetric aggregationuncertainty handlingopen-source toolsproprietary software
Authors
Tobias Rehberger, Thomas Hütter, Lisa Ehrlinger, Wolfram Wöß
Abstract
High data quality is critical for reliable analytics and operational efficiency. A growing ecosystem of tools has emerged to support data quality management, ranging from lightweight open-source libraries to comprehensive enterprise platforms. This paper evaluates six data quality tools: Great Expectations, Deequ, Evidently, Informatica, Experian, and Ataccama. The evaluation criteria cover rule definition, duplicate detection, metric aggregation, and uncertainty handling, and were derived from real-world use cases of company partners. We further examine to what extent these tools integrate Large Language Models (LLMs). Our findings show that proprietary tools offer more comprehensive measurement features and emerging LLM-based assistance, while open-source tools provide flexibility at the cost of higher implementation effort. Across all tools, LLM integration remains limited to rule creation workflows. Direct data validation through LLMs is not yet supported by any of the evaluated tools.