Consistency evaluation of benchmarks used for causal discovery

2026-06-01 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors studied how well standard causal graphs used in research match up with the latest scientific findings. They created a system that automatically looks through thousands of research papers and uses large language models to check if these benchmark graphs agree with current domain knowledge. They found that the popular benchmark graphs differ a lot in how consistent they are with recent research, which matters for testing new causal discovery methods. This work highlights a problem in evaluating methods that learn cause-and-effect relationships from data.

graphical causal modelcausal discoverybenchmark causal graphslarge language modelsdomain knowledgescientific databasesevaluationconsistency checkingresearch papers

Authors

Yuzhe Zhang, Chihui Chen, Lina Yao, Chen Wang

Abstract

In graphical causal model, causal discovery aims to construct a causal graph based on numerical data and domain knowledge in plain text. However, the evaluation of causal discovery methods remains a challenge in the area as the progress of domain researches often makes benchmark causal graphs contain mis-aligned knowledge. This problem especially affects the evaluation of large language model (LLM) based causal discovery methods as they are sensitive to the new discoveries in the literature. This work is the first to systematically study the quality of benchmark causal graphs. Specifically, we design a pipeline that automatically retrieves relevant research papers from scientific databases, and prompts LLMs to check the consistency between the benchmark causal graphs and domain research papers. We evaluate 11 popular real-world benchmarks, for which our pipeline in total proceeds 38,081 domain papers. Our results show that popular benchmarks vary significantly in their consistency with domain research, with clear implications for causal discovery research.

View PDFOpen arXiv