Generalized Rank-based Evaluation for Knowledge Graph Completion: Perspectives, Framework, and Analyses

2026-06-08Machine Learning

Machine Learning
AI summary

The authors focus on improving how we judge the accuracy of knowledge graph completion (KGC) models, which predict missing info in large data networks. They identify two important but often ignored factors: how confident predictions are (predictive sharpness) and how fair the evaluation is to less popular facts (popularity-bias robustness). To address this, they create a new evaluation method called PROBE that better assesses model performance by balancing these aspects. Their tests show PROBE gives more reliable and consistent results compared to current metrics, especially when dealing with incomplete data.

Knowledge Graph CompletionPredictive SharpnessPopularity BiasEvaluation MetricsRank TransformerRank AggregatorOpen-world AssumptionModel PerformanceIncomplete Knowledge Graphs
Authors
Sooho Moon, Jian Kang, Yunyong Ko
Abstract
Knowledge graph completion (KGC) aims to predict missing facts from an observed knowledge graph (KG), playing a crucial role in a wide range of real-world applications such as drug discovery, recommender systems, and retrieval-augmented generation (RAG). Although numerous KGC models have been proposed, the evaluation of KGC remains underexplored, despite its critical role in reliably assessing model performance and selecting appropriate models for real-world applications. In this paper, we introduce two important perspectives for KGC evaluation that are overlooked by existing evaluation metrics, (P1) predictive sharpness and (P2) popularity-bias robustness. To address both perspectives, we propose a generalized evaluation framework, PROBE, which consists of a rank transformer (RT) that estimates the score of each prediction based on a desired level of predictive sharpness and a rank aggregator (RA) that determines the final evaluation score by aggregating all prediction scores according to a desired level of popularity-bias robustness. We theoretically analyze PROBE by defining six key properties for reliable KGC evaluation and prove that PROBE satisfies all the properties, while existing metrics fail to satisfy some. In particular, due to the open-world nature of KGs, an evaluation metric should preserve the relative performance of KGC models even when only incomplete facts are observed. We show that PROBE better maintains such consistency, providing a more reliable estimate of intrinsic model performance than existing metrics. Extensive experiments with six KGC models on six real-world KGs reveal that existing metrics may over- or under-estimate model performance depending on different evaluation perspectives, whereas PROBE enables a more comprehensive, flexible, and consistent evaluation of KGC models.