Adaptive Graph Refinement and Label Propagation with LLMs for Cost-Effective Entity Resolution

2026-05-25Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors focus on dirty entity resolution, which means finding which records in a messy dataset refer to the same real-world thing. They argue that the common method, which breaks the task into separate steps, creates an incomplete and error-prone graph that leads to mistakes in grouping records. Instead, they propose Alper, a new approach that combines matching and clustering into a single, ongoing process that improves the graph and labels together. Alper smartly uses both cheap but less accurate signals and more costly high-quality queries to get better results efficiently. Their tests on many datasets show Alper works better than current popular methods.

entity resolutionblockingmatchingclusteringlabel propagationgraph algorithmsprobabilistic modelslarge language modelsquery optimizationtransitivity
Authors
Hongtao Wang, Renchi Yang, Haoran Zheng, Xiangyu Ke
Abstract
Dirty entity resolution (ER), which identifies records referring to the same real-world entity from a single, messy dataset, is a fundamental task in data management and mining. However, the dominant blocking-matching-clustering paradigm for ER suffers from critical flaws. Its cascaded, decoupled workflow essentially produces a static, sparse graph plagued by missing edges (due to blocking failures) and noisy links (due to matching errors), causing error propagation and yielding suboptimal clusters, particularly when rigid transitivity is imposed in the clustering. We contend that matching and clustering are fundamentally synergistic, both optimizing for the construction of an ideal entity graph. Building upon this insight, we propose Alper, a unified framework that integrates these steps into an iterative probabilistic label propagation process over a global, evolving graph. Unlike disjoint blocking, Alper refines the graph structure and labels dynamically by adaptively integrating "weak but cheap" signals from graph propagation with "strong but expensive" LLM-based pairwise queries. For higher cost-effectiveness, we formulate the signal selection as a constrained optimization problem maximizing cumulative marginal gain under a query budget, solved via our greedy algorithm with provable theoretical guarantees. Our extensive experiments over eight benchmark datasets demonstrate that Alper is consistently superior to state-of-the-art cascaded pipelines.