Managing Map Cardinality in Automatic Disease Classification Mapping: Balancing Precision, Recall and Coverage
2026-06-29 • Computation and Language
Computation and Language
AI summaryⓘ
The authors tackle the problem of automatically linking disease codes from different versions of the International Classification of Diseases (ICD), which is important for combining health data over time. Existing methods mostly match one code to one other code but struggle when one code should link to multiple codes. To improve this, the authors create a two-step approach: first narrowing down possible matches, then using a large language model to find all valid mappings. Their method provides better accuracy while keeping similar coverage when tested on various ICD versions.
International Classification of Diseases (ICD)disease codingembedding-based methodsone-to-many mappingentity resolutionblockinglarge language models (LLMs)mapping precisionmapping recallhealth data integration
Authors
Santosh Purja Pun, Oliver Obst, Jim Basilakis, Jeewani Anupama Ginige
Abstract
Automatic mapping between disease classification systems, such as the International Classification of Diseases (ICD), is a challenging yet essential task for integrating health data and conducting longitudinal data analysis. Existing embedding-based methods primarily focus on \emph{one-to-one} mappings, overlooking more complex \emph{one-to-many} scenarios. The threshold-based and top-K methods offer natural extensions; however, they involve inherent trade-offs between \emph{precision}, \emph{recall} and \emph{mapping coverage} -- the proportion of source codes with at least one mapping to a target code. To address this challenge, we introduce a novel method, which is inspired by the \emph{blocking-and-matching} pipeline commonly used in \emph{entity resolution}. In particular, we first generate a block of candidate matches (\emph{blocking}) and then employ a large language model (LLM) to identify all valid mappings within each block (\emph{matching}). Empirically, we show that the proposed method achieves higher precision with comparable recall and broader coverage across multiple ICD version pairs (ICD-9-CM$\leftrightarrow$ICD-10-CM and ICD-10-AM$\leftrightarrow$ICD-11). Our source code and dataset is available at: https://tinyurl.com/46kyn7wp.