How Far Has AI Come in Liver Fibrosis Staging? A Large-Scale Real-World Dataset and Benchmark
2026-05-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created LiFS, a large dataset from multiple hospitals to test how well AI can stage liver fibrosis using special MRI scans. They compared nine AI methods to radiologists and found some AI performed as well as experienced doctors, but many AI models only matched less experienced ones. They also found that differences between hospitals and the types of MRI scans cause problems for AI accuracy. Finally, technical choices in AI design affect how well these methods work across different hospitals, but no single approach solves all issues. LiFS offers a real-world test to help improve AI for liver fibrosis diagnosis.
Liver fibrosisAI stagingMulti-center datasetGadoxetic acid-enhanced MRIHistopathologyCross-center heterogeneityLabel imbalanceSpatial registrationMulti-modal fusionBackbone architecture
Authors
Yuanye Liu, Nannan Shi, Zhejia Zhang, Hanxiao Zhang, Boya Wang, Derong Yu, Nao Wang, Yuxin Jin, Yang Zhou, Kunhao Yuan, Siqi Wang, Lida Yang, Xu Qiao, Wentao Liu, Xuelei He, Xin Hong, Guoyan Zheng, Xin Chen, Guang-Zhong Yang, Le Zhang, Lei Li, Yuxin Shi, Xiahai Zhuang
Abstract
Despite years of methodological progress, how far AI has come in liver fibrosis staging has never been systematically evaluated under the heterogeneous, multi-center conditions that define clinical practice. To address this gap, we introduce LiFS, a large-scale dataset and benchmark derived from the MICCAI 2025 CARE-Liver challenge, comprising 610 patients across multiple centers and scanners with multi-sequence MRI. To the best of our knowledge, LiFS is the first benchmark providing complete gadoxetic acid-enhanced sequences with histopathology-confirmed annotations from diverse real-world scanners. Through systematic evaluation of 9 independently developed methods selected from 96 registered teams against in-cohort radiologist reference results, our findings address how far current AI has progressed toward clinical-level liver fibrosis staging from three complementary perspectives. First, against radiologists, the best AI methods were broadly comparable to the senior radiologist and significantly exceeded the junior radiologist in selected settings, while median AI performance generally approached junior-radiologist levels. Second, from a data perspective, cross-center heterogeneity, label imbalance, and contrast-enhanced sequence variability emerge as the dominant challenges for AI methods. Third, from a technical perspective, methodological design choices, including spatial registration, input dimensionality, multi-modal fusion strategy, and backbone architecture, appear to modulate cross-center robustness, although no single choice alone closes the gap. Overall, LiFS provides a rigorous real-world benchmark for positioning the current state of AI in liver fibrosis staging and for enabling future research on the key challenges that limit clinically reliable deployment.