Fundus-R1: Training a Fundus-Reading MLLM with Knowledge-Aware Reasoning on Public Data

2026-04-09Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors work on improving computer programs that read eye images to detect problems early. Since many detailed eye image datasets are private, they use only public data that mostly have simple labels. They create a method to automatically link image details to medical knowledge and train the model to reason consistently about the images. Their new model, Fundus-R1, performs better than other similar models on several test sets. This approach shows it is possible to build strong eye image readers using publicly available data.

Fundus imagingColor Fundus Photography (CFP)Optical Coherence Tomography (OCT)Ultra-Widefield Imaging (UWF)Multimodal Large Language Model (MLLM)Supervised Finetuning (SFT)Reinforcement Learning with Verifiable Rewards (RLVR)Retrieval-Augmented Generation (RAG)Fundus image annotation
Authors
Yuchuan Deng, Qijie Wei, Kaiheng Qian, Jiazhen Liu, Zijie Xin, Bangxiang Lan, Jingyu Liu, Jianfeng Dong, Xirong Li
Abstract
Fundus imaging such as CFP, OCT and UWF is crucial for the early detection of retinal anomalies and diseases. Fundus image understanding, due to its knowledge-intensive nature, poses a challenging vision-language task. An emerging approach to addressing the task is to post-train a generic multimodal large language model (MLLM), either by supervised finetuning (SFT) or by reinforcement learning with verifiable rewards (RLVR), on a considerable amount of in-house samples paired with high-quality clinical reports. However, these valuable samples are not publicly accessible, which not only hinders reproducibility but also practically limits research to few players. To overcome the barrier, we make a novel attempt to train a reasoning-enhanced fundus-reading MLLM, which we term Fundus-R1, using exclusively public datasets, wherein over 94\% of the data are annotated with only image-level labels. Our technical contributions are two-fold. First, we propose a RAG-based method for composing image-specific, knowledge-aware reasoning traces. Such auto-generated traces link visual findings identified by a generic MLLM to the image labels in terms of ophthalmic knowledge. Second, we enhance RLVR with a process reward that encourages self-consistency of the generated reasoning trace in each rollout. Extensive experiments on three fundus-reading benchmarks, i.e., FunBench, Omni-Fundus and GMAI-Fundus, show that Fundus-R1 clearly outperforms multiple baselines, including its generic counterpart (Qwen2.5-VL) and a stronger edition post-trained without using the generated traces. This work paves the way for training powerful fundus-reading MLLMs with publicly available data.