RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed RAPTOR+, a new system that uses advanced AI models to better understand urgent colorectal cancer referral forms. Unlike the old RAPTOR, which relied on separate text reading and struggled with handwriting and layouts, RAPTOR+ processes the whole document using vision and language together. They tested different models and found that training the AI specifically on these forms greatly improved accuracy and trustworthy linking of extracted information to the original document. This means RAPTOR+ can help make cancer referrals faster and safer by connecting decisions to the exact visual proof.

Colorectal cancerReferral formsLarge Language ModelsVision-Language ModelsOCR (Optical Character Recognition)Fine-tuningEvidence groundingClinical document understandingZero-shot learningModel evaluation

Authors

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Anusha Jose, Benjamin Wallace, William Poulett, Adam Byfield, Lukman Akanbi, Muhammad Bilal

Abstract

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

View PDFOpen arXiv