BiliVLA: Scene-Aware Vision-Language-Action Model with Reinforcement Learning for Autonomous Biliary Endoscopic Navigation

2026-06-22 • Robotics

Robotics

AI summaryⓘ

The authors developed BiliVLA, a system that helps navigate an endoscope precisely during a tricky medical procedure called ERCP. BiliVLA uses visual and language instructions to decide where to move the scope safely inside the body, even when the view is complicated or unclear. Their method teaches the system to recognize important parts, avoid tissue damage, and improve decision-making by combining different training stages. Tests with realistic models showed that the system works reliably most of the time, making autonomous navigation more feasible. This work shows how combining guided learning with safety checks helps robots perform delicate medical tasks better.

Endoscopic retrograde cholangiopancreatography (ERCP)Biliary cannulationVisuomotor learningSemantic groundingScene-aware supervisionMotor commandClosed-loop navigationGroup Relative Policy Optimization (GRPO)Endoscope navigationContinuum endoscope

Authors

Jinsong Lin, Chi kit Ng, Zhiyong Xiong, Zikang Pan, Yihan Hu, Tabassum Tamima, Ziyi Hao, Eddie Cheung, Jiewen Lai, Huxin Gao, Hongliang Ren

Abstract

Endoscopic retrograde cholangiopancreatography (ERCP) demands precise endoscopic navigation and stable biliary cannulation within a narrow monocular field characterized by specular reflections, partial occlusions, and frequent tissue contact. Although recent robotic systems and vision-based assistance techniques improve operator ergonomics and provide perceptual cues, their performance degrades under pronounced anatomical variability and safety-critical visual artifacts, which hinders reliable autonomy in cannulation-grade procedures. Here, we present BiliVLA, a scene-aware Vision-Language-Action (VLA) framework that formulates biliary endoscopic navigation as an instruction-conditioned visuomotor learning problem. Given an endoscopic observation and a stage-specific language instruction, BiliVLA jointly predicts the target category, a grounded bounding box, and a discrete three degrees of freedom (DoF) motor command for a continuum endoscope. The proposed framework incorporates scene-aware supervision to enhance semantic target consistency and safety-aware recovery supervision to induce conservative retreat behaviors under luminal wall contact. A key component of BiliVLA is a two-stage training paradigm that combines grounding-enhanced supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), which significantly improves action reliability and decision consistency during closed-loop navigation. Across three ERCP subtasks, BiliVLA achieves an average action precision of 91.96\% and an overall success rate (SR) of 84.85\% in real-world phantom experiments. These results indicate that integrating semantic grounding, scene-aware learning, and reward-guided optimization improves perception-action alignment and enables robust autonomous endoscopic navigation.

View PDFOpen arXiv