EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory
2026-05-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors identify challenges in segmenting ultrasound videos due to noise, unclear boundaries, and fast changes in anatomy. They introduce EchoPilot, a method that uses just one click and the name of the anatomical part to accurately find and track it over time. EchoPilot combines existing models to better understand the area and avoid errors from building up as the video progresses. They also created a new dataset of ultrasound videos of fetal placentas to test their approach. Their results show better performance than other methods that do not require extensive training or user input.
ultrasound video segmentationspeckle noisevision-language model (VLM)foundation modelsemantic localizationmask propagationfew-shot interactionscale-space promptingmemory updatefetal placenta dataset
Authors
Ruiqiang Xiao, Zhaohu Xing, Yijun Yang, Zhenyan Han, Weiming Wang, Kaishun Wu, Lei Zhu
Abstract
Ultrasound video segmentation is clinically valuable yet difficult due to speckle noise, weak boundaries, and rapid anatomical deformation. Recent promptable foundation models enable point-guided segmentation, but their direct deployment in ultrasound remains unreliable: a single point provides insufficient spatial context to resolve scale ambiguity, and greedy memory updates amplify early errors into severe temporal drift. We present EchoPilot, a training-free framework for ultrasound video segmentation under sparse first-frame interaction, requiring only a single point click and an anatomical category name. EchoPilot orchestrates a frozen medical vision-language model (VLM) for semantic localization, a vision foundation model (VFM) for dense geometric feature extraction, and a promptable video segmentor for mask prediction and propagation. To resolve initialization ambiguity, we propose Scale-Space Semantic Prompting, which first selects an optimal contextual view via a parameter-free S.E.E.D. (Semantic Energy-Entropy Density) criterion, and then synthesizes geometrically precise auxiliary point prompts from dense foundation features without additional user interaction. To reduce propagation drift, a Reliability-Gated Memory update is further introduced to selectively freeze the segmentor's memory bank under uncertain predictions, preventing error accumulation. We also contribute the first dynamic fetal placenta ultrasound video segmentation dataset with 671 annotated frames. Across three ultrasound video datasets, EchoPilot achieves state-of-the-art performance under the sparse-interactive setting, consistently outperforming training-free baselines and finetuned specialists.