PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics
AI summary

The authors developed a method called PROSE to align two sequences of indoor images taken from a head-mounted camera at different times, which is hard because the views are blurry and only RGB images are available. Instead of relying on detailed 3D points or pre-built maps, they use a pre-trained vision-language model to recognize and match objects between the two sequences. They use object heights to help confirm matches and find the best alignment between scenes without extra training or depth sensors. Their approach works better than existing methods on benchmark datasets and can be used for other tasks afterward.

egocentric visionRGB-only registrationvision-language models3D scene graphobject matchingrigid transformhead-mounted camerasspatial memorypoint cloudgeometry verification
Authors
Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong
Abstract
Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.