SAVMap: Structure-Aided Visual Mapping of Large-Scale 2.5D Manhattan Wireframes from Panoramic Video
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present SAVMap, a method to create 3D maps showing warehouse shelves and lights using only a 360-degree video camera. They extract specific points from images inside the warehouse, like shelf corners and light centers, and track them over time to build a 3D wireframe model. By using geometric rules about how these points relate, their approach produces accurate reconstructions of large warehouse structures. They tested it in a big warehouse and achieved an average error of less than 5 centimeters compared to real measurements.
3D reconstructionsemantic segmentationwireframe mappanoramic videostructure-from-motionManhattan gridrobot localizationdigital twinfeature tracking
Authors
Howard Huang, Bharath Surianarayanan, Keifer Lee, Chenyu Wang, Chen Feng
Abstract
Precise 3D representations of industrial environments enable tasks such as robot localization and digital twin generation. We propose SAVMap, a method for generating a semantic wireframe map of warehouse shelf and light structures using only a panoramic video camera as the sensor input. Sequences of rectified images with shelf and ceiling-facing views are extracted from a panoramic video captured along the warehouse aisles. Using a semantic segmentation network front end, a set of sparse, semantic structure feature points (e.g., corners of shelf structures, centers of lights) are extracted from each image and tracked across the sequences. By accounting for real-world geometric relationships among the points such as Manhattan grids, a constrained structure-from-motion algorithm yields the 3D points that form a wireframe map. We demonstrate the scalability and accuracy of our proposal in a warehouse with 46 shelving rows, each with faces spanning 55\,m by 7\,m. From an hour of panoramic video content, we create wireframe maps for over 5000 shelf elements across the rows, achieving an aggregate mean absolute error of 4.8\,cm with respect to ground-truth.