Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs
2026-04-23 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors propose a new way to understand human actions and their surroundings without using cameras, which often have privacy and safety issues. Their method, called IMU-to-4D, uses motion sensors like those in earbuds or smartphones to predict how people move and the general shape of the space around them. They use large language models to process this sensor data for detailed 4D motion and scene layout reconstruction. Tests show their approach works better and more smoothly over time compared to existing methods that use multiple steps. This suggests that just wearable motion sensors can provide a rich understanding of human movement and environments.
4D perceptionInertial Measurement Units (IMUs)Human motion reconstructionScene layout estimationLarge language modelsWearable sensorsSpatiotemporal understandingPrivacy in sensingHuman-scene dynamicsNon-visual perception
Authors
Hao-Yu Hsu, Tianhang Cheng, Jing Wen, Alexander G. Schwing, Shenlong Wang
Abstract
Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.