OctoSense: Self-Supervised Learning for Multimodal Robot Perception

2026-06-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionRobotics

AI summaryⓘ

The authors introduce OctoSense, a new sensor setup combining cameras, LiDAR, GPS, and robot/car sensor data to collect driving information in many conditions, including tough ones like nighttime or sensor problems. They created a large dataset and developed a smart computer program that learns from all these different types of sensor data together. This program processes the data in a special way to handle different speeds and types of signals, making it faster and more accurate than models that only use images. Their system can better understand things like movement, distance, and scenes even when some sensors don't work well.

Stereo RGB CamerasEvent CamerasLiDARThermal CameraInertial Measurement Unit (IMU)RTK GPSSelf-Supervised LearningMasked AutoencoderOptical FlowSemantic Segmentation

Authors

Anthony Bisulco, Jeremy Wang, Kostas Daniilidis, Randall Balestriero, Pratik Chaudhari

Abstract

We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.

View PDFOpen arXiv