SS3D: End2End Self-Supervised 3D from Web Videos

2026-04-24Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce SS3D, a method that learns to estimate 3D information like depth, camera movement, and camera settings from regular videos without extra labels. Their model processes all these tasks together in one step and is trained end-to-end for better results. They tackle challenges of learning from diverse, large-scale web videos by filtering data smartly and using a step-by-step training approach. Training on a big YouTube video dataset helps the model perform well on different types of videos even without extra training. The authors also share their trained model and code for others to use.

self-supervised learningstructure from motion (SfM)depth estimationego-motioncamera intrinsicsmonocular videoend-to-end trainingzero-shot transfercurriculum learningmulti-view signals
Authors
Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera
Abstract
We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the pretrained checkpoint and code.