Emergence of a Shared Canonical Object Frame from In-the-Wild Videos

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a way for a computer to understand the position and orientation of different objects by learning from videos without needing manual labels. Their method uses a simple shared 3D shape to link all object views together, even though the shape lacks detailed features. By training on many videos with noisy camera data, the system figures out a common frame of reference on its own. This approach performs well compared to methods that require manual pose information.

canonical frameself-supervised learningStructure-from-Motion (SfM)object-centric videospose estimationdense correspondencesmulti-view consistencygeometric bottlenecksemantic priorsmesh alignment

Authors

Tom Fischer, Martin Sundermeyer, Adam Kortylewski, Eddy Ilg

Abstract

Comparing object orientations and positions across different instances requires their poses to be expressed in a shared canonical frame. Establishing such frames has traditionally required manual annotation, creating a scaling bottleneck that limits category and instance diversity. We show that a shared canonical frame can instead emerge from self-supervised training on object-centric videos captured in the wild, using only noisy camera poses from Structure-from-Motion. Our key idea is to route all training sequences through a shared geometric bottleneck: a coarse canonical mesh that carries no category-specific detail. By learning dense correspondences from image pixels to this mesh, and estimating per-sequence alignments from noisy SfM geometry, a common canonical frame emerges from multi-view consistency and the semantic priors of the feature extractor, without any canonical pose labels or category conditioning. Trained in a self-supervised manner on 160,000 in-the-wild object videos, our method achieves competitive accuracy on category-level pose estimation benchmarks compared to methods that rely on canonical pose supervision. The code and checkpoint is available on https://github.com/Fischer-Tom/Emergent-Canonical-Frame/.

View PDFOpen arXiv