AI summaryⓘ
The authors note that current self-supervised learning methods focus on recognizing objects but often overlook the spatial relationships between object parts. They propose Spatial Prediction (SP), a task where the model learns to predict the relative position and size between small regions of the same image, helping it understand how parts fit together. This improves the model's ability to capture detailed spatial structure, which benefits tasks like image recognition and segmentation. Their experiments show that adding SP helps models understand spatial layout better and makes them more robust to new situations. They also created new tests to measure how well models understand spatial relationships.
self-supervised learningspatial structurepretext taskrepresentation learningimage recognitionsemantic segmentationdepth estimationout-of-distribution robustnessgeometric awarenesspatch reordering
Authors
Yang Shen, Yusen Cai, Weronika Hryniewska-Guzik, Qing Lin, Mengmi Zhang
Abstract
Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP), a spatially aware pretext regression task that predicts the relative position and scale between a pair of disentangled local views from the same image. By modeling part-to-part relationships in a continuous geometric space, SP encourages representations to capture fine-grained spatial dependencies beyond invariant categorical semantics, thereby learning the compositional structure of visual scenes. SP is implemented as a decoupled plug-in and can be seamlessly integrated into diverse SSL frameworks. Extensive experiments show consistent improvements across image recognition, fine-grained classification, semantic segmentation, and depth estimation, as well as substantial gains in out-of-distribution robustness for object recognition. To evaluate spatial reasoning, we introduce (1) a position and scale prediction task on image patch pairs and (2) a jigsaw understanding task requiring patch reordering and recognition after reconstruction. Strong performance on these tasks indicates improved spatial structure and geometric awareness. Overall, explicitly modeling spatial information provides an effective inductive bias for SSL, leading to more structured representations and better generalization. Code and models will be released.