Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

2026-06-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors studied if video models trained beforehand have a basic understanding of how physical events work, like intuitive physics, without changing their internal parts. They tested different types of video models and found that some, especially those predicting future frames, showed better knowledge of physical events than others. This knowledge is mostly found in the middle to later layers of the models, and messing with the order of video frames makes it harder to detect. Their work suggests that video models do learn some physics concepts during training, but how easy it is to see this knowledge depends on the model type and where in the model you look.

pretrained video modelsintuitive physicsfrozen-feature probingpredictive joint-embeddingmasked video reconstructiondiffusion-based video generationmodel layerstemporal dynamicsvideo representation

Authors

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

Abstract

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

View PDFOpen arXiv