CollideNet: Hierarchical Multi-scale Video Representation Learning with Disentanglement for Time-To-Collision Forecasting

2026-04-17Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a new method called CollideNet to better predict the time until a collision happens in videos. Their approach looks at both small details and the bigger picture across space and time by using a special transformer-based design. They also break down video patterns into parts like trends and repeating cycles to understand changes better. Their method performs better than previous ones on common test datasets, and they studied how well it works on different data. They shared their code openly for others to use.

Time-to-CollisionTransformerSpatiotemporalHierarchical ArchitectureMulti-scale Feature EncodingTrend and SeasonalityVideo AnalysisCollision PredictionCross-dataset Evaluation
Authors
Nishq Poorav Desai, Ali Etemad, Michael Greenspan
Abstract
Time-to-Collision (TTC) forecasting is a critical task in collision prevention, requiring precise temporal prediction and comprehending both local and global patterns encapsulated in a video, both spatially and temporally. To address the multi-scale nature of video, we introduce a novel spatiotemporal hierarchical transformer-based architecture called CollideNet, specifically catered for effective TTC forecasting. In the spatial stream, CollideNet aggregates information for each video frame simultaneously at multiple resolutions. In the temporal stream, along with multi-scale feature encoding, CollideNet also disentangles the non-stationarity, trend, and seasonality components. Our method achieves state-of-the-art performance in comparison to prior works on three commonly used public datasets, setting a new state-of-the-art by a considerable margin. We conduct cross-dataset evaluations to analyze the generalization capabilities of our method, and visualize the effects of disentanglement of the trend and seasonality components of the video data. We release our code at https://github.com/DeSinister/CollideNet/.