Three-Step Hierarchical Transformer for Multi-Pedestrian Trajectory Prediction

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a method to better predict where people will walk by clearly separating the steps of understanding time, combining different kinds of information, and figuring out social interactions. They use a system that summarizes data efficiently to keep computations manageable. Their tests on multiple real-world datasets showed strong results, and they demonstrated the method can anticipate complex pedestrian behaviors like early turns. Each part of their model contributes to improving predictions.

pedestrian trajectory predictionTransformertemporal encodingmultimodal fusionsocial attentionGRUcross-modal attentiontrajectory datasetssocial interactions
Authors
Raphaël Delécluse, Hazem Wannous, Laurent Grisoni, Laurent Guimas
Abstract
Pedestrian trajectory prediction requires modeling temporal dynamics, multimodal cues, and social interactions in crowded environments. Existing methods often address these factors separately or entangle them in costly attention blocks, limiting scalability, flexibility, and interpretability. We propose a three-step hierarchical Transformer that explicitly separates temporal encoding, multimodal fusion, and scene-level interaction reasoning. Lightweight GRU summaries enable efficient cross-modal attention, while social attention over time--agent tokens captures inter-pedestrian influences at manageable cost. Experiments on JTA, JRDB, and the Pedestrians and Cyclists in Road Traffic dataset show state-of-the-art performance on real-world datasets (JRDB, Urban) and competitive results on JTA. Ablation and qualitative analyses confirm the contribution of each stage and the model's ability to anticipate complex behaviors such as early turning.