MSGL-Transformer: A Multi-Scale Global-Local Transformer for Rodent Social Behavior Recognition
2026-04-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors created a new computer model called MSGL-Transformer to recognize social behaviors in rodents by analyzing their body movements over time. This model looks at movements at different time scales all at once and uses a special block to focus on important behavior features. They tested their model on two datasets with different behavior types and inputs, and it performed better than several existing methods. Their approach works well across different data by just changing input size and number of behaviors.
transformerpose estimationrodent behaviortemporal sequencesmulti-scale attentionBehavior-Aware Modulationcross-validationF1-scoresocial behavior recognitionneural networks
Authors
Muhammad Imran Sharif, Doina Caragea
Abstract
Recognition of rodent behavior is important for understanding neural and behavioral mechanisms. Traditional manual scoring is time-consuming and prone to human error. We propose MSGL-Transformer, a Multi-Scale Global-Local Transformer for recognizing rodent social behaviors from pose-based temporal sequences. The model employs a lightweight transformer encoder with multi-scale attention to capture motion dynamics across different temporal scales. The architecture integrates parallel short-range, medium-range, and global attention branches to explicitly capture behavior dynamics at multiple temporal scales. We also introduce a Behavior-Aware Modulation (BAM) block, inspired by SE-Networks, which modulates temporal embeddings to emphasize behavior-relevant features prior to attention. We evaluate on two datasets: RatSI (5 behavior classes, 12D pose inputs) and CalMS21 (4 behavior classes, 28D pose inputs). On RatSI, MSGL-Transformer achieves 75.4% mean accuracy and F1-score of 0.745 across nine cross-validation splits, outperforming TCN, LSTM, and Bi-LSTM. On CalMS21, it achieves 87.1% accuracy and F1-score of 0.8745, a +10.7% improvement over HSTWFormer, and outperforms ST-GCN, MS-G3D, CTR-GCN, and STGAT. The same architecture generalizes across both datasets with only input dimensionality and number of classes adjusted.