AI summaryⓘ
The authors studied why 3D skeleton-based action recognition models that work well in controlled settings fail badly when applied to new, unseen types of videos or views using just 2D poses. They found that common methods to detect when the model is unsure don't help because the model confidently makes wrong predictions on unfamiliar data. They showed that although some statistical tools can spot unusual data, these tools don't ensure the model safely avoids mistakes when making decisions. To fix this, the authors designed a lightweight gating method that helps the model abstain from uncertain predictions, making it more reliable. Their work highlights important safety challenges when using these models outside controlled environments.
3D skeleton capture2D pose estimationdomain shiftout-of-distribution detectionuncertainty estimationselective classificationenergy-based scoringMahalanobis distancecalibrationzero-shot transfer
Abstract
The practical deployment gap -- transitioning from controlled multi-view 3D skeleton capture to unconstrained monocular 2D pose estimation -- introduces a compound domain shift whose safety implications remain critically underexplored. We present a systematic study of this severe domain shift using a novel Gym2D dataset (style/viewpoint shift) and the UCF101 dataset (semantic shift). Our Skeleton Transformer achieves 63.2% cross-subject accuracy on NTU-120 but drops to 1.6% under zero-shot transfer to the Gym domain and 1.16% on UCF101. Critically, we demonstrate that high Out-Of-Distribution (OOD) detection AUROC does not guarantee safe selective classification. Standard uncertainty methods fail to detect this performance drop: the model remains confidently incorrect with 99.6% risk even at 50% coverage across both OOD datasets. While energy-based scoring (AUROC >= 0.91) and Mahalanobis distance provide reliable distributional detection signals, such high AUROC scores coexist with poor risk-coverage behavior when making decisions. A lightweight finetuned gating mechanism restores calibration and enables graceful abstention, substantially reducing the rate of confident wrong predictions. Our work challenges standard deployment assumptions, providing a principled safety analysis of both semantic and geometric skeleton recognition deployment.