Improving Temporal Action Segmentation via Constraint-Aware Decoding

2026-05-11Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem of dividing videos into labeled action segments, which is hard because actions can vary and boundaries are unclear. They propose a simple way to improve predictions by using statistical rules about how actions usually follow each other, how long they last, and where boundaries happen. This method refines results during testing without needing to retrain the model or add complexity. Their approach works for both fully and partly supervised models, making segmentation more accurate and efficient.

temporal action segmentationuntrimmed videosViterbi decodingstructural priorssemi-supervised learningaction boundariestransition confidenceduration modelinginference refinement
Authors
Yeo Keat Ee, Debaditya Roy, Chen Li, Hao Zhang, Basura Fernando
Abstract
Temporal action segmentation (TAS) divides untrimmed videos into labeled action segments. While fully supervised methods have advanced the field, challenges such as action variability, ambiguous boundaries, and high annotation costs remain, especially in new or low-resource domains. Grammar-based approaches improve segmentation with structural priors but rely on complex parsing limiting scalability. In this work, we propose a lightweight, constraint-based refinement framework that enhances TAS predictions by integrating statistical structural priors such as transition confidence, action boundary sets, and per-class duration, that can be directly extracted from annotated data. These constraints are integrated into a modified Viterbi decoding algorithm, allowing inference-time refinement without retraining or added model complexity. Our approach improves both fully and semi-supervised TAS models by correcting structural prediction errors while maintaining high efficiency. Code is available at https://github.com/LUNAProject22/CAD