SAM2Matting: Generalized Image and Video Matting
2026-06-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address the challenge of video matting, which requires both tracking objects across frames and capturing fine details. They propose SAM2Matting, a new method that connects a strong tracker with special matting parts to handle these two tasks separately but effectively. Even though it is only trained on images, their approach performs very well on videos, works consistently over time, and adapts to many different types of scenes. This shows that their method improves video matting without needing expensive video-specific data.
video mattingobject trackingtemporal consistencyimage mattingSAM trackerregion proposaldeep learningground truthgeneralizationvideo object segmentation
Authors
Ruiqi Shen, Guangquan Jie, Chang Liu, Henghui Ding
Abstract
Despite impressive advances in image matting, video matting remains challenging due to the inherent gap between high-level tracking, which requires frame-wise understanding, and low-level matting, which focuses on extremely fine-grained details. Existing methods attempt this with expensive and narrowly-scoped video matting datasets, which may limit out-of-domain generalization and compromise tracking robustness. We rethink the paradigm with SAM2Matting, a tracker-to-matting framework that advances VOS trackers to high-fidelity video matting. Specifically, it decouples the task by enhancing a foundational tracker (e.g., SAM2, SAM3) with a region-proposal bridge and dedicated matting heads, enabling the uncompromised tracker to handle temporal consistency while the matting components resolve fine-grained details. Notably, despite being trained only on images, SAM2Matting establishes new state-of-the-art performance on video matting, supports diverse prompt types, maintains strong temporal consistency, and demonstrates robust generalization across both human-centric and in-the-wild scenarios.