Motion-Guided Causal Disentanglement for Robust Multi-View Cine Cardiac MRI Diagnosis

2026-06-03Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMultimedia
AI summary

The authors address the problem that current AI models mix up heart images taken from different angles with disease information, which makes it hard to spot disease accurately, especially when there is not much data. They created a new method called MoViD that separates the parts of the image related to the view angle from those related to the disease using special learning techniques. They also used heart motion from video frames to better focus on the important heart areas. When tested, their method worked better than regular models on detecting heart diseases and segmenting heart images, showing that separating structure and disease features helps in medical image analysis.

Cardiac Magnetic Resonance (CMR)Transformer modelsView-disentanglementViT-MAESupervised contrastive learningGradient-reversalTemporal motion featureClass imbalanceVenous thrombosisMedical image segmentation
Authors
Chuankai Xu, Cristiane De Carvalho Singulane, Mohammad Abuannadi, Stephen Chandler, Jeremy Slivnick, Karolina Zareba, Jane Cao, Vidya Nadig, Fabio Fernandes, Seth Uretsky, Diego Perez de Arenaza, Amit Patel, Jianxin Xie
Abstract
Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (M&Ms, M&Ms2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.