Context-driven Missing-Modality Learning for Robust Medical Diagnosis with Image-Tabular Data

2026-05-25 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors address the challenge of missing medical data types (modalities) in diagnosis, which usually harms model accuracy. They created a new method called CMML that first tries to recreate the missing data using context from available data and then aligns everything to work well together. Their approach uses advanced transformers and memory banks to understand relationships between different data types and improve diagnosis. Tests on skin, eye, and brain datasets show their method outperforms existing techniques by a small but consistent margin.

Multimodal dataMissing modalityTransformerAutoencoderSemantic alignmentContrastive learningContext tokensMedical diagnosisInter-modal dependenciesAUC (Area Under Curve)

Authors

Tianling Liu, Lequan Yu, Tong Han, Liang Wan

Abstract

While multimodal data integrating diverse imaging and clinical tabular records is crucial for accurate medical diagnosis, the arbitrary absence of specific modalities is prevalent in clinical practice, severely degrading the performance of multimodal models. Existing methods either discard missing modalities, leading to information loss, or struggle to synthesize them without capturing complex inter-modal dependencies. To address these limitations, we propose a novel Context-driven Missing-Modality Learning (CMML) framework, which sequentially performs modality synthesis and semantic alignment to achieve robust diagnosis under arbitrary missing conditions. Specifically, we design a Cascade Residual Transformer-based Autoencoder (CRTA) that leverages learnable context tokens acting as dataset-level semantic prior to capture inter-modal dependencies and synthesize key missing representations. These representations are further enriched by modality-specific memory banks. To resolve the discrepancy between original available and synthesized representations, we transform the learned context tokens into instance-adaptive semantic references by infusing multimodal representations from the CRTA's outputs. This reference guides the alignment of heterogeneous modality representations into a unified space, where class-aware contrastive refinement is finally applied to explore discriminative diagnostic cues. Extensive evaluations on skin lesion (Derm7pt), ocular disease (ODIR), and meningioma (MEN) datasets demonstrate that CMML significantly outperforms state-of-the-art (SOTA) methods, yielding AVG AUC improvements of 1.26%, 0.97%, and 1.32%, respectively.

View PDFOpen arXiv