Quantifying Cross-Modal Interactions in Multimodal Glioma Survival Prediction via InterSHAP: Evidence for Additive Signal Integration

2026-03-31Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors studied how combining two types of data—images of tumor slides and gene expression profiles—helps predict survival in brain cancer patients. They tested if mixing these data sources deeply (cross-modal interactions) improves predictions but found that better performance actually comes from simply adding the separate strengths of each data type, not from complex interactions. Their method measures how much the two data types work together versus independently, showing that simple combination works just as well. This suggests simpler models might be enough for this task and aids in choosing how to combine data sources securely across different sites.

Multimodal deep learningCancer prognosisCox proportional hazards modelShapley interaction indexWhole-slide imaging (WSI)RNA sequencing (RNA-seq)TCGA datasetSurvival predictionModel fusionC-index
Authors
Iain Swift, JingHua Ye, Ruairi O'Reilly
Abstract
Multimodal deep learning for cancer prognosis is commonly assumed to benefit from synergistic cross-modal interactions, yet this assumption has not been directly tested in survival prediction settings. This work adapts InterSHAP, a Shapley interaction index-based metric, from classification to Cox proportional hazards models and applies it to quantify cross-modal interactions in glioma survival prediction. Using TCGA-GBM and TCGA-LGG data (n=575), we evaluate four fusion architectures combining whole-slide image (WSI) and RNA-seq features. Our central finding is an inverse relationship between predictive performance and measured interaction: architectures achieving superior discrimination (C-index 0.64$\to$0.82) exhibit equivalent or lower cross-modal interaction (4.8\%$\to$3.0\%). Variance decomposition reveals stable additive contributions across all architectures (WSI${\approx}$40\%, RNA${\approx}$55\%, Interaction${\approx}$4\%), indicating that performance gains arise from complementary signal aggregation rather than learned synergy. These findings provide a practical model auditing tool for comparing fusion strategies, reframe the role of architectural complexity in multimodal fusion, and have implications for privacy-preserving federated deployment.