Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

2026-04-07Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors studied a special type of Vision Transformer called ZACH-ViT, designed for medical images where the location of features can vary a lot. They tested how well ZACH-ViT and other similar models perform when images are corrupted or attacked with small changes, using limited training data. Their results show that ZACH-ViT generally works best with clean and corrupted images and holds up fairly well against adversarial attacks, though all models struggle under such attacks. This means ZACH-ViT's design helps with real-world challenges like noisy medical images, but tougher attacks still pose problems.

Vision TransformerZACH-ViTmedical imagingrobustnessimage corruptionadversarial attacksMedMNISTpermutation-invariantlow-data learningFGSMPGD
Authors
Athanasios Angelakis, Marta Gomez-Barrero
Abstract
The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.