Certified Robustness from Approximate Gaussian Mixture Structures in Pretrained Latent Spaces
2026-05-25 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors study how to build machine learning models that can be proven to resist small, tricky changes to their inputs (called adversarial attacks). They focus on using hidden patterns within data to make these guarantees more practical and less strict than before. By showing that models only need to be close to a certain type of data structure (Gaussian mixtures) to stay robust, their method works well even with pretrained models. Experiments on popular image datasets show their approach is effective with good accuracy and low extra computing cost. This suggests using approximate latent structures is a useful way to create reliable defenses.
Adversarial perturbationsCertified robustnessGaussian mixture modelsLatent representationsPretrained encodersKL divergenceGeneralization guaranteesCIFAR-10ImageNetDeep learning
Authors
Konstantinos Emmanouilidis, Tianjiao Ding, Nghia Nguyen, Nicolas Loizou, René Vidal
Abstract
Deep learning models are vulnerable to adversarial perturbations, raising important concerns for safety-critical deployment. Empirical defenses can achieve strong robustness in practice, but lack formal guarantees, motivating the need for certifiably robust classifiers. While certified methods provide formal guarantees, they often yield overly conservative bounds due to their inability to exploit structure in complex data distributions. In this work, we propose a framework for designing certifiably robust classifiers that leverages latent structure in data representations. We first analyze the Gaussian mixture setting, deriving necessary and sufficient conditions for the existence of robust classifiers and constructing a classifier with a closed-form robustness certificate and generalization guarantees. Our main contribution is to show that exact structure is not required: we prove that if a pretrained encoder maps inputs to a latent distribution that is $\varepsilon$-close (in KL divergence) to a Gaussian mixture, then certified accuracy degrades gracefully, with an explicit bound relating robustness under the true and approximate distributions. This result enables the direct use of pretrained models without requiring exact distributional assumptions. Empirically, our method achieves state-of-the-art or competitive certified accuracy on CIFAR-10 and ImageNet, while maintaining strong clean performance and low computational overhead. Overall, our work establishes approximate latent structure as a practical and principled route to certifiable robustness.