Leveraging Model Soups to Classify Intangible Cultural Heritage Images from the Mekong Delta
2026-03-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors address the problem of classifying images of Intangible Cultural Heritage in the Mekong Delta, where data is scarce and classes look very similar. To improve performance, they combine a special neural network called CoAtNet that mixes convolution and attention with a method called model soups, which averages different saved versions of the model to make predictions more stable. Their approach reduces variability in results without much extra bias and performs better than common models on a dataset with 17 cultural classes. They also show that their method picks models that are diverse in how they make predictions, unlike simpler averaging techniques.
Intangible Cultural HeritageCoAtNetmodel soupsconvolutional neural networksself-attentionensemblingcross-entropyMultidimensional Scalingbias-variance decompositiontop-1 accuracy
Authors
Quoc-Khang Tran, Minh-Thien Nguyen, Nguyen-Khang Pham
Abstract
The classification of Intangible Cultural Heritage (ICH) images in the Mekong Delta poses unique challenges due to limited annotated data, high visual similarity among classes, and domain heterogeneity. In such low-resource settings, conventional deep learning models often suffer from high variance or overfit to spurious correlations, leading to poor generalization. To address these limitations, we propose a robust framework that integrates the hybrid CoAtNet architecture with model soups, a lightweight weight-space ensembling technique that averages checkpoints from a single training trajectory without increasing inference cost. CoAtNet captures both local and global patterns through stage-wise fusion of convolution and self-attention. We apply two ensembling strategies - greedy and uniform soup - to selectively combine diverse checkpoints into a final model. Beyond performance improvements, we analyze the ensembling effect through the lens of bias-variance decomposition. Our findings show that model soups reduces variance by stabilizing predictions across diverse model snapshots, while introducing minimal additional bias. Furthermore, using cross-entropy-based distance metrics and Multidimensional Scaling (MDS), we show that model soups selects geometrically diverse checkpoints, unlike Soft Voting, which blends redundant models centered in output space. Evaluated on the ICH-17 dataset (7,406 images across 17 classes), our approach achieves state-of-the-art results with 72.36% top-1 accuracy and 69.28% macro F1-score, outperforming strong baselines including ResNet-50, DenseNet-121, and ViT. These results underscore that diversity-aware checkpoint averaging provides a principled and efficient way to reduce variance and enhance generalization in culturally rich, data-scarce classification tasks.