Confidence-Guided Diffusion Augmentation for Enhanced Bangla Compound Character Recognition

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors address the difficulty of recognizing handwritten Bangla compound characters, which have complicated shapes and few labeled examples. They create a method to produce more high-quality fake images of these characters using a special type of AI called diffusion models, improved with additional techniques for better image quality. They filter these fake images using a confidence check to keep only the most accurate ones and add them to the training data. Testing on a known dataset, the authors show that their approach helps several AI models recognize Bangla compound characters better, achieving higher accuracy than previous methods.

Bangla scripthandwritten character recognitioncompound charactersdiffusion modelsdata augmentationSqueeze-and-Excitation blocksclassifier guidanceconfidence filteringResNet50Vision Transformer

Authors

Md. Sultan Al Rayhan, Maheen Islam

Abstract

Recognition of handwritten Bangla compound characters remains a challenging problem due to complex character structures, large intra-class variation, and limited availability of high-quality annotated data. Existing Bangla handwritten character recognition systems often struggle to generalize across diverse writing styles, particularly for compound characters containing intricate ligatures and diacritical variations. In this work, we propose a confidence-guided diffusion augmentation framework for low-resolution Bangla compound character recognition. Our framework combines class-conditional diffusion modeling with classifier guidance to synthesize high-quality handwritten compound character samples. To further improve generation quality, we introduce Squeeze-and-Excitation enhanced residual blocks within the diffusion model's U-Net backbone. We additionally propose a confidence-based filtering mechanism where pre-trained classifiers act as quality gates to retain only highly class-consistent synthetic samples. The filtered synthetic images are fused with the original training data and used to retrain multiple classification architectures. Experiments conducted on the AIBangla compound character dataset demonstrate consistent performance improvements across ResNet50, DenseNet121, VGG16, and Vision Transformer architectures. Our best-performing model achieves 89.2\% classification accuracy, surpassing the previously published AIBangla benchmark by a substantial margin. The results demonstrate that quality-aware diffusion augmentation can effectively enhance handwritten character recognition performance in low-resource script domains.

View PDFOpen arXiv