C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification
2026-06-01 • Sound
Sound
AI summaryⓘ
The authors developed a new method called C2GA to help computers better recognize lung sounds, which are important for diagnosing lung diseases. They used a special type of artificial intelligence to generate realistic and clear lung sound data that matches specific disease classes. This helps address problems like small datasets and noisy recordings without losing important details. Their approach aims to improve how well lung sound classifiers work in real medical settings.
Respiratory sound classificationData augmentationVariational Autoencoder (VAE)Generative Adversarial Network (GAN)Vector-Quantized Variational Autoencoder (VQ-VAE)TransformerMel-spectrogramAutoregressive modelClass-controllable generation
Authors
Ziqi Ma, Mengyu Han, Anteng Cai, Zhanchong Liu, Bowen Feng, Hang Yu, Sheng Hu
Abstract
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.