Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

2026-06-01Sound

SoundMachine Learning
AI summary

The authors address underwater sound classification, which is hard due to complex and changing ocean sounds. They combine two ways of analyzing sounds—waveforms (raw sound data) and spectrograms (visual frequency data)—using a special two-part neural network. Their method smartly mixes these two sound views with a fuzzy logic approach that helps the model focus on the clearest information, improving accuracy and understanding of how the model decides. Tests on real underwater sound datasets show their approach works better than using just one type of sound representation, while keeping the model efficient and less likely to overfit.

underwater acoustic classificationwaveformspectrogramneural networkpre-trained modelsfine-tuningChoquet integralfuzzy aggregationdomain adaptationnon-stationary environment
Authors
Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine
Abstract
Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.