AI summaryⓘ
The authors present HILBERT, a system that combines audio and text information from long documents to create better overall representations, especially when data is limited. They use existing speech and language models to get features from segments, then combine them carefully using attention methods. To keep audio and text balanced and aligned, they introduce a special training method that compares each modality to a shared joint space, rather than comparing audio directly to text. They also add extra techniques to maintain consistent structure and ensure neither audio nor text dominates. Finally, their approach uses a classifier that can handle different types of labels and shows improved results in experiments.
cross-attentionmultimodal learningcontrastive learninglong-sequence modelingspeech encodinglanguage encodingmutual informationCentered Kernel Alignment (CKA)Mixture-of-Experts (MoE)audio-text representation
Authors
Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin
Abstract
We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.