Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs

2026-04-10Sound

SoundArtificial Intelligence
AI summary

The authors looked at how audio language models sometimes make things up when generating descriptions of sounds, which is called hallucination. They found that previous methods for spotting these errors only said yes or no, missing the variety of mistakes that can happen. To fix this, they created a new way called Noise-Aware In-Context Learning (NAICL), which gives the model examples of noisy sounds to help it be more careful when it isn't sure. They also made a new test set and ways to measure different types of hallucinations. Their method lowered the model's hallucination rate significantly from about 27% to 17%.

Auditory Large Language Models (ALLMs)HallucinationNoise-Aware In-Context Learning (NAICL)Audio CaptioningClotho-1K DatasetHallucination BenchmarkMultimodal ReasoningGenerative ModelsFine-grained AnalysisSpeculative Associations
Authors
Qixuan Huang, Khalid Zaman, Masashi Unoki
Abstract
Auditory large language models (ALLMs) have demonstrated strong general capabilities in audio understanding and reasoning tasks. However, their reliability is still undermined by hallucination issues. Existing hallucination evaluation methods are formulated as binary classification tasks, which are insufficient to characterize the more complex hallucination patterns that arise in generative tasks. Moreover, current hallucination mitigation strategies rely on fine-tuning, resulting in high computational costs. To address the above limitations, we propose a plug-and-play Noise-Aware In-Context Learning (NAICL) method. Specifically, we construct a noise prior library, retrieve noise examples relevant to the input audio, and incorporate them as contextual priors, thereby guiding the model to reduce speculative associations when acoustic evidence is insufficient and to adopt a more conservative generation strategy. In addition, we establish a hallucination benchmark for audio caption tasks including the construction of the Clotho-1K multi-event benchmark dataset, the definition of four types of auditory hallucinations, and the introduction of metrics such as hallucination type distribution to support fine-grained analysis. Experimental results show that all evaluated ALLMs exhibit same hallucination behaviors. Moreover, the proposed NAICL method reduces the overall hallucination rate from 26.53% to 16.98%.