DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech

2026-04-10Sound

SoundArtificial Intelligence
AI summary

The authors improved a voice conversion method that uses digital signal processing by fixing the way sound is generated for voiced and unvoiced parts of speech. They detect when someone is not talking and replace the usual buzz-like sound with filtered noise to avoid unwanted artifacts. They also smooth out sharp jumps in the sound wave to reduce distortions without making the system more complex. These changes help the converted voice sound cleaner and more natural without adding extra training steps.

Differentiable Digital Signal Processingvoice conversionsubtractive synthesisaliasing artifactsvoicing detectionfiltered noisephase-accumulated oscillatorPolynomial Band-Limited Step (PolyBLEP)MOS (Mean Opinion Score)
Authors
Suhita Ghosh, Yamini Sinha, Sebastian Stober
Abstract
Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.