The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

2026-06-22Sound

SoundArtificial Intelligence
AI summary

The authors studied a method called provenance watermarking, which is used to mark synthetic speech to detect if it's fake. They found that when only fake speech is watermarked, detectors can mistakenly use the watermark as a shortcut to identify fakes, causing errors like mislabeling real speech with watermarks as fake and failing on new data. They showed this problem both in their experiments and with a commercial system. However, they also found that training detectors with watermarked samples from both real and fake speech fixes these errors. The authors provide a dataset to help further study this issue.

provenance watermarkingsynthetic speechspeech-generation modelswatermark detectionmachine learning shortcutswhite-box experimentblack-box testEqual Error Ratedetector generalizationdata augmentation
Authors
Nicolas M. Müller, Pascal Debus
Abstract
Provenance watermarking is increasingly treated as a safeguard for synthetic speech, whether built directly into speech-generation models such as Chatterbox, provided through dedicated techniques such as AudioSeal, or deployed by commercial platforms such as ElevenLabs. We identify a previously uncharacterized liability: when synthetic speech is watermarked and human speech is not, detectors trained alongside latch onto the watermark as a spurious "watermark => fake" shortcut. This single feature yields three coupled failures: generalization degradation (model performance deteriorates on unseen data), strip-to-evade (a watermarked fake escapes once unwatermarked), and mark-to-frame (watermarking a real voice flags it as fake). In a controlled white-box experiment, a watermark-trained detector shows all three (for example, mark-to-frame lifts Equal Error Rate from 16% to 75%). In a black-box test of a commercial API, we show that adding a watermark to real speech disguises it as fake. However, this shortcut is fixable: retraining with the watermark on both classes decorrelates it and restores clean behavior. We release experiment data as a paired clean-versus-watermarked corpus (WASP).