FlowEdit: Associative Memory for Lifelong Pronunciation Adaptation in Flow-Matching TTS

2026-06-18 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors developed FlowEdit, a new method that helps text-to-speech systems fix pronunciation mistakes of unusual or new words without retraining the whole model. Instead of changing the model itself, FlowEdit learns small pronunciation fixes and remembers them using a special memory network. When the TTS system speaks again, it uses these corrections to sound more accurate. Tests showed FlowEdit greatly reduced errors on hard-to-pronounce words while keeping overall speech quality the same.

flow-matchingtext-to-speech (TTS)zero-shot learningpronunciation correctionlatent conditioningModern Hopfield Networktoken-level perturbationepisodic memoryphoneme error ratesoft attention

Authors

Harshit Singh, Ayush Pratap Singh, Nityanand Mathur

Abstract

Flow-matching text-to-speech systems achieve remarkable zero-shot quality but remain static after deployment: pronunciation errors on out-of-vocabulary proper nouns persist unless the model is retrained. We introduce FlowEdit, a life-long adaptation framework for frozen flow-matching TTS that learns pronunciation corrections as latent conditioning edits rather than weight updates. When corrective feedback is provided, FlowEdit optimizes a token-level perturbation in the text embedding space, then stores the correction in a Modern Hopfield Network serving as content-addressable episodic memory. At inference, corrections are retrieved via soft attention with a similarity gate, enabling fuzzy morphological matching. On our curated benchmark of 312 multilingual proper nouns across 18 language families, FlowEdit reduces target-word Phoneme Error Rate by 92.7% relative to the zero-shot baseline while maintaining identical general-speech quality. Corrections complete in approximately 15 seconds on a single GPU.

View PDFOpen arXiv