LatentFlowSR: High-Fidelity Audio Super-Resolution via Noise-Robust Latent Flow Matching

2026-04-10Sound

Sound
AI summary

The authors developed LatentFlowSR, a method to improve low-quality audio by adding missing high-frequency sounds. Instead of working directly on raw audio data, they first convert the audio into a simpler hidden format using an autoencoder. Then, they use a special technique called conditional flow matching to generate the improved audio representation, which is finally turned back into sound. Their tests show this approach works better than other methods on different types of sounds like music and effects, not just speech.

audio super-resolutionlatent spaceautoencoderconditional flow matchingordinary differential equation solverhigh-frequency reconstructionlow-resolution audiolatent representationaudio decodinggeneralization
Authors
Fei Liu, Yang Ai, Hui-Peng Du, Yu-Fei Shi, Zhen-Hua Ling
Abstract
Audio super-resolution aims to recover missing high-frequency details from bandwidth-limited low-resolution audio, thereby improving the naturalness and perceptual quality of the reconstructed signal. However, most existing methods directly operate in the waveform or time-frequency domain, which not only involves high-dimensional generation spaces but is also largely limited to speech tasks, leaving substantial room for improvement on more complex audio types such as sound effects and music. To mitigate these limitations, we introduce LatentFlowSR, a new audio super-resolution approach that leverages conditional flow matching (CFM) within a latent representation space. Specifically, we first train a noise-robust autoencoder, which encodes low-resolution audio into a continuous latent space. Conditioned on the low-resolution latent representation, a CFM mechanism progressively generates the corresponding high-resolution latent representation from a Gaussian prior with a one-step ordinary differential equation (ODE) solver. The resulting high-resolution latent representation is then decoded by the pretrained autoencoder to reconstruct the high-resolution audio. Experimental results demonstrate that LatentFlowSR consistently outperforms baseline methods across various audio types and super-resolution settings. These results indicate that the proposed method possesses strong high-frequency reconstruction capability and robust generalization performance, providing compelling evidence for the effectiveness of latent-space modeling in audio super-resolution. All relevant code will be made publicly available upon completion of the paper review process.