Drum Synthesis from Expressive Drum Grids via Neural Audio Codecs

2026-05-11Sound

SoundArtificial Intelligence
AI summary

The authors created a system that turns drum patterns written in a special digital format (MIDI) into real drum sounds. They use a Transformer model to predict codes from a neural audio compressor, which are then turned into audio by a decoder. They tested different audio compressors to see which produces the best drum sounds and trained their system on a large dataset of real drum performances. Their results show that predicting these audio codes is a good way to generate realistic drum audio from MIDI inputs.

drum gridMIDITransformer modelneural audio codeccodec tokensaudio synthesisExpanded Groove MIDI Dataset (E-GMD)waveform audiomicrotimingvelocity
Authors
Konstantinos Soiledis, Maximos Kaliakatsos-Papakostas, Dimos Makris, Konstantinos Tsamis
Abstract
Generating realistic drum audio directly from symbolic representations is a challenging task at the intersection of music perception and machine learning. We propose a system that transforms an expressive drum grid, a time-aligned MIDI representation with microtiming and velocity information, into drum audio by predicting discrete codes of a neural audio codec. Our approach uses a Transformer-based model to map the drum grid input to a sequence of codec tokens, which are then converted to waveform audio via a pre-trained codec decoder. We experiment with multiple state-of-the-art neural codecs, namely EnCodec, DAC, and X-Codec, to assess how the choice of audio representation impacts the quality of the generated drums. The system is trained and evaluated on the Expanded Groove MIDI Dataset, E-GMD, a large collection of human drum performances with paired MIDI and audio. We evaluate the fidelity and musical alignment of the generated audio using objective metrics. Overall, our results establish codec-token prediction as an effective route for drum grid-to-audio generation and provide practical insights into selecting audio tokenizers for percussive synthesis.