Text Dictates, Music Decorates: Energy-based Attention for Editable Dance Motion Generation

2026-06-22Artificial Intelligence

Artificial Intelligence
AI summary

The authors developed STREAM, an AI system for generating dance movements that follow both music and text descriptions without mixing them up. They fixed the common problem where the music rhythm would overpower the text instructions by creating separate pathways for each input. They also made a new dataset called Motorica++ with detailed dance annotations and introduced new ways to measure how well the system can follow editing commands. Their experiments show STREAM produces dance that matches the music and respects the text guidance better than before.

choreographic motiondiffusion transformerAdaptive Layer Normalization (AdaLN)Bimodal Energy-Based Attention Module (BEAM)modal collapsemotorica datasetsemantic controlzero-shot editabilitymusic-conditioned motion synthesismodal decoupling
Authors
Seong Jong Yoo, Siyuan Peng, Felix Gu, Stratis Aloimonos, Cornelia Fermüller
Abstract
Choreographic motion generation poses unique challenges for AI, demanding precise semantic control over complex, temporally structured, and expressive full-body dynamics. While existing models can synthesize motion from music, they remain largely black boxes. Conversely, attempting to condition generation on both text and music frequently leads to modality collapse, where dense acoustic rhythms overwhelm sparse semantic text prompts, destroying user controllability. To resolve this spatial-temporal conflict, we propose STREAM (Structural-Temporal Rhythmic Energy-based Attention for Motion), a modality-decoupled diffusion transformer. STREAM strictly separates conditioning pathways: global text semantics dictate the kinematic structure via Adaptive Layer Normalization (AdaLN), while a novel Bimodal Energy-Based Attention Module (BEAM) routes these features to the musical beat without overwriting the semantics. We further introduce Motorica++, a newly curated dataset enriched with domain-specific dance vocabulary and frame-level semantic annotations from existing Motorica dataset. Additionally, to rigorously quantify zero-shot editability, we propose the Exchange Evaluation Protocol and Editable Dance Score (EDS). Through extensive experiments, STREAM achieves state-of-the-art alignment between motion and music while perfectly preserving choreographic semantics, positioning AI not merely as a reactive synthesizer, but as a controllable, collaborative partner for artistic direction. The source code and datasets are available at https://github.com/SeongJong-Yoo/STREAM.