Towards Continual Motion-Language Agents: LoRA Variants for Incremental Motion Understanding and Generation

2026-06-29Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study how to teach AI agents to both understand and create human movements described in language, while learning new types of motions without forgetting old ones. They build on a large language model and use a special technique called low-rank adaptation along with a mixture-of-experts design that picks the right expert automatically for each task. Their experiments on a set of five motion-related tasks show that their approach prevents forgetting and keeps the AI's understanding and generation abilities strong. They also find that choosing one expert decisively works better than mixing experts and note that standard accuracy measures don't always reflect real-world quality well.

motion-to-texttext-to-motionlow-rank adaptationmixture-of-expertscatastrophic forgettingcontinual learningautoencodertask routingHumanML3D datasettoken-level accuracy
Authors
Bertram Taetz, Hugo Albuquerque Cosme da Silva, Gabriele Bleser-Taetz
Abstract
Motion-language agents must possess the bidirectional capability to both understand human movement (motion-to-text, M2T) and generate it from natural language (text-to-motion, T2M). While foundational models have achieved strong performance in static settings, autonomous agents operating in dynamic environments must continuously incorporate new motion concepts -- such as novel athletic styles or specialized gestures -- without catastrophic forgetting of previously acquired skills. We investigate the stability-plasticity trade-off in bidirectional motion-language learning under sequential task exposure. Building on a frozen large language model backbone, we introduce low-rank adaptation (LoRA) variants designed to mitigate inter-task interference. We specifically propose mixture-of-experts architectures that utilize an autoencoder-based router to select task-specific experts at inference time, so that no task-label is needed. To evaluate these methods, we establish a reproducible five-task benchmark derived from HumanML3D through semantic clustering of motion descriptions. Our experimental results demonstrate near-zero forgetting across both M2T and T2M directions while maintaining high generation and captioning quality. Furthermore, we show that hard expert selection via routing significantly outperforms soft expert blending in quality metrics, indicating that preserving expert isolation is critical for maintaining performance in our continual learning setting. Finally, we observe that a divergence between token-level accuracy and downstream generation quality may occur, highlighting the need for more comprehensive evaluation protocols in future research on lifelong motion-language agents.