Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation

2026-03-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceGraphicsHuman-Computer InteractionMachine Learning

AI summaryⓘ

The authors present Sketch2Colab, a method that turns simple 2D sketches of scenes into detailed 3D animations of multiple people moving and interacting. Instead of relying on slow or complex training like earlier systems, they teach a model to understand sketches and then create fast, realistic motion using a special flow-based approach. Their system also plans specific human interactions and timing, making the movements look natural and coordinated. Tests show their method follows the input sketches better and runs faster than previous methods.

diffusion model3D human motionlatent spacerectified flowcontinuous-time Markov chainmotion planningkeyframesphysical plausibilitymulti-human interactionstoryboard

Authors

Divyanshu Daiya, Aniket Bera

Abstract

We present Sketch2Colab, which turns storyboard-style 2D sketches into coherent, object-aware 3D multi-human motion with fine-grained control over agents, joints, timing, and contacts. Conventional diffusion-based motion generators have advanced realism; however, achieving precise adherence to rich interaction constraints typically demands extensive training and/or costly posterior guidance, and performance can degrade under strong multi-entity conditioning. Sketch2Colab instead first learns a sketch-driven diffusion prior and then distills it into an efficient rectified-flow student operating in latent space for fast, stable sampling. Differentiable energies over keyframes, trajectories, and physics-based constraints directly shape the student's transport field, steering samples toward motions that faithfully satisfy the storyboard while remaining physically plausible. To capture coordinated interaction, we augment the continuous flow with a continuous-time Markov chain (CTMC) planner that schedules discrete events such as touches, grasps, and handoffs, modulating the dynamics to produce crisp, well-phased human-object-human collaborations. Experiments on CORE4D and InterHuman show that Sketch2Colab achieves state-of-the-art constraint adherence and perceptual quality while offering significantly faster inference than diffusion-only baselines.

View PDFOpen arXiv