SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation

2026-05-11Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created SocialDirector, a tool that improves AI video generators so they can better control who does what in videos with multiple people interacting. Their method helps avoid mistakes like the wrong person doing an action or actions not matching who they’re directed at. SocialDirector uses two parts: one makes sure each person in the video focuses only on their own descriptions, and the other boosts attention to words that indicate direction. They tested it on different AI video models and found it made the social interactions in generated videos more accurate, closer to real videos.

video generationcross-attention mapsactor-action mismatchspatiotemporal maskdirectional reweightingVisual Language Models (VLMs)social interaction modelingmulti-person videointeraction fidelityautomated evaluation
Authors
Liangyang Ouyang, Ruicong Liu, Caixin Kang, Yifei Huang, Yoichi Sato
Abstract
Video generation has advanced rapidly, producing photorealistic videos from text or image prompts. Meanwhile, film production and social robotics increasingly demand multi-person videos with rich social interactions, including conversations, gestures, and coordinated actions. However, existing models offer no explicit control over interactions, such as who performs which action, when it occurs, and toward whom it is directed. This often results in wrong person performing unintended actions (actor-action mismatch), disordered social dynamics, and wrong action targets. To address these challenges, we present SocialDirector, a training-free interaction controller that enhances the generation model by modulating cross-attention maps. SocialDirector contains two modules: Social Actor Masking and Directional Reweighting. Social Actor Masking constrains each person's visual tokens to attend only to their own textual descriptions via a spatiotemporal mask, avoiding actor-action mismatch and disordered social dynamics. Directional Reweighting amplifies attention to directional words (e.g., "leftward", "right"), leading each action towards its intended target. To evaluate generated social interactions, we annotate existing datasets with interaction descriptions and build a fully automated evaluation pipeline powered by open-source VLMs. Experiments on different video generation models show that SocialDirector significantly improves interaction fidelity and approaches the upper bound set by real videos.