3D Scene-Adaptive Trajectory-Controllable Human Image Animation with Camera Movement

2026-06-29Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the challenge of creating videos where a person moves naturally within a 3D scene while the camera viewpoint also changes. They developed a method that adjusts the person's movement to fit different ground levels and orientations automatically, making it easier to control the motion path. Additionally, their method uses 3D scene information to help the video generation process understand and apply camera viewpoint changes more precisely. Tests show their approach improves video quality compared to previous methods.

human image animation3D motion retargetingcamera trajectorydiffusion modelspoint cloudlatent fusionscene visibilityvideo generation3D environmentpose guidance
Authors
Deyin Liu, Jicheng Xu, Lin Yuanbo Wu, Xiaowei Zhao, Xiatian Zhu, Zhe Jin, Anjan Dutta
Abstract
Human image animation, which aims to generate a video of a reference subject following a provided action sequence, has received increasing research interest. With the development of diffusion-based/flow-based video foundation models, existing animation works have began to upgrade the guidance information from 2D skeleton/pose to 3D modeling conditions. Despite achieving reasonable results, these approaches face challenges in synthesizing trajectory-controllable human motion within natural scene under changed camera views. In this work, we present a scene-adaptive human image animation framework that controls both human motion and camera trajectories within a reconstructed 3D environment for video generation. To achieve this, we first develop a ground-adaptive 3D motion retargeting approach to enable user-friendly motion trajectory control adapting to the changes of elevations of ground and orientations automatically. Then we design a viewpoint-adaptive latent fusion mechanism to inject point-cloud geometric priors through scene-visibility masking into the generative process, providing precise guidance of viewpoint changes under camera control. Experiments on two standard human image animation benchmark datasets demonstrate remarkable improvements of our method over the state of the arts in related video generation metics. Project page: https://robinhood256100.github.io/web-disp