Data-driven Head Motion Generation through Natural Gaze-Head Coordination

2026-05-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed a new way to understand how eye movement and head movement work together over time by using lots of real-world videos. They created a method to automatically collect diverse examples of where people look and how their heads move. Then, they made a model that can predict natural head movements based on where someone is looking, which also helps create more realistic facial videos controlled by gaze. People who tested their method preferred it over older approaches, showing it works well.

gaze estimationhead motiontemporal coordinationconditional Variational Autoencodergenerative modelfacial video generationprobabilistic modelingvideo synthesisdata-driven approach
Authors
Xiaohan Liu, Yilin Wen, Yusuke Sugano
Abstract
We present the first data-driven approach to model temporal gaze-head coordination from large-scale in-the-wild facial videos. To obtain training data for generalizable learning, we propose an automatic pipeline that extracts natural yet diverse gaze and head motions with off-the-shelf appearance-based gaze estimators. To capture the probabilistic correlation and temporal dynamics of gaze-head coordination, we build our model on a generative conditional Variational Autoencoder for plausible yet diverse gaze-conditioned head motion generations. We further apply our framework to gaze-controlled facial video generation, where we enable video generation with natural and realistic head motion correlated to the input gaze - an aspect that has not been emphasized before. Human evaluation and quantitative comparisons demonstrate our method's effectiveness and validate our design choices, with evaluators showing statistically significant preference for our approach over baseline methods.