Unleashing Guidance Without Classifiers for Human-Object Interaction Animation

2026-03-26Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors introduce LIGHT, a new method to create more realistic animations of people interacting with objects. Unlike older methods that use fixed rules about how contacts happen, their approach learns guidance from how the animation is generated step-by-step, without extra manual instructions. By handling different parts of the animation separately and letting clearer parts help noisier ones, their method naturally understands how humans should touch objects. They also train on many fake object shapes to help the model work well with a variety of objects. Tests show their approach makes interactions look more believable and works better on new objects than previous methods.

Human-Object Interaction (HOI)AnimationDiffusion ModelsDenoisingContact PriorsCross-AttentionClassifier-Free GuidanceSynthetic GeometryModality-Specific RepresentationGeneralization
Authors
Ziyin Wang, Sirui Xu, Chuan Guo, Bing Zhou, Jiangshan Gong, Jian Wang, Yu-Xiong Wang, Liang-Yan Gui
Abstract
Generating realistic human-object interaction (HOI) animations remains challenging because it requires jointly modeling dynamic human actions and diverse object geometries. Prior diffusion-based approaches often rely on hand-crafted contact priors or human-imposed kinematic constraints to improve contact quality. We propose LIGHT, a data-driven alternative in which guidance emerges from the denoising pace itself, reducing dependence on manually designed priors. Building on diffusion forcing, we factor the representation into modality-specific components and assign individualized noise levels with asynchronous denoising schedules. In this paradigm, cleaner components guide noisier ones through cross-attention, yielding guidance without auxiliary classifiers. We find that this data-driven guidance is inherently contact-aware, and can be enhanced when training is augmented with a broad spectrum of synthetic object geometries, encouraging invariance of contact semantics to geometric diversity. Extensive experiments show that pace-induced guidance more effectively mirrors the benefits of contact priors than conventional classifier-free guidance, while achieving higher contact fidelity, more realistic HOI generation, and stronger generalization to unseen objects and tasks.