IMAGIN-4D: Image-Guided Controllable Interaction Generation

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how to create realistic human-object interactions for animation and AI by using text, object shapes, and movement paths as instructions. Since these instructions don't fully specify all details of the interaction, they add a reference image showing what the interaction should look like at one moment. They design a new method called IMAGIN-4D that breaks down the image information both in space and time, helping the system understand different parts of the interaction across frames. They also develop tools to create training data and measure how well generated motions match the reference image. Their experiments show IMAGIN-4D creates more detailed and accurate interactions than previous methods while still following the given movement paths.

Human-Object Interaction (HOI)Diffusion-based GenerationBody PoseObject PoseSpatial RelationshipsTemporal ConditioningAdaLN (Adaptive Layer Normalization)Motion SynthesisSynthetic Data GenerationImage-Adherence Metric
Authors
Sai Kumar Dwivedi, Federica Bogo, Buğra Tekin, Chenhongyi Yang, Nadine Bertsch, Tomas Hodan, Michael J. Black, Dimitrios Tzionas, Shreyas Hampali
Abstract
Generating human-object interactions (HOI) is central to character animation, robotics, AR/VR, and embodied AI. Recent HOI generation methods synthesize motion from text, object geometry, and sparse waypoints, controlling action semantics and object trajectories. However, these signals underspecify interaction: the same prompt and trajectory can produce different grasps, approach directions, body poses, object poses, contacts, and body-object layouts. We address this ambiguity with a reference image as a visual specification of the desired interaction snapshot. However, a single global image representation conflates distinct cues and conditions all frames on identical visual evidence. We therefore introduce IMAGIN-4D, a diffusion-based HOI generator that decomposes image conditioning spatio-temporally. For spatial conditioning, IMAGIN-4D extracts supervised interaction-state tokens for body pose, object pose, body-object contact, and spatial relationships at the depicted frame. For temporal conditioning, it computes frame-aware tokens by querying image patches per generated frame, allowing sequence segments to attend to different visual cues from the same image. To balance image, text, and waypoint cues, IMAGIN-4D uses role-aware conditioning: text, waypoints, and interaction-state tokens use separate AdaLN streams, while frame-aware visual tokens cross-attend with motion tokens. Since HOI motion datasets lack paired images, we build a synthetic motion-to-image rendering pipeline from FullBodyManipulation (FBM) and introduce an image-adherence metric to evaluate whether generated motions match the reference snapshot. Experiments on FBM and BEHAVE show that IMAGIN-4D improves fine-grained interaction control over single-token and uniformly image-conditioned baselines while preserving waypoint-following and motion quality. Code and models will be released at https://imagin4d.github.io.