OneHOI: Unifying Human-Object Interaction Generation and Editing
2026-04-15 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMultimedia
AI summaryⓘ
The authors developed OneHOI, a new model that can both create and change human-object interaction scenes using a single system. Their approach uses a special transformer called R-DiT, which understands how actions connect people and objects in a scene. It combines different data types and conditions, making it better at handling complex interactions than previous methods. They trained and tested OneHOI on several datasets and achieved top performance in creating and editing interaction scenes.
Human-Object InteractionDiffusion TransformerScene GenerationScene EditingStructured Interaction RepresentationAction GroundingAttention MechanismMultimodal LearningPose EstimationLayout Conditioning
Authors
Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan
Abstract
Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.