OrthoMotion:Disentangling Camera and Subject Motion via Geometry Semantics Orthogonal Attention

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors address a problem in making videos where you can separately control the camera movement and the subject's motion. They explain that the usual 2D approach mixes these controls in a way that can't be separated just from the images. To fix this, they designed a new method, OrthoMotion, which splits camera and subject movements into two mathematically different channels using special attention techniques. This ensures the camera and subject controls don't interfere with each other, improving accuracy without losing video quality. They also introduce a new way to measure how much the controls get mixed, proving their method reduces this mixing significantly.

controllable video generationoptical flowinverse-depth scalingattention operatorrotary position embedding (RoPE)cross-attentiondisentanglementorthogonalitycross-talk error (CTE)affine transformation
Authors
Zijie Meng
Abstract
Controllable video generation demands independent command of the camera and the subject, yet 2D conditioning entangles them: camera- and object-induced optical flow share the same inverse-depth (1/Z) scaling and cannot be separated from image evidence alone. We first prove that this entanglement is representational, not architectural -- the 2D camera/object split is a non-identifiable inverse problem -- and therefore reframe decoupling as a question of operator design. We resolve it at the level of the attention operator. OrthoMotion routes camera motion into a geometric channel, a norm-preserving rotation of the rotary position embedding (RoPE) phase, and subject motion into a semantic channel, a gated value injection in cross-attention. Because these sub-operators are algebraically complementary -- a rotation versus a translation of the affine action on tokens -- a lightweight decoupling regularizer provably drives their response subspaces to orthogonality, so the two controls stop interfering. To our knowledge OrthoMotion is the first method to guarantee disentanglement by construction rather than hope for it to emerge. It attains state-of-the-art camera and subject accuracy at once while minimizing cross-talk, which we quantify with a new Cross-Talk Error (CTE) metric, cutting cross-talk by more than 2.4x with no loss in fidelity and generalizing across backbones.