AI summaryⓘ
The authors created a method called homographic navigation to help cameras precisely capture flat surfaces like pictures or screens. They use a mathematical tool called homography not just as a result but as a core part of their process, allowing them to teach a model to recognize and locate rectangular objects from just one example image by making many virtual variations. To get very accurate results even with low-resolution inputs, they developed a two-step approach: first finding the object roughly and then refining the location more closely. Their method also estimates how confident it is about each part it finds. Tests show this approach can accurately align images with minimal initial information, useful for guiding cameras and analyzing videos.
homographyplanar regionskeypoint predictiondata augmentationcamera guidancetwo-pass inferenceconfidence estimationimage alignmentsynthetic training dataStable Warp training
Authors
Dominik Kroupa, Marek Vaško, Muh Yuzril Ihza Baharuddin, Adam Herout
Abstract
We present homographic navigation, a geometry-centric framework for guiding camera acquisition toward precise capture of planar regions. Rather than treating homography as an output, we use it as an organizing variable that unifies learning, alignment, and evaluation. From a single annotated reference image, we generate unlimited synthetic training data via homographic augmentation and train a single-shot model for joint recognition and localization of multiple artifacts (physical objects with a rectangular planar target) through sparse keypoint prediction. To address precision under limited model input resolution, we introduce a two-pass inference scheme with global detection followed by localized refinement, and a Stable Warp training strategy that significantly improves accuracy, particularly in the high-precision regime. The model also predicts confidence estimates per predicted keypoint and per the whole sample. Experimental results demonstrate that accurate planar alignment can be achieved from minimal supervision, providing a foundation for geometry-driven camera guidance and future learning from in-the-wild video data.