RT-DocLayout: Real-Time End-to-End Document Layout Analysis with Reading Order in the Wild

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors developed RT-DocLayout, a fast and efficient system to analyze the layout of documents by combining several tasks like detecting elements, segmenting them, and figuring out reading order all in one model. Their approach improves upon previous methods that were slower or used many separate steps, causing errors to add up. By using a single model that learns to understand both the shapes and the structure of the document, they made the system more robust to real-world problems like warped paper. Tests show that their method works very well and is fast enough for practical use, helping improve how documents are processed and read automatically.

document layout analysismulti-task learningRT-DETRend-to-end frameworkreading order predictionpixel-level segmentationobject detectiontransformer modelOCRreal-time inference
Authors
Cheng Cui, Tingquan Gao, Xueqing Wang, Changda Zhou, Hongen Liu, Ting Sun, Yubo Zhang, Zelun Zhang, Jiaxuan Liu, Manhui Lin, Yue Zhang, Suyin Liang, Yiqing Xiang, Yi Liu
Abstract
Accurate document layout analysis remains a critical bottleneck for document parsing systems, due to the intricate coupling among heterogeneous document layout elements, geometric distortions (\eg, paper warping and bending, perspective variations), and reading order within diverse layout structures. Existing approaches typically rely on fragmented multi-stage pipelines or computationally heavy generative Transformer architectures, leading to error propagation and limited efficiency. In this paper, we present RT-DocLayout, a highly efficient end-to-end framework for document layout analysis, designed as a front-end for document parsing tasks. The proposed model unifies classification, detection, pixel-level segmentation, and reading order prediction for layout elements within a single 33M-parameter architecture. Built upon the RT-DETR, our key contribution is a unified multi-task formulation within a single query-based decoder that simultaneously classifies, regresses bounding box, generates masks, and constructs relationship to reason reading order. By jointly learning geometric and structural representations, RT-DocLayout introduces multi-task optimization that substantially improves robustness under real-world document distortions. Extensive experiments on public benchmarks demonstrate state-of-the-art performance in document layout analysis while maintaining real-time inference speed(132.1 FPS). When coupled with downstream OCR engines, RT-DocLayout significantly improves full-document reconstruction quality, providing a scalable and practical foundation for real-world document intelligence systems.