Unlimited OCR Works

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language

AI summaryⓘ

The authors address a problem in OCR systems where longer texts make the model slow and use a lot of memory. They developed Unlimited OCR, which uses a new attention technique called Reference Sliding Window Attention (R-SWA) to keep memory use steady and speed consistent during reading. This lets their model handle very long documents in one go, unlike previous models. Their method can also work for other tasks like speech recognition and translation.

OCRend-to-end modelsattention mechanismlanguage model decoderKV cachememory consumptionReference Sliding Window Attentionparsing working memoryASRtranslation

Authors

Youyang Yin, Huanhuan Liu, YY, Qunyi Xie, Chaorun Liu, Shiqi Yang, Shaohua Wang, Zhanlong Liu, Hao Zou, Jinyue Chen, Shu Wei, Jingjing Wu, Mingxin Huang, Zhen Wu, Guibin Wang, Tengyu Du, Lei Jia

Abstract

Recently, end-to-end OCR models, exemplified by DeepSeek OCR, have once again thrust OCR into the spotlight. A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation. This stands in stark contrast to humans, who exhibit no such decline in efficiency during long-horizon copying tasks. In this technical report, we propose Unlimited OCR, a model designed to emulate human parsing working memory. Taking DeepSeek OCR as the baseline, we replace all attention layers in the decoder with our proposed Reference Sliding Window Attention (R-SWA), which reduces attention computation costs while maintaining a constant KV cache throughout the entire decoding process. By combining the high compression rate of DeepSeek OCR's encoder with our constant KV cache design, Unlimited OCR can transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. More importantly, R-SWA is a general-purpose parsing attention mechanism - beyond OCR, it is equally applicable to tasks such as ASR, translation, etc. Codes and model weights are publicly available at http://github.com/baidu/Unlimited-OCR.

View PDFOpen arXiv