Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

2026-06-22 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language

AI summaryⓘ

The authors created a big dataset called Koshur Pixel to help computers read Kashmiri text, which is hard because of its unique script style and complicated letter shapes. This dataset includes over 600,000 pictures of text paired with their correct words, made using computer-generated methods to avoid the need for manual labeling. It covers different fonts and text lengths, and also simulates real-world paper damage to make the dataset more useful. This work aims to support building better OCR systems for Kashmiri, a language with very little existing digital text resources.

Optical Character RecognitionKashmiri languagePerso-Arabic Nastaliq scriptSynthetic datasetGlyph shapingLigaturesData augmentationOCR training data

Authors

Haq Nawaz Malik, Faizan Iqbal, Nahfid Nissar

Abstract

Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to contextual glyph shaping, dense ligatures, and orthographic variability. We introduce Koshur Pixel, the first large-scale synthetic OCR dataset for Kashmiri, comprising 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. The dataset spans multiple fonts and textual granularities, ranging from individual words to full-page documents, and incorporates more than 25 augmentation strategies that emulate real-world document degradations. Koshur Pixel provides a scalable and cost-effective alternative to manual annotation, establishing a foundational resource for training OCR systems, digitizing Kashmiri textual heritage, and advancing language technologies for a severely under-resourced language.

View PDFOpen arXiv