Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionComputation and Language
AI summaryⓘ
The authors created a big dataset called Koshur Pixel to help computers read Kashmiri text, which is hard because of its unique script style and complicated letter shapes. This dataset includes over 600,000 pictures of text paired with their correct words, made using computer-generated methods to avoid the need for manual labeling. It covers different fonts and text lengths, and also simulates real-world paper damage to make the dataset more useful. This work aims to support building better OCR systems for Kashmiri, a language with very little existing digital text resources.
Optical Character RecognitionKashmiri languagePerso-Arabic Nastaliq scriptSynthetic datasetGlyph shapingLigaturesData augmentationOCR training data
Authors
Haq Nawaz Malik, Faizan Iqbal, Nahfid Nissar
Abstract
Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to contextual glyph shaping, dense ligatures, and orthographic variability. We introduce Koshur Pixel, the first large-scale synthetic OCR dataset for Kashmiri, comprising 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. The dataset spans multiple fonts and textual granularities, ranging from individual words to full-page documents, and incorporates more than 25 augmentation strategies that emulate real-world document degradations. Koshur Pixel provides a scalable and cost-effective alternative to manual annotation, establishing a foundational resource for training OCR systems, digitizing Kashmiri textual heritage, and advancing language technologies for a severely under-resourced language.