Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

2026-06-16 • Computation and Language

Computation and Language

AI summaryⓘ

The authors developed a method to turn the Al-Mawrid Arabic-English dictionary, originally in print, into a digital format that computers can easily use. They combined two standards (ISO LMF and TEI Lex-0) to handle the dictionary's complex structure and fixed issues like inconsistent punctuation. By testing their method on part of the dictionary, they showed it works well at accurately capturing word meanings and details. They also compared their work to other Arabic language resources and suggested ways to include it in the wider language data web. This work helps both computer language processing and researchers working with Arabic texts.

Al-Mawrid dictionaryISO Lexical Markup Framework (LMF)Text Encoding Initiative (TEI) Lex-0digitizationbilingual lexiconmorpho-semantic featuresLinked Open Data (LOD)Arabic NLPlexical knowledge densitysemantic web

Authors

Diaa Fayed, Laurent Romary

Abstract

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

View PDFOpen arXiv