GLiNER2-PII: A Multilingual Model for Personally Identifiable Information Extraction
2026-05-11 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors created GLiNER2-PII, a small and efficient model that can identify 42 different types of personal information in text. They tackled the problem of limited and sensitive real data by generating a large, diverse set of synthetic examples in multiple languages. This helped the model perform better than other PII detectors on a tough test called SPY. They also made the model publicly available for others to use and improve.
Personally Identifiable Information (PII)Named Entity RecognitionSynthetic Data GenerationMultilingual NLPCharacter-span ResolutionData PrivacyMachine Learning ModelsF1 ScoreNatural Language ProcessingBenchmarking
Authors
Urchade Zaratiana, Ash Lewis, George Hurn-Maloney
Abstract
Reliable detection of personally identifiable information (PII) is increasingly important across modern data-processing systems, yet the task remains difficult: PII spans are heterogeneous, locale-dependent, context-sensitive, and often embedded in noisy or semi-structured documents. We present GLiNER2-PII, a small 0.3B-parameter model adapted from GLiNER2 and designed to recognize a broad taxonomy of 42 PII entity types at character-span resolution. Training such systems, however, is constrained by the scarcity of shareable annotated data and the privacy risks associated with collecting real PII at scale. To address this challenge, we construct a multilingual synthetic corpus of 4,910 annotated texts using a constraint-driven generation pipeline that produces diverse, realistic examples across languages, domains, formats, and entity distributions. On the challenging SPY benchmark, GLiNER2-PII achieves the highest span-level F1 among five compared systems, including OpenAI Privacy Filter and three GLiNER-based detectors. We publicly release the model on Hugging Face to support further research and practical deployment of open PII detection systems.