DialogPII: A multilingual dataset of synthetic dialog transcripts to detect personal information

2026-06-29 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created DialogPII, a dataset of synthetic conversations designed to help computers find and remove personal information from spoken and written data. These dialogs represent different real-life scenarios like emergency calls and therapy sessions, spanning 11 languages and various types of personal data. They used language models to generate the dialogs, made sure they were realistic, converted them to speech, and then transcribed and annotated them carefully. The authors also developed baseline computer models to recognize personal entities and tested the quality of their dataset and annotations.

de-identificationnamed entity recognitionmultilingual datasetssynthetic dialogstext-to-speech synthesistranscriptionannotationtransformer modelspersonal informationinter-annotator agreement

Authors

Roland Roller, Vera Czehmann, Derya Erman, Luke Flanagan, Ibrahim Baroud, Frédéric Blain, Viviana Cotik, Eletta Giusto, Akhil Juneja, Mariana Neves, Maria Słowińska, Christine Hovhannisyan, Aaron Louis Eidt, Lisa Raithel, Sebastian Möller, Maija Poikela

Abstract

Conversational data collected in domains such as healthcare or social sciences is a valuable resource for research and automated analysis. However, responsible data sharing requires the detection and removal of personally identifiable and sensitive information to protect individual privacy. To support the development and evaluation of automatic de-identification systems, we present DialogPII, a multilingual dataset of synthetic dialogs and speech-derived transcripts for personal information detection. DialogPII covers eight interaction scenarios (emergency calls, medical anamnesis interviews, therapy sessions, insurance communication, customer support, clinical interviews regarding an AI-supported dashboard, police reports, and group therapy discussions), 19 entity types, and 11 languages (English, Arabic, Finnish, French, German, Hindi, Italian, Polish, Portuguese, Spanish, and Turkish). Dialogs were generated semi-automatically using large language models, manually curated for plausibility and diversity, and localized to country- and city-specific contexts. All dialogs were additionally converted to speech via text-to-speech synthesis, transcribed with Whisper, and annotated through automatic projection and manual correction, yielding aligned written and speech-derived resources across all languages. We further release baseline multilingual named entity recognition models and provide technical validation through inter-annotator agreement analysis, translation quality evaluation, annotation projection assessment, and benchmark experiments with transformer-based sequence labeling models.

View PDFOpen arXiv