It does what it says on the tin: safe synthetic data from coarsened margins
2026-06-01 • Machine Learning
Machine Learning
AI summaryⓘ
The authors propose a new way to create synthetic data that is more transparent and safer to use. Their method clearly shows which parts of the original data's relationships are kept in the synthetic version. They also make sure the data comes from information that has already been checked to prevent revealing private details. This is done by adjusting certain counts and relationships before generating the synthetic data using a technique called Iterative Proportional Fitting. They demonstrate their approach using data from the 1901 Census of Scotland.
synthetic datatransparencystatistical disclosure controlIterative Proportional Fittingmarginstop-codingbottom-codingdata custodian1901 Census of Scotland
Authors
Gillian M Raab
Abstract
This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.