Synthetic data generation for anonymization purposes. Application … · 2019. 11. 28. · Bhattacharyya coefficient Hellinger's distance Dissimilarity index Overlap Bhattacharyya

Synthetic data generation for anonymization purposes. Application on the Norwegian Survey on living conditions/EHIS

JOHAN HELDAL AND DIANA-CRISTINA IANCU

STATISTICS NORWAY, DEPARTMENT OF METHODOLOGY AND DATA COLLECTION

JOINT UNECE/EUROSTAT WORK SESSION ON STATISTICAL DATA CONFIDENTIALITY

29-31 OCTOBER 2019, THE HAGUE

Intro

• In some cases de-identification does not offer sufficient protection

• How to merge synthetic data based on surveys with other data sources, such as registers?

Intro



• An ideal solution:

◦ preserves data confidentiality

◦ allows for quality controls

◦ provides the same opportunities for data analysis as the non-anonymized data would

◦ enables international reporting to institutions

◦ provides the possibility to adjust data after a longer period

◦ can be reused with minimal adaptations

Intro



• An ideal solution:

◦ preserves data confidentiality

◦ allows for quality controls

◦ provides the same opportunities for data analysis as the non-anonymized data would

◦ enables international reporting to institutions

◦ provides the possibility to adjust data after a longer period

◦ can be reused with minimal adaptations

• Potential solution: create “satellite” datasets with register data that could be merged with the survey data as

needed, based on a key

Synthetic data generation

• Our solution: eliminate the key and match the “satellite” files through statistical matching

• We propose a model-free method that relies on statistical matching to replace the register information

corresponding to individuals from the original sample with register information corresponding to other similar

individuals

• Draw a “register sample” → Add register variables → Match the sample to the original survey and replace the

register data

Data

• Norwegian Survey on living conditions/EHIS (European Health Interview Survey) from 2015

• Comprehensive survey covering several topics

• Multiple uses for both aggregated results and microdata

• Conducted on a representative sample of individuals aged 16 and above

• The sample is divided into 19 strata, corresponding to the 19 counties in Norway

• Survey sample: 14,000 potential respondents for the entire country (700 individuals per county , except for

Oslo, with 1400 individuals)

• Goss sample: 13,748 individuals

• Net sample: 8164 individuals

Data

• Data collected before the interview:

◦ residence municipality of the respondent

◦ componence of the household

◦ name and address of the employer of each household member

◦ respondent’s occupation

• Data added after the interview is conducted:

◦ education

◦ income

◦ whether the respondent lives in a densely or sparsely populated area

◦ more detailed demographic information for the household and each family member, such as country of birth

and immigrant background

Method

• Step 1. Drawing a “register sample”

◦ We draw a sample of 42,000 individuals from the population register

◦ Survey respondents are not excluded prior to drawing the sample

• Step 2. Adding register variables

◦ Variables that would normally be linked to the Norwegian Survey on living conditions/EHIS

Method

• Step 3. Performing the statistical matching

◦ Match the enriched “register sample” to the original survey sample

◦ We follow the procedure outlined in D’Orazio (2016) for the statistical matching, consisting in 5 steps:

1) choosing the target variables

2) identifying the common variables

3) choosing the matching variables

4) applying a statistical matching method

5) evaluating the results

Method

• Step 3.1. Choosing the target variables

◦ Variables chosen for each of the two data sources

◦ The “register sample” can be adjusted by adding or removing register variables according to the needs of the

synthetic dataset users

Method





• Step 3.2. Choosing the common variables

◦ Assess definitions, accuracy, frequency distributions of the common variables

Method





• Step 3.2. Choosing the common variables

◦ Assess definitions, accuracy, frequency distributions of the common variables

• Step 3.3. Choosing the matching variables

◦ Choose only relevant variables

◦ Apply the principle of parsimony

◦ In order to preserve the structure of the survey sample, we use county, gender, age and household size as

matching variables

Method

Distribution of gender in the EHIS survey sample compared with the “register sample”

Method

Distribution of age in the EHIS survey sample compared

with the “register sample”

Method

Distribution of household size in the EHIS survey sample compared with the “register sample”

Method

• Step 3.4. Applying a statistical matching method

Random hot deck statistical matching Nearest neighbor distance hot deck statistical matching

Description For every individual in the original sample survey a donor is randomly selected from the donor dataset (“register sample”)

The closest donor to each record in the original survey is selected from the donor dataset, according to a distance computed on a subset of common variables

Version used Only one donor is chosen randomly (“RND1”)

Only one donor is chosen randomly among the 20 closest neighbors in terms of age and household size (“RND2”)

Distance hot deck statistical matching without constraints (“NN”)

Constrained distance hot deck matching (“NNC”)

Allows for the selection of a record as donor multiple times

Yes Yes Yes No

Donation classes County, Gender County, Gender County, Gender County, Gender

Matching variables Age, Household size Age, Household size Age, Household size Age, Household size

Method










Yes Yes Yes No



Number of observations in the synthetic dataset

8164 8164 8164 8164

Method










Yes Yes Yes No



Number of observations in the synthetic dataset

8164 8164 8164 8164

Number of distinct donors in the synthetic dataset

5135 7285 7239 8164

Method

• Step 3.5. Evaluating the results

◦ Assess the representativeness of the synthetic dataset

◦ Check the marginal distribution of the imputed variables

◦ Check the joint distribution of the imputed variables with the matching variables (the distribution in the

donor dataset, i.e. the register sample, is the reference)

ResultsMatching

method

Test

Variable

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Gender 0.0010 0.9990 1.0000 0.0007 0.0010 0.9990 1.0000 0.0007

County 0.0393 0.9607 0.9985 0.0390 0.0393 0.9607 0.9985 0.0390

Education

level 0.0172 0.9828 0.9997 0.0165 0.0276 0.9724 0.9996 0.0208

Education

field 0.0176 0.9824 0.9996 0.0199 0.0170 0.9830 0.9997 0.0167

Immigrant

category 0.0122 0.9878 0.9998 0.0158 0.0142 0.9858 0.9998 0.0143

Country

background 0.0141 0.9859 0.9997 0.0176 0.0161 0.9839 0.9997 0.0178

Degree of

urbanization 0.0052 0.9948 0.9999 0.0095 0.0011 0.9989 1.0000 0.0025

Occupation 0.0209 0.9791 0.9996 0.0198 0.0249 0.9751 0.9996 0.0198

Works full-/

part-time 0.0018 0.9982 1.0000 0.0014 0.0058 0.9942 1.0000 0.0044

Employment

status 0.0028 0.9972 0.9999 0.0084 0.0054 0.9946 0.9999 0.0106

RND1 RND2

Similarity and dissimilarity measures for comparing estimated distributions of

categorical variables from the synthetic datasets generated through random hot

deck statistical matching

Matching

method

Test

Variable

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Gender 0.0010 0.9990 1.0000 0.0007 0.0010 0.9990 1.0000 0.0007

County 0.0393 0.9607 0.9985 0.0390 0.0393 0.9607 0.9985 0.0390

Education

level 0.0125 0.9875 0.9998 0.0143 0.0153 0.9847 0.9998 0.0151

Education

field 0.0098 0.9902 0.9998 0.0123 0.0170 0.9830 0.9998 0.0158

Immigrant

category 0.0136 0.9864 0.9997 0.0168 0.0143 0.9857 0.9998 0.0142

Country

background 0.0119 0.9881 0.9998 0.0146 0.0145 0.9855 0.9997 0.0161

Degree of

urbanization 0.0109 0.9891 0.9999 0.0093 0.0071 0.9929 0.9999 0.0071

Occupation 0.0162 0.9838 0.9997 0.0169 0.0096 0.9904 0.9999 0.0119

Works full-/

part-time 0.0026 0.9974 1.0000 0.0020 0.0023 0.9977 1.0000 0.0018

Employment

status 0.0067 0.9933 0.9999 0.0080 0.0034 0.9966 1.0000 0.0046

NN NNC

Similarity and dissimilarity measures for comparing estimated distributions of

categorical variables from the synthetic datasets generated through nearest

neighbour distance hot deck statistical matching

ResultsMatching

method

Test

Variables

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

County and

education

level 0.0780 0.9220 0.9933 0.0819 0.0703 0.9297 0.9950 0.0705

County and

employment

status 0.0490 0.9510 0.9972 0.0528 0.0519 0.9481 0.9975 0.0503

Gender and

education

level 0.0206 0.9794 0.9996 0.0199 0.0314 0.9686 0.9993 0.0259

Gender and

employment

status 0.0152 0.9848 0.9998 0.0145 0.0056 0.9944 0.9999 0.0107

RND1 RND2

Similarity and dissimilarity measures for comparing joint distributions of

categorical variables from the synthetic datasets generated through random hot

deck statistical matching

Matching

method

Test

Variable

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

Dissimilarity

index Overlap

Bhattacharyya

coefficient

Hellinger's

distance

County and

education

level 0.0698 0.9302 0.9951 0.0703 0.0608 0.9392 0.9960 0.0634

County and

employment

status 0.0463 0.9537 0.9972 0.0526 0.0462 0.9538 0.9978 0.0467

Gender and

education

level 0.0129 0.9871 0.9997 0.0174 0.0155 0.9845 0.9997 0.0168

Gender and

employment

status 0.0093 0.9907 0.9999 0.0098 0.0097 0.9903 0.9999 0.0089

NN NNC

Similarity and dissimilarity measures for comparing joint distributions of

categorical variables from the synthetic datasets generated through nearest

neighbour distance hot deck statistical matching

Results

Distribution of age in each of the four synthetic

datasets, compared with the “register sample”

The road further

• Synthesize the entire survey sample, to capture non-response patterns

• Perform robustness checks

◦ Change the size of the “register sample”

◦ Use alternative options in the implementation of the matching procedure

• Test the quality of the resulting synthetic datasets with respect to more variables and to the household

structure

• Compare the information loss due to the usage of synthetic data with the information loss caused by applying

traditional disclosure control methods on the original survey data

• Simulate an attack by an intruder on the synthetic datasets

Thank you!

Documents

Synthetic data generation for anonymization purposes. Application … · 2019. 11. 28. · Bhattacharyya coefficient Hellinger's distance Dissimilarity index Overlap Bhattacharyya