195
Statistical Disclosure Control: A Practice Guide Thijs Benschop, Cathrine Machingauta, Matthew Welch Jul 26, 2021

Statistical Disclosure Control: A Practice Guide

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: APractice Guide

Thijs Benschop, Cathrine Machingauta, Matthew Welch

Jul 26, 2021

Page 2: Statistical Disclosure Control: A Practice Guide
Page 3: Statistical Disclosure Control: A Practice Guide

Table of Content

1 Introduction 31.1 Building a knowledge base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Using this guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Outline of this guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Glossary and list of acronyms 72.1 List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Statistical Disclosure Control (SDC): An Introduction 133.1 Need for SDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 The risk-utility trade-off in the SDC process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Release Types 174.1 Conditions for PUFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Conditions for SUFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Conditions for microdata available in a controlled research data center . . . . . . . . . . . . . . . . . 20

5 Measuring Risk 235.1 Types of disclosure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Classification of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Disclosure scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Levels of risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.5 Individual risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.6 Special Uniques Detection Algorithm (SUDA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.7 Risk measures for continuous variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.8 Global risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405.9 Household risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Anonymization Methods 456.1 Classification of SDC methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Non-perturbative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466.3 Perturbative methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.4 Anonymization of geospatial variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756.5 Anonymization of the quasi-identifier household size . . . . . . . . . . . . . . . . . . . . . . . . . . 766.6 Special case: census data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

i

Page 4: Statistical Disclosure Control: A Practice Guide

7 Measuring Utility and Information Loss 777.1 General utility measures for continuous and categorical variables . . . . . . . . . . . . . . . . . . . 787.2 Utility measures based on the end user’s needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847.4 Assessing data utility with the help of data visualizations (in R) . . . . . . . . . . . . . . . . . . . . 877.5 Choice of utility measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8 SDC with sdcMicro in R: Setting Up Your Data and more 958.1 Installing R, sdcMicro and other packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 958.2 Read functions in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968.3 Missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978.4 Classes in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 978.5 Objects of class sdcMicroObj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 988.6 Household structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1028.7 Randomizing order and numbering of individuals or households . . . . . . . . . . . . . . . . . . . . 1048.8 Computation time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1068.9 Common errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9 The SDC Process 1099.1 Step 1: Need for confidentiality protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099.2 Step 2: Data preparation and exploring data characteristics . . . . . . . . . . . . . . . . . . . . . . . 1109.3 Step 3: Type of release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119.4 Step 4: Intruder scenarios and choice of key variables . . . . . . . . . . . . . . . . . . . . . . . . . 1129.5 Step 5: Data key uses and selection of utility measures . . . . . . . . . . . . . . . . . . . . . . . . . 1139.6 Step 6: Assessing disclosure risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149.7 Step 7: Assessing utility measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149.8 Step 8: Choice and application of SDC methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1149.9 Step 9: Re-measure risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159.10 Step 10: Re-measure utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159.11 Step 11: Audit and Reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169.12 Step 12: Data release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

10 Appendices 11910.1 Appendix A: Overview of Case Study Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11910.2 Appendix B: Example of Blanket Agreement for SUF . . . . . . . . . . . . . . . . . . . . . . . . . 12010.3 Appendix C: Internal and External Reports for Case Studies . . . . . . . . . . . . . . . . . . . . . . 12210.4 Case study 1 - Internal report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12210.5 Case study 1 - External report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12910.6 Case study 2 - Internal report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13010.7 Case study 2- External report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13510.8 Appendix D: Execution Times for Multiple Scenarios Tested using Selected Sample Data . . . . . . 135

11 Case Studies (Illustrating the SDC Process) 13911.1 Case study 1- SUF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13911.2 Case study 2 - PUF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Bibliography 189

ii

Page 5: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

This is documentation and guidance for using sdcMicro from command-line. sdcMicro is an R package, which pro-vides tools for Statistical Disclosure Control (SDC) for microdata, also known as microdata anonymization. We referto Statistical Disclosure Control for Microdata: A Theory Guide. for an introduction to SDC for microdata with acomplete overview of the theory as well as examples from practice.

Authors of this guide: Thijs Benschop and Matthew Welch, The World Bank

Acknowledgments: The authors thank Olivier Dupriez and Cathrine Machingauta (The World Bank) for their techni-cal comments and inputs throughout the process.

Preferred citation of this guide: Benschop, T. and Welch, M. (n.d.) Statistical Disclosure Control for Microdata: APractice Guide. Retrieved (insert date), from https://sdcpractice.readthedocs.io/en/latest/

The production of this guide was made possible through a World Bank Knowledge for Change II Grant: KCP II -A microdata dissemination challenge: Balancing data protection and data utility. Grant number: TF 015043, ProjectNumber P094376. As well as from United Kingdom - DFID funding to the World Bank Multi-Donor Trust Fund -International Household Survey and Accelerated Data Program – TF071804/TF011722/TF0A7461.

Table of Content 1

Page 6: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

2 Table of Content

Page 7: Statistical Disclosure Control: A Practice Guide

CHAPTER 1

Introduction

National statistics agencies are mandated to collect microdata1 from surveys and censuses to inform and measurepolicy effectiveness. In almost all countries, statistics acts and privacy laws govern these activities. These laws requirethat agencies protect the identity of respondents, but may also require that agencies disseminate the results and, inappropriate cases, the microdata. Data producers who are not part of national statistics agencies are also often subjectto restrictions, through privacy laws or strict codes of conduct and ethics that require a similar commitment to privacyprotection. This has to be balanced against the increasing requirement from funders that data produced using donorfunds be made publically available.

This tension between complying with confidentiality requirements while at the same time requiring that microdatabe released means that a demand exists for practical solutions for applying Statistical Disclosure Control (SDC), alsoknown as microdata anonymization. The provision of adequate solutions and technical support has the potential to“unlock” a large number of datasets.

The International Household Survey Network (IHSN) and the World Bank have contributed to successful programsthat have generated tools, resources and guidelines for the curation, preservation and dissemination of microdata andresulted in the documentation of thousands of surveys by countries and agencies across the world. While these pro-grams have ensured substantial improvements in the preservation of data and dissemination of good quality metadata,many agencies are still reluctant to allow access to the microdata. The reasons are technical, legal, ethical and political,and sometimes involve a fear of being criticized for not providing perfect data. When combined with the tools andguidelines already developed by the IHSN/World Bank for the curation, preservation and dissemination of microdata,tools and guidelines for the anonymization of microdata should further reduce or remove some of these obstacles.

Working with the IHSN, PARIS21 (OECD), Statistics Austria and the Vienna University of Technology, the WorldBank has contributed to the development of an open source software package for SDC, called sdcMicro TKM15.The package was developed for use with the open source R statistical software, available from the Comprehensive RArchive Network (CRAN) at http://cran.us.r-project.org. The package includes numerous methods for the assessmentand reduction of disclosure risk in microdata.

Ensuring that a free open source solution is available to agencies was an important step forward, but not a sufficientone. There is still limited consolidated and reported knowledge on the impact of disclosure risk reduction methods on

1 Microdata are unit-level data obtained from sample surveys, censuses and administrative systems. They provide information about charac-teristics of individual people or entities such as households, business enterprises, facilities, farms or even geographical areas such as villages ortowns. They allow in-depth understanding of socio-economic issues by studying relationships and interactions among phenomena. Microdata arethus key to designing projects and formulating policies, targeting interventions and monitoring and measuring the impact and results of projects,interventions and policies.

3

Page 8: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

data utility. This limited access to knowledge combined with a lack of experience in using the tools and methods makesit difficult for many agencies to implement optimal solutions, i.e., solutions that meet their obligations towards bothprivacy protection and the release of data useful for policy monitoring and evaluation. This practice guide attempts tofill this critical gap by:

(i) consolidating knowledge gained at the World Bank through experiments conducted during a large-scale evalua-tion of anonymization techniques

(ii) translating the experience and key results into practical guidelines

It should be stressed that SDC is only one part of the data release process, and its application must be considered withinthe complete data release framework. The level and methods of SDC depend on the laws of the country, the sensitivityof the data and the access policy (i.e., who will gain access) considered for release. Agencies that are currentlyreleasing data are already using many of the methods described in this guide and applying appropriate access policesto their data before release. The primary objective of this guide is to provide a primer to those new to the processwho are looking for guidance on both theory and practical implementation. This guide is not intended to prescribe oradvocate for changes in methods that specific data producers are already using and which they have designed to fit andcomply with their existing data release policies.

The guide seeks to provide practical steps to those agencies that want to unlock access to their data in a safe way andensure that the data remain fit for purpose.

1.1 Building a knowledge base

The release of data is important, as it allows researchers and policymakers to replicate officially published results,generate new insights into issues, avoid duplication of surveys and provide greater returns to the investment in thesurvey process.

Both the production of reports, with aggregate tables of indicators and statistics, and the release of microdata resultin privacy challenges to the producer. In the past, for many agencies, the only requirement was to release a reportand some key indicators. The recent movement around Open Data, Open Government and transparency means thatagencies are under greater pressure to release their microdata to allow broader use of data collected through public anddonor funds. This guide focuses on the methods and processes for the release of microdata.

Releasing data in a safe way is required to protect the integrity of the statistical system, by ensuring agencies honortheir commitment to respondents to protect their identity. Agencies do not widely share, in substantial detail, theirknowledge and experience using SDC and the processes for creating safe data with other agencies. This makes itdifficult for agencies new to the process to implement solutions. To fill this experience and knowledge gap, weevaluated the use of a broad suite of SDC methods on a range of survey microdata covering important developmenttopics related to health, labor, education, poverty and inequality. The data we used were all previously treated tomake them safe for release. Given that their producers had already treated these data, it was not possible, nor was itour goal, to pass any judgment on the safety of these data, many of which are in the public domain The focus wasrather on measuring the effects that various methods would have on the risk-utility trade-off for microdata producedto measure common development indicators. We used the experience from this large-scale experimentation to informour discussion of the processes and methods in this guide.

Important

At no point was any attempt made to re-identify, through matching or any other method, any respondents in the surveyswe used in building our knowledge base. All risk assessments were based on frequencies and probabilities.

4 Chapter 1. Introduction

Page 9: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

1.2 Using this guide

The methods discussed in this guide originate from a large body of literature on SDC. The processes underlying manyof the methods are the subject of extensive academic research and many, if not all, of them are used extensively byagencies experienced in preparing microdata for release.

Where possible, for each method and topic, we provide elaborate examples, references to the original or seminal workdescribing the methods and algorithms in detail and recommended readings. This, when combined with the discussionof the method and practical considerations in this guide, should allow the reader to understand the methods and theirstrengths and weaknesses. It should also provide enough detail for readers to use an existing software solution toimplement the methods or program the methods in statistical software of their choice.

For the examples in this guide, we use the open source and free package for SDC called sdcMicro as well as thestatistical software R. sdcMicro is an add-on package to the statistical software R. The package was developed and ismaintained by Matthias Templ, Alexander Kowarik and Bernhard Meindl.2 The statistical software R and the sdcMicropackage, as well as any other packages needed for the SDC process, are freely available from the Comprehensive RArchive Network (CRAN) mirrors (http://cran.r-project.org/). The software is available for Linux, Windows andMacintosh operating systems. We chose to use R and sdcMicro because it is freely available, accommodates all maindata formats and is easy to adapt by the user. The World Bank, through the IHSN, has also provided funding towardsthe development of the sdcMicro package to ensure it meets the requirements of the agencies we support.

This guide does not provide a review of all other available packages for implementing the SDC process. Our concernis more with providing practical insight into the application of the methods. We would, however, like to highlightone particular other software package that is commonly used by agencies: 𝜇-ARGUS3. 𝜇-ARGUS is developed byStatistics Netherlands. sdcMicro and 𝜇-ARGUS are both widely used in statistics offices in the European Union andimplement many of the same methods.

The user needs some knowledge of R to use sdcMicro. It is beyond the scope of this guide to teach the use of R, butwe do provide throughout the guide code examples on how to implement the necessary routines in R.4 We also presenta number of case studies that include the code for the anonymization of a number of demo datasets using R. Throughthese case studies, we demonstrate a number of approaches to the anonymization process in R.5

1.3 Outline of this guide

This guide is divided into the following main sections:

(i) the Section Statistical Disclosure Control (SDC): An Introduction is a primer on SDC.

(ii) the Section Release Types gives an introduction to different release types for microdata.

(iii) the Sections Anonymization Methods , Measuring Risk and Measuring Utility and Information Loss coverSDC methods, risk and utility measurement. The goal here is to provide knowledge that allows the reader toindependently apply and execute the SDC process. This section is enriched with real examples as well as codesnippets from the sdcMicro package. The interested reader can also find more information in the references andrecommended readings at the end of each section.

(iv) the Section SDC with sdcMicro in R: Setting Up Your Data and more gives an overview of issues encounteredwhen carrying out anonymization with the sdcMicro package in R, which exceed basic R knowledge. This

2 See http://cran.r-project.org/web/packages/sdcMicro/index.html and the GitHub https://github.com/sdcTools/sdcMicro of the developers. TheGitHub repository can also be used to submit bugs found in the package.

3 𝜇-ARGUS is available at: http://neon.vb.cbs.nl/casc/mu.htm. The software was recently ported to open source.4 There are many free resources for learning R available on the web. One place to start would be the CRAN R Project page: http://cran.r-project.

org/other-docs.html5 The developers of sdcMicro have also developed a graphical user interface (GUI) for the package, which is contained in the sdcMicro package

available from the CRAN mirrors. The GUI, however, does not implement the full functionality of the sdcMicro package and is not discussed inthis guide. The GUI can be called after loading sdcMicro by typing sdcApp() at the prompt.

1.2. Using this guide 5

Page 10: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

section also includes tips and solutions to some of the common issues and problems that might be encounteredwhen applying SDC methods in R with sdcMicro.

(v) the Section The SDC Process provides a step-by-step guide to disclosure control, which draws upon the knowl-edge presented in the previous sections.

(vi) the Section Case Studies (Illustrating the SDC Process) presents a number of detailed case studies that demon-strate the use of the methods, their implementation in sdcMicro and the process that should be followed to reachthe optimal risk-utility solution.

References

6 Chapter 1. Introduction

Page 11: Statistical Disclosure Control: A Practice Guide

CHAPTER 2

Glossary and list of acronyms

2.1 List of Acronyms

AFR Sub Saharan Africa

COICOP Classification of Individual Consumption by Purpose

CRAN Comprehensive R Archive Network

CTBIL Contingency Table-Based Information Loss

DHS Demographic and Health Surveys

DIS Data Intrusion Simulation

EAP East Asia and the Pacific

ECA Europe and Central Asia

EU European Union

GIS Geographical Information System

GPS Global Positioning System

GUI Graphical User Interface

HIV/AIDS Human Immunodeficiency Virus/Acquired Immune Deficiency Syndrome

I2D2 International Income Distribution Database

IHSN International Household Survey Network

LAC Latin America and the Caribbean

LSMS Living Standards Measurement Survey

MDAV Maximum Distance Average Vector

MDG Millennium Development Goal

MENA Middle East and North America

7

Page 12: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

MICS Multiple Indicator Cluster Survey

MME Mean Monthly Expenditures

MMI Mean Monthly Income

MSU Minimal Sample Uniques

NSI National Statistical Institute

NSO National Statistical Office

OECD Organization for Economic Cooperation and Development

PARIS21 Partnership in Statistics for Development in the 21st century

PRAM Post Randomization Method

PC Principal Component

PUF Public Use File

SA South Asia

SDC Statistical Disclosure Control

SSE Sum of Squared Errors

SHIP Survey-based Harmonized Indicators Program

SUDA Special Uniques Detection Algorithm

SUF Scientific Use File

UNICEF United Nations Children’s Fund

2.2 Glossary

Administrative data Data collected for administrative purposes by government agencies. Typically, administrativedata require specific SDC methods.

Anonymization Use of techniques that convert confidential data into anonymized data/ removal or masking of iden-tifying information from datasets.

Attribute disclosure Attribute disclosure occurs if an intruder is able to determine new characteristics of an individualor organization based on the information available in the released data.

Categorical variable A variable that takes values over a finite set, e.g., gender. Also called factor in R.

Confidentiality Data confidentiality is a property of data, usually resulting from legislative measures, which preventsit from unauthorized disclosure.2

Confidential data Data that will allow identification of an individual or organization, either directly or indirectly.1

Continuous variable A variable with which numerical and arithmetic operations can be performed, e.g., income.

Data protection Data protection refers to the set of privacy-motivated laws, policies and procedures that aim tominimize intrusion into respondents’ privacy caused by the collection, storage and dissemination of personaldata.2

Deterministic methods Anonymization methods that follow a certain algorithm and produce the same results if ap-plied repeatedly to the same data with the same set of parameters.

2 OECD, http://stats.oecd.org/glossary1 Australian Bureau of Statistics, http://www.nss.gov.au/nss/home.nsf/pages/Confidentiality+-+Glossary

8 Chapter 2. Glossary and list of acronyms

Page 13: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Direct identifier A variable that reveals directly and unambiguously the identity of a respondent, e.g., names, socialidentity numbers.

Disclosure Disclosure occurs when a person or an organization recognizes or learns something that they did notalready know about another person or organization through released data.1 See also Identity disclosure, Attributedisclosure and Inferential disclosure.

Disclosure risk A disclosure risk occurs if an unacceptably narrow estimation of a respondent’s confidential infor-mation is possible or if exact disclosure is possible with a high level of confidence.2 Disclosure risk also refersto the probability that successful disclosure could occur.

End user The user of the released microdata file after anonymization. Who is the end user depends on the releasetype.

Factor variable Factor variables are one way to classify categorical variables in R.

Hierarchical structure Data is made up of collections of records that are interconnected through links, e.g., individ-uals belonging to groups/households or employees belonging to companies.

Identifier An identifier is a variable/ information that can be used to establish identity of an individual or organization.Identifiers can lead to direct or indirect identification.

Identity disclosure Identity disclosure occurs if an intruder associates a known individual or organization with areleased data record.

Indirect identification Indirect identification occurs when the identity of an individual or organization is disclosed,not using direct identifiers but through a combination of unique characteristics in key variables.1

Inferential disclosure Inferential disclosure occurs if an intruder is able to determine the value of some characteristicof an individual or organization more accurately with the released data than otherwise would have been possible.

Information loss Information loss refers to the reduction of the information content in the released data relative tothe information content in the raw data. Information loss is often measured with respect to common analyticalmeasures, such as regressions and indicators. See also Utility.

Interval A set of numbers between two designated endpoints that may or may not be included. Brackets (e.g., [0, 1])denote a closed interval, which includes the endpoints 0 and 1. Parentheses (e.g., (0, 1) denote an open interval,which does not include the endpoints.

Intruder A user who misuses released data by trying to disclose information about an individual or organization,using a set of characteristics known to the user.

𝑘-anonymity The risk measure 𝑘-anonymity is based on the principle that the number of individuals in a samplesharing the same combination of values (key) of categorical key variables should be higher than a specifiedthreshold 𝑘.

Key A combination or pattern of key variables/quasi-identifiers.

Key variables A set of variables that, in combination, can be linked to external information to re-identify respondentsin the released dataset. Key variables are also called “quasi-identifiers” or “implicit identifiers”.

Microaggregation Anonymization method that is based on replacing values for a certain variable with a commonvalue for a group of records. The grouping of records is based on a proximity measure of variables of interest.The groups of records are also used to calculate the replacement value.

Microdata A set of records containing information on individual respondents or on economic entities. Such recordsmay contain responses to a survey questionnaire or administrative forms.

Noise addition Anonymization method based on adding or multiplying a stochastic or randomized number to theoriginal values to protect data from exact matching with external files. Noise addition is typically applied tocontinuous variables.

2.2. Glossary 9

Page 14: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Non-perturbative methods Anonymization methods that reduce the detail in the data or suppress certain values(masking) without distorting the data structure.

Observation A set of data derived from an object/unit of experiment, e.g., an individual (in individual-level data), ahousehold (in household-level data) or a company (in company data). Observations are also called “records”.

Original data The data before SDC/anonymization methods were applied. Also called “raw data” or “untreateddata”.

Outlier An unusual value that is correctly reported but is not typical of the rest of the population. Outliers can alsobe observations with an unusual combination of values for variables, such as 20-year-old widow. On their ownage, 20 and widow are not unusual values, but their combination may be.1

Perturbative methods Anonymization methods that alter values slightly to limit disclosure risk by creating uncer-tainty around the true values, while retaining as much content and structure as possible, e.g. microaggregationand noise addition.

Population unique The only record in the population with a particular set of characteristics, such that the individualor organization can be distinguished from other units in the population based on that set of characteristics.

Post Randomization Method (PRAM) Anonymization method for microdata in which the scores of a categoricalvariable are altered according to certain probabilities. It is thus intentional misclassification with known mis-classification probabilities.1

Probabilistic methods Anonymization methods that depend on a probability mechanism or a random number-generating mechanism. Every time a probabilistic method is used, a different outcome is generated.

Privacy Privacy is a concept that applies to data subjects while confidentiality applies to data. The concept is definedas follows: “It is the status accorded to data which has been agreed upon between the person or organizationfurnishing the data and the organization receiving it and which describes the degree of protection which will beprovided.”2

Public Use File (PUF) Type of release of microdata file, which is freely available to any user, for example on theinternet.

Quasi-identifiers A set of variables that, in combination, can be linked to external information to re-identify respon-dents in the released dataset. Quasi-identifiers are also called “key variables” or “implicit identifiers”.

Raw data The data before SDC/anonymization methods were applied. Also called “original data” or “untreated data”.

Recoding Anonymization method for microdata in which groups of existing categories/values are replaced with newvalues, e.g. the values ‘protestant’, and ‘catholic’ are replaced with ‘Christian’. Recoding reduces the detailin the data. Recoding of continuous variables leads to a transformation from continuous to categorical, e.g.creating income bands.

Record A set of data derived from an object/unit of experiment, e.g., an individual (in individual-level data), a house-hold (in household-level data) or a company (in company data). Records are also called “observations”.

Regression A statistical process of measuring the relation between the mean value of one variable and correspondingvalues of other variables.

Re-identification risk See Disclosure risk

Release Dissemination – the release to users of information obtained through a statistical activity.2

Respondents Individuals or units of observation whose information/responses to a survey make up the data file.

Sample unique The only record in the sample with a particular set of characteristics, such that the individual ororganization can be distinguished from other units in the sample based on that set of characteristics.

Scientific Use File (SUF) Type of release of microdata file, which is only available to selected researchers undercontract. Also known as “licensed file”, “microdata under contract” or “research file”.

10 Chapter 2. Glossary and list of acronyms

Page 15: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

sdcMicro An R based package authored by Templ, M., Kowarik, A. and Meindl, B. with tools for the anonymizationof microdata, i.e. for the creation of public- and scientific-use files.

sdcMicroGUI A GUI for the R based sdcMicro package, which allows users to use the sdcMicro tools without Rknowledge.

Sensitive variables Sensitive or confidential variables are those whose values must not be discovered for any respon-dent in the dataset. The determination of sensitive variables is often subject to legal and ethical concerns.

Statistical Disclosure Control (SDC) Statistical Disclosure Control techniques can be defined as the set of methodsto reduce the risk of disclosing information on individuals, businesses or other organizations. Such methods areonly related to the dissemination step and are usually based on restricting the amount of or modifying the datareleased.2

Suppression Data suppression involves not releasing information that is considered unsafe because it fails confiden-tiality rules being applied. Sometimes this is done is by replacing values signifying individual attributes withmissing values. In the context of this guide, usually to achieve a desired level of k- anonymity.

Threshold An established level, value, margin or point at which values that fall above or below it will deem the datasafe or unsafe. If unsafe, further action will need to be taken to reduce the risk of identification.

Utility Data utility describes the value of data as an analytical resource, comprising analytical completeness andanalytical validity.

Untreated data The data before SDC/anonymization methods were applied. Also called “raw data” or “originaldata”.

Variable Any characteristic, number or quantity that can be measured or counted for each unit of observation.

2.2. Glossary 11

Page 16: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

12 Chapter 2. Glossary and list of acronyms

Page 17: Statistical Disclosure Control: A Practice Guide

CHAPTER 3

Statistical Disclosure Control (SDC): An Introduction

3.1 Need for SDC

A large part of the data collected by statistical agencies cannot be published directly due to privacy and confidentialityconcerns. These concerns are both of legal and ethical nature. SDC seeks to treat and alter the data so that the datacan be published or released without revealing the confidential information it contains, while, at the same time, limitinformation loss due to the anonymization of the data. In this guide, we discuss only disclosure control for microdata.1

Microdata are datasets that provide information on a set of variables for each individual respondent. Respondents canbe natural persons, but also legal entities such as companies.

The aim of anonymizing microdata is to transform the datasets to achieve an “acceptable level” of disclosure risk. Thelevel of acceptability of disclosure risk and the need for anonymization are usually at the discretion of the data producerand guided by legislation. These are formulated in the dissemination policies and programs of the data providers andbased on considerations including “[. . . ] the costs and expertise involved; questions of data quality, potential misuseand misunderstanding of data by users; legal and ethical matters; and maintaining the trust and support of respondents”(DuBo10). There is a moral, ethical and legal obligation for the data producers to ensure that data provided by therespondents are used only for statistical purposes.

In some cases, the dissemination of microdata is a legal obligation, but, in most cases, the legislation will formulaterestrictions. Thus, a country’s legislative framework will shape its microdata dissemination policy. It is crucial fordata producers to “ensure there is a sound legal and ethical base (as well as the technical and methodological tools)for protecting confidentiality. This legal and ethical base requires a balanced assessment between the public good ofconfidentiality protection on the one hand, and the public benefits of research on the other. A decision on whether ornot to provide access might depend on the merits of specific research proposals and the credibility of the researcher,and there should be some allowance for this in the legal arrangements.” (DuBo10).

“Data access arrangements should respect the legal rights and legitimate interests of all stakeholders in the publicresearch enterprise. Access to, and use of, certain research data will necessarily be limited by various types of legalrequirements, which may include restrictions for reasons of:

• National security: data pertaining to intelligence, military activities, or political decision making may be classi-fied and therefore subject to restricted access.

1 There is another strand of literature on the anonymization of tabular data, see e.g., HDFG12.

13

Page 18: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

• Privacy and confidentiality: data on human subjects and other personal data are subject to restricted access undernational laws and policies to protect confidentiality and privacy. However, anonymization or confidentialityprocedures that ensure a satisfactory level of confidentiality should be considered by custodians of such data topreserve as much data utility as possible for researchers.

• Trade secrets and intellectual property rights: data on, or from, businesses or other parties that contain confiden-tial information may not be accessible for research. (. . . )” (DuBo10).

Box 1, extracted from DuBo10, provides several examples of statistical legislation on microdata release.

Info-box - Examples of statistical legislation on microdata release

1. The US Bureau of the Census operates under Title 13-Census of the US Code. “Title 13, U.S.C., Section 9prohibits the publication or release of any information that would permit identification of any particular estab-lishment, individual, or household. Disclosure assurance involves the steps taken to ensure that Title 13 dataprepared for public release will not result in wrongful disclosure. This includes both the use of disclosure limi-tation methods and the review process to ensure that the disclosure limitation techniques used provide adequateprotection to the information.”

Source: https://www.census.gov/srd/sdc/wendy.drb.faq.pdf

2. In Canada the Statistics Act states that “no person who has been sworn under section 6 shall disclose or know-ingly cause to be disclosed, by any means, any information obtained under this Act in such a manner that itis possible from the disclosure to relate the particulars obtained from any individual return to any identifiableindividual person, business or organization.”

Source: http://laws-lois.justice.gc.ca/eng/acts/S-19/page-2.html

Statistics Canada does release microdata files. Its microdata release policy states that Statistics Canada will authorisethe release of microdata files for public use when:

(a) the release substantially enhances the analytical value of the data collected; and

(b) the Agency is satisfied all reasonable steps have been taken to prevent the identification of particular surveyunits.

3. In Thailand the Act states: “Personal information obtained under this act shall be strictly considered confidential.A person who performs his or her duty hereunder or a person who has the duty of maintaining such informationcannot disclose it to anyone who doesn’t have a duty hereunder except in the case that:

(1) Such disclosure is for the purpose of any investigation or legal proceedings in a case relating to an offensehereunder.

(2) Such disclosure is for the use of agencies in the preparation, analysis or research of statistics provided thatsuch disclosure does not cause damage to the information owner and does not identify or disclose the dataowner.”

Source: http://web.nso.go.th/eng/en/about/stat_act2007.pdf section 15.

Besides the legal and ethical concerns and codes of conducts of agencies producing statistics, SDC is importantbecause it guarantees data quality and response rates in future surveys. If respondents feel that data producers are notprotecting their privacy, they might not be willing to participate in future surveys. “[. . . ] one incident, particularly if itreceives strong media attention, could have a significant impact on respondent cooperation and therefore on the qualityof official statistics” (DuBo10). At the same time, if data users are unable to gain enough utility from the data due toexcessive or inappropriate SDC protection, or are unable to access the data, then the large investment in producing thedata will be lost.

14 Chapter 3. Statistical Disclosure Control (SDC): An Introduction

Page 19: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

3.2 The risk-utility trade-off in the SDC process

SDC is characterized by the trade-off between risk of disclosure and utility of the data for end users. The risk–utilityscale extends between two extremes; (i) no data is released (zero risk of disclosure) and thus users gain no utility fromthe data, to (ii) data is released without any treatment, and thus with maximum risk of disclosure, but also maximumutility to the user (i.e., no information loss). The goal of a well-implemented SDC process is to find the optimal pointwhere utility for end users is maximized at an acceptable level of risk. Fig. 3.1 illustrates this trade-off. The trianglecorresponds to the raw data. The raw data have no information loss, but generally have a disclosure risk higher thanthe acceptable level. The other extreme is the square, which corresponds to no data release. In that case there is nodisclosure risk, but also no utility from the data for the users. The points in-between correspond to different choicesof SDC methods and/or parameters for these methods applied to different variables. The SDC process looks for theSDC methods and the parameters for those methods and applies these in a way that reduces the risk sufficiently, whileminimizing the information loss.

Fig. 3.1: Risk-utility trade-off

SDC cannot achieve total risk elimination, but can reduce the risk to an acceptable level. Any application of SDCmethods will suppress or alter values in the data and as such decrease the utility (i.e., result in information loss) whencompared to the original data. A common thread that will be emphasized throughout this guide will be that the processof SDC should prioritize the goal of protecting respondents, while at the same time keeping the data users in mind tolimit information loss. In general, the lower the disclosure risk, the higher the information loss and the lower the datautility for end-users.

In practice, choosing SDC methods is partially trial and error: after applying methods, disclosure risk and data utilityare re-measured and compared to the results of other choices of methods and parameters. If the result is satisfactory,

3.2. The risk-utility trade-off in the SDC process 15

Page 20: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the data can be released. We will see that often the first attempt will not be the optimal one. The risk may not besufficiently reduced or the information loss may be too high and the process has to be repeated with different methodsor parameters until a satisfactory solution is found. Disclosure risk, data utility and information loss in the SDC contextand how to measure them are discussed in subsequent chapters of this guide.

Again, it must be stressed that the level of SDC and methods applied depend to a large extent on the entire data releaseframework. For example, a key consideration is to whom and under what conditions the data are to be released (seealso the Section Release types). If data are to be released as public use data, then the level of SDC applied willnecessarily need to be higher than in the cases where data are released under license conditions to trusted users aftercareful vetting. With careful preparation, data may be released under both public and licensed versions. We discusshow this might be achieved later in the guide.

References

16 Chapter 3. Statistical Disclosure Control (SDC): An Introduction

Page 21: Statistical Disclosure Control: A Practice Guide

CHAPTER 4

Release Types

This section discusses data release. Rather than rewriting work that has already been conducted through the WorldBank and its partners at the IHSN, this section extracts from an excellent guide published by DuBo10.

The trade-off between risk and utility in the anonymization process depends greatly on who the users are1 and underwhat conditions a microdata file is released. Generally, three types of data release methods are practiced and apply todifferent target groups.

• Public Use File (PUF): the data “are available to anyone agreeing to respect a core set of easy-to-meet condi-tions. Such conditions relate to what cannot be done with the data (e.g. the data cannot be sold), upon gainingaccess to the data. In some cases PUFs are disseminated with no conditions; often being made available on-line,[e.g. on the website of the statistical agency]. These data are made easily accessible because the risk of iden-tifying individual respondents is considered minimal. Minimising the risk of disclosure involves eliminatingall content that can identify respondents directly—for instance, names, addresses and telephone numbers. Inaddition this requires purging relevant indirect identifiers from the microdata file. These vary across surveydesigns, but commonly-suppressed indirect identifiers include geographical information below the sub-nationallevel at which the sample is representative. Occasionally, certain records may be suppressed also from PUFs, asmight variables characterised by extremely skewed distribution or outliers. However, in lieu of deleting entirerecords or variables from microdata files, alternative SDC methods can minimise the risk of disclosure whilemaximizing information content. Such methods include top-and-bottom coding, local suppression or using dataperturbation techniques [(see the Section Anonymization methods for an overview of anonymization methods)].PUFs are typically generated from census data files using a sub-set [or sample] of records rather than the entirefile and [from sample surveys, such as] household surveys.” (DuBo10).

• Scientific Use File (SUF) (also known as a licensed file, microdata under contract or research file): the “dissem-ination is restricted to users who have received authorization to access them after submitting a documented ap-plication and signing an agreement governing the data’s use. While typically licensed files are also anonymisedto ensure the risk of identifying individuals is minimised when used in isolation, they may still [potentially]contain identifiable data if linked with other data files. Direct identifiers such as respondents’ names must beremoved from a licensed dataset. The data files may, however, still contain indirect variables that could identifyrespondents by matching them to other data files such as voter lists, land registers or school records. When dis-seminating licensed files, the recommendation is to establish and sign an agreement between the data producerand external bona fide users – trustworthy users with legitimate need to access the data. Such an agreement

1 See Section 5 in DuBo10 as to who the users of microdata are and to whom microdata should be made available.

17

Page 22: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

should govern access and use of such microdata files2. Sometimes, licensing agreements are only entered intowith users affiliated to an appropriate sponsoring institution. i.e., research centers, universities or developmentpartners. It is further recommended that, before entering into a data access and use agreement, the data producerasks potential users to complete an application form to demonstrate the need to use a licensed file (instead ofthe PUF version, if available) for a stated statistical or research purpose” (DuBo10). This also allows the dataproducer to learn which characteristics of the data are important for the users, which is valuable information foroptimizing future anonymization processes.

• Microdata available in a controlled research data center (also known as data enclave): “Some files maybe offered to users under strict conditions in a data enclave. This is a facility [(often on the premises of thedata provider)] equipped with computers not linked to the internet or an external network and from which noinformation can be downloaded via USB ports, CD-DVD or other drives. Data enclaves contain data that are par-ticularly sensitive or allow direct or easy identification of respondents. Examples include complete populationcensus datasets, enterprise surveys and certain health related datasets containing highly-confidential informa-tion. Users interested in accessing a data enclave will not necessarily have access to the full dataset – only to theparticular data subset they require. They will be asked to complete an application form demonstrating a legiti-mate need to access these data to fulfill a stated statistical or research purpose [. . . ] The outputs generated mustbe scrutinised by way of a full disclosure review before release. Operating a data enclave may be expensive – itrequires special premises and computer equipment. It also demands staff with the skills and time to review out-puts before their removal from the data enclave in order to ensure there is no risk of disclosure. Such staff mustbe familiar with data analysis and be able to review the request process and manage file servers. Because of thesubstantial operating costs and technical skills required, some statistical agencies or other official data producersopt to collaborate with academic institutions or research centres to establish and manage data enclaves.”

There are other data access possibilities besides these, such as teaching files, files for other specific purposes, remoteexecution or remote access. Obviously, the required level of protection depends on the type of release; a PUF file mustbe protected to a much larger extent than a SUF file, which in turn has to be protected more than a file which is onlyavailable in an on-site facility. The Section Step 3: Type of release gives more guidance on the choice of the releasetype and its implications for the anonymization process. The same microdata set can be released in different ways fordifferent users, e.g., as SUF and teaching file. The Section Step 3: Type of release discusses the particular issues ofmultiple releases of one dataset.

The first step for any agency that wants to release data would be formulation of clear data dissemination policies forthe release of microdata. We will see later that deciding on the level of anonymization needed will depend partly onknowing under what conditions the data will be released. Access policies and conditions provide the framework forthe whole release process.

The following sections further specify the conditions under which microdata should be provided under different releasetypes.

4.1 Conditions for PUFs

“Generally, data regarded as public are open to anyone with access to an [National Statistical Office] (NSO) website.It is, however, normally good practice to include statements defining suitable uses for and precautions to be adopted inusing the data. While these may not be legally binding, they serve to sensitise the user. Prohibitions such as attemptsto link the data to other sources can be part of the ‘use statement’ to which the user must agree, on-line, before the datacan be downloaded. [. . . ] Dissemination of microdata files necessarily involves the application of rules or principles.[The info-box] below [taken from DuBo10] shows basic principles normally applying to PUFs.” (DuBo10).

Info-box - Conditions for accessing and using PUFs

1. Data and other material provided by the NSO will not be redistributed or sold to other individuals, institutionsor organisations without the NSO’s written agreement.

2 Appendix B provides an example of a blanket agreement.

18 Chapter 4. Release Types

Page 23: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reportingaggregated information, including modelling, and not for investigating specific individuals or organisations.

3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person orestablishment discovered inadvertently. Any such discovery will be reported immediately to the NSO.

4. No attempt will be made to produce links between datasets provided by the NSO or between NSO data andother datasets that could identify individuals or organisations.

5. Any books, articles, conference papers, theses, dissertations, reports or other publications employing data ob-tained from the NSO will cite the source, in line with the citation requirement provided with the dataset.

6. An electronic copy of all publications based on the requested data will be sent to the NSO.

7. The original collector of the data, the NSO, and the relevant funding agencies bear no responsibility for thedata’s use or interpretation or inferences based upon it.

Note: Items 3 and 6 in the list require that users be provided with an easy way to communicate with the data provider.It is good practice to provide a contact number, an email address, and possibly an on-line “feedback provision” system.

Source: DuBo10

4.2 Conditions for SUFs

“For [SUFs], terms and conditions must include the basic common principles plus some additional ones applyingto the researcher’s organisation. There are two options: firstly, data are provided to a researcher or a team for aspecific purpose; secondly, data are provided to an organization under a blanket agreement for internal use, e.g., to aninternational body or research agency. In both cases, the researcher’s organisation must be identified, as must suitablerepresentatives to sign the licence” (DuBo10).

Access to a researcher or research team for a specific purpose

“If data are provided for an individual research project, the research team must be identified. This is covered byrequiring interested users to complete a formal request to access the data (a model of such a request form is providedin Appendix 1 [in DuBo10]). The conditions to obtain the data (see example in the info-box below) will specify thatthe files will not be shared outside the organisation and that data will be stored securely. To the possible extent, theintended use of the data – including a list of expected outputs and the organisation’s dissemination policy – mustbe identified. Access to licensed datasets is only granted when there is a legally-registered sponsoring agency, e.g.,government ministry, university, research centre or national or international organization” (DuBo10).

Info-box - Conditions for accessing and using SUFs

Note: Items 1 to 8 below are similar to the conditions for use of public use files in the info-box above. Items 9 and 10would have to be adapted in the case of a blanket agreement.

1. Data and other material provided by the NSO will not be redistributed or sold to other individuals, institutionsor organisations without the NSO’s written agreement.

2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reportingaggregated information, including modelling, and not for investigating specific individuals or organisations.

3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person orestablishment discovered inadvertently. Any such discovery will be reported immediately to the NSO.

4. No attempt will be made to produce links between datasets provided by the NSO or between NSO data andother datasets that could identify individuals or organisations.

4.2. Conditions for SUFs 19

Page 24: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

5. Any books, articles, conference papers, theses, dissertations, reports or other publications employing data ob-tained from the NSO will cite the source, in line with the citation requirement provided with the dataset.

6. An electronic copy of all publications based on therequested data will be sent to the NSO.

7. The NSO and the relevant funding agencies bear no responsibility the data’s use or for interpretation or infer-ences based upon it.

8. An electronic copy of all publications based on the requested data will be sent to the NSO.

9. The researcher’s organisation must be identified, as must the principal and other researchers involved in usingthe data must be identified. The principal researcher must sign the licence on behalf of the organization. If theprincipal researcher is not authorized to sign on behalf of the receiving organization, a suitable representativemust be identified.

10. The intended use of the data, including a list of expected outputs and the organisation’s dissemination policymust be identified.

(Conditions 9 to 11 may be waved in the case of educational institutions)

Source: DuBo10

Blanket agreement to an organization

“In the case of a blanket agreement, where it is agreed the data can be used widely but securely within the receivingorganisation, the licence should ensure compliance, with a named individual formally assuming responsibility for this.Each additional user must be made aware of the terms and conditions that apply to data files: this can be achievedby having to sign an affidavit. Where such an agreement exists, with security in place, it is not necessary for users todestroy the data after use” (DuBo10). Appendix B provides an example of the formulation of such an agreement.

4.3 Conditions for microdata available in a controlled research datacenter

Access to microdata in research data centers is “used for particularly sensitive data or for more detailed data for whichsufficient anonymisation to release them outside the NSO premises is not possible. These can be referred to also asdata laboratories or research data centres. A [research data centre] may be located at the NSO headquarters or in majorcentres such as universities close to the research community. They are used to give researchers access to complete datafiles but without the risk of releasing confidential data. In a typical [research data centre], NSO staff supervise accessand use of the data; the computers must not be able to communicate outside the [research data centre]; and the resultsobtained by the researchers must be screened for confidentiality by an NSO analyst before taken outside. A model of adata enclave access policy is provided in Appendix 2 [in DuBo10], and a model of a data enclave access request formis in Appendix 3 [in DuBo10]” (DuBo10).

Research data centers “have the advantage of providing access to detailed microdata but the disadvantage of requiringresearchers to work at a different location. And they are expensive to set up and operate. It is, however, quite likely thatmany countries have used on-site researchers as a way of providing access to microdata. These researchers are swornin under the statistics’ acts in the same way as regular NSO employees. This approach tends to favour researchers wholive near NSO headquarters.” (DuBo10)

Recommended Reading Material on Release Types

Dupriez, O., & Boyko, E. (2010). Dissemination of Microdata Files; Principles, Procedures and Practices. Interna-tional Household Survey Network (IHSN).

20 Chapter 4. Release Types

Page 25: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

References

4.3. Conditions for microdata available in a controlled research data center 21

Page 26: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

22 Chapter 4. Release Types

Page 27: Statistical Disclosure Control: A Practice Guide

CHAPTER 5

Measuring Risk

5.1 Types of disclosure

Measuring disclosure risk is an important part of the SDC process: risk measures are used to judge whether a data fileis safe enough for release. Before measuring disclosure risk, we have to define what type of disclosure is relevant forthe data at hand. The literature commonly defines three types of disclosure; we take these directly from Lamb93 (seealso HDFG12).

• Identity disclosure, which occurs if the intruder associates a known individual with a released data record.For example, the intruder links a released data record with external information, or identifies a respondent withextreme data values. In this case, an intruder can exploit a small subset of variables to make the linkage, andonce the linkage is successful, the intruder has access to all other information in the released data related to thespecific respondent.

• Attribute disclosure, which occurs if the intruder is able to determine some new characteristics of an individualbased on the information available in the released data. Attribute disclosure occurs if a respondent is correctly re-identified and the dataset contains variables containing information that was previously unknown to the intruder.Attribute disclosure can also occur without identity disclosure. For example, if a hospital publishes data showingthat all female patients aged 56 to 60 have cancer, an intruder then knows the medical condition of any femalepatient aged 56 to 60 in the dataset without having to identify the specific individual.

• Inferential disclosure, which occurs if the intruder is able to determine the value of some characteristic of anindividual more accurately with the released data than would otherwise have been possible. For example, with ahighly predictive regression model, an intruder may be able to infer a respondent’s sensitive income informationusing attributes recorded in the data, leading to inferential disclosure.

SDC methods for microdata are intended to prevent identity and attribute disclosure. Inferential disclosure is generallynot addressed in SDC in the microdata setting, since microdata is distributed precisely so that researchers can makestatistical inference and understand relationships between variables. In that sense, inference cannot be likened todisclosure. Also, inferences are designed to predict aggregate, not individual, behavior, and are therefore usually poorpredictors of individual data values.

23

Page 28: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

5.2 Classification of variables

For the purpose of the SDC process, we use the classifications of variables described in the following paragraphs (seeFig. 5.1 for an overview). The initial classification of variables into identifying and non-identifying variables dependson the way the variables can be used by intruders for re-identification (HDFG12, TeMK14):

• Identifying variables: these contain information that can lead to the identification of respondents and can befurther categorized as:

– Direct identifiers reveal directly and unambiguously the identity of the respondent. Examples are names,passport numbers, social identity numbers and addresses. Direct identifiers should be removed from thedataset prior to release. Removal of direct identifiers is a straightforward process and always the first stepin producing a safe microdata set for release. Removal of direct identifiers, however, is often not sufficient.

– Quasi-identifiers (or key variables) contain information that, when combined with other quasi-identifiersin the dataset, can lead to re-identification of respondents. This is especially the case when they can be usedto match the information with other external information or data. Examples of quasi-identifiers are race,birth date, sex and ZIP/postal codes, which might be easily combined or linked to publically available ex-ternal information and make identification possible. The combinations of values of several quasi-identifiersare called keys (see also the section Levels of Risk). The values of quasi-identifiers themselves often donot lead to identification (e.g., male/female), but a combination of several values of quasi-identifier canrender a record unique (e.g. male, 14 years, married) and hence identifiable. It is not generally advisable tosimply remove quasi-identifiers from the data to solve the problem. In many cases, they will be importantvariables for any sensible analysis. In practice, any variable in the dataset could potentially be used as aquasi-identifier. SDC addresses this by identifying variables as quasi-identifiers and anonymizing themwhile still maintaining the information in the dataset for release.

• Non-identifying variables are variables that cannot be used for re-identification of respondents. This could bebecause these variables are not contained in any other data files or other external sources and are not observableto an intruder. Non-identifying variables are nevertheless important in the SDC process, since they may containconfidential/sensitive information, which may prove damaging should disclosure occur as a result of identitydisclosure based on identifying variables.

These classifications of variables depend partially on the availability of external datasets that might contain informationthat, when combined with the current data, could lead to disclosure. The identification and classification of variablesas quasi-identifiers depends, amongst others, on the availability of information in external datasets. An important stepin the SDC process is to define a list of possible disclosure scenarios based on how the quasi-identifiers might becombined with each other and information in external datasets and then treating the data to prevent disclosure. Wediscuss disclosure scenarios in more detail in the section Disclosure scenarios.

For the SDC process, it is also useful to further classify the quasi-identifiers into categorical, continuous and semi-continuous variables. This classification is important for determining the appropriate SDC methods for that variable,as well as the validity of risk measures.

• Categorical variables take values over a finite set, and any arithmetic operations using them are generally notmeaningful or not allowed. Examples of categorical variables are gender, region and education level.

• Continuous variables can take on an infinite number of values in a given set. Examples are income, bodyheight and size of land plot. Continuous variables can be transformed into categorical variables by constructingintervals (such as income bands).1

• Semi-continuous variables are continuous variables that take on values that are limited to a finite set. Anexample is age measured in years, which could take on values in the set {0, 1, . . . , 100}. The finite nature of the

1 Recoding a continuous variable is sometimes useful in cases where the data contains only a few continuous variables. We will see in theSection Individual risk that many methods used for risk calculation depend on whether the variables are categorical. We will also see that it is easierfor the measurement of risk if the data contains only categorical or only continuous variables.

24 Chapter 5. Measuring Risk

Page 29: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

values for these variables means that they can be treated as categorical variables for the purpose of SDC.2

Apart from these classifications of variables, the SDC process further classifies variables according to their sensitivityor confidentiality. Both quasi-identifiers and non-identifying variables can be classified as sensitive (or confidential)or non-sensitive (or non-confidential). This distinction is not important for direct identifiers, since direct identifiersare removed from the released data.

• Sensitive variables contain confidential information that should not be disclosed without suitable treatmentusing SDC methods to reduce disclosure risk. Examples are income, religion, political affiliation and variablesconcerning health. Whether a variable is sensitive depends on the context and country: a certain variable can beconsidered sensitive in one country and non-sensitive in another.

• Non-sensitive variables contain non-confidential information on the respondent, such as place of residence orrural/urban residence. The classification of a variable as non-sensitive, however, does not mean that it does notneed to be considered in the SDC process. Non-sensitive variables may still serve as quasi-identifiers whencombined with other variables or other external data.

Fig. 5.1: Classification of variables

5.3 Disclosure scenarios

Evaluation of disclosure risk is carried out with reference to the available data sources in the environment where thedataset is to be released. In this setting, disclosure risk is the possibility of correctly re-identifying an individual

2 This is discussed in greater detail in the following sections. In cases where the number of possible values is large, recoding the variable, orparts of the set it takes values on, to obtain fewer distinct values is recommended.

5.3. Disclosure scenarios 25

Page 30: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

in the released microdata file by matching their data to an external file based on a set of quasi-identifiers. The riskassessment is done by identifying so-called disclosure or intrusion scenarios. A disclosure scenario describes theinformation potentially available to the intruder (e.g., census data, electoral rolls, population registers or data collectedby private firms) to identify respondents and the ways such information can be combined with the microdata set tobe released and used for re-identification of records in the dataset. Typically, these external datasets include directidentifiers. In that case, the re-identification of records in the released dataset leads to identity and, possibly, attributedisclosure. The main outcome of the evaluation of disclosure scenarios is the identification of a set of quasi-identifiers(i.e., key variables) that need to be treated during the SDC process (see ELMP10).

An example of a disclosure scenario could be the spontaneous recognition of a respondent by a researcher. Forinstance, while going through the data, the researcher recognizes a person with an unusual combination of the variablesage and marital status. Of course, this can only happen if the person is well-known or is known to the researcher.Another example of a disclosure scenario for a publicly available file would be if variables in the data could be linkedto a publically available electoral register. An intruder might try matching the entire dataset with individuals in theregister. However, this might be difficult and take specialized expertise, or software, and other conditions have to befulfilled. Examples are that the point in time the datasets were collected should approximately match and the contentof the variables should be (nearly) identical. If these conditions are not fulfilled, exact matching is much less likely.

Note: Not all external data is necessarily in the public domain. Also privately owned datasets or datasets which arenot released should be taken into consideration for determining the suitable disclosure scenario.

Info-box - Disclosure scenarios and different release types

A dataset can have more than one disclosure scenario. Disclosure scenarios also differ depending on the data accesstype that the data will be released under; for example, Public Use Files (PUF) or Scientific Use Files (SUF, also knownas licensed) or in a data enclave. The required level of protection, the potential avenues of disclosure as well as theavailability of other external data sources differ according to the access type under which the data will be released.For example, the user of a Scientific Use File (SUF) might be contractually restricted by an agreement as to what theyare allowed to do with the data, whereas a Public Use File (PUF) might be freely available on the internet under amuch looser set of conditions. PUFs will in general require more protection than SUFs and SUFs will require moreprotection than those files only released in an data enclave. Disclosure scenarios should be developed with all of thisin mind.

The evaluation of disclosure risk is based on the quasi-identifiers, which are identified in the analysis of disclosure riskscenarios. The disclosure risk directly depends on the inclusion or exclusion of variables in the set of quasi-identifierschosen. This step in the SDC process (making the choice of quasi-identifiers) should therefore be approached withgreat thought and care. We will see later, as we discuss the steps in the SDC process in more detail, that the first stepfor any agency is to undertake an exercise in which an inventory is compiled of all datasets available in the country.Both datasets released by the national statistical office and from other sources are considered and their availabilityto intruders as well as the variables included in these datasets is analyzed. It is this information that will serve as akey metric when deciding which variables to choose as potential identifiers, as well as dictate the level of SDC andmethods needed.

5.4 Levels of risk

With microdata from surveys and censuses, we often have to be concerned about disclosure at the individual or unitlevel, i.e., identifying individual respondents. Individual respondents are generally natural persons, but can also beunits, such as companies, schools, health facilities, etc. Microdata files often have a hierarchical structure whereindividual units belong to groups, e.g., people belong to households. The most common hierarchical structure inmicrodata is the household structure in household survey data. Therefore, in this guide, we sometimes call disclosurerisk for data with a hierarchical structure “household risk”. The concepts, however, apply equally to establishment

26 Chapter 5. Measuring Risk

Page 31: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

data and other data with hierarchical structures, such as school data with pupils and teachers or company data withemployees.

We will see that this hierarchical structure is important to take into consideration when measuring disclosure risk. Forhierarchical data, information collected at the higher hierarchical level (e.g., household level) would be the same forall individuals in the group belonging to that higher hierarchical level (e.g., household).3 Some typical examples ofvariables that would have the same values for all members of the same higher hierarchical unit are, in the case ofhouseholds, those relating to housing and household income. These variables differ from survey to survey and fromcountry to country.4 This hierarchical structure creates a further level of disclosure risk for two reasons:

1. if one individual in the household is re-identified, the household structure allows for re-identification of the otherhousehold members in the same household,

2. values of variables for other household members that are common for all household members can be used forre-identification of another individual of the same household. This is discussed in more detail in the SectionHousehold Risk.

Next, we first discuss risk measures used to evaluate disclosure risk in the absence of a hierarchical structure. Thisincludes risk measures that seek to aggregate the individual risk for all individuals in the microdata file; the objectiveis to quantify a global disclosure risk measure for the file. We then discuss how risk measures change when taking thehierarchical structure of the data into account.

We will also discuss how risk measures differ for categorical and continuous key variables. For categorical variables,we will use the concept of uniqueness of combinations of values of quasi-identifiers (so-called “keys”) used to identifyindividuals at risk. The concept of uniqueness, however, is not useful for continuous variables, since it is likely that allor many individuals will have unique values for that variable, by definition of a continuous variable. Risk measuresfor categorical variables are generally a priori measures, i.e., they can be evaluated before applying anonymizationmethods since they are based on the principle of uniqueness. Risk measures for continuous variables are a posteriorimeasures; they are based on comparing the microdata before and after anonymization and are, for example, based onthe proximity of observations between the original and the treated (anonymized) datasets.

Files that are limited to only categorical or only continuous key variables are easiest for risk measurement. We willsee in later sections that, in cases where both types of variables are present, recoding of continuous variables intocategories is one approach to use to simplify the SDC process, but we will also see that from a utility perspective thismay not be desirable. An example might be the use of income quintiles instead of the actual income variables. Wewill see that measuring the risk of disclosure based on the categorical and continuous variables separately is generallynot a valid approach.

The risk measures discussed in the next section are based on several assumptions. In general, these measures rely onquite restrictive assumptions and will often lead to conservative risk estimates. These conservative risk measures mayoverstate the risk as they assume a worst-case scenario. Two assumptions should, however, be fulfilled for the riskmeasures to be valid and meaningful; the microdata should be a sample of a larger population (no census) and thesampling weights should be available. The Section Special case: census data briefly discusses how to deal with censusdata.

5.5 Individual risk

5.5.1 Categorical key variables and frequency counts

The main focus of risk measurement for categorical quasi-identifiers is on identity disclosure. Measuring disclosurerisk is based on the evaluation of the probability of correct re-identification of individuals in the released data. We

3 Besides variables collected at the higher hierarchical level, also variables collected at the lower level but with no (or little) variation within thegroups formed by the hierarchical structure should be treated as higher level variables. An example could be mother tongue, where most householdsare monolingual, but the variable is collected at the individual level.

4 Religion, for example, can be shared by all household members in some countries, whereas in other countries this variable is measured at theindividual level and mixed-religion households exist.

5.5. Individual risk 27

Page 32: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

use measures based on the actual microdata to be released. In general, the rarer a combination of values of the quasi-identifiers (i.e., key) of an observation in the sample, the higher the risk of identity disclosure. An intruder that triesto match an individual who has a relatively rare key within the sample data with an external dataset in which the samekey exists will have a higher probability of finding a correct match than when a larger number of individuals share thesame key. This can be illustrated with the following example that is illustrated in Table 5.1.

Table 5.1 shows values for 10 respondents for the quasi-identifiers “residence”, “gender”, “education level” and “laborstatus”. In the data, we find seven unique combinations of values of quasi-identifiers (i.e., patterns or keys) of the fourquasi-identifiers. Examples of keys are {‘urban’, ‘female’, ‘secondary incomplete’, ‘employed’} and {‘urban’, ‘fe-male’, ‘primary incomplete’, ‘non-LF’}. Let 𝑓𝑘 be the sample frequency of the 𝑘th key, i.e., the number of individualsin the sample with values of the quasi-identifiers that coincide with the 𝑘th key. This would be 2 for the key {urban,female, secondary incomplete, employed}, since this key is shared by individuals 1 and 2 and 1 for the key {‘urban’,‘female’, ‘primary incomplete’, ‘non-LF’}, which is unique to individual 3. By definition, 𝑓𝑘 is the same for eachrecord sharing a particular key.

The fewer the individuals with whom an individual shares his or her combination of quasi-identifiers, the more likelythe individual is to be correctly matched in another dataset that contains these quasi-identifiers. Even when directidentifiers are removed from the dataset, that individual has a higher disclosure risk than others, assuming that theirsample weights are the same. Table 5.1 reports the sample frequencies 𝑓𝑘 of the keys for all individuals. Individualswith the same keys have the same sample frequency. If 𝑓𝑘 = 1, this individual has a unique combination of values ofquasi-identifiers and is called “sample unique”. The dataset in Table 5.1 contains four sample uniques. Risk measuresare based on this sample frequency.

Table 5.1: Example dataset showing sample frequencies, population fre-quencies and individual disclosure risk

No Residence Gender Education level Labor status Weight 𝑓𝑘 𝐹𝑘 risk1 Urban Female Secondary incomplete Employed 180 2 360 0.00542 Urban Female Secondary incomplete Employed 180 2 360 0.00543 Urban Female Primary incomplete Non-LF 215 1 215 0.02514 Urban Male Secondary complete Employed 76 2 152 0.01265 Rural Female Secondary complete Unemployed 186 1 186 0.02826 Urban Male Secondary complete Employed 76 2 152 0.01267 Urban Female Primary complete Non-LF 180 1 180 0.02908 Urban Male Post-secondary Unemployed 215 1 215 0.02519 Urban Female Secondary incomplete Non-LF 186 2 262 0.007410 Urban Female Secondary incomplete Non-LF 76 2 262 0.0074

In Listing 5.1, we show how to use the sdcMicro package to create a list of sample frequencies 𝑓𝑘 for each record in adataset. This is done by using the sdcMicro function freq(). A value of 2 for an observation means that in the sample,there is one more individual with exactly the same combination of values for the selected key variables. In Listing5.1, the function freq() is applied to “sdcInitial”, which is an sdcMicro object. Footnote5 shows how to initialize thesdcMicro object for this example. For a complete discussion of sdcMicro objects as well as instructions on how tocreate sdcMicro objects, we refer to the section Objects of class sdcMicroObj. sdcMicro objects are used when doingSDC with sdcMicro. The function freq() displays the sample frequency for the keys constructed on a defined set ofquasi-identifiers. Listing 5.1 corresponds to the data in Table 5.1.

Listing 5.1: Calculating 𝑓𝑘 using sdcMicro

1 # Frequency of the particular combination of key variables (keys) for each record in→˓the sample

2 freq(sdcInitial, type = 'fk')

(continues on next page)

5 The code examples in this guide are based on sdcMicro objects. An sdcMicro object contains, amongst others, the data and identifies allthe specified key variables. The code below creates a data.frame with the data from Table 5.1 and the sdcMicro objects “sdcInitial” used in mostexamples in this section.

28 Chapter 5. Measuring Risk

Page 33: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

3 2 2 1 2 1 2 1 1 2 2

For sample data, it is more interesting to look at 𝐹𝑘, the population frequency of a combination of quasi-identifiers(key) 𝑘, which is the number of individuals in the population with the key that corresponds to key 𝑘. The populationfrequency is unknown if the microdata is a sample and not a census. Under certain assumptions, the expected value ofthe population frequencies can be computed using the sample design weight 𝑤𝑖 (in a simple sample, this is the inverseof the inclusion probability) for each individual 𝑖

𝐹𝑘 =∑︁

𝑖|𝑘𝑒𝑦 𝑜𝑓 𝑖𝑛𝑑𝑖𝑣𝑖𝑑𝑢𝑎𝑙 𝑖 𝑐𝑜𝑟𝑟𝑒𝑠𝑝𝑜𝑛𝑑𝑠 𝑡𝑜 𝑘𝑒𝑦 𝑘

𝑤𝑖

𝐹𝑘 is the sum of the sample weights of all records with the same key 𝑘. Hence, like 𝑓𝑘, 𝐹𝑘 is the same for each recordwith key 𝑘. The risk of correct re-identification is the probability that the key is matched to the correct individual inthe population. Since every individual in the sample with key 𝑘 corresponds to 𝐹𝑘 individuals in the population, theprobability of correct re-identification is 1/𝐹𝑘. This is the probability of re-identification in the worst-case scenarioand can be interpreted as disclosure risk. Individuals with the same key have the same frequencies, i.e., the frequencyof the key.

If 𝐹𝑘 = 1, the key 𝑘 is both a sample and a population unique and the disclosure risk would be 1. Population uniquesare an important factor to consider when evaluating risk, and deserve special attention. Table 5.1 also shows 𝐹𝑘 forthe example dataset. This is further discussed in the case studies the Section Case Studies.

Besides 𝑓𝑘, the sample frequency of key 𝑘 (i.e., the number of individuals in the sample with the combination of quasi-identifiers corresponding to the combination specified in key 𝑘) and 𝐹𝑘, the estimated population frequency of key 𝑘,can be displayed in sdcMicro. Listing 5.2 illustrates how to return lists of length 𝑛 of frequencies for all individuals.The frequencies are displayed for each individual and not for each key.

Listing 5.2: Calculating the sample and population frequencies usingsdcMicro

1 # Sample frequency of individual’s key2 freq(sdcInitial, type = 'fk')3 2 2 1 2 1 2 1 1 2 24

(continues on next page)

library(sdcMicro)

# Set up datasetdata <- as.data.frame(cbind(as.factor(c('Urban', 'Urban', 'Urban', 'Urban', 'Rural', 'Urban',→˓'Urban', 'Urban',

'Urban', 'Urban')),as.factor(c('Female', 'Female', 'Female', 'Male',

'Female', 'Male', 'Female', 'Male', 'Female', 'Female')),as.factor(c('Sec in', 'Sec in', 'Prim in', 'Sec com', 'Sec com',

→˓'Sec com', 'Prim com', 'Post-sec', 'Sec in', 'Sec in')),as.factor(c('Emp', 'Emp', 'Non-LF', 'Emp', 'Unemp', 'Emp', 'Non-LF',

→˓'Unemp', 'Non-LF','Non-LF')),as.factor(c('yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'yes', 'no

→˓', 'yes')),c(180, 180, 215, 76, 186, 76, 180, 215, 186, 76)))

# Specify variable namesnames(data) <- c('Residence', 'Gender', 'Educ', 'Lstat', 'Health', 'Weights')

# Set up sdcMicro object with specified quasi-identifiers and weight variablesdcInitial <- createSdcObj(dat = data, keyVars = c('Residence', 'Gender', 'Educ', 'Lstat'),→˓weightVar = 'Weights')

5.5. Individual risk 29

Page 34: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

5 # Population frequency of individual’s key6 freq(sdcInitial, type = 'Fk')7 360 360 215 152 186 152 180 215 262 262

In practice, this approach leads to conservative risk estimates, as it does not adequately take the sampling methodsinto account. In this case, the estimates of re-identification risk may be estimated too high. If this overestimated riskis used, the data may be overprotected (i.e., information loss will be higher than was necessary) when applying SDCmeasures. Instead, a Bayesian approach to risk measurement is recommended, where the posterior distribution of 𝐹𝑘

is used (see e.g., HDFG12) to estimate an individual risk measure 𝑟𝑘 for each key 𝑘.

The risk measure 𝑟𝑘 is, as 𝑓𝑘 and 𝐹𝑘, the same for all individuals sharing the same pattern of values of key variablesand is referred to as individual risk. The values 𝑟𝑘 can also be interpreted as the probability of disclosure for theindividuals or as the probability for a successful match with individuals chosen at random from an external data filewith the same values of the key variables. This risk measure is based on certain assumptions6, which are strict andmay lead to a relatively conservative risk measure. In sdcMicro, the risk measure 𝑟𝑘 is automatically computed whencreating an sdcMicro object and saved in the “risk” slot7. Listing 5.3 shows how to retrieve the risk measures usingsdcMicro for our example. The risk measures are also presented in Table 5.1.

Listing 5.3: The individual risk slot in the sdcMicro object

1 sdcInitial@risk$individual2

3 risk fk Fk4 [1,] 0.005424520 2 3605 [2,] 0.005424520 2 3606 [3,] 0.025096439 1 2157 [4,] 0.012563425 2 1528 [5,] 0.028247279 1 1869 [6,] 0.012563425 2 152

10 [7,] 0.029010932 1 18011 [8,] 0.025096439 1 21512 [9,] 0.007403834 2 26213 [10,] 0.007403834 2 262

The main factors influencing the individual risk are the sample frequencies 𝑓𝑘 and the sampling design weights 𝑤𝑖. Ifan individual is at relatively high risk of disclosure, in our example this would be individuals 3, 5, 7 and 8 in Table5.1 and Listing 5.3, the probability that a potential intruder correctly matches these individuals with an external datafile is high relative to the other individuals in the released data. In our example, the reason for the high risk is thefact that these individuals are sample uniques (𝑓𝑘 = 1). This risk is the worst-case scenario risk and does not implythat the person will be re-identified with certainty with this probability. For instance, if an individual included in themicrodata is not included in the external data file, the probability for a correct match is zero. Nevertheless, the riskmeasure computed based on the frequencies will be positive.

5.5.2 𝑘-anonymity

The risk measure 𝑘-anonymity is based on the principle that, in a safe dataset, the number of individuals sharing thesame combination of values (keys) of categorical quasi-identifiers should be higher than a specified threshold 𝑘. 𝑘-

6 The assumptions for this risk measure are strict and the risk is estimated in many cases higher than the actual risk. Among other assumptions,it is assumed that all individuals in the sample are also included in the external file used by the intruder to match against. If this is not the case,the risk is much lower; if the individual in the released file is not contained in the external file, the probability of a correct match is zero. Otherassumptions are that the files contain no errors and that both sets of data were collected simultaneously, i.e. they contain the same information.These assumptions will often not hold generally, but are necessary for computation of a measure. An example of a violation of the last assumptionsis could occur if datasets are collected at different points in time and records have changed. This could happen when people move or change jobsand makes correct matching impossible. The assumptions assume a worst-case scenario.

7 See the Section Objects of class sdcMicroObj for more information on slots and the sdcMicro object structure.

30 Chapter 5. Measuring Risk

Page 35: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

anonymity is a risk measure based on the microdata to be released, since it only takes the sample into account. Anindividual violates 𝑘-anonymity if the sample frequency count 𝑓𝑘 for the key 𝑘 is smaller than the specified threshold𝑘. For example, if an individual has the same combination of quasi-identifiers as two other individuals in the sample,these individuals satisfy 3-anonymity but violate 4-anonymity. In the dataset in Table 5.1, six individuals satisfy2-anonymity and four violate 2-anonymity. The individuals that violate 2-anonymity are sample uniques. The riskmeasure is the number of observations that violates k-anonymity for a certain value of k, which is∑︁

𝑖

𝐼(𝑓𝑘 < 𝑘),

where 𝐼 is the indicator function and 𝑖 refers to the 𝑖th record. This is simply a count of the number of individualswith a sample frequency of their key lower than 𝑘. The count is higher for larger 𝑘, since if a record satisfies 𝑘-anonimity, it also satisfies (𝑘 + 1)-anonimity. The risk measure 𝑘-anonymity does not consider the sample weights,but it is important to consider the sample weights when determining the required level of 𝑘-anonymity. If the sampleweights are large, one individual in the dataset represents more individuals in the target population, the probabilityof a correct match is smaller, and hence the required threshold can be lower. Large sample weights go together withsmaller datasets. In a smaller dataset, the probability to find another record with the same key is smaller than in alarger dataset. This probability is related to the number of records in the population with a particular key through thesample weights.

In sdcMicro we can display the number of observations violating a given 𝑘-anonymity threshold. In Listing 5.4, weuse sdcMicro to calculate the number of violators for the thresholds 𝑘 = 2 and 𝑘 = 3. Both the absolute number ofviolators and the relative number as percentage of the number of individuals in the sample are given. In the example,four observations violate 2-anonimity and all 10 observations violate 3-anonymity.

Listing 5.4: Using the print() function to display observations violating𝑘-anonymity

1 print(sdcInitial, 'kAnon')2

3 Number of observations violating4 - 2-anonymity: 45 - 3-anonymity: 106 --------------------------7 Percentage of observations violating8 - 2-anonymity: 40 %9 - 3-anonymity: 100 %

For other levels of 𝑘-anonymity, it is possible to compute the number of violating individuals by using the samplefrequency counts in the sdcMicro object. The number of violators is the number of individuals with sample frequencycounts smaller than the specified threshold 𝑘. In Listing 5.5, we show an example of how to calculate any thresholdfor 𝑘 using the already-stored risk measures available after setting up an sdcMicro object in R. 𝑘 can be replaced withany required threshold. The choice of the required threshold that all individuals in the microdata file should satisfydepends on many factors and is discussed further in the Section Local suppression on local suppression. In manyinstitutions, typically required thresholds for 𝑘-anonymity are 3 and 5.

Listing 5.5: Computing 𝑘-anonymity violations for other values of k

1 sum(sdcInitial@risk$individual[,2] < k)

It is important to note that missing values (‘NA’s in R8) are treated as if they were any other value. Two individuals withkeys {‘Male’, NA, ‘Employed’} and {‘Male’, ‘Secondary complete’, ‘Employed’} share the same key, and similarly,{‘Male’, NA, ‘Employed’} and {‘Male’, ‘Secondary incomplete’, ‘Employed’} also share the same key. Therefore,

8 In sdcMicro it is important to use the standard missing value code NA instead of other codes, such as 9999 or strings. In the Section Missingvalues, we further discuss how to set other missing value codes to NA in R. This is necessary to ensure that the methods in sdcMicro functionproperly. When missing values have codes other than NA, the missing value codes are interpreted as a distinct factor level in the case of categoricalvariables.

5.5. Individual risk 31

Page 36: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the missing value in the first key is first interpreted as ‘Secondary complete’, and then as ‘Secondary incomplete’. Thisis illustrated in Table 5.2.

Note: The sample frequency of the third record is 3, since it is regarded to share its key both with the first and secondrecord.

This principle is used when applying local suppression to achieve a certain level of 𝑘-anonymity (see the Section Localsuppression) and is based on the fact that the value NA could replace any value.

Table 5.2: Example dataset to illustrate the effect of missing values onk-anonymity

No Gender Education level Labor status 𝑓𝑘1 Male Secondary complete Employed 22 Male Secondary incomplete Employed 23 Male NA Employed 3

If a dataset satisfies 𝑘-anonymity, an intruder will always find at least 𝑘 individuals with the same combination ofquasi-identifiers. 𝑘-anonymity is often a necessary requirement for anonymization for a dataset before release, butis not necessarily a sufficient requirement. The 𝑘-anonymity measure is only based on frequency counts and doesnot take (differences in) sample weights into account. Often 𝑘-anonymity is achieved by first applying recoding andsubsequently applying local suppression, and in some cases by microaggregation, before using other risk measuresand disclosure methods to further reduce disclosure risk. These methods are discussed in the Section Anonymizationmethods.

5.5.3 𝑙-diversity

𝑘-anonymity has been criticized for not being restrictive enough. Sensitive information might be disclosed even if thedata satisfies 𝑘-anonymity. This might occur in cases where the data contains sensitive (non-identifying) categoricalvariables that have the same value for all individuals that share the same key. Examples of such sensitive variablesare those containing information on an individual’s health status. Table 5.3 illustrates this problem by using the samedata as previously used, but adding a sensitive variable, ”health”. The first two individuals satisfy 2-anonymity for thekey variables “residence”, “gender”, “education level” and “labor status”. This means that an intruder will find at leasttwo individuals when matching to the released microdata set based on those four quasi-identifiers. Nevertheless, if theintruder knows that someone belongs to the sample and has the key {‘Urban’, ‘Female’, ‘Secondary incomplete’ and‘Employed’}, with certainty the health status is disclosed (‘yes’), because both observations with this key have thesame value. This information is thus disclosed without the necessity to match exactly to the individual. This is not thecase for the individuals with the key {‘Urban’, ‘Male’, ‘Secondary complete’, ‘Employed’}. Individuals 4 and 6 havedifferent values (‘yes’ and ‘no’) for “health”, and thus the intruder would not gain information about the health statusfrom this dataset by matching an individual to one of these individuals.

32 Chapter 5. Measuring Risk

Page 37: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 5.3: l-diversity illustrationNo Residence Gender Education level Labor status Health 𝑓𝑘 𝐹𝑘 𝑙-diversity1 Urban Female Secondary incomplete Employed yes 2 360 12 Urban Female Secondary incomplete Employed yes 2 360 13 Urban Female Primary incomplete Non-LF yes 1 215 14 Urban Male Secondary complete Employed yes 2 152 25 Rural Female Secondary complete Unemployed yes 1 186 16 Urban Male Secondary complete Employed no 2 152 27 Urban Female Primary complete Non-LF no 1 180 18 Urban Male Post-secondary Unemployed yes 1 215 19 Urban Female Secondary incomplete Non-LF no 2 262 210 Urban Female Secondary incomplete Non-LF yes 2 262 2

The concept of (distinct) 𝑙-diversity addresses this shortcoming of 𝑘-anonymity (see MKGV07). A dataset satisfies𝑙-diversity if for every key 𝑘 there are at least 𝑙 different values for each of the sensitive variables. In the example, thefirst two individuals satisfy only 1-diversity, individuals 4 and 6 satisfy 2-diversity. The required level of 𝑙-diversitydepends on the number of possible values the sensitive variable can take. If the sensitive variable is a binary variable,the highest level if 𝑙-diversity that can be achieved is 2. A sample unique will always only satisfy 1-diversity.

To compute 𝑙-diversity for sensitive variables in sdcMicro, the function ldiversity() can be used. This is illustrated inListing 5.6. As arguments, we specify the names of the sensitive variables9 in the file as well as a constant for recursive𝑙-diversity,10 and the code for missing values in the data. The output is saved in the “risk” slot of the sdcMicro object.The result shows the minimum, maximum, mean and quantiles of the 𝑙-diversity scores for all individuals in thesample. The output in Listing 5.6 reproduces the results based on the data in Table 5.3.

Listing 5.6: 𝑙-diversity function in sdcMicro

1 # Computing l-diversity2

3 sdcInitial <- ldiversity(obj = sdcInitial, ldiv_index = c("Health"), l_recurs_c = 2,→˓missing = NA)

4 # Output for l-diversity5 sdcInitial@risk$ldiversity6

7 --------------------------8 L-Diversity Measures9 --------------------------

10 Min. 1st Qu. Median Mean 3rd Qu. Max.11 1.0 1.0 1.0 1.4 2.0 2.012

13 # l-diversity score for each record14 sdcInitial@risk$ldiversity[,'Health_Distinct_Ldiversity']15

16 [1] 1 1 1 2 1 2 1 1 2 2

𝑙-diversity is useful if the data contains categorical sensitive variables that are not quasi-identifiers themselves. It isnot possible to select quasi-identifiers to calculate the 𝑙-diversity. 𝑙-diversity has to be calculated for each sensitivevariable separately.

9 Alternatively, the sensitive variables can be specified when creating the sdcMicro object using the function createSdcObj() in the sensibleVarargument. This is further explained in the Section Objects of class sdcMicroObj . In that case, the argument ldiv_index does not have to be specifiedin the ldiversity() function. and the variables in the sensibleVar argument will automatically be used to compute 𝑙-diversity.

10 Besides distinct 𝑙-diversity, there are other 𝑙-diversity methods: entropy and recursive. Distinct 𝑙-diversity is most commonly used.

5.5. Individual risk 33

Page 38: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

5.6 Special Uniques Detection Algorithm (SUDA)

The previously discussed risk measures depend on identifying key variables for which there may be informationavailable from other sources or other datasets, and which, when combined with the current data, may lead to re-identification. In practice, however, it might not always be possible to conduct an inventory of all available datasetsand their variables and thus assess all known external linkages and risks.

To overcome this, an alternative heuristic measure based on special uniques has been developed to determine theriskiness of a record, which leads to a SUDA metric or score (see ElMF02). These measures are based on the search forspecial uniques. To find these special uniques, algorithms, called SUDA (Special Uniqueness Detection Algorithm),have been developed. SUDA algorithms are based on the concept of special uniqueness, which is introduced in thenext subsection. Since this is a heuristic approach, its performance is only tested in actual datasets, which is done inElMF02 for UK census data. These tests have shown that the performance of the algorithm leads to good risk estimatesfor these test datasets.

5.6.1 Sample unique vs. special unique

The previous measures of risk focused on the uniqueness of the key of a record in the dataset. Table 5.4 reproducesthe data from Table 5.1. The sample dataset has 10 records and four pre-determined quasi-identifiers {“Residence”,“Gender”, “Education level” and “Labor status”}. Given the four quasi-identifiers, we have seven distinct patterns inthose key variables, or keys (e.g., {‘Urban’, ‘Female’, ‘Secondary incomplete’, ‘Employed’}). The sample frequencycounts of the first and second records equal 2, because the two records share the same pattern (i.e., {‘Urban’, ‘Female’,‘Secondary incomplete’, ‘Employed’}). Record 3 is a sample unique because it is the only individual in the samplewho is a female living in an urban area who is employed without completing primary school. Similarly, records 5, 7and 8 are sample uniques, because they possess distinct patterns with respect to the four key variables.

Table 5.4: Sample uniques and special uniquesNo Residence Gender Education level Labor status Weight 𝑓𝑘 𝐹𝑘 risk1 Urban Female Secondary incomplete Employed 180 2 360 0.00542 Urban Female Secondary incomplete Employed 180 2 360 0.00543 Urban Female Primary incomplete Non-LF 215 1 215 0.02514 Urban Male Secondary complete Employed 76 2 152 0.01265 Rural Female Secondary complete Unemployed 186 1 186 0.02826 Urban Male Secondary complete Employed 76 2 152 0.01267 Urban Female Primary complete Non-LF 180 1 180 0.02908 Urban Male Post-secondary Unemployed 215 1 215 0.02519 Urban Female Secondary incomplete Non-LF 186 2 262 0.007410 Urban Female Secondary incomplete Non-LF 76 2 262 0.0074

In addition to the records 3, 5, 7 and 8 in Table 5.4 being sample uniques with respect to the key variable set {“Res-idence”, “Gender”, “Education level”, “Labor status”}, we can find unique patterns in these records without evenhaving to consider the complete set of key variables. For instance, a unique pattern can be found in record 5 when welook only at the variables “Education level” and “Labor status” ({‘Secondary complete’, ‘Unemployed’}). While thevalues {‘Secondary complete’} and {‘Unemployed’} are not unique in the sample, the combination of them, {‘Sec-ondary complete’, ‘Unemployed’} makes record 5 unique. This variable subset is referred to as the Minimal SampleUnique (MSU) as any smaller subset of this set of variables is not unique (in this case {‘Secondary complete’} and{‘Unemployed’}). It is an MSU of size 2. This holds as well for three other combinations in record 5, i.e., {‘Female’,‘Unemployed’} and {‘Female’, ‘Secondary Complete’}, which are also MSUs of size 2 and {‘Rural’} of size 1. Intotal, record 5 has four MSUs11. To determine if a set is an MSU of size 𝑘, we check whether it fulfills the minimal

11 There are more combinations of quasi-identifiers that make record 5 unique (e.g., {‘Rural’, ‘Female’} and {‘Female’, ‘Secondary Complete’,‘Unemployed’}. These combinations, however, are not considered MSUs because they do not fulfill the minimal subset requirement. They containsubsets that are MSUs.

34 Chapter 5. Measuring Risk

Page 39: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

requirement. It suffices to check whether all subsets of size 𝑘 − 1 of the MSU are unique. If any of these subsets arealso unique in the sample, the set found may be a sample unique, but violates the minimal requirement and is hence notan MSU. The unique subset of size 𝑘−1 could be a MSU. In our example, to determine if the MSU {‘Secondary com-plete’, ‘Unemployed’} is a MSU, we checked as to whether its subsets {‘Secondary complete’} and {‘Unemployed’}were not unique in the sample. By definition, only sample uniques can be special uniques.

The SUDA algorithm identifies all the MSUs in the sample, which in turn are used to assign a SUDA score to eachrecord. This score indicates how “risky” a record is. The potential risk of the records is determined based on twoobservations:

• The smaller the size of the MSU within a record (i.e., the fewer variables are needed to reach uniqueness), thegreater the risk of the record

• The larger the number of MSUs possessed by a record, the greater the risk of the record

A record is defined as a special unique if it is a sample unique both on the complete set of quasi-identifiers (e.g., inthe data in Table 5.4, the variables “Residence”, ”Gender”, “Education level” and “Labor status”) and simultaneouslyhas at least one MSU (ElSD98). Special uniques can be classified according to the number and size of subsets thatare MSUs. Research has shown that special uniques are more likely to be population uniques than random uniques(ElMF02) and are thus relevant for risk assessment.

5.6.2 Calculating SUDA scores

The SUDA algorithm is used to search for MSUs in the data among the sample uniques to determine which sampleuniques are also special uniques i.e., have subsets that are also unique (see Elliot et al., 2005). First the SUDAalgorithm is used to identify the MSUs for each sample unique. To simplify the search and because smaller subsets aremore important for disclosure risk, the search is limited to a maximum subset size. Subsequently, a score is assignedto each individual, which ranks the individuals according to their level of risk.

For each MSU of size 𝑘 contained in a given record, a score is computed by∏︀𝑀

𝑖=𝑘 (𝐴𝑇𝑇 − 𝑖), where 𝑀 is the user-specified maximum size of MSUs12, and 𝐴𝑇𝑇 is the total number of attributes or variables in the dataset. By definition,the smaller the size 𝑘 of the MSU, the larger the score for the MSU, which reflects greater risk (see EMMG05). Thefinal SUDA score for each record is computed by adding the scores for each MSU in the record. In this way, recordswith more MSUs are assigned a higher SUDA score, which also reflects the higher risk. The SUDA score ranks theindividuals according to their level of risk. The higher the SUDA score, the riskier the sample unique.

Calculating SUDA scores – a simplified example

To illustrate how SUDA scores are calculated, we compute the SUDA scores for the sample uniques in the data inTable 5.5, which replicates the data from Table 5.5. Record 5 contains four MSUs: {Rural} of size 1, and {‘SecondaryComplete’, ‘Unemployed’}, {‘Female’, ‘Unemployed’} and {Female, Secondary Complete} of size 2. Suppose themaximum size of MSUs we search for in the data, 𝑀 , is set at 3. Knowing that, 𝐴𝑇𝑇 , the number of selected keyvariables in the dataset, is 4; the score assigned to {Rural} is computed by

∏︀3𝑖=1 (4 − 𝑖) = 3 * 2 * 1 = 6; and the

score assigned to {Secondary complete, Unemployed}, {Female, Unemployed} and {Female, Secondary Complete}is∏︀3

𝑖=2 (4 − 𝑖) = 2 * 1 = 2. The SUDA score for the fifth record in Table 5.5 is then 6 + 2 + 2 + 2 = 12, which isthe sum of these four scores per MSU. The SUDA scores for the other sample uniques are computed accordingly13.The values that are in the MSUs in the sample uniques are shaded in Table 5.5. Records that are not sample uniques(𝑓𝑘 > 1) cannot be special uniques and are assigned the score 0.

12 OECD, http://stats.oecd.org/glossary13 The third record has one MSU, {‘Primary incomplete’}; the seventh record has one MSU, {‘Primary complete’}; and the eighth record has

three MSUs, {‘Urban, Unemployed’}, {‘Male, Unemployed’} and {‘Post-secondary’}.

5.6. Special Uniques Detection Algorithm (SUDA) 35

Page 40: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 5.5: Illustrating the calculation of SUDA and DIS-SUDA scoresNo Resi-

denceGen-der

Education level Labor sta-tus

Weight 𝑓𝑘 SUDAscore

DIS-SUDA

1 Urban Female Secondary incom-plete

Employed 180 2 0 0.0000

2 Urban Female Secondary incom-plete

Employed 180 2 0 0.0000

3 Urban Female Primary incomplete Non-LF 215 1 6 0.00514 Urban Male Secondary complete Employed 76 2 0 0.00005 Rural Female Secondary complete Unem-

ployed186 1 12 0.0107

6 Urban Male Secondary complete Employed 76 2 0 0.00007 Urban Female Primary complete Non-LF 180 1 6 0.00518 Urban Male Post-secondary Unem-

ployed215 1 10 0.0088

9 Urban Female Secondary incom-plete

Non-LF 186 2 0 0.0000

10 Urban Female Secondary incom-plete

Non-LF 76 2 0 0.0000

To estimate record-level disclosure risks, SUDA scores can be used in combination with the Data Intrusion Simulation(DIS) metric (ElMa03) , a method for assessing disclosure risks for the entire dataset (i.e., file-level disclosure risks).Roughly speaking, the DIS-SUDA method distributes the file-level risk measure generated by the DIS metric betweenrecords according to the SUDA scores of each record. This way, SUDA scores are calibrated against a consistentmeasure to produce the DIS-SUDA scores, which provide the record-level disclosure risk. These scores are usedto compute the conditional probability that a unique match found by an intruder between the sample unique in thereleased microdata and an external data source is also a correct match, and hence a successful disclosure. The DIS-SUDA measure can be computed in sdcMicro. Since the DIS score is a probability, its values are in the interval [0, 1].A full description of the DIS-SUDA method is provided by ElMa03.

Note that unlike the risk methods discussed earlier, the DIS-SUDA score does not fully account for the samplingweights. Risk measures based on the previous methods (i.e., negative binomial models) will in general have lowerrisks for those records with greater sampling weight, given the same sample frequency count, than those measuredusing DIS-SUDA. Therefore, instead of replacing the risk measures introduced in the previous section, the SUDAscores and DIS-SUDA approach should be used as a complementary method. As mentioned earlier, DIS-SUDAis particularly useful in situations where taking an inventory of all already available datasets and their variables isdifficult.

5.6.3 Application of SUDA, DIS-SUDA using sdcMicro

Both SUDA and DIS-SUDA scores can be computed using sdcMicro (TMKC14 ). Given that the search for MSUswith the SUDA algorithm can be computationally demanding, sdcMicro uses an improved SUDA2 algorithm, whichmore effectively locates the boundaries of the search space for MSUs (MaHK08).

SUDA scores can be calculated using the suda2() function in sdcMicro. It is important to specify the missing argumentin suda2(). This should match the code for missing values in your dataset. In R this is most likely the R standard missingvalue, NA. We mention this because the default missing value code in the sdcMicro suda2() function is -999 andwill most likely need to be changed to ‘NA’ when using most R datasets. The scores are saved in the risk slot ofthe sdcMicro object. The syntax in Listing 5.7 shows how to retrieve the output.

36 Chapter 5. Measuring Risk

Page 41: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 5.7: Evaluating SUDA scores

1 # Evaluating SUDA scores for the specified variables2 sdcInitial <- suda2(obj = sdcInitial, missing = NA)3

4 # The results are saved in the risk slot of the sdcMicro object5 # SUDA scores6 sdcInitial@risk$suda2$score7 [1] 0.00 0.00 1.75 0.00 3.25 0.00 1.75 2.75 0.00 0.008

9 # DIS-SUDA scores10 sdcInitial@risk$suda2$disScore11 [1] 0.000000000 0.000000000 0.005120313 0.000000000 0.01070206112 [6] 0.000000000 0.005120313 0.008775093 0.000000000 0.00000000013

14 # Summary of DIS-SUDA scores15 sdcInitial@risk$suda216

17 Dis suda scores table:18 - - - - - - - - - - -19 thresholds number20 1 > 0 621 2 > 0.1 422 3 > 0.2 023 4 > 0.3 024 5 > 0.4 025 6 > 0.5 026 7 > 0.6 027 8 > 0.7 028 - - - - - - - - - - -

To compare DIS scores before and after applying SDC methods, it may be useful to use histograms or density plotsof these scores. Listing 5.8 shows how to generate histograms of the SUDA scores summarized in Listing 5.7. Thehistogram is shown in Fig. 5.2. All outputs relate to the data used in the example. In our case, we have not appliedany SDC method to the data yet and thus have only the plots for the initial values. Typically, after applying SDCmethods, one would recalculate the SUDA scores and compare them to the original values. One way to quickly seethe differences would be to rerun these visualizations and compare them to the base for risk changes.

Listing 5.8: Histogram and density plots of DIS-SUDA scores

1 # Plot a histogram of disScore2 hist(sdcInitial@risk$suda2$disScore, main = 'Histogram of DIS-SUDA scores')3

4 # Density plot5 density <- density(sdcInitial@risk$suda2$disScore)6 plot(density, main = 'Density plot of DIS-SUDA scores')

5.7 Risk measures for continuous variables

The principle of rareness or uniqueness of combinations of quasi-identifiers (keys) is not useful for continuous vari-ables, because it is likely that all or many individuals will have unique keys. Therefore, other approaches are exploitedfor measuring the disclosure risk of continuous variables. These methods are based on uniqueness of the values in theneighborhood of the original values. The uniqueness is defined in different ways: in absolute terms (interval measure)or relative terms (record linkage). Most measures are a posteriori measures: they are evaluated after anonymization

5.7. Risk measures for continuous variables 37

Page 42: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 5.2: Visualizations of DIS-SUDA scores

38 Chapter 5. Measuring Risk

Page 43: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

of the raw data, compare the treated data with the raw data and evaluate for each individual the distance between thevalues in the raw and the treated data. This means that these methods are not useful for identifying individuals at riskwithin the raw data, but rather show the distance/difference between the dataset before and after anonymization andcan therefore be interpreted as evaluation of the anonymization method. For that reason, they resemble the informationloss measures discussed in the Section Measuring utility and information loss. Finally, risk measures for continuousquasi-identifiers are also based on outlier detection. Outliers play an important role in the re-identification of theserecords.

5.7.1 Record linkage

Record linkage is an a posteriori method that evaluates the number of correct linkages when linking the perturbedvalues with the original values. The linking algorithm is based on the distance between the original and the perturbedvalues (i.e., distance-based record linkage). The perturbed values are matched with the closest individual. It is im-portant to note that this method does not give information on the initial risk, but is rather a measure to evaluate theperturbation algorithm (i.e., it is designed to indicate the level of uncertainty introduced into the variable by countingthe number of records that could be correctly matched).

Record linkage algorithms differ with respect to which distance measure is used. When a variable has very dif-ferent scaling than other continuous variables in the dataset, rescaling the variables before using record linkage isrecommended. Very different scales may lead to undesired results when measuring the multivariate distance betweenrecords based on several continuous variables. Since these methods are based on both the raw data and treated data,examples of their applications require the introduction of SDC methods and are therefore postponed to the case studiesin the Section Case Studies.

Besides distance-based record linkage, another method for linking is probabilistic record linkage (see DoTo03). Theliterature shows, however, that results from distance-based record linkage are better than the results from probabilisticrecord linkage. Individuals in the treated data that are linked to the correct individuals in the raw data are consideredat risk of disclosure.

5.7.2 Interval measure

Successful application of an SDC method should result in perturbed values that are considered not too close to theirinitial values; if the value is relatively close, re-identification may be relatively easy. In the application of intervalmeasures, intervals are created around each perturbed value and then a determination is made as to whether the originalvalue of that perturbed observation is contained in this interval. Values that are within the interval around the initialvalue after perturbation are considered too close to the initial value and hence unsafe and need more perturbation.Values that are outside of the intervals are considered safe. The size of the intervals is based on the standard deviationof the observations and a scaling parameter. This method is implemented in the function dRisk() in sdcMicro. Listing5.9 shows how to print or display the risk value computed by sdcMicro by comparing the income variables beforeand after anonymization. “sdcObj” is an sdcMicro object and “compExp“ is a vector containing the names of theincome variables. The size of the intervals is 𝑘 times the standard deviation, where 𝑘 is a parameter in the functiondRisk(). The larger 𝑘, the larger the intervals are, and hence the larger the number of observations within the intervalconstructed around their original values and the higher the risk measure. The result 1 indicates that all (100 percent)the observations are outside the interval of 0.1 times the standard deviation around the original values.

Listing 5.9: Example with the function dRisk()

1 dRisk(obj = sdcObj@origData[,compExp], xm = sdcObj@manipNumVars[,compExp], k = 0.1)2 [1] 1

For most values, this is a satisfactory approach. It is not a sufficient measure for outliers, however. After perturba-tion, outliers will stay outliers and are easily re-identifiable, even if they are sufficiently far from their initial values.Therefore, outliers should be treated with caution.

5.7. Risk measures for continuous variables 39

Page 44: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

5.7.3 Outlier detection

Outliers are important for measuring re-identification risk in continuous microdata. Continuous data are often skewed,especially right-skewed. This means that there are a few outliers with very high values relative to the other observationsof the same variable. Examples are income in household data, where only few individuals/households may have veryhigh incomes, or turnover data for firms that are much larger than other firms in the sample are. In cases like these,even if these values are perturbed, it may still be easy to identify these outliers, since they will stay the largest valueseven after perturbation. (The perturbation will have created uncertainty as to the exact value, but because the valuestarted out so much further away from other observations, it may still be easy to link to the high-income individualor very large firm.) Examples would be the only doctor in a geographical area with a high income or one single largefirm in one industry type. Therefore, identifying outliers in continuous data is an important step when identifyingindividuals at high risk. In practice, identifying the values of a continuous variable that are larger than a predetermined𝑝%-percentile might help identify outliers, and thus units at greater risk of identification. The value of 𝑝 depends onthe skewness of the data.

We can calculate the 𝑝%-percentile of a continuous variable in R and show the individuals who have income largerthan this percentile. Listing 5.10 provides an illustration for the 90th percentile.

Listing 5.10: Computing 90 % percentile of variable income

1 # Compute the 90 % percentile for the variable income2 perc90 <- quantile(file[,'income'], 0.90, na.rm = TRUE)3

4 # Show the ID of observations with values for income larger than the 90 % percentile5 file[(file[, 'income'] >= perc90), 'ID']

A second approach for outlier detection is a posteriori measure comparing the treated and raw data. An intervalis constructed around the perturbed values as described in the previous section. If the original values fall into theinterval around the perturbed values, the perturbed values are considered unsafe since they are too close to the originalvalues. There are different ways to construct such intervals, such as rank-based intervals and standard deviation-basedintervals. TeMe08 propose a robust alternative for these intervals. They construct the intervals based on the squaredRobust Mahalanobis Distance (RMD) of the individual values. The intervals are scaled by the RMD such that outliersobtain larger intervals and hence need to have a larger perturbation in order to be considered safe than values thatare not outliers. This method is implemented in sdcMicro in the function dRiskRMD(), which is an extension of thedRisk() function. This method is illustrated in the Section Case Studies.

5.8 Global risk

To construct one aggregate risk measure at the global level for the complete dataset, we can aggregate the measures forrisk at the individual level in several ways. Global risk measures should be used with caution: behind an acceptableglobal risk can hide some very high-risk records that are compensated by many low risk records.

5.8.1 Mean of individual risk measures

A straightforward way of aggregating the individual risk measures is taking the mean of all individuals in the sample,which is equal to summing over all keys in the sample if multiplied by the sample frequencies of these keys anddividing by the sample size 𝑛:

𝑅1 =1

𝑛

∑︁𝑖

𝑟𝑘 =1

𝑛

∑︁𝑘

𝑓𝑘𝑟𝑘

𝑟𝑘 is the individual risk of key 𝑘 that the 𝑖th individual shares (see the Section Categorical key variables and frequencycounts). This measure is reported as global risk in sdcMicro, is stored in the “risk” slot and can be displayed as shownin Listing 5.11. It indicates that the average re-identification probability is 0.01582 or 0.1582 %.

40 Chapter 5. Measuring Risk

Page 45: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 5.11: Computation of the global risk measure

1 # Global risk (average re-identification probability)2 sdcInitial@risk$global$risk3 [1] 0.01582

The global risk in the example data in Table 5.1 is 0.01582, which is the expected proportion of all individuals in thesample that could be re-identified by an intruder. Another way of expressing the global risk is the number of expectedre-identifications, 𝑛 * 𝑅1, which is in the example 10 * 0.01582. The expected number of re-identifications is alsosaved in the sdcMicro object. Listing 5.12 shows how to display this.

Note: This global risk measure should be used with caution. The average risk can be relatively low, but a fewindividuals could have a very high probability of re-identification.

An easy way to check for this is to look at the distribution of the individual risk values or the number of individualswith risk values above a certain threshold, as shown in the next section.

Listing 5.12: Computation of expected number of re-identifications

1 # Global risk (expected number of reidentifications)2 sdcInitial@risk$global$risk_ER3 [1] 0.1582

5.8.2 Count of individuals with risks larger than a certain threshold

All individuals belonging to the same key have the same individual risk, 𝑟𝑘. Another way of expressing the total risk inthe sample is the total number of observations that exceed a certain threshold of individual risk. Setting the thresholdcan be absolute (e.g., all those individuals who have a disclosure risk higher than 0.05 or 5%) or relative (e.g., allthose individuals with risks higher than the upper quartile of individual risk). Listing 5.13 shows how, using R, onewould count the number of observations with an individual re-identification risk higher than 5%. In the example, noindividual has a higher disclosure risk than 0.05.

Listing 5.13: Number of individuals with individual risk higher than thethreshold 0.05

1 sum(sdcInitial@risk$individual[,1] > 0.05)2 [1] 0

These calculations can then be used to treat data for individuals whose risk values are above a predetermined threshold.We will see later that there are methods in sdcMicro, such as localSupp(), that can be used to suppress values of certainkey variables for those individuals with risk above a specified threshold. This is explained further in the Section Localsuppression .

5.9 Household risk

In many social surveys, the data have a hierarchical structure where an individual belongs to a higher-level entity (seethe Section Levels of risk). Typical examples are households in social surveys or pupils in schools. Re-identificationof one household member can lead to re-identification of the other household members, too. It is therefore easy tosee that if we take the household structure into account, the re-identification risk is the risk that at least one of the

5.9. Household risk 41

Page 46: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

household members is re-identified.

𝑟ℎ = 𝑃 (𝐴1 ∪𝐴2 ∪ . . . ∪𝐴𝐽) = 1 −𝐽∏︁

𝑗=1

1 − 𝑃 (𝐴𝑗),

where 𝐴𝑗 is the event that the 𝑗th member of the household is re-identified and 𝑃 (𝐴𝑗) = 𝑟𝑘 is the individual disclosurerisk of the 𝑗th member. For example, if a household member has three members with individual disclosure risks basedon their respective keys 0.02, 0.03 and 0.03, respectively, the household risk is

1 − (1 − 0.02)(1 − 0.03)(1 − 0.03)) = 0.078

The hierarchical or household risk cannot be lower than the individual risk, and the household risk is always the samefor all household members. The household risk should be used in cases where the data contain a hierarchical structure,i.e., where a household structure is present in the data. Using sdcMicro, if a household identifier is specified (in theargument hhId in the function createSdcObj()) while creating an sdcMicro object, the household risk will automaticallybe computed. Listing 5.14 shows how to display these risk measures.

Listing 5.14: Computation of household risk and expected number ofre-identifications

1 # Household risk2 sdcInitial@risk$global$hier_risk3

4 # Household risk (expected number of reidentifications5 sdcInitial@risk$global$hier_risk_ER

Note: The size of a household is an important identifier itself, especially for large households. Suppression of theactual size variable (e.g., number of household members), however, does not suffice to remove this information fromthe dataset, as a simple count of the household members for a particular household will allow reconstructing thisvariable as long as a household ID is in the data, which allows assigning individuals to households. We flag this forthe reader’s attention as it is important. Further discussion on approaches to the SDC process that take into accountthe household structure where it exists can be found in the Section Anonymization of the quasi-identifier householdsize

Recommended Reading Material on Risk Measurement

Elliot, Mark J, Anna Manning, Ken Mayes, John Gurd, and Michael Bane. 2005. “SUDA: A Program for DetectingSpecial Uniques.” Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality. Geneva.

Hundepool, Anco, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer,and Peter Paul de Wolf. 2012. Statistical Disclosure Control. Chichester: John Wiley & Sons Ltd.doi:10.1002/9781118348239.

Lambert, Diane. 1993.”Measures of Disclosure Risk and Harm.” Journal of Official Statistics 9(2) : 313-331.

Machanavajjhala, Ashwin, Daniel Kifer, Johannes Gehrke, and Muthuramakrishnan Venki-tasubramaniam. 2007. “L-diversity: Privacy Beyond K-anonymity.” ACM Trans.Knowl. Discov. Data 1 (Article 3) (1556-4681). doi:10.1145/1217299.1217302.https://ptolemy.berkeley.edu/projects/truststc/pubs/465/L%20Diversity%20Privacy.pdf. Accessed July 6, 2018.

Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Shuang Chen. 2014. “Introduction to Statistical Disclo-sure Control (SDC).” http://www.ihsn.org/home/sites/default/files/resources/ihsn-working-paper-007-Oct27.pdf. Au-gust 1. Accessed July 6, 2018.

42 Chapter 5. Measuring Risk

Page 47: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

References

5.9. Household risk 43

Page 48: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

44 Chapter 5. Measuring Risk

Page 49: Statistical Disclosure Control: A Practice Guide

CHAPTER 6

Anonymization Methods

This Section describes the SDC methods most commonly used. All methods are implementable in R by using thesdcMicro package. We discuss for every method for what type of data the method is suitable, both in terms of datacharacteristics and type of data. Furthermore, options such as specific parameters for each method are discussed aswell as their impacts.1 These findings are meant as guidance but should be used with caution, since every dataset hasdifferent characteristics and our findings may not always address your particular dataset. The last three sections are onthe anonymization of variables and datasets with particular characteristics that deserve special attention. The SectionAnonymization of geospatial variables deals with for anonymizing geographical data, such as GPS coordinates, theSection Anonymization of the quasi-identifier household size discusses the anonymization of data with a hierarchicalstructure (household structure) and the Section Special case: census data describes the peculiarities of dealing withand releasing census microdata.

To determine which anonymization methods are suitable for specific variables and/or datasets, we begin by presentingsome classifications of SDC methods.

6.1 Classification of SDC methods

SDC methods can be classified as non-perturbative and perturbative (see HDFG12).

• Non-perturbative methods reduce the detail in the data by generalization or suppression of certain values (i.e.,masking) without distorting the data structure.

• Perturbative methods do not suppress values in the dataset but perturb (i.e., alter) values to limit disclosurerisk by creating uncertainty around the true values.

Both non-perturbative and perturbative methods can be used for categorical and continuous variables.

We also distinguish between probabilistic and deterministic SDC methods.

• Probabilistic methods depend on a probability mechanism or a random number-generating mechanism. Everytime a probabilistic method is used, a different outcome is generated. For these methods it is often recommendedthat a seed be set for the random number generator if you want to produce replicable results.

1 We also show code examples in R, which are drawn from findings we gathered by applying these methods to a large collection of datasets.

45

Page 50: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

• Deterministic methods follow a certain algorithm and produce the same results if applied repeatedly to thesame data with the same set of parameters.

SDC methods for microdata intend to prevent identity and attribute disclosure. Different SDC methods are used foreach type of disclosure control. Methods such as recoding and suppression are applied to quasi-identifiers to preventidentity disclosure, whereas top coding a quasi-identifier (e.g., income) or perturbing a sensitive variable preventattribute disclosure.

As this practice guide is written around the use of the sdcMicro package, we discuss only SDC methods that areimplemented in the sdcMicro package or can be easily implemented in R. These are the most commonly appliedmethods from the literature and used in most agencies experienced in using these methods. Table 6.1 gives an overviewof the SDC methods discussed in this guide, their classification, types of data to which they are applicable and theirfunction names in the sdcMicro package.

Table 6.1: SDC methods and corresponding functions in sdcMicroMethod Classification of SDC

methodData Type Function in sdcMicro

Global recoding non-perturbative, determinitic continuous and categori-cal

globalRecode , groupVars

Top and bottom cod-ing

non-perturbative, determinitic continuous and categori-cal

topBotCoding

Local suppression non-perturbative, determinitic categorical localSuppression, local-Supp

PRAM perturbative, probabilistic categorical pramMicro aggregation perturbative, probabilistic continuous microaggregationNoise addition perturbative, probabilistic continuous addNoiseShuffling perturbative, probabilistic continuous shuffleRank swapping perturbative, probabilistic continuous rankSwap

6.2 Non-perturbative methods

6.2.1 Recoding

Recoding is a deterministic method used to decrease the number of distinct categories or values for a variable. This isdone by combining or grouping categories for categorical variables or constructing intervals for continuous variables.Recoding is applied to all observations of a certain variable and not only to those at risk of disclosure. There are twogeneral types of recoding: global recoding and top and bottom coding.

Global recoding

Global recoding combines several categories of a categorical variable or constructs intervals for continuous variables.This reduces the number of categories available in the data and potentially the disclosure risk, especially for categorieswith few observations, but also, importantly, it reduces the level of detail of information available to the analyst. Toillustrate recoding, we use the following example. Assume that we have five regions in our dataset. Some regions arevery small and when combined with other key variables in the dataset, produce high re-identification risk for someindividuals in those regions. One way to reduce risk would be to combine some of the regions by recoding them. Wecould, for example, make three groups out of the five, call them ‘North’, ‘Central’ and ‘South’ and re-label the valuesaccordingly. This reduces the number of categories in the variable region from five to three.

Note: Any grouping should be some logical grouping and not a random joining of categories.

46 Chapter 6. Anonymization Methods

Page 51: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Examples would be grouping districts into provinces, municipalities into districts, or clean water categories together.Grouping all small regions without geographical proximity together is not necessarily the best option from the utilityperspective. Table 6.2 illustrates this with a very simplified example dataset. Before recoding, three individuals havedistinct keys, whereas after recoding (grouping ‘Region 1’ and ‘Region 2’ into ‘North’, ‘Region 3’ into ‘Central’ and‘Region 4’ and ‘Region 5’ into ‘South’), the number of distinct keys is reduced to four and the frequency of every keyis at least two, based on the three selected quasi-identifiers. The frequency counts of the keys 𝑓𝑘 are shown in the lastcolumn of Table 6.2. An intruder would find at least two individuals for each key and cannot distinguish any morebetween individuals 1 – 3, individuals 4 and 6, individuals 5 and 7 and individuals 8 – 10, based on the selected keyvariables.

Table 6.2: Illustration of effect of recoding on frequency counts of keys. Before recoding After recodingIndividual Region Gender Religion 𝑓𝑘 Region Gender Religion 𝑓𝑘1 Region 1 Female Catholic 1 North Female Catholic 32 Region 2 Female Catholic 2 North Female Catholic 33 Region 2 Female Catholic 2 North Female Catholic 34 Region 3 Female Protestant 2 Central Female Protestant 25 Region 3 Male Protestant 1 Central Male Protestant 26 Region 3 Female Protestant 2 Central Female Protestant 27 Region 3 Male Protestant 2 Central Male Protestant 28 Region 4 Male Muslim 2 South Male Muslim 39 Region 4 Male Muslim 2 South Male Muslim 310 Region 5 Male Muslim 1 South Male Muslim 3

Recoding is commonly the first step in an anonymization process. It can be used to reduce the number of uniquecombinations of values of key variables. This generally increases the frequency counts for most keys and reducesthe risk of disclosure. The reduction in the number of possible combinations is illustrated in Table 6.3 with thequasi-identifiers “region”, “marital status” and “age”. Table 6.3 shows the number of categories of each variable andthe number of theoretically possible combinations, which is the product of the number of categories of each quasi-identifier, before and after recoding. “Age” is interpreted as a semi-continuous variable and treated as a categoricalvariable. The number of possible combinations and hence the risk for re-identification are reduced greatly by recoding.One should bear in mind that the number of possible combinations is a theoretical number; in practice, these mayinclude very unlikely combinations such as age = 3 and marital status = widow and the actual number of combinationsin a dataset may be lower.

Table 6.3: Illustration of the effect of recoding on the theoretically pos-sible number of combinations an a dataset

Number of categories Region Marital status Age Possible combinationsbefore recoding 20 8 100 16,000after recoding 6 6 15 540

The main parameters for global recoding are the size of the new groups, as well as defining which values are groupedtogether in new categories.

Note: Care should be taken to choose new categories in line with the data use of the end users and to minimizeinformation loss as a result of recoding.

We illustrate this with three examples:

• Age variable: The categories of age should be chosen so that they still allow data users to make calculationsrelevant for the subject being studied. For example, if indicators need to be calculated for children of schoolgoing ages 6 – 11 and 12 – 17, and age needs to be grouped to reduce risk, then care should be taken to create

6.2. Non-perturbative methods 47

Page 52: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

age intervals that still allow the calculations to be made. A satisfactory grouping could be, for example, 0 –5, 6 – 11, 12 – 17, etc., whereas a grouping 0 – 10, 11 – 15, 16 – 18 would destroy the data utility for theseusers. While it is common practice to create intervals (groups) of equal width (size), it is also possible (if datausers require this) to recode only part of the variables and leave some values as they were originally. This couldbe done, for example, by recoding all ages above 20, but leaving those below 20 as they are. If SDC methodsother than recoding will be used later or in a next step, then care should be taken when applying recoding toonly part of the distribution, as this might increase the information loss due to the other methods, since thegrouping does not protect the ungrouped variables. Partial recoding followed by suppression methods such aslocal suppression may, for instance, lead to a higher number of suppressions than desired or necessary in casethe recoding is done for the entire value range (see the next section on local suppression). In the example above,the number of suppressions of values below 20 will likely be higher than for values in the recoded range. Thedisproportionately high number of suppressions in this range of values that are not recoded can lead to higherutility loss for these groups.

• Geographic variables: If the original data specify administrative level information in detail, e.g., down to mu-nicipality level, then potentially those lower levels could be recoded or aggregated into higher administrativelevels, e.g., province, to reduce risk. In doing so, the following should be noted: Grouping municipalities intoabstract levels that intersect different provinces would make data analysis at the municipal or provincial levelchallenging. Care should be taken to understand what the user requires and the intention of the study. If a keycomponent of the survey is to conduct analysis at the municipal level, then aggregating up to provincial levelcould damage the utility of the data for the user. Recoding should be applied if the level of detail in the datais not necessary for most data users and to avoid an extensive number of suppressions when using other SDCmethods subsequently. If the users need information at a more detailed level, other methods such as perturbativemethods might provide a better solution than recoding.

• Toilet facility: An example of a situation where a high level of detail might not be necessary and recoding maydo very little harm to utility is the case of a detailed household toilet facility variable that lists responses for 20types of toilets. Researchers may only need to distinguish between improved and unimproved toilet facilitiesand may not require the exact classification of up to 20 types. Detailed information of toilet types can be used tore-identify households, while recoding to two categories – improved and unimproved facilities – reduces the re-identification risk and in this context, hardly reduces data utility. This approach can be applied to any variablewith many categories where data users are not interested in detail, but rather in some aggregate categories.Recoding addresses aggregation for the data users and at the same time protects the microdata. Important is totake stock of the aggregations used by data users.

Recoding should be applied only if removing the detailed information in the data will not harm most data users. If theusers need information at a more detailed level, then recoding is not appropriate and other methods such as perturbativemethods might work better.

In sdcMicro there are different options for global recoding. In the following paragraphs, we give examples of globalrecoding with the functions groupVars() and globalRecode(). The function groupVars() is generally used for categor-ical variables and the function globalRecode() for continuous variables. Finally, we discuss the use of rounding toreduce the detail in continuous variables.

Recoding a categorical variable using the sdcMicro function groupVars()

Assume that an object of class sdcMicro was created, which is called “sdcInitial”2 (see the Section Objects of classsdcMicroObj) how to create objects of class sdcMicro). In Listing 6.1, the variable “sizeRes” has four differentcategories: ‘capital, large city’, ‘small city’, town’, and ‘countryside’). The first three are recoded or regrouped as‘urban’ and the category ‘countryside’ is renamed ‘rural’. In the function arguments, we specify the categories to begrouped (before) and the names of the categories after recoding (after). It is important that the vectors “before” and“after” have the same length. Therefore, we have to repeat ‘urban’ three times in the “after” vector to match the threedifferent values that are recoded to ‘urban’.

2 Here the sdcMicro object “sdcIntial“ contains a dataset with 2,500 individuals and 103 variables. We selected five quasi-identifiers: “sizeRes”,“age”, “gender”, “region”, and “ethnicity”.

48 Chapter 6. Anonymization Methods

Page 53: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Note: The function groupVars() works only for variables of class factor.

We refer to the Section Classes in R on how to change the class of a variable.

Listing 6.1: Using the sdcMicro function groupVars() to recode a cate-gorical variable

1 # Frequencies of sizeRes before recoding2 table(sdcInitial@manipKeyVars$sizeRes)3 ## capital, large city small city town countryside4 ## 686 310 146 13585

6 # Recode urban7 sdcInitial <- groupVars(obj = sdcInitial, var = c("sizeRes"),8 before = c("capital, large city", "small city", "town"),9 after = c("urban", "urban", "urban"))

10

11 # Recode rural12 sdcInitial <- groupVars(obj = sdcInitial, var = c("sizeRes"),13 before = c("countryside"), after = c("rural"))14

15 # Frequencies of sizeRes before recoding16 table(sdcInitial@manipKeyVars$sizeRes)17 ## urban rural18 ## 1142 1358

Fig. 6.1 illustrates the effect of recoding the variable “sizeRes” and show respectively the frequency counts before andafter recoding. We see that the number of categories has reduced from 4 to 2 and the small categories (‘small city’ and‘town’) have disappeared.

Fig. 6.1: Effect of recoding – frequency counts before and after recoding

6.2. Non-perturbative methods 49

Page 54: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Recoding a continuous variable using the sdcMicro function: globalRecode()

Global recoding of numerical (continuous) variables can be achieved in sdcMicro by using the function globalRe-code(), which allows specifying a vector with the break points between the intervals. Recoding a continuous variablechanges it into a categorical variable. One can additionally specify a vector of labels for the new categories. By de-fault, the labels are the intervals, e.g., “(0, 10]”. Listing 6.2 shows how to recode the variable age in 10-year intervalsfor age values between 0 and 100.

Note: Values that fall outside the specified intervals are assigned a missing value (NA).

Therefore, the intervals should cover the entire value range of the variable.

Listing 6.2: Using the sdcMicro function globalRecode() to recode a con-tinuous variable (age)

1 sdcInitial <- globalRecode(sdcInitial, column = c('age'),2 breaks = 10 * c(0:10))3

4 # Frequencies of age after recoding5 table(sdcInitial@manipKeyVars$age)6 ## (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90]

→˓(90,100]7 ## 462 483 344 368 294 214 172 94 26

→˓ 3

Fig. 6.2 shows the effect of recoding the variable “age”.

Fig. 6.2: Age variable before and after recoding

Instead of creating intervals of equal width, we can also create intervals of unequal width. This is illustrated in Listing6.3, where we use the age groups 1-5, 6-11, 12-17, 18-21, 22-25, 26-49, 50-64 and 65+. In this example, this is auseful step, since even after recoding in 10-year intervals, the categories with high age values have low frequencies.We chose the intervals by respecting relevant school age and employment age values (e.g., retirement age is 65 in this

50 Chapter 6. Anonymization Methods

Page 55: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

example) such that the data can still be used for common research on education and employment. Fig. 6.3 shows theeffect of recoding the variable “age”.

Listing 6.3: Using globalRecode() to create intervals of unequal width

1 sdcInitial <- globalRecode(sdcInitial, column = c('age'),2 breaks = c(0, 5, 11, 17, 21, 25, 49, 65, 100))3

4 # Frequencies of age after recoding5 table(sdcInitial@manipKeyVars$age)6 ## (0,5] (5,11] (11,17] (17,21] (21,25] (25,49] (49,65] (65,100]7 ## 192 317 332 134 142 808 350 185

Fig. 6.3: Age variable before and after recoding

Caution about using the globalRecode() function in sdcMicro: In the current implementation of sdcMicro, the intervalsare defined as left-open. In mathematical terms, this means that, in our example, age 0 is excluded from the specifiedintervals. In interval notation, this is denoted as (0, 5] (as in 𝑥-axis labels in Fig. 6.2 and Fig. 6.3 for the recodedvariable). The interval (0, 5] is interpreted as from 0 to 5 and does not include 0, but does include 5. R recodes valuesthat are not contained in any of the intervals as missing (NA). This implementation would set in our example all agevalues 0 (children under 1 year) to missing and could potentially mean a large data loss. The globalRecode() functionallows only constructing intervals, which are left-open. This may not be a desirable result and the loss of the zero agesfrom the data is clearly problematic for a real-world dataset.

To construct right-open intervals, e.g., in our example, for age intervals [0,14), [15, 65), [66, 100), we present twoalternatives for global recoding:

• A work-around for semi-continuous variables3 that would allow for the globalRecode() to be used would besubtracting a small number from the boundary intervals, thus allowing the desired intervals to be created. In thefollowing example, subtracting 0.1 from each interval forces globalRecode() to include 0 in the lowest intervaland allow for breaks where we want them. We set the upper interval boundary to be larger than the maximum

3 This approach works only for semi-continuous variables, because in the case of continuous variables, there might be values that are betweenthe lower interval boundary and the lower interval boundary minus the small number. For example, using this for income, we would have an interval(9999, 19999] and the value 9999.5 would be misclassified as belonging to the interval [10000, 19999].

6.2. Non-perturbative methods 51

Page 56: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

value for the “age” variable. We can use the option labels to define clear labels for the new categories. This isillustrated in Listing 6.4.

Listing 6.4: Constructing right-open intervals for semi-continuous vari-ables using built-in sdcMicro function globalRecode()

1 sdcInitial <- globalRecode(sdcInitial, column = c('age'),2 breaks = c(-0.1, 14.9, 64.9, 99.9),3 labels = c('[0,15)', '[15,65)', '[65,100)'))

• It is also possible to use R code to manually recode the variables without using sdcMicro functions. When usingthe built-in sdcMicro functions, the change in risk after recoding is automatically recalculated, but if recodedmanually it is not. In this case, we need to take an extra step and recalculate the risk after manually changingthe variables in the sdcMicro object. This approach is also valid for continuous variables and is illustrated inListing 6.5.

Listing 6.5: Constructing intervals for semi-continuous and continuousvariables using manual recoding in R

1 # Group age 0-142 sdcInitial@manipKeyVars$age[sdcInitial@manipKeyVars$age >= 0 &3 sdcInitial@manipKeyVars$age < 15] <- 04

5 # Group age 15-646 sdcInitial@manipKeyVars$age[sdcInitial@manipKeyVars$age >= 15 &7 sdcInitial@manipKeyVars$age < 65] <- 18

9 # Group age 65-10010 sdcInitial@manipKeyVars$age[sdcInitial@manipKeyVars$age >= 65 &11 sdcInitial@manipKeyVars$age <= 100] <- 212

13 # Add labels for the new values14 sdcInitial@manipKeyVars$age <-ordered(sdcInitial@manipKeyVars$age,15 levels = c(0,1,2), labels = c("0-14", "15-64", "65-100"))16

17 # Recalculate risk after manual manipulation18 sdcInitial <- calcRisks(sdcInitial)

Top and bottom coding

Top and bottom coding are similar to global recoding, but instead of recoding all values, only the top and/or bottomvalues of the distribution or categories are recoded. This can be applied only to ordinal categorical variables and(semi-)continuous variables, since the values have to be at least ordered. Top and bottom coding is especially useful ifthe bulk of the values lies in the center of the distribution with the peripheral categories having only few observations(outliers). Examples are age and income; for these variables, there will often be only a few observations abovecertain thresholds, typically at the tails of the distribution. The fewer the observations within a category, the higherthe identification risk. One solution could be grouping the values at the tails of the distribution into one category.This reduces the risk for those observations, and, importantly, does so without reducing the data utility for the otherobservations in the distribution.

Deciding where to apply the threshold and what observations should be grouped requires:

• Reviewing the overall distribution of the variable to identify at which point the frequencies drop below thedesired number of observations and identify outliers in the distribution. Fig. 6.4 shows the distribution of theage variable and suggests 65 (red vertical line) for the top code age.

• Taking into account the intended use of the data and the purpose for which the survey was conducted. For

52 Chapter 6. Anonymization Methods

Page 57: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

example, if the data are typically used to measure labor force participation for those aged 15 to 64, then top andbottom coding should not interfere with the categories 15 to 64. Otherwise the analyst would find it impossibleto create the desired measures for which the data were intended. In the example, we consider this and code allage higher than 64.

Fig. 6.4: Utilizing the frequency distribution of variable age to determine threshold for top coding

Top and bottom coding can be easily done with the function topBotCoding() in sdcMicro. Top coding and bottomcoding cannot be done simultaneously in sdcMicro. Listing 6.6 illustrates how to recode values of age higher than 64and values of age lower than 5; 65 and 5 replace the values respectively. To construct several top or bottom codingcategories, e.g., age 65 – 80 and higher than age 80, one can use the groupVars() function in sdcMicro or manualrecoding as described in the previous subsection.

Listing 6.6: Top coding and bottom coding in sdcMicro using topBot-Coding() function

1 # Top coding at age 652 sdcInitial <- topBotCoding(obj = sdcInitial, value = 65, replacement = 65,3 kind = 'top', column = 'age')4

5 # Bottom coding at age 56 sdcInitial <- topBotCoding(obj = sdcInitial, value = 5, replacement = 5,7 kind = 'bottom', column = 'age')

Rounding

Rounding is similar to grouping, but used for continuous variables. Rounding is useful to prevent exact matching withexternal data sources. In addition, it can be used to reduce the level of detail in the data. Examples are removingdecimal figures or rounding to the nearest 1,000.

The next section discusses the method local suppression. Recoding is often used before local suppression to reducethe number of necessary suppressions.

6.2. Non-perturbative methods 53

Page 58: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Recommended Reading Material on Recoding

Hundepool, Anco, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Rainer Lenz, Jane Naylor, Eric SchulteNordholt, Giovanni Seri, and Peter Paul de Wolf. 2006. Handbook on Statistical Disclosure Control. ESSNet SDC.http://neon.vb.cbs.nl/casc/handbook.htm.

Hundepool, Anco, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer,and Peter Paul de Wolf. 2012. Statistical Disclosure Control. Chichester: John Wiley & Sons Ltd.doi:10.1002/9781118348239.

Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Shuang Chen. 2014. Statistical Disclosure Control(SDCMicro). http://www.ihsn.org/home/software/disclosure-control-toolbox. (accessed June 9, 2018).

De Waal, A.G., and Willenborg, L.C.R.J. 1999. Information loss through global recoding and local suppression.Netherlands Official Statistics, 14:17-20, 1999. Special issue on SDC

6.2.2 Local suppression

It is common in surveys to encounter values for certain variables or combinations of quasi-identifiers (keys) that areshared by very few individuals. When this occurs, the risk of re-identification for those respondents is higher than therest of the respondents (see the Section k-anonymity). Often local suppression is used after reducing the number ofkeys in the data by recoding the appropriate variables. Recoding reduces the number of necessary suppressions as wellas the computation time needed for suppression. Suppression of values means that values of a variable are replaced bya missing value (NA in R). The the Section k-anonymity discusses how missing values influence frequency counts and𝑘-anonymity. It is important to note that not all values for all individuals of a certain variable are suppressed, whichwould be the case when removing a direct identifier, such as “name”; only certain values for a particular variable and aparticular respondent or set of respondents are suppressed. This is illustrated in the following example and Table 6.4.

Table 6.4 presents a dataset with seven respondents and three quasi-identifiers. The combination {‘female’, ‘rural’,‘higher’} for the variables “gender”, “region” and “education” is an unsafe combination, since it is unique in thesample. By suppressing either the value ‘female’ or ‘higher’, the respondent cannot be distinguished from the otherrespondents anymore, since that respondent shares the same combination of key variables with at least three otherrespondents. Only the value in the unsafe combination of the single respondent at risk is suppressed, not the values forthe same variable of the other respondents. The freedom to choose which value to suppress can be used to minimizethe total number of suppressions and hence the information loss. In addition, if one variable is very important to theuser, we can choose not to suppress values of this variable, unless strictly necessary. In the example, we can choosebetween suppressing the value ‘female’ or ‘higher’ to achieve a safe data file; we chose to suppress ‘higher’. Thischoice should be made taking into account the needs of data users. In this example we find “gender” more importantthan “education”.

Table 6.4: Local suppression illustration - sample data before and aftersuppression

Variable Before local suppression After local suppressionID Gender Region Education Gender Region Education1 female rural higher female rural NA/missing5

2 male rural higher male rural higher3 male rural higher male rural higher4 male rural higher male rural higher5 female rural lower female rural lower6 female rural lower female rural lower7 female rural lower female rural lower

5 In R suppressed values are recoded NA, the standard missing value code.

54 Chapter 6. Anonymization Methods

Page 59: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Since continuous variables have a high number of unique values (e.g., income in dollars or age in years), 𝑘-anonymityand local suppression are not suitable for continuous variables or variables with a very high number of categories.A possible solution in those cases might be to first recode to produce fewer categories (e.g., recoding age in 10-yearintervals or income in quintiles). Always keep in mind, though, what effect any recoding will have on the utility of thedata.

The sdcMicro package includes two functions for local suppression: localSuppression() and localSupp(). The functionlocalSuppression() is most commonly used and allows the use of suppression on specified quasi-identifiers to achievea certain level of 𝑘-anonymity for these quasi-identifiers. The algorithm used seeks to minimize the total number ofsuppressions while achieving the required 𝑘-anonymity threshold. By default, the algorithm is more likely to suppressvalues of variables with many different categories or values, and less likely to suppress variables with fewer categories.For example, the values of a geographical variable, with 12 different areas, are more likely to be suppressed than thevalues of the variable “gender”, which has typically only two categories. If variables with many different values areimportant for data utility and suppression is not desired for them, it is possible to rank variables by importance in thelocalSuppression() function and thus specify the order in which the algorithm will seek to suppress values within quasi-identifiers to achieve 𝑘-anonymity. The algorithm seeks to apply fewer suppressions to variables of high importancethan to variables with lower importance. Nevertheless, suppressions in the variables with high importance might beinevitable to achieve the required level of 𝑘-anonymity.

In Listing 6.7, local suppression is applied to achieve the 𝑘-anonymity threshold of 5 on the quasi-identifiers “gender”,“region”, “religion”, “age” and “ethnicity”6. Without ranking the importance of the variables, the value of the variable“age” is more likely to be suppressed, since this is the variable with most categories. The variable “age” has 10categories after recoding. The variable “gender” is least likely to be suppressed, since it has only two different values:‘male’ and ‘female’. The other variables have 4 (“sizeRes”), 2 (“region”), and 8 (“ethnicity”) categories. Afterapplying the localSuppression() function, we display the number of suppressions per variable with the built-in print()function with the option ‘ls’ for the local suppression output. As expected, the variable “age” has most suppressions(80). In fact, only the variable “ethnicity” of the other variables also needed suppressions (8) to achieve the 𝑘-anonymity threshold of 5. The variable “ethnicity” is the variable with the second highest number of suppressions.Subsequently, we undo and redo local suppression on the same data and reduce the number of suppressions on “age”by specifying the importance vector with high importance (little suppression) on the quasi-identifier “age”. We alsoassign importance to the variable “gender”. This is done by specifying an importance vector. The values in theimportance vector can range from 1 to 𝑘, the number of quasi-identifiers. In our example 𝑘 is equal to 5. Variableswith lower values in the importance vectors have high importance and, when possible, receive fewer suppressions thanvariables with higher values.

To assign high importance to the variables “age” and “gender”, we specify the importance vector as c(5, 1, 1, 5, 5),with the order according to the order of the specified variables in the sdcMicro object. The effect is clear: there are nosuppressions in the variables “age” and “gender”. For that, the other variables, especially “sizeRes” and “ethnicity”,received many more suppressions. The total number of suppressed values has increased from 88 to 166.

Note: Fewer suppressions in one variable increase the number of necessary suppressions in other variables (cf. Listing6.7).

Generally, the total number of suppressed values needed to achieve the required level of 𝑘-anonymity increases whenspecifying an importance vector, since the importance vector prevents to use the optimal suppression pattern. Theimportance vector should be specified only in cases where the variables with many categories play an important rolein data utility for the data users7.

6 Here the sdcMicro object “sdcIntial“ contains a dataset with 2,500 individuals and 103 variables. We selected five quasi-identifiers: “sizeRes”,“age”, “gender”, “region”, and “ethnicity”.

7 This can be assessed with utility measures.

6.2. Non-perturbative methods 55

Page 60: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 6.7: Application of local suppression with and without impor-tance vector

1 # local suppression without importance vector2 sdcInitial <- localSuppression(sdcInitial, k = 5)3

4 print(sdcInitial, 'ls')5 ## KeyVar | Suppressions (#) | Suppressions (%)6 ## sizeRes | 0 | 0.0007 ## age | 80 | 3.2008 ## gender | 0 | 0.0009 ## region | 0 | 0.000

10 ## ethnicity | 8 | 0.32011

12 # Undoing the supressions13 sdcInitial <- undolast(sdcInitial)14

15 # Local suppression with importance vector to avoid suppressions16 # in the first (gender) and fourth (age) variables17 sdcInitial <- localSuppression(sdcInitial, importance = c(5, 1, 1, 5, 5), k = 5)18 print(sdcInitial, 'ls')19 ## KeyVar | Suppressions (#) | Suppressions (%)20 ## sizeRes | 87 | 3.48021 ## age | 0 | 0.00022 ## gender | 0 | 0.00023 ## region | 17 | 0.68024 ## ethnicity | 62 | 2.480

Fig. 6.5 demonstrates the effect of the required 𝑘-anonymity threshold and the importance vector on the data utility byusing several labor market-related indicators from an I2D28 dataset before and after anonymization. Fig. 6.5 displaysthe relative changes as a percentage of the initial value after re-computing the indicators with the data to which localsuppression was applied. The indicators are the proportion of active females and males, and the number of femalesand males of working age. The values computed from the raw data were, respectively, 68%, 12%, 8,943 and 9,702.The vertical line at 0 is the benchmark of no change. The numbers indicate the required k-anonymity threshold (3 or5) and the colors indicate the importance vector: red (no symbol) is no importance vector, blue (with * symbol) is highimportance on the variable with the employment status information and dark green (with + symbol) is high importanceon the age variable.

A higher 𝑘-anonymity threshold leads to greater information loss (i.e., larger deviations from the original values ofthe indicators, the 5’s are further away from the benchmark of no change than the corresponding 3’s) caused by localsuppression. Reducing the number of suppressions on the employment status variable by specifying an importancevector does not improve the indicators. Instead, reducing the number of suppressions on age greatly reduces theinformation loss. Since specific age groups have a large influence on the computation of these indicators (the rarecases are in the extremes and will be suppressed), high suppression rates on age distort the indicators. It is generallyuseful to compare utility measures (see the Section Measuring Utility and Information Loss ) to specify the importancevector, since the effects can be unpredictable.

The threshold of 𝑘-anonymity to be set depends on several factors, which are amongst others: 1) the legal requirementsfor a safe data file; 2) other methods that will be applied to the data; 3) the number of suppressions and relatedinformation loss resulting from higher thresholds; 4) the type of variable; 5) the sample weights and sample size; and6) the release type (see the Section Release Types ). Commonly applied levels for the 𝑘-anonymity threshold are 3 and5.

Table 6.5 illustrates the influence of the importance vector and 𝑘-anonymity threshold on the running time, globalrisk after suppression and total number of suppressions required to achieve this 𝑘-anonymity threshold. The dataset

8 I2D2 is a dataset with data related to the labor market.

56 Chapter 6. Anonymization Methods

Page 61: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 6.5: Changes in labor market indicators after anonymization of I2D2 data

contains about 63,000 individuals. The higher the 𝑘-anonymity threshold, the more suppressions are needed andthe lower the risk after local suppression (expected number of re-identifications). In this particular example, thecomputation time is shorter for higher thresholds. This is due the higher number of necessary suppressions, whichreduces the difficulty of the search for an optimal suppression pattern.

The age variable is recoded in five-year intervals and has 20 age categories. This is the variable with the highestnumber of categories. Prioritizing the suppression of other variables leads to a higher total number of suppressionsand a longer computation time.

Table 6.5: How importance vectors and 𝑘-anonymity thresholds affectrunning time and total number of suppressions

Threshold Importance Total number of Threshold Importance Total number ofk-anonimity vector suppressions k-anonimity vector suppressions3 none (default) 6,676 5,387 293.0 11.83 employment status 7,254 5,512 356.5 13.13 age variable 8,175 60 224.6 4.55 none (default) 9,971 7,894 164.6 8.55 employment status 11,668 8,469 217.0 10.25 age variable 13,368 58 123.1 3.8

In cases where there are a large number of quasi-identifiers and the variables have many categories, the number ofpossible combinations increases rapidly (see 𝑘-anonymity). If the number of variables and categories is very large,the computation time of the localSuppression() algorithm can be very long (see the Section Computation time oncomputation time). Also, the algorithm may not reach a solution, or may come to a solution that will not meet thespecified level of 𝑘-anonymity. Therefore, reducing the number of quasi-identifiers and/or categories before applyinglocal suppression is recommended. This can be done by recoding variables or selecting some variables for other(perturbative) methods, such as PRAM. This is to ensure that the number of suppressions is limited and hence the lossof data is limited to only those values that pose most risk.

In some datasets, it might prove difficult to reduce the number of quasi-identifiers and even after reducing the numberof categories by recoding, the local suppression algorithm takes a long time to compute the required suppressions. A

6.2. Non-perturbative methods 57

Page 62: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

solution in such cases can be the so-called ‘all-𝑚 approach’ (see Wolf15). The all-𝑚 approach consists of applying thelocal suppression algorithm as described above to all possible subsets of size 𝑚 of the total set of quasi-identifiers. Theadvantage of this approach is that the partial problems are easier to solve and computation time will be slower. Cautionshould be applied since this method does not necessarily lead to 𝑘-anonymity in the complete set of quasi-identifiers.There are two possibilities to reach the same level of protection: 1) to choose a higher threshold for 𝑘 or 2) to re-applythe local suppression algorithm on the complete set of quasi-identifiers after using the all-𝑚 approach to achieve therequired threshold. In the second case, the all-𝑚 approach leads to a shorter computation time at the cost of a highertotal number of suppressions.

Note: The required level is not achieved automatically on the entire set of quasi-identifiers if the all-m approach isused.

Therefore, it is important to evaluate the risk measures carefully after using the all-𝑚 approach.

In sdcMicro the all-𝑚 approach is implemented in the ‘combs’ argument in the localSuppression() function. The valuefor 𝑚 is specified in the ‘combs’ argument and can also take on several values. The subsets of different sizes are thenused sequentially in the local suppression algorithm. For example if ‘combs’ is set to c(3,9), first all subsets of size 3are considered and subsequently all subsets of size 9. Setting the last value in the combs argument to the total numberof key variables guarantees the achievement of 𝑘-anonymity for the complete dataset. It is also possible to specifydifferent values for 𝑘 for each subset size in the ‘k’ argument. If we would want to achieve 5-anonimity on the subsetsof size 3 and subsequently 3-anonimity on the subsets of size 9, we would set the ‘k’ argument to c(5,3). Listing 6.8illustrates the use of the all-𝑚 approach in sdcMicro.

Listing 6.8: The all-𝑚 approach in sdcMicro

1 # Apply k-anonymity with threshold 5 to all subsets of two key variables and2 # subsequently to the complete dataset3 sdcInitial <- localSuppression(sdcInitial, k = 5, combs = c(2, 5))4 # Apply k-anonymity with threshold 5 to all subsets of three key variables and5 # subsequently with threshold 2 to the complete dataset6 sdcInitial <- localSuppression(sdcInitial, k = c(3, 5), combs = c(5, 2))

Table 6.6 presents the results of using the all-𝑚 approach of a test dataset with 9 key variables and 4,000 records. Thetable shows the arguments ‘k’ and ‘combs’ of the localSuppression() function, the number of 𝑘-anonymity violatorsfor different levels of 𝑘 as well as the total number of suppressions. We observe that the different combinations do notalways lead to the required level of 𝑘-anonimity. For example, when setting 𝑘 = 3, and combs 3 and 7, there are still15 records in the dataset (with a total of 9 quasi-identifiers) that violate 3-anonimity after local suppression. Due tothe smaller sample size, the gains in running time are not yet apparent in this example, since the rerunning algorithmseveral times takes up time. A larger dataset would benefit more from the all-𝑚 approach, as the algorithm would takelonger in the first place.

58 Chapter 6. Anonymization Methods

Page 63: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 6.6: Effect of the all-𝑚 approach on k-anonymityArguments Number of violators for different levels of 𝑘-

anonimity on complete setTotal number ofsuppressions

Running time(seconds)

k combs k = 2 k = 3 k = 5Before local sup-pression

2,464 3,324 3,877 0 0.00

3 . 0 0 1,766 2,264 17.085 . 0 0 0 3,318 10.573 3 2,226 3,202 3,819 3,873 13.393 3, 7 15 108 1,831 6,164 46.843 3, 9 0 0 1,794 5,982 31.383 5, 9 0 0 1,734 6,144 62.305 3 2,047 3,043 3,769 3,966 12.885 3, 7 0 6 86 7,112 46.575 3, 9 0 0 0 7,049 24.135 5, 9 0 0 0 7,129 54.765, 3 3, 7 11 108 1,859 6,140 45.605, 3 3, 9 0 0 1,766 2,264 30.075, 3 5, 9 0 0 0 3,318 51.25

Often the dataset contains variables that are related to the key variables used for local suppression. Examples arerural/urban to regions in case regions are completely rural or urban or variables that are only answered for specificcategories (e.g., sector for those working, schooling related variables for certain age ranges). In those cases, thevariables rural/urban or sector might not be quasi-identifiers themselves, but could allow the intruder to reconstructsuppressed values in the quasi-identifiers region or employment status. For example, if region 1 is completely urban,and all other regions are only semi-urban or rural, a suppression in the variable region for a record in region 1 canbe simply reconstructed by the rural/urban variable. Therefore, it is useful to suppress the values corresponding tothe suppressions in those linked variables. Listing 6.9 illustrates how to suppress the values in the variable “rururb”corresponding to the suppressions in the region variable. All values of “rururb”, which correspond to a suppressedvalue (NA) in the variable “region” are suppressed (set to NA).

Listing 6.9: Manually suppressing values in linked variables

1 # Suppress values of rururb in file if region is suppressed2 file[is.na(sdcInitial@manipKeyVars$region) &3 !is.na(sdcInitial@origData$region),'sizRes'] <- NA

Alternatively, the linked variables can be specified when creating the sdcMicro object. The linked variables are calledghost variables. Any suppression in the key variable will lead to a suppression in the variables linked to that keyvariable. Listing 6.10 shows how to specify the linkage between “region” and “rururb” with ghost variables.

Listing 6.10: Suppressing values in linked variables by specifying ghostvariables

1 # Ghost (linked) variables are specified as a list of linkages2 ghostVars <- list()3

4 # Each linkage is a list, with the first element the key variable and5 # the second element the linked variable(s)6 ghostVars[[1]] <- list()7 ghostVars[[1]][[1]] <- "region"8 ghostVars[[1]][[2]] <- c("sizeRes")9

(continues on next page)

6.2. Non-perturbative methods 59

Page 64: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

10 ## Create the sdcMicroObj11 sdcInitial <- createSdcObj(file, keyVars = keyVars, numVars = numVars,12 weightVar = weight, ghostVars = ghostVars)13

14 # The manipulated ghost variables are in the slot manipGhostVars15 sdcInitial@manipGhostVars

The simpler alternative for the localSuppression() function in sdcMicro is the localSupp() function. The localSupp()function can be used to suppress values of certain key variables of individuals with risks above a certain threshold.In this case, all values of the specified variable for respondents with a risk higher than the specified threshold will besuppressed. The risk measure used is the individual risk (see the Section Individual risk). This is useful if one variablehas sensitive values that should not be released for individuals with high risks of re-identification. What is consideredhigh re-identification probability depends on legal requirements. In the following example, the values of the variable“education” are suppressed for all individuals whose individual risk is higher than 0.1, which is illustrated in Listing6.11. For an overview of the individual risk values, it can be useful to look at the summary statistics of the individualrisk values as well as the number of suppressions.

Listing 6.11: Application of built-in sdcMicro function localSupp()

1 # Summary statistics2 summary(sdcInitial@risk$individual[,1])3 ## Min. 1st Qu. Median Mean 3rd Qu. Max.4 ## 0.05882 0.10000 0.14290 0.26480 0.33330 1.000005

6 # Number of individuals with individual risk higher than 0.17 sum(sdcInitial@risk$individual[,1] > 0.1)8 ## [1] 18639

10 # local suppression11 sdcInitial <- localSupp(sdcInitial, threshold = 0.1, keyVar = 'education')

6.3 Perturbative methods

Perturbative methods do not suppress values in the dataset, but perturb (alter) values to limit disclosure risk by creatinguncertainty around the true values. An intruder is uncertain whether a match between the microdata and an externalfile is correct or not. Most perturbative methods are based on the principle of matrix masking, i.e., the altered dataset𝑍 is computed as

𝑍 = 𝐴𝑋𝐵 + 𝐶

where 𝑋 is the original data, 𝐴 is a matrix used to transform the records, 𝐵 is a matrix to transform the variables and𝐶 is a matrix with additive noise.

Note: Risk measures based on frequency counts of keys are no longer valid after applying perturbative methods.

This can be seen in Table 6.7 , which displays the same data before and after swapping some values. The swappedvalues are in italics. Both before and after perturbing the data, all observations violate 𝑘-anonymity at the level 3(i.e., each key does not appear more than twice in the dataset). Nevertheless, the risk of correct re-identification ofthe records is reduced and hence information contained in other (sensitive) variables possibly not disclosed. With acertain probability, a match of the microdata with an external data file will be wrong. For example, an intruder wouldfind one individual with the combination {‘male’, ‘urban’, ‘higher’}, which is a sample unique. However, this matchis not correct, since the original dataset did not contain any individual with these characteristics and hence the matched

60 Chapter 6. Anonymization Methods

Page 65: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

individual cannot be a correct match. The intruder cannot know with certainty whether the information disclosed fromother variables for that record is correct.

Table 6.7: Sample data before and after perturbationVariable Original data After perturbing the dataID Gender Region Education Gender Region Education1 female rural higher female rural higher2 female rural higher female rural lower3 male rural lower male rural lower4 male rural lower female rural lower5 female urban lower male urban higher6 female urban lower female urban lower

One advantage of perturbative methods is that the information loss is reduced, since no values will be suppressed,depending on the level of perturbation. One disadvantage is that data users might have the impression that the datawas not anonymized before release and will be less willing to participate in future surveys. Therefore, there is a needfor reporting both for internal and external use (see the Section Step 11: Audit and Reporting).

An alternative to perturbative methods is the generation of synthetic data files with the same characteristics as theoriginal data files. Synthetic data files are not discussed in these guidelines. For more information and an overviewof the use of synthetic data as SDC method, we refer to Drec11 and Section 3.8 in HDFG12. We discuss here fiveperturbative methods: Post Randomization Method (PRAM), microaggregation, noise addition, shuffling and rankswapping.

6.3.1 PRAM (Post RAndomization Method)

PRAM is a perturbative method for categorical data. This method reclassifies the values of one or more variables, suchthat intruders that attempt to re-identify individuals in the data do so, but with positive probability, the re-identificationmade is with the wrong individual. This means that the intruder might be able to match several individuals betweenexternal files and the released data files, but cannot be sure whether these matches are to the correct individual.

PRAM is defined by the transition matrix 𝑃 , which specifies the transition probabilities, i.e., the probability that avalue of a certain variable stays unchanged or is changed to any of the other 𝑘−1 values. 𝑘 is the number of categoriesor factor levels within the variable to be PRAMmed. For example, if the variable region has 10 different regions,𝑘 equals 10. In case of PRAM for a single variable, the transition matrix is size 𝑘 * 𝑘. We illustrate PRAM withan example of the variable “region”, which has three different values: ‘capital’, ‘rural1’ and ‘rural2’. The transitionmatrix for applying PRAM to this variable is size 3*3:

𝑃 =

⎡⎣ 1 0 00.05 0.8 0.150.05 0.15 0.8

⎤⎦The values on the diagonal are the probabilities that a value in the corresponding category is not changed. The value 1at position (1,1) in the matrix means that all values ‘capital’ stay ‘capital’; this might be a useful decision, since mostindividuals live in the capital and no protection is needed. The value 0.8 at position (2,2) means that an individualwith value ‘rural1’ will stay with probability 0.8 ‘rural1’. The values 0.05 and 0.15 in the second row of the matrixindicate that the value ‘rural1’ will be changed to ‘capital’ or ‘rural2’ with respectively probability 0.05 and 0.15. Ifin the initial file we had 5,000 individuals with value ‘capital’ and resp. 500 and 400 with values ‘rural1’ and ‘rural2’,we expect after applying PRAM to have 5,045 individuals with capital, 460 with rural1 and 395 with rural29. Therecoding is done independently for each individual. We see that the tabulation of the variable “region” yields differentresults before and after PRAM, which are shown in Table 6.8. The deviation from the expectation is due to the factthat PRAM is a probabilistic method, i.e., the results depend on a probability-generating mechanism; consequently,the results can differ every time we apply PRAM to the same variables of a dataset.

9 The 5,045 is the expectation computed as 5,000 * 1 + 500 * 0.05 + 400 * 0.05.

6.3. Perturbative methods 61

Page 66: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Note: The number of changed values is larger than one might think when inspecting the tabulations in Table 6.8. Notall 5,000 individuals with value captial after PRAM had this value before PRAM and the 457 individuals in rural1 afterPRAM are not all included in the 500 individuals before PRAM. The number of changes is larger than the differencesin the tabulation (cf. transition matrix).

Given that the transition matrix is known to the end users, there are several ways to correct statistical analysis of thedata for the distortions introduced by PRAM.

Table 6.8: Tabulation of variable “region” before and after PRAMValue Tabulation before PRAM Tabulation after PRAMcapital 5,000 5,052rural1 500 457rural2 400 391

One way to guarantee consistency between the tabulations before and after PRAM is to choose the transition matrixso that, in expectation, the tabulations before and after applying PRAM are the same for all variables.10 This methodis called invariant PRAM and is implemented in sdcMicro in the function pram(). The method pram() determines thetransition matrix that satisfies the requirements for invariant PRAM.

Note: Invariant does not guarantee that cross-tabulations of variables (unlike univariate tabulations) stay the same.

In Listing 6.12, we give an example of invariant PRAM using sdcMicro.11 PRAM is a probabilistic method and theresults can differ every time we apply PRAM to the same variables of a dataset. To overcome this and make the resultsreproducible, it is good practice to set a seed for the random number generator in R, so the same random numbers willbe generated every time.12 The number of changed records per variable is also shown.

Listing 6.12: Producing reproducible PRAM results by using set.seed()

1 # Set seed for random number generator2 set.seed(123)3

4 # Apply PRAM to all selected variables5 sdcInitial <- pram(obj = sdcInitial)6 ## Number of changed observations:7 ## - - - - - - - - - - -8 ## ROOF != ROOF_pram : 75 (3.75%)9 ## TOILET != TOILET_pram : 200 (10%)

10 ## WATER != WATER_pram : 111 (5.55%)11 ## ELECTCON != ELECTCON_pram : 99 (4.95%)12 ## FUELCOOK != FUELCOOK_pram : 152 (7.6%)13 ## OWNMOTORCYCLE != OWNMOTORCYCLE_pram : 42 (2.1%)14 ## CAR != CAR_pram : 168 (8.4%)15 ## TV != TV_pram : 170 (8.5%)16 ## LIVESTOCK != LIVESTOCK_pram : 52 (2.6%)

10 This means that the vector with the tabulation of the absolute frequencies of the different categories in the original data is an eigenvector of thetransition matrix that corresponds to the unit eigenvalue.

11 In this example and the following examples in this section, the sdcMicro object “sdcIntial“ contains a dataset with 2,000 individuals and 39variables. We selected five categorical quasi-identifiers and 9 variables for PRAM: “ROOF”, “TOILET”, “WATER”, “ELECTCON”, “FUEL-COOK”, “OWNMOTORCYCLE”, “CAR”, “TV”, and “LIVESTOCK”. These PRAM variabels were selected according to the requirements of thisparticular dataset and for illustrative purposes.

12 The PRAM method in sdcMicro sometimes produces the following error: Error in factor(xpramed, labels = lev) : invalid ‘labels’; length 6should be 1 or 5. Under some circumstances, changing the seed can solve this error.

62 Chapter 6. Anonymization Methods

Page 67: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 6.9 shows the tabulation of the variable after applying invariant PRAM. We can see that the deviations fromthe initial tabulations, which are in expectation 0, are smaller than with the transition matrix that does not fulfill theinvariance property. The remaining deviations are due to the randomness.

Table 6.9: Tabulation of variable “region” before and after (invariant)PRAM

Value Tabulation before PRAM Tabulation after PRAM Tabulation after invariant PRAMcapital 5,000 5,052 4,998rural1 500 457 499rural2 400 391 403

Table 6.10 presents the cross-tabulations with the variable gender. Before applying invariant PRAM, the share of malesin the city is much higher than the share of females (about 60%). This property is not maintained after invariant PRAM(the shares of males and females in the city are roughly equal), although the univariate tabulations are maintained. Onesolution is to apply PRAM separately for the males and females in this example13. This can be done by specifying thestrata argument in the pram() function in sdcMicro (see below).

Table 6.10: Cross-tabulation of variable “region” and variable “gender”before and after invariant PRAM

. Tabulation before PRAM Tabulation after invariant PRAMValue male female male femalecapital 3,056 1,944 2,623 2,375rural1 157 343 225 274rural2 113 287 187 216

The pram() function in sdcMicro has several options.

Note: If no options are set and the PRAM method is applied to an sdcMicro object, all PRAM variables selected inthe sdcMicro object are automatically used for PRAM and PRAM is applied within the selected strata (see the SectionObjects of class sdcMicroObj on sdcMicro objects for more details).

Alternatively, PRAM can also be applied to variables that are not specified in the sdcMicro object as PRAM variables,such as key variables, which is shown in Listing 6.13. In that case, however, the risk measures that are automaticallycomputed will not be correct anymore, since the variables are perturbed. Therefore, if during the SDC process PRAMwill be applied to some key variables, it is recommended to create a new sdcMicro object where the variables to bePRAMmed are selected as PRAM variables in the function createSdcObj().

Listing 6.13: Selecting the variable “toilet” to apply PRAM

1 # Set seed for random number generator2 set.seed(123)3

4 # Apply PRAM only to the variable TOILET5 sdcInitial <- pram(obj = sdcInitial, variables = c ("TOILET"))6 ## Number of changed observations:7 ## - - - - - - - - - - -8 ## TOILET != TOILET_pram : 115 (5.75%)

13 This can also be achieved with multidimensional transition matrices. In that case, the probability is not specified for ‘male’ -> ‘female’, but for‘male’ + ‘rural’ -> ‘female’ + ‘rural’ and for ‘male’ + ‘urban’ -> ‘female’ + ‘urban’. This is not implemented in sdcMicro but can be achieved byPRAMming the males and females separately. In the example here, this could be done by specifying gender as strata variable in the pram() functionin sdcMicro.

6.3. Perturbative methods 63

Page 68: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

The results for PRAM differ if applied simultaneously to several variables or subsequently to each variable separately.It is not possible to specify the entire transition matrix in sdcMicro, but we can set minimum values (between 0and 1) for the diagonal entries. The diagonal entries specify the probability that a certain value stays the same afterapplying PRAM. Setting the minimum value to 1 will yield no changes to this category. By default, this value is 0.8,which applies for all categories. It is also possible to specify a vector with value for each diagonal element of thetransformation matrix/category. In Listing 6.14 values of the first region are less likely to change than values of theother regions.

Note: The invariant PRAM method requires that the transition matrix has a unit eigenvalue.

Not all sets of restrictions can therefore be used (e.g., the minimum value 1 on any of the categories).

Listing 6.14: Specifying minimum values for diagonal entries in PRAMtransition matrix

1 sdcInitial <- pram(obj = sdcInitial, variables = c("TOILET"),2 pd = c(0.9, 0.5, 0.5, 0.5))3 ## Number of changed observations:4 ## - - - - - - - - - - -5 ## TOILET != TOILET_pram : 496 (24.8%)

In the invariant PRAM method, we can also specify the amount of perturbation by specifying the parameter alpha.This choice is reflected in the transition matrix. By default, the alpha value is 0.5. The larger alpha, the larger theperturbations. Alpha equal to zero leads to no changes. The maximum value for alpha is 1.

PRAM is especially useful when a dataset contains many variables and applying other anonymization methods, suchas recoding and local suppression, would lead to significant information loss. Checks on risk and utility are importantafter PRAM.

To do statistical inference on variables to which PRAM was applied, the researcher needs knowledge about the PRAMmethod as well as about the transition matrix. The transition matrix, together with the random number seed, can,however, lead to disclosure through reconstruction of the non-perturbed values. Therefore, publishing the transitionmatrix but not the random seed is recommended.

A disadvantage of using PRAM is that very unlikely combinations can be generated, such as a 63-year-old who goesto school. Therefore, the PRAMmed variables need to be audited to prevent such combinations from happening inthe released data file. In principal, the transition matrix can be designed in such a way that certain transitions arenot possible (probability 0). For instance, for those that go to school, the age must range within 6 to 18 years andonly such changes are allowed. In sdcMicro the transition matrix cannot be exactly specified. A useful alternativeis constructing strata and applying PRAM within the strata. In this way, the changes between variables will onlybe applied within the strata. Listing 6.15 illustrates this by applying PRAM to the variable “toilet” within the stratagenerated by the “region” education. This prevents changes in the variable “toilet”, where toilet types in a particularregion are exchanged with those in other regions. For instance, in the capital region certain types of unimproved toilettypes are not in use and therefore these combinations should not occur after PRAMming. Values are only changedwith those that are available in the same strata. Strata can be formed by any categorical variable, e.g., gender, agegroups, education level.

Listing 6.15: Minimizing unlikely combinations by applying PRAMwithin strata

1 # Applying PRAM within the strata generated by the variable region2 sdcInitial <- pram(obj = sdcInitial, variables = c("TOILET"),3 strata_variables = c("REGION"))4 ## Number of changed observations:5 ## - - - - - - - - - - -6 ## TOILET != TOILET_pram : 179 (8.95%)

64 Chapter 6. Anonymization Methods

Page 69: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Recommended Reading Material on PRAM

Gouweleeuw, J. M, P Kooiman, L.C.R.J Willenborg, and P.P de Wolf. “Post Randomization for Statistical DisclosureControl: Theory and Implementation.” Journal of Official Statistics 14, no. 4 (1998a): 463-478. Available at http://www.jos.nu/articles/abstract.asp?article=144463

Gouweleeuw, J. M, P Kooiman, L.C.R.J Willenborg, and Peter Paul de Wolf. “The Post Randomization Methodfor Protecting Microdata.” Qüestiió, Quaderns d’Estadística i Investigació Operativa 22, no. 1 (1998b): 145-156.Available at http://www.raco.cat/index.php/Questiio/issue/view/2250

Marés, Jordi, and Vicenç Torra. 2010.”PRAM Optimization Using an Evolutionary Algorithm.” In Privacy in Statisti-cal Databases, by Josep Domingo-Ferrer and Emmanouil Magkos, 97-106. Corfú, Greece: Springer.

Warner, S.L. “Randomized Response: A Survey Technique for Eliminating Evasive Answer Bias.” Journal of Ameri-can Statistical Association 57 (1965): 622-627.

6.3.2 Microaggregation

Microaggregation is most suitable for continuous variables, but can be extended in some cases to categorical vari-ables.14 It is most useful where confidentiality rules have been predetermined (e.g., a certain threshold for 𝑘-anonymityhas been set) that permit the release of data only if combinations of variables are shared by more than a predeterminedthreshold number of respondents (𝑘). The first step in microaggregation is the formation of small groups of individu-als that are homogeneous with respect to the values of selected variables, such as groups with similar income or age.Subsequently, the values of the selected variables of all group members are replaced with a common value, e.g., themean of that group. Microaggregation methods differ with respect to (i) how the homogeneity of groups is defined,(ii) the algorithms used to find homogeneous groups, and (iii) the determination of replacement values. In practice,microaggregation works best when the values of the variables in the groups are more homogeneous. When this is thecase, then the information loss due to replacing values with common values for the group will be smaller than in caseswhere groups are less homogeneous.

In the univariate case, and also for ordinal categorical variables, formation of homogeneous groups is straightforward:groups are formed by first ordering the values of the variable and then creating 𝑔 groups of size 𝑛𝑖 for all groups 𝑖 in1, . . . , 𝑔. This maximizes the within-group homogeneity, which is measured by the within-groups sum of squares(𝑆𝑆𝐸)

𝑆𝑆𝐸 =

𝑔∑︁𝑖=1

𝑛𝑖∑︁𝑗=1

(𝑥𝑖𝑗 − 𝑥𝑖)𝑇

(𝑥𝑖𝑗 − 𝑥𝑖)

The lower the SSE, the higher the within-group homogeneity. The group sizes can differ amongst groups, but oftengroups of equal size are used to simplify the search15.

The function microaggregation() in sdcMicro can be used for univariate microaggregation. The argument ‘aggr’specifies the group size. Forming groups is easier if all groups – except maybe the last group of remainders – have thesame size. This is the case in the implementation in sdcMicro as it is not possible to have groups of different sizes.Listing 6.16 shows how to use the function microaggregation() in sdcMicro.16 The default group size is 3 but theuser can specify any desired group size. Choice of group size depends on the homogeneity within the groups and therequired level of protection. In general it holds that the larger the group, the higher the protection. A disadvantage ofgroups of equal sizes is that the data might be unsuitable for this. For instance, if two individuals have a low income(e.g., 832 and 966) and four individuals have a high income (e.g., 3,313, 3,211, 2,987, 3,088), the mean of two groups

14 Microaggregation can also be used for categorical data, as long as there is a possibility to form groups and an aggregate replacement for thevalues in the group can be calculated. This is the case for ordinal variables.

15 Here all groups can have different sizes (i.e., number of individuals in a group). In practice, the search for homogeneous groups is simplifiedby imposing equal group sizes for all groups.

16 In this example and the following examples in this section, the sdcMicro object “sdcIntial“ contains a dataset with 2,000 individuals and 39variables. We selected five categorical quasi-identifiers and three continuous quasi-identifiers: “INC”, “EXP” and “WEALTH”.

6.3. Perturbative methods 65

Page 70: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

of size three (e.g., (832 + 966 + 2,987) / 3 = 1,595 and (3,088 + 3,211 + 3,313) / 3 = 3,204) would represent neitherthe low nor the high income.

Listing 6.16: Applying univariate microaggregation with sdcMicro func-tion microaggregation()

1 sdcInitial <- microaggregation(obj = sdcInitial, variables = 'INC',2 aggr = 3, method = mafast, measure = "mean")

By default, the microaggregation function replaces values with the group mean. An alternative, more robust approachis to replace group values with the median. This can be specified in the argument ‘measure’ of the function microag-gregation(). In cases where the median is chosen, one individual in every group keeps the same value if groups haveodd sizes. In cases where there is a high degree of heterogeneity within the groups (this is often the case for largergroups), the median is preferred to preserve the information in the data. An example is income, where one outlier canlead to multiple outliers being created when using microaggregation. This is illustrated in Table 6.11. If we choosethe mean as replacement for all values, which are grouped with the outlier (6,045 in group 2), these records will beassigned values far from their original values. If we chose the median, the incomes of individuals 1 and 2 are notperturbed, but no value is an outlier. Of course, this might in itself present problems.

Note: If microaggregation alters outlying values, this can have a significant impact on the computation of somemeasures sensitive to outliers, such as the GINI index.

In the case where microaggregation is applied to categorical variables, the median is used to calculate the replacementvalue for the group.

Table 6.11: Illustrating the effect of choosing mean vs. median for mi-croaggregation where outliers are concerned

ID Group Income Microaggregation (mean) Microaggregation (median)1 1 2,300 2,245 2,3002 2 2,434 3,608 2,4343 1 2,123 2,245 2,3004 1 2,312 2,245 2,3005 2 6,045 3,608 2,4346 2 2,345 3,608 2,434

In case of multiple variables that are candidates for microaggregation, one possibility is to apply univariate microag-gregation to each of the variables separately. The advantage of univariate microaggregation is minimal informationloss, since the changes in the variables are limited. The literature shows, however, that disclosure risk can be very highif univariate microaggregation is applied to several variables separately and no additional anonymization techniquesare applied (DMOT02). To overcome this shortcoming, an alternative to univariate microaggregation is multivariatemicroaggregation.

Multivariate microaggregation is widely used in official statistics. The first step in multivariate aggregation is thecreation of homogeneous groups based on several variables. Groups are formed based on multivariate distancesbetween the individuals. Subsequently, the values of all variables for all group members are replaced with the samevalues. Table 6.12 illustrates this with three variables. We see that the grouping by income, expenditure and wealthleads to a different grouping, as in the case in Table 6.11, where groups were formed based only on income.

66 Chapter 6. Anonymization Methods

Page 71: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 6.12: Illustration of multivariate microaggregationID Group Before microaggregation After microaggregation. . Income Exp Wealth Income Exp Wealth1 1 2,300 1,714 5.3 2,285.7 1,846.3 6.32 1 2,434 1,947 7.4 2,285.7 1,846.3 6.33 1 2,123 1,878 6.3 2,285.7 1,846.3 6.34 2 2,312 1,950 8.0 3,567.3 2,814.0 8.35 2 6,045 4,569 9.2 3,567.3 2,814.0 8.36 2 2,345 1,923 7.8 3,567.3 2,814.0 8.3

There are several multivariate microaggregation methods that differ with respect to the algorithm used for creatinggroups of individuals. There is a trade-off between speed of the algorithm and within-group homogeneity, which is di-rectly related to information loss. For large datasets, this is especially challenging. We discuss the Maximum Distanceto Average Vector (MDAV) algorithm here in more detail. The MDAV algorithm was first introduced by DoTo05 andrepresents a good choice with respect to the trade-off between computation time and the group homogeneity, computedby the within-group 𝑆𝑆𝐸. The MDAV algorithm is implemented in sdcMicro.

The algorithm computes an average record or centroid C, which contains the average values of all included variables.We select an individual A with the largest squared Euclidean distance from C, and build a group of 𝑘 records around A.The group of 𝑘 records is made up of A and the 𝑘 − 1 records closest to A measured by the Euclidean distance. Next,we select another individual B, with the largest squared Euclidean distance from individual A. With the remainingrecords, we build a group of 𝑘 records around B. In the same manner, we select an individual D with the largestdistance from B and, with the remaining records, build a new group of 𝑘 records around D. The process is repeateduntil we have fewer than 2*𝑘 records remaining. The MDAV algorithm creates groups of equal size with the exceptionof maybe one last group of remainders. The microaggregated dataset is then computed by replacing each record inthe original dataset by the average values of the group to which it belongs. Equal group sizes, however, may not beideal for data characterized by greater variability. In sdcMicro multivariate microaggregation is also implemented inthe function microaggregation(). Listing 6.17 shows how to choose the MDAV algorithm in sdcMicro.

Listing 6.17: Multivariate microaggregation with the Maximum Distanceto Average Vector (MDAV) algorithm in sdcMicro

1 sdcInitial <- microaggregation(obj = sdcInitial,2 variables = c("INC", "EXP", "WEALTH"),3 method = "mdav")

It is also possible to group variables only within strata. This reduces the computation time and adds an extra layer ofprotection to the data, because of the greater uncertainty produced17. In sdcMicro this can be achieved by specifyingthe strata variables, as shown in Listing 6.18.

Listing 6.18: Specifying strata variables for microaggregation

1 sdcInitial <- microaggregation(obj = sdcInitial,2 variables = c("INC", "EXP", "WEALTH"),3 method = "mdav", strata_variables = c("strata"))

Besides the method MDAV, there are few other grouping methods implemented in sdcMicro (TeMK14). Table 6.13gives an overview of these methods. Whereas the method ‘MDAV’ uses the Euclidian distance, the method ‘rmd’ usesthe Mahalanobis distance instead. An alternative to these methods is sorting the respondents based on the first principalcomponent (PC), which is the projection of all variables into a one-dimensional space maximizing the variance of thisprojection. The performance of this method depends on the share of the total variance in the data that is explained bythe first PC. The ‘rmd’ method is computationally more intensive due to the computation of Mahalanobis distances,

17 Also the homogeneity in the groups will be generally lower, leading to larger changes, higher protection, but also more information loss, unlessthe strata variable correlates with the microaggregation variable.

6.3. Perturbative methods 67

Page 72: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

but provides better results with respect to group homogeneity. It is recommended for smaller datasets (TeMK14).

Table 6.13: Grouping methods for microaggregation that are imple-mented in sdcMicro

Method / option in sd-cMicro

Description

mdav grouping is based on classical (Euclidean) distance measuresrmd grouping is based on robust multivariate (Mahalanobis) distance measurespca grouping is based on principal component analysis whereas the data are sorted on the

first principal componentclustpppca grouping is based on clustering and (robust) principal component analysis for each clus-

terinfluence grouping is based on clustering and aggregation is performed within clusters

In case of several variables to be used for microaggregation, looking first at the covariance or correlation matrix of thesevariables is recommended. If not all variables correlate well, but two or more sets of variables show high correlation,less information loss will occur when applying microaggregation separately to these sets of variables. In general, lessinformation loss will occur when applying multivariate microaggregation, if the variables are highly correlated. Theadvantage of replacing the values with the mean of the groups rather than other replacement values has the advantagethat the overall means of the variables are preserved.

Recommended Reading Material on Microaggregation

Domingo-Ferrer, Josep, and Josep Maria Mateo-Sanz. 2002.”Practical data-oriented microaggregation for statisticaldisclosure control.” IEEE Transactions on Knowledge and Data Engineering 14 (2002): 189-201.

Hansen, Stephen Lee, and Sumitra Mukherjee. 2003. “A polynomial algorithm for univariate optimal.” IEEE Trans-actions on Knowledge and Data Engineering 15 (2003): 1043-1044.

Hundepool, Anco, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Rainer Lenz, Jane Naylor, Eric SchulteNordholt, Giovanni Seri, and Peter Paul de Wolf. 2006. Handbook on Statistical Disclosure Control. ESSNet SDC.http://neon.vb.cbs.nl/casc/handbook.htm

Hundepool, Anco, Josep Domingo-Ferrer, Luisa Franconi, Sarah Giessing, Eric Schulte Nordholt, Keith Spicer,and Peter Paul de Wolf. 2012. Statistical Disclosure Control. Chichester: John Wiley & Sons Ltd.doi:10.1002/9781118348239.

Templ, Matthias, Bernhard Meindl, Alexander Kowarik, and Shuang Chen. 2014, August. “International HouseholdSurvey Network (IHSN).” http://www.ihsn.org/home/software/disclosure-control-toolbox. (accessed July 9, 2018).

6.3.3 Noise addition

Noise addition, or noise masking, means adding or subtracting (small) values to the original values of a variable, and ismost suited to protect continuous variables (see Bran02 for an overview). Noise addition can prevent exact matchingof continuous variables. The advantages of noise addition are that the noise is typically continuous with mean zero,and exact matching with external files will not be possible. Depending on the magnitude of noise added, however,approximate interval matching might still be possible.

When using noise addition to protect data, it is important to consider the type of data, the intended use of the data andthe properties of the data before and after noise addition, i.e., the distribution – particularly the mean – covariance andcorrelation between the perturbed and original datasets.

Depending on the data, it may also be useful to check that the perturbed values fall within a meaningful range ofvalues. Fig. 6.7 illustrates the changes in data distribution with increasing levels of noise. For data that has outliers,

68 Chapter 6. Anonymization Methods

Page 73: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

it is important to note that when the perturbed data distribution is similar to the original data distribution (e.g., atlow noise levels), noise addition will not protect outliers. After noise addition, these outliers can generally still bedetected as outliers and hence easily be identified. An example is a single very high income in a certain region. Afterperturbing this income value, the value will still be recognized as the highest income in that region and can thus be usedfor re-identification. This is illustrated in Fig. 6.6, where 10 original observations (open circles) and the anonymizedobservations (red triangles) are plotted. The tenth observation is an outlier. The values of the first nine observationsare sufficiently protected by adding noise: their magnitude and order has changed and exact or interval matching canbe successfully prevented. The outlier is not sufficiently protected since, after noise addition, the outlier can still beeasily identified. The fact that the absolute value has changed is not sufficient protection. On the other hand, at highnoise levels, protection is higher even for the outliers, but the data structure is not preserved and the information lossis large, which is not an ideal situation. One way to circumvent the outlier problem is to add noise of larger magnitudeto outliers than to the other values.

Fig. 6.6: Illustration of effect of noise addition to outliers

There are several noise addition algorithms. The simplest version of noise addition is uncorrelated additive normallydistributed noise, where 𝑥𝑗 , the original values of variable 𝑗are replaced by

𝑧𝑗 = 𝑥𝑗 + 𝜀𝑗 ,

where 𝜀𝑗 ∼ 𝑁(0, 𝜎2𝜀𝑗 ) and 𝜎𝜀𝑗 = 𝛼 * 𝜎𝑗 with 𝜎𝑗 the standard deviation of the original data. In this way, the

mean and the covariances are preserved, but not the variances and correlation coefficient. If the level of noise added,𝛼, is disclosed to the user, many statistics can be consistently estimated from the perturbed data. The added noise isproportional to the variance of the original variable. The magnitude of the noise added is specified by the parameter𝛼, which specifies this proportion. The standard deviation of the perturbed data is 1 + 𝛼 times the standard deviationof the perturbed data. A decision on the magnitude of noise added should be informed by the legal situation regardingdata privacy, data sensitivity and the acceptable levels of disclosure risk and information loss. In general, the level ofnoise is a function of the variance of the original variables, the level of protection needed and the desired value rangeafter anonymization18. An 𝛼 value that is too small will lead to insufficient protection, while an 𝛼 value that is toohigh will make the data useless for data users.

In sdcMicro noise addition is implemented in the function addNoise(). The algorithm and parameter can be specified asarguments in the function addNoise(). Simple noise addition is implemented in the function addNoise() with the value

18 Common values for 𝛼 are between 0.5 and 2. The default value in the sdcMicro function addNoise() is 150, which is too large for mostdatasets; the level of noise should be set in the argument ‘noise’.

6.3. Perturbative methods 69

Page 74: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

“additive” for the argument ‘method’. Listing 6.19 shows how to use sdcMicro to add uncorrelated noise to expenditurevariables, where the standard deviation of the added noise equals half the standard deviation of the original variables.19

Noise is added to all selected variables.

Listing 6.19: Uncorrelated noise addition

1 sdcInitial <- addNoise(obj = sdcInitial,2 variables = c('TOTFOOD', 'TOTHLTH', 'TOTALCH', 'TOTCLTH',3 'TOTHOUS', 'TOTFURN', 'TOTTRSP', 'TOTCMNQ',4 'TOTRCRE', 'TOTEDUC', 'TOTHOTL', 'TOTMISC'),5 noise = 0.5, method = "additive")

Fig. 6.7 shows the frequency distribution of a numeric continuous variable and the distribution before and after noiseaddition with different levels of noise (0.1, 0.5, 1, 2 and 5). The first plot shows the distribution of the original values.The histograms clearly show that noise of large magnitudes (high values of alpha) lead to a distribution of the data farfrom the original values. The distribution of the data changes to a normal distribution when the magnitude of the noisegrows respective to the variance of the data. The mean in the data is preserved, but, with an increased level of noise,the variance of the perturbed data grows. After adding noise of magnitude 5, the distribution of the original data iscompletely destroyed.

Fig. 6.7: Frequency distribution of a continuous variable before and after noise addition

Fig. 6.8 shows the value range of a variable before adding noise (no noise) and after adding several levels of noise (𝛼from 0.1 to 1.5 with 0.1 increments). In the figure, the minimum value, the 20th, 30th, 40th percentiles, the median, the60th, 70th, 80th and 90th percentiles and the maximum value are plotted. The median (50th percentile) is indicated withthe red “+” symbol. From Fig. 6.7 and Fig. 6.8, it is apparent that the range of values expands after noise addition, andthe median stays roughly at the same level, as does the mean by construction. The larger the magnitude of noise added,the wider the value range. In cases where the variable should stay in a certain value range (e.g., only positive values,between 0 and 100), this can be a disadvantage of noise addition. For instance, expenditure variables typically havenon-negative values, but adding noise to these variables can generate negative values, which are difficult to interpret.One way to get around this problem is to set any negative values to zero. This truncation of values below a certain

19 In this example and the following examples in this section, the sdcMicro object “sdcIntial“ contains a dataset with 2,000 individuals and 39variables. We selected five categorical quasi-identifiers and 12 continuous quasi-identifiers. These are the expenditure components “TFOODEXP”,“TALCHEXP”, “TCLTHEXP”, “THOUSEXP”, “TFURNEXP”, “THLTHEXP”, “TTRANSEXP”, “TCOMMEXP”, “TRECEXP”, “TEDUEXP”,“TRESTHOTEXP”, “TMISCEXP“.

70 Chapter 6. Anonymization Methods

Page 75: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

threshold, however, will distort the distribution (mean and variance matrix) of the perturbed data. This means that thecharacteristics that were preserved by noise addition, such as the conservation of the mean and covariance matrix, aredestroyed and the user, even with knowledge of the magnitude of the noise, can no longer use the data for consistentestimation.

Another way to avoid negative values is the application of multiplicative rather than additive noise. In that case,variables are multiplied by a random factor with expectation 1 and a positive variance. This will also lead to largerperturbations (in absolute value) of large initial values (outliers). If the variance of the noise added is small, there willbe no or few negative factors and thus fewer sign changes than in case of additive noise masking. Multiplicative noisemasking is not implemented in sdcMicro, but can be relatively easily implemented in base R by generating a vector ofrandom numbers and multiplying the data with this vector. For more information on multiplicative noise masking andthe properties of the data after masking, we refer to KiWi03.

Fig. 6.8: Noise levels and the impact on the value range (percentiles)

If two or more variables are selected for noise addition, correlated noise addition is preferred to preserve the correlationstructure in the data. In this case, the covariance matrix of noise Σ𝜀 is proportional to the covariance matrix of theoriginal data Σ𝑋 :

Σ𝜀 = 𝛼Σ𝑋

In the addNoise() function of the sdcMicro package, correlated noise addition can be used by specifying the meth-ods ‘correlated’ or ‘correlated2’. The method “correlated” assumes that the variables are approximately normallydistributed. The method ‘correlated2’ is a version of the method ‘correlated’, which is robust against the normalityassumption. Listing 6.20 shows how to use the ‘correlated2’ method. The normality of variables can be investigatedin R, with, for instance, a Jarque-Bera or Shapiro-Wilk test20.

Listing 6.20: Correlated noise addition

1 sdcInitial <- addNoise(obj = sdcInitial,2 variables = c('TOTFOOD', 'TOTHLTH', 'TOTALCH',3 'TOTCLTH', 'TOTHOUS', 'TOTFURN',

(continues on next page)

20 The Shapiro-Wilk test is implemented in the function shapiro.test() from the package stats in R. The Jarque-Bera test has several implementa-tions in R, for example, in the function jarque.bera.test() from the package tseries.

6.3. Perturbative methods 71

Page 76: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

4 'TOTTRSP', 'TOTCMNQ', 'TOTRCRE',5 'TOTEDUC', 'TOTHOTL', 'TOTMISC'),6 noise = 0.5, method = "correlated2")

In many cases, only the outliers have to be protected, or have to be protected more. The method ‘outdect’ adds noiseonly to the outliers, which is illustrated in Listing 6.21. The outliers are identified with univariate and robust multivari-ate procedures based on a robust Mahalanobis distance calculated by the MCD estimator (TMKC14). Nevertheless,noise addition is not the most suitable method for outlier protection.

Listing 6.21: Noise addition for outliers using the ‘outdect’ method

1 sdcInitial <- addNoise(obj = sdcInitial,2 variables = c('TOTFOOD', 'TOTHLTH', 'TOTALCH',3 'TOTCLTH', 'TOTHOUS', 'TOTFURN',4 'TOTTRSP', 'TOTCMNQ', 'TOTRCRE',5 'TOTEDUC', 'TOTHOTL', 'TOTMISC'),6 noise = 0.5, method = "outdect")

If noise addition is applied to variables that are a ratio of an aggregate, this structure can be destroyed by noiseaddition. Examples are income and expenditure data with many income and expenditure categories. The categoriesadd up to total income or total expenditures. In the original data, the aggregates match with the sum of the components.After adding noise to their components (e.g., different expenditure categories), however, their new aggregates will notnecessarily match the sum of the categories anymore. One way to keep this structure is to add noise only to theaggregates and release the components as ratio of the perturbed aggregates. Listing 6.22 illustrates this by addingnoise to the total of expenditures. Subsequently, the ratios of the initial expenditure categories are used for eachindividual to reconstruct the perturbed values for each expenditure category.

Listing 6.22: Noise addition to aggregates and their components

1 # Add noise to totals (income / expenditures)2 sdcInital <- addNoise(noise = 0.5, obj = sdcInitial, variables=c("EXP", "INC"),3 method="additive")4

5 # Multiply anonymized totals with ratios to obtain anonymized components6 compExp <- c("TOTFOOD", "TOTALCH", "TOTCLTH", "TOTHOUS", "TOTFURN",7 "TOTHLTH", "TOTTRSP", "TOTCMNQ", "TOTRCRE", "TOTEDUC",8 "TOTHOTL", "TOTMISC")9

10 sdcInital@manipNumVars[,compExp] <- sdcInital@manipNumVars[,"HHEXP_N"] *11 sdcInital@origData[,compExp]/ sdcInital@origData[,"HHEXP_

→˓N"]12

13 # Recalculate risks after manually changing values in sdcMicro object14 sdcInitial <- calcRisks(sdcInital)

Recommended Reading Material on Noise Addition

Brand, Ruth. 2002. “Microdata Protection through Noise Addition.” In Inference Control in Statistical Databases -From Theory to Practice, edited byJosep Domingo-Ferrer. Lecture Notes in Computer Science Series 2316, 97-116.Berlin Heidelberg: Springer. http://link.springer.com/chapter/10.1007%2F3-540-47804-3_8

Kim, Jay J, and William W Winkler. 2003. “Multiplicative Noise for Masking Continuous Data.” Research ReportSeries (Statistical Research Division. US Bureau of the Census). https://www.census.gov/srd/papers/pdf/rrs2003-01.pdf

Torra, Vicenç, and Isaac Cano. 2011. “Edit Constraints on Microaggregation and Additive Noise.” In Privacy and Se-

72 Chapter 6. Anonymization Methods

Page 77: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

curity Issues in Data Mining and Machine Learning, edited by C. Dimitrakakis, A. Gkoulalas-Divanis, A. Mitrokotsa,V. S. Verykios, Y. Saygin. Lecture Notes in Computer Science Volume 6549, 1-14. Berlin Heidelberg: Springer.http://link.springer.com/book/10.1007/978-3-642-19896-0

Mivule, K. 2013. “Utilizing Noise Addition for Data Privacy, An Overview.” Proceedings of the International Confer-ence on Information and Knowledge Engineering (IKE 2012), (pp.65-71).Las Vegas, USA. http://arxiv.org/ftp/arxiv/papers/1309/1309.3958.pdf

6.3.4 Rank swapping

Data swapping is based on interchanging values of a certain variable across records. Rank swapping is one type ofdata swapping, which is defined for ordinal and continuous variables. For rank swapping, the values of the variable arefirst ordered. The possible number of values for a variable to swap with is constrained by the values in a neighborhoodaround the original value in the ordered values of the dataset. The size of this neighborhood can be specified, e.g.,as a percentage of the total number of observations. This also means that a value can be swapped with the same orvery similar values. This is especially the case if the neighborhood is small or there are only a few different values inthe variable (ordinal variable). An example is the variables “education” with only few categories: (‘none’, ‘primary’,‘secondary’, ‘tertiary’). In these cases, rank swapping is not a suitable method.

If rank swapping is applied to several variables simultaneously, the correlation structure between the variables ispreserved. Therefore, it is important to check whether the correlation structure in the data is plausible. Rank swappingis implemented in the function rankSwap() in sdcMicro. The variables, which have to be swapped, should be specifiedin the argument ‘variables’. By default, values below the 5th percentile and above the 95th percentile are top andbottom coded and replaced by their average value (see the Section Top and bottom coding ). By specifying theoptions ‘TopPercent’ and ‘BottomPercent’ we can choose these percentiles. The argument ‘P’ defines the size of theneighborhood as percentage of the sample size. If the value ‘p’ is 0.05, the neighborhood will be of size 0.05 * 𝑛,where 𝑛 is the sample size. Since rank swapping is a probabilistic method, i.e., the swapping depends on a randomnumber generating mechanism, specifying a seed for the random number generator before using rank swapping isrecommended to guarantee reproducibility of results. The seed can also be specified as a function argument in thefunction rankSwap(). Listing 6.23 shows how to apply rank swapping with sdcMicro. If the variables contain missingvalues (NA in R), the function rankSwap() will automatically recode those to the value specified in the ‘missing’argument. This value should not be in the value range of any of the variables. After using the function rankSwap(),these values should be recoded NA. This is shown in the Listing 6.23.

Listing 6.23: Rank swapping using sdcMicro

1 # Check correlation structure between the variables2 cor(file$TOTHOUS, file$TOTFOOD)3 ## [1] 0.38113354

5 # Set seed for random number generator6 set.seed(12345)7

8 # Apply rank swapping9 rankSwap(sdcInitial, variables = c("TOTHOUS", "TOTFOOD"), missing = NA)

Rank swapping has been found to yield good results with respect to the trade-off between information loss and dataprotection (DoTo01a). Rank swapping is not useful for variables with few different values or many missing values,since the swapping in that case will not result in altered values. Also, if the intruder knows to whom the highestor lowest value of a specific variable belongs (e.g., income), the level of this variable will be disclosed after rankswapping, because the values themselves are not altered and the original values are all disclosed. This can be solvedby top and bottom coding the lowest and/or highest values.

Recommended Reading Material on Rank Swapping

6.3. Perturbative methods 73

Page 78: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Dalenius T. and Reiss S.P. 1978. Data-swapping: a technique for disclosure control (extended abstract). In Proc. ASASection on Survey Research Methods. American Statistical Association, Washington DC, 191–194.

Domingo-Ferrer J. and Torra V. 2001. “A Quantitative Comparison of Disclosure Control Methods for Microdata.” InConfidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies, edited by P.Doyle, J.I. Lane, J.J.M. Theeuwes, and L. Zayatz, 111–134. Amsterdam, North-Holland.

Hundepool A., Van de Wetering A., Ramaswamy R., Franconi F., Polettini S., Capobianchi A., De Wolf P.-P.,Domingo-Ferrer J., Torra V., Brand R. and Giessing S. 2007. 𝜇-Argus User’s Manual version 4.1.

6.3.5 Shuffling

Shuffling as introduced by MuSa06 is similar to swapping, but uses an underlying regression model for the variables todetermine which variables are swapped. Shuffling can be used for continuous variables and is a deterministic method.Shuffling maintains the marginal distributions in the shuffled data. Shuffling, however, requires a complete ranking ofthe data, which can be computationally very intensive for large datasets with several variables.

The method is explained in detail in MuSa06. The idea is to rank the individuals based on their original variables.Then fit a regression model with the variables to be protected as regressands and a set of variables that predict thisvariable well (i.e., are correlated with) as regressors. This regression model is used to generate 𝑛 synthetic (predicted)values for each variable that has to be protected. These generated values are also ranked and each original value isreplaced with another original value with the rank that corresponds to the rank of the generated value. This meansthat all original values will be in the data. Table 6.14 presents a simplified example of the shuffling method. Theregressands are not specified in this example.

Table 6.14: Simplified example of the shuffling methodID Income (orig) Rank (orig) Income (pred) Rank (pred) Shuffled values1 2,300 2 2,466.56 4 2,3452 2,434 6 2,583.58 7 2,5433 2,123 1 2,594.17 8 2,6434 2,312 3 2,530.97 6 2,4345 6,045 10 5,964.04 10 6,0456 2,345 4 2,513.45 5 2,3657 2,543 7 2,116.16 1 2,1238 2,854 9 2,624.32 9 2,8549 2,365 5 2,203.45 2 2,30010 2,643 8 2,358.29 3 2,312

The method ‘ds’ (the default method of data shuffling in sdcMicro) is recommended for use (TeMK14)21. A regressionfunction with regressors for the variables to be protected must be specified in the argument ‘form’. At least tworegressands should be specified and the regressors should have predictive power for the variables to be predicted. Thiscan be checked with goodness-of-fit measures such as the 𝑅2 of the regression. The 𝑅2 captures only linear relations,but these are also the only relations that are captured by the linear regression model used for shuffling. Following is anexample for shuffling expenditure variables, which are predicted by total household expenditures and household size.

Listing 6.24: Shuffling using a specified regression equation

1 # Evaluate R-squared (goodness-of-fit) of the regression model2 summary(lm(file, form = TOTFOOD + TOTALCH + TOTCLTH + TOTHOUS +3 TOTFURN + TOTHLTH + TOTTRSP + TOTCMNQ +

(continues on next page)

21 In sdcMicro, there are several other methods for shuffling implemented, including ‘ds’, ‘mvn’ and ‘mlm’. See the Help option for the shufflefunction in sdcMicro for details on methods ‘ds’, ‘mvm’ and ‘mlm’.

74 Chapter 6. Anonymization Methods

Page 79: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

4 TOTRCRE + TOTEDUC + TOTHOTL + TOTMISC ~ EXP + HHSIZE))5

6 # Shuffling using the specified regression equation7 sdcInitial <- shuffle(sdcInitial, method='ds',8 form = TOTFOOD + TOTALCH + TOTCLTH + TOTHOUS +9 TOTFURN + TOTHLTH + TOTTRSP + TOTCMNQ +

10 TOTRCRE + TOTEDUC + TOTHOTL + TOTMISC ~ EXP + HHSIZE)

Recommended Reading Material on Shuffling

K. Muralidhar and R. Sarathy. 2006.”Data shuffling - A new masking approach for numerical data,” ManagementScience, 52, 658-670.

6.3.6 Comparison of PRAM, rank swapping and shuffling

PRAM, rank swapping and shuffling are all perturbative methods, i.e., they change the values for individual recordsand are mainly used for continuous variables. After rank swapping and shuffling, the original values are all containedin the treated dataset but might be assigned to other records. This implies that univariate tabulations are not changed.This also holds in expectation for PRAM, if a transition matrix is chosen that has the invariant property.

Choosing a method is based on the structure to be preserved in the data. In cases where the regression model fitsthe data well, data shuffling would work very well, as there should be sufficient (continuous) regressors available.Rank swapping works well if there are sufficient categories in the variables. PRAM is preferred if the perturbationmethod should be applied to only one or few variables; the advantage is the possibility of specifying restrictions onthe transition matrix and applying PRAM only within strata, which can be user defined.

6.4 Anonymization of geospatial variables

Recently, geospatial data has become increasingly popular with researchers and wide-spread. Georeferenced dataidentifies the geographical location for each record with the help of a Geographical Information System (GIS), thatuses for instance GPS (Global Positioning System) coordinates or address data. The advantages of geospatial dataare manifold: 1) researchers can create their own geographical areas, such as the service area of a hospital; 2) itenables researchers to measure the proximity to facilities, such as schools; 3) researchers can use the data to extractgeographical patterns; and 4) it enables linking of data from different sources (see e.g., BCRZ13). However, geospatialdata, due to the precise reference to a location, also pose a challenge to the privacy of the respondents.

One way to anonymize georeferenced data is removing the GIS variables and instead leaving in or creating othergeographical variables, such as province, region. However, this approach also removes the benefits of geospatialdata. Another option is the geographical displacement of areas and/or records. BCRZ13 describe a geographicaldisplacement procedure for a health dataset. This paper also includes the code in Python. HuDr15 propose threedifferent strategies for generating synthetic geocodes.

Recommended Reading Material on Anonymization of Geospatial Data

C.R. Burgert, J. Colston, T. Roy and B. Zachary. 2013. “DHS Spatial Analysis Report No. 7 - Geographic Dis-placement Procedure and Georeferenced Data Release Policy for the Demographic and Health Surveys” (USAID).http://dhsprogram.com/pubs/pdf/SAR7/SAR7.pdf

J. Hu and J. Drechsler. 2015. “Generating synthetic geocoding information for public release.” http://www.iab.de/389/section.aspx/Publikation/k150601301

6.4. Anonymization of geospatial variables 75

Page 80: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

6.5 Anonymization of the quasi-identifier household size

The size of a household is an important identifier, especially for large households.22 Suppression of the actual sizevariable, if available (e.g., number of household members), however, does not suffice to remove this information fromthe dataset, as a simple count of the household members for a particular household will allow this variable to bereconstructed as long as a household ID is in the data. In any case, households of a very large size or with a unique orspecial key (i.e., combination of values of quasi-identifiers) should be checked manually. One way to treat them is toremove these households from the dataset before release. Alternatively, the households can be split, but care should betaken to suppress or change values for these households to prevent an intruder from immediately understanding thatthese households have been split and reconstructing them by combining the two households with the same values.

6.6 Special case: census data

Census microdata are a special case because the user (and intruder) knows that all respondents are included in thedataset. Therefore, risk measures that use the sample weights and are based on uncertainty of the correctness of amatch are no longer applicable. If an intruder has identified a sample unique and successfully matched, there is nodoubt whether the match is correct, as it would be in the case of a sample. One approach to release census microdatais to release a stratified sample of the sample (1 – 5% of the total census).

Note: After sampling, the anonymization process has to be followed; sampling alone is not sufficient to guaranteeconfidentiality.

Several statistical offices release microdata based on census data. A few examples are:

• The British Office for National Statistics (ONS) released several files based on the 2011 census: 1. A micro-data teaching file for educational purposes. This file is a 1% sample of the total census with a limited set ofvariables. 2. Two scientific use files with 5% samples are available for registered researchers who acceptthe terms and conditions of their use. 3. Two 10% samples are available in controlled research data centersfor approved researchers and research goals. All these files have been anonymized prior to release.23

• The U.S. Census Bureau released two samples of the 2000 census: a 5% sample on the national level and a1% sample on the state level. The national level file is more detailed, but the most detailed geographicalarea has at least 400,000 people. This, however, allows representation of all states from the dataset.The state-level file has less detailed variables but a more detailed geographical structure, which allowsrepresentation of cities and larger counties from the dataset (the minimum size of a geographical area is100,000). Both files have been anonymized by using data swapping, top coding, perturbation and reducingdetail by recoding.24

References

22 Even if the dataset does not contain an explicit variable with household size, this information can be easily extracted from the data and shouldbe taken into account. The Section Household structure shows how to create a variable “household size” based on the household IDs.

23 More information on census microdata at ONS is available on their website: http://www.ons.gov.uk/ons/guide-method/census/2011/census-data/census-microdata/index.html

24 More information on the anonymization of these files is available on the website of the U.S. Census Bureau: https://www.census.gov/population/www/cen2000/pums/index.html

76 Chapter 6. Anonymization Methods

Page 81: Statistical Disclosure Control: A Practice Guide

CHAPTER 7

Measuring Utility and Information Loss

SDC is a trade-off between risk of disclosure and loss of data utility and seeks to minimize the latter, while reducingthe risk of disclosure to an acceptable level. Data utility in this context means the usefulness of the anonymized datafor statistical analyses by end users as well as the validity of these analyses when performed on the anonymized data.Disclosure risk and its measurement are defined in the Section Measure Risk of this guide. In order to make a trade-offbetween minimizing disclosure risk and maximizing utility of data for end users, it is necessary to measure the utilityof the data after anonymization and compare it with the utility of the original data. This section describes measuresthat can be used to compare the data utility before and after anonymization, or alternatively quantify the informationloss. Information loss is the inverse of data utility: the larger the data utility after anonymization, the smaller theinformation loss.

Note: If the microdata to be anonymized is based on a sample, the data will incur a sampling error. Also other errorsmay be present in the data, such as nonresponse error.

Note: The methods discussed here only measure the information loss caused by the anonymization process relativeto the original sample data and do not attempt to measure the error caused by other sources.

Ideally, the information loss is evaluated with respect to the needs and uses of the end users of the microdata. However,different end users of anonymized data may have very diverse uses for the released data and it might not be possibleto collect an exhaustive list of these uses. Even if many uses can be identified, the characteristics in the data neededfor these uses can be contradictory (e.g., one user needs a detailed geographical level whereas another is interestedin a detailed age structure and does not need a detailed geographical structure). Nevertheless, as pointed out earlier,only one anonymized dataset can be released for each dataset and every type of release to avoid unintended disclosure.Releasing multiple anonymized datasets for different purposes may lead to unintended disclosure.1 Therefore, it is notpossible to anonymize and release a file tailored to each user’s needs.

Since collecting and taking into account all data uses is often impossible, we also look at general (or generic) informa-tion loss measures besides user- and data-specific information loss measures. These measures do not take into account

1 It is possible to release data files for different groups of users, e.g., PUF and SUF. All information in the less detailed file, however, must alsobe included in the more detailed file to prevent unintended disclosure. Datasets released in data enclaves can be customized for the user, since therisk that they will be combined with other version is zero.

77

Page 82: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the specific data use, but can be used as guiding measures for information loss and evaluating whether a dataset isstill analytically valid after anonymization. The main idea for such measures is to compare records between the origi-nal and treated datasets and compare statistics computed from both datasets (HDFG12). Examples of such measuresare the number of suppressions, number of changed values, changes in contingency tables and changes in mean andcovariance matrices.

Many of the SDC methods discussed earlier are parametric, in the sense that their outcome depends on parameterschosen by the user. Examples are the cluster size for microaggregation (see the Section Microaggregation) or theimportance vector in local suppression (see the Section Local suppression). Data utility and information loss measuresare useful for choosing these parameters by comparing the impact of different parameters on the information loss.Fig. 7.1 illustrates this by showing the trade-off between the disclosure risk and data utility of a hypothetical dataset.The triangle represents the original data with full utility and a certain level of disclosure risk, which is too high fordisclosure. The square represents no release of microdata. Although there is no risk of disclosure, there is also noutility from the data for users since no data is released. The points in between represent the result of applying differentSDC methods with different parameter specifications. We would select the SDC method corresponding to the point,which maximizes the utility, while keeping disclosure risk at an acceptable level.

Fig. 7.1: The trade-off between risk and utility in a hypothetical dataset

In the following sections, we first propose general utility measures independent of data use, and later present anexample of a specific measure useful to measure information loss with respect to specific data uses. Finally, we showhow to visualize changes in the data caused by anonymization and discuss the selection of utility measures for aparticular dataset.

7.1 General utility measures for continuous and categorical variables

General or generic measures of information loss can be divided into those comparing the actual values of the rawand anonymized data, and those comparing statistics from both datasets. All measures are a posteriori, since theymeasure utility after anonymization and require both the data before and after the anonymization process. Generalutility measures are different for categorical and continuous variables.

78 Chapter 7. Measuring Utility and Information Loss

Page 83: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

7.1.1 General utility measures for categorical variables

Number of missing values

An informative measure is to compare the number of missing values in the data. Missing values are often introducedafter suppression and more suppressions indicate a higher degree of information loss. After using the local suppressionfunction on an sdcMicro object, the number of suppressions for each categorical key variable can be retrieved withthe print() function, which is illustrated in Listing 7.12. The argument ‘ls’ in the print() function stands for localsuppression. The output shows both the absolute and relative number of suppressions.

Listing 7.1: Using the print() function to retrieve the total number ofsuppressions for each categorical key variable

1 sdcInitial <- localSuppression(sdcInitial, k = 5, importance = NULL)2

3 print(sdcInitial, 'ls')4

5 ## Local Suppression:6 ## KeyVar | Suppressions (#) | Suppressions (%)7 ## URBRUR | 0 | 0.0008 ## REGION | 81 | 4.0509 ## RELIG | 0 | 0.000

10 ## MARITAL | 0 | 0.00011 ## ---------------------------------------------------------------------------

More generally, it is possible to count and compare the number of missing values in the original data and the treateddata. This can be useful to see the proportional increase in the number of missing values. Missing values can alsohave other sources, such as nonresponse. Listing 7.2 shows how to display the number of missing values for each ofthe categorical key variables in an sdcMicro object. Here it is assumed that all missing values are coded ‘NA’. If themissing values are not coded ‘NA’, but instead another value, it is possible to use the alternative missing values code.The results agree with the number of missing values introduced by local suppression in the previous example, but alsoshows that the variable “RELIG” has 1,000 missing values in the original data.

Listing 7.2: Displaying the number of missing values for each categoricalkey variable in an sdcMicro object

1 # Store the names of all categorical key variables in a vector2 namesKeyVars <- names(sdcInitial@manipKeyVars)3

4 # Matrix to store the number of missing values (NA) before and after anonymization5 NAcount <- matrix(NA, nrow = 2, ncol = length(namesKeyVars))6 colnames(NAcount) <- c(paste0('NA', namesKeyVars)) # column names7 rownames(NAcount) <- c('initial', 'treated') # row names8

9 # NA count in all key variables (NOTE: only those coded NA are counted)10 for(i in 1:length(namesKeyVars)) {11 NAcount[1, i] <- sum(is.na(sdcInitial@origData[,namesKeyVars[i]]))12 NAcount[2, i] <- sum(is.na(sdcInitial@manipKeyVars[,i]))13 }14

15 # Show results16 NAcount17 ## NAURBRUR NAREGION NARELIG NAMARITAL

(continues on next page)

2 Here the sdcMicro object “sdcIntial“ contains a dataset with 2,500 individuals and 103 variables. We selected three categorical quasi-identifiers:“URBRUR”, “REGION”, “RELIG” and “MARITAL” and several continuous quasi-identifiers relating to income and expenditure. To illustrate theutility loss, we also applied several SDC methods to this sdcMicro object, such as local suppression, PRAM and additive noise addition.

7.1. General utility measures for continuous and categorical variables 79

Page 84: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

18 ## initial 0 0 1000 5119 ## treated 0 81 1000 51

Number of records changed

Another useful statistic is the number of records changed per variable. These can be counted in a similar way as themissing values and include suppressions (i.e., changes to missing/’NA’ in R). The number of records changed gives agood indication of the impact of the anonymization methods on the data. Listing 7.3 illustrates how to compute thenumber of records changed for the PRAMmed variables.

Listing 7.3: Computing number of records changed per variable

1 # Store the names of all pram variables in a vector2 namesPramVars <- names(sdcInitial@manipPramVars)3

4 # Dataframe to save the number of records changed5 recChanged <- rep(0, length(namesPramVars))6 names(recChanged) <- c(paste0('RC', namesPramVars))7

8 # Count number of records changed9 for(j in 1:length(namesPramVars)) # for all key variables

10 {11 comp <- sdcInitial@origData[namesPramVars[j]] !=12 sdcInitial@manipPramVars[namesPramVars[j]]13 temp1 <- sum(comp, na.rm = TRUE) # all changed variables without NAs14 temp2 <- sum(is.na(comp)) # if NA, changed, unless NA initially15 temp3 <- sum(is.na(sdcInitial@origData[namesPramVars[j]])16 + is.na(sdcInitial@manipPramVars[namesPramVars[j]])==2)17 # both NA, no change, but counted in temp218 recChanged[j] <- temp1 + temp2 - temp319 }20

21 # Show results22 recChanged23 ## RCWATER RCROOF RCTOILET24 ## 125 86 180

Comparing contingency tables

A useful way to measure information loss in categorical variables is to compare univariate tabulations and, more in-terestingly, contingency tables (also cross tabulations or two-way tables) between pairs of variables. To maintain theanalytical validity of a dataset, the contingency tables should stay approximately the same. The function table() pro-duces contingency tables of one or more variables. Listing 7.4 creates a contingency table of the variables “REGION”and “URBRUR”. We observe small differences between the tables before and after anonymization.

Listing 7.4: Comparing contingency tables of categorical variables

1 # Contingency table (cross tabulation) of the variables region and urban/rural2 table(sdcInitial@origData[, c('REGION', 'URBRUR')]) # before anonymization3 ## URBRUR4 ## REGION 1 25 ## 1 235 896 ## 2 261 73

(continues on next page)

80 Chapter 7. Measuring Utility and Information Loss

Page 85: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

7 ## 3 295 768 ## 4 304 719 ## 5 121 139

10 ## 6 100 23611

12 table(sdcInitial@manipKeyVars[, c('REGION', 'URBRUR')]) # after anonymization13 ## URBRUR14 ## REGION 1 215 ## 1 235 8916 ## 2 261 7317 ## 3 295 7618 ## 4 304 7119 ## 5 105 13020 ## 6 79 201

DoTo01b propose a Contingency Table-Based Information Loss (CTBIL) measure, which quantifies the distance be-tween the contingency tables in the original and treated data. Alternatively, visualizations of the contingency tablewith mosaic plots can be used to compare the impact of anonymization methods on the tabulations and contingencytables (see the Section Mosaic plots).

7.1.2 General utility measures for continuous variables

Statistics: mean, covariance, correlation

The statistics characterizing the dataset should not change after the anonymization. Examples of such statistics are themean, variance, and covariance and correlation structure of the most important variables in the dataset. Other statisticscharacterizing the data include the principal components and the loadings. DoTo01b give an overview of statisticsthat can be considered. In order to evaluate the information loss caused by the anonymization, one should comparethe appropriate statistics for continuous variables computed from the data before and after anonymization. There areseveral ways to evaluate the loss of utility with respect to the changes in these statistics, for instance, by comparingmeans and (co-)variances in the data or comparing the (multivariate) distributions of the data. Especially changesin the correlations gives valuable information on the validity of the data for regressions. Functions from the R basepackage or any other statistical package can be used to do this. Following are a few examples in R.

To compute the mean of each numerical variable we use the function colMeans(). To ignore missing values, it isnecessary to use the option na.rm = TRUE. “numVars” is a vector with the names of the numerical variables. Listing7.5 shows how to compute the means for all numeric variables. The untreated data is extracted from the ‘origData’slot of the sdcMicro object and the anonymized data from the ‘manipNumVars’ slot, which contains the manipulatednumeric variables. We observe small changes in each of the three variables.

Listing 7.5: Comparing the means of continuous variables

1 # untreated data2 colMeans(sdcInitial@origData[, numVars], na.rm = TRUE)3 ## INC INCRMT INCWAGE4 ## 479.7710 961.0295 1158.13305

6 # anonymized data7 colMeans(sdcInitial@manipNumVars[, numVars], na.rm = TRUE)8 ## INC INCRMT INCWAGE9 ## 489.6030 993.8512 1168.7561

In the same way, one can compute the covariance and correlation matrices of the numerical variables in the sdcMicroobject from the untreated and anonymized data. This is shown in Listing 7.6. We observe that the variance of each

7.1. General utility measures for continuous and categorical variables 81

Page 86: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

variable (the diagonal elements in the covariance matrix) have increased by the anonymization. These functions alsoallow computing confidence intervals in the case of samples. The means and covariances of subsets in the data alsoshould not differ. An example is the mean of income by gender, by age group or by region. These characteristics ofthe data are important for analysis.

Listing 7.6: Comparing covariance structure and correlation matrices ofnumeric variables

1 # untreated data2 cov(sdcInitial@origData[, numVars])3 ## INC INCRMT INCWAGE4 ## INC 1645926.1 586975.6 23789015 ## INCRMT 586975.6 6984502.3 16642576 ## INCWAGE 2378900.7 1664257.4 161698787

8 cor(sdcInitial@origData[, numVars])9

10 ## INC INCRMT INCWAGE11 ## INC 1.0000000 0.1731200 0.461124112 ## INCRMT 0.1731200 1.0000000 0.156602813 ## INCWAGE 0.4611241 0.1566028 1.000000014

15 # anonymized data16 cov(sdcInitial@manipNumVars[, numVars])17 ## INC INCRMT INCWAGE18 ## INC 2063013.1 649937.5 238244719 ## INCRMT 649937.5 8566169.1 177898520 ## INCWAGE 2382447.4 1778985.1 1992587021

22 cor(sdcInitial@manipNumVars[, numVars])23 ## INC INCRMT INCWAGE24 ## INC 1.0000000 0.1546063 0.371589725 ## INCRMT 0.1546063 1.0000000 0.136166526 ## INCWAGE 0.3715897 0.1361665 1.0000000

DoTo01b propose several measures for the discrepancy between the covariance and correlation matrices. These mea-sures are based on the mean squared error, the mean absolute error or the mean variation of the individual cells. Werefer to DoTo01b for a complete overview of these measures.

IL1s information loss measure

Alternatively, we can also compare the actual data and quantify the distance between the original dataset 𝑋 and thetreated dataset 𝑍. Here 𝑋 and 𝑍 contain only continuous variables. YaWC02 introduce the distance measure IL1s,which is the sum of the absolute distances between the corresponding observations in the raw and anonymized datasets,which are standardized by the standard deviation of the variables in the original data. For the continuous variables inthe dataset, the IL1s measure is defined as

𝐼𝐿1𝑠 =1

pn

𝑝∑︁𝑗=1

𝑛∑︁𝑖=1

|𝑥ij − 𝑧ij|√2𝑆𝑗

,

where 𝑝 is the number of continuous variables; 𝑛 is the number of records in the dataset; 𝑥ij and 𝑧ij, respectively, arethe values before and after anonymization for variable 𝑗 and individual 𝑖; and 𝑆𝑗 is the standard deviation of variable𝑗 in the original data (YaWC02).

When using sdcMicro, the IL1s data utility measure can be computed for all numerical quasi-identifiers with thefunction dUtility(), which is illustrated in Listing 7.7. If required, the measure can also be computed on subsets of

82 Chapter 7. Measuring Utility and Information Loss

Page 87: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the complete set of numerical quasi-identifiers. The function is called dUtility(), but returns a measure of informationloss. The result is saved in the utility slot of the sdcMicro object. Listing 7.7 also illustrates how to call the result.

Listing 7.7: Using dUtility() to compute IL1s data utility measure insdcMicro

1 # Evaluating IL1s measure for all variables in the sdcMicro object sdcInitial2 sdcInitial <- dUtility(sdcInitial)3

4 # Calling the result of IL1s5 sdcInitial@utility$il16 ## [1] 0.22037917

8 # IL1s for a subset of the numerical quasi-identifiers9 subset <- c('INCRMT', 'INCWAGE', 'INCFARMBSN')

10 dUtility(obj = sdcInitial@origData[,subset], xm = sdcInitial@manipNumVars[,subset],11 method = 'IL1')12 ## [1] 0.5641103

The measure is useful for comparing different methods. The smaller the value of the measure, the closer the valuesare to the original values and the higher the utility.

Note: This measure is related to risk measures based on distance and intervals (see the Section Risk measures forcontinuous variables).

The greater the distance between the original and anonymized values, the lower the data utility. Greater distance,however, also reduces the risk of re-identification.

Eigenvalues

Another way to evaluate the information loss is to compare the robust eigenvalues of the data before and afteranonymization. Listing 7.8 illustrates how to use this approach with sdcMicro. Here “contVars” is a vector withthe names of the continuous variables in which we are interested. “obj” is the argument that specifies the untreateddata and “xm” is the argument that specifies the anonymized data. The function’s output is the difference in eigenval-ues. Therefore, the minimum value is 0. Again, the main use is to compare different methods. The greater the value,the greater the changes in the data and the information loss.

Listing 7.8: Using dUtility() to compute eigenvalues in sdcMicro

1 # Comparison of eigenvalues of continuous variables2 dUtility(obj = sdcInitial@origData[,contVars],3 xm = sdcInitial@manipNumVars[,contVars], method = 'eigen')4 ## [1] 2.4829485

6 # Comparison of robust eigenvalues of continuous variables*7 dUtility(obj = sdcInitial@origData[,contVars],8 xm = sdcInitial@manipNumVars[,contVars], method = 'robeigen')9 ## [1] -4.297621e+14

7.2 Utility measures based on the end user’s needs

Not all needs and uses of a certain dataset can be inventoried. Nevertheless, some types of data have similar uses orimportant characteristics, which can be evaluated before and after anonymization. Examples of such “benchmarking

7.2. Utility measures based on the end user’s needs 83

Page 88: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

indicators” (TMKC14) are different for each dataset. Examples include poverty measures for income datasets andschool attendance ratios. Often ideas for selecting such indicators come from the reports data users publish based onpreviously released microdata.

The approach is to compare the indicators calculated on the untreated data and the data after anonymization withdifferent methods. If the differences between the indicators are not too large, the anonymized dataset can be releasedfor use by researchers. It should be taken into account that indicators calculated on samples are estimates with acertain variance and confidence interval. Therefore, for sample data, it is more informative to compare the overlap ofconfidence intervals and/or to evaluate whether the point estimate calculated after anonymization is contained withinthe confidence interval of the original estimate. Examples of benchmark indicators and their confidence intervals andhow to compute these in R are included in the case studies in these guidelines. Here we give the example of the GINIcoefficient.

The GINI coefficient is a measure of statistical dispersion, which is often used to measure inequality in income. Away to measure the information loss in income data is to compare the income distribution, which can be easily doneby comparing the GINI coefficients. Several R packages have functions to compute the GINI coefficient. We chosethe laeken package, which computes the GINI coefficient as the area between the 45-degree line and the Lorenz curve.To use the gini() function, we first have to install and load the laeken library. To calculate the GINI coefficient forthe variable income, we use the sample weights in the data. This is shown in Listing 7.9. The GINI coefficient ofsample data is a random variable. Therefore, it is useful to construct a confidence interval around the coefficient toevaluate the significance of any change in the coefficient after anonymization. The gini() function computes a 1-alphaconfidence interval for the GINI coefficient by using bootstrap.

Listing 7.9: Computing the GINI coefficient from the income variable todetermine income inequality

1 # Gini coefficient before anonymization2 gini(inc = sdcInitial@origData[selInc,'INC'],3 weights = curW[selInc], na.rm = TRUE)$value # before4 ## [1] 34.059285

6 # Gini coefficient after anonymization7 gini(inc = sdcInitial@manipNumVars[selInc,'INC'],8 weights = curW[selInc], na.rm = TRUE)$value # after9 ## [1] 67.13218

7.3 Regression

Besides comparing covariance and correlation matrices, regressions are a useful tool to evaluate whether the structurein the data is maintained after anonymization. By comparing regressions parameters, it is also possible to compare re-lations between non-continuous variables (e.g., by introducing dummy variables or regression with ordinal variables).If it is known for what purpose and in what field the data is used, common regressions can be used to compare thechange in coefficients and confidence intervals.

An example of using regression to evaluate the data utility in income data is the Mincer equation. The Mincer equationexplains earnings as a function of education and experience while controlling for other variables. The Mincer equationis often used to evaluate the gender pay gap and gender wage inequality by including a gender dummy. Here weshow how to evaluate the impact of anonymization methods on the gender coefficient. We regress the log income ona constant, a gender dummy, years of education, years of experience, years of experience squared and other factorsinfluencing wage.

ln (wage) = 𝛽0 + 𝛽1𝑔𝑒𝑛𝑑𝑒𝑟 + 𝛽2𝑒𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑒𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒 + 𝛽3experience2 + 𝛽𝑋

The parameter of interest here is 𝛽1, the effect of gender on the log wage. X is a matrix with several other factorsinfluencing wage and 𝛽 the coefficients of these factors. Listing 7.10 illustrates how to run a Mincer regression in

84 Chapter 7. Measuring Utility and Information Loss

Page 89: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

R using the function lm() and evaluate the coefficients and confidence intervals around the coefficients. We run theregression as specified for paid employees with a positive wage in the age groups 15 – 65 years.

Listing 7.10: Estimating the Mincer equation (regression) to evaluatedata utility before and after anonymization

1 # Mincer equation variables before anonymization2 Mlwage <- log(sdcMincer@origData$wage) # log wage3 # TRUE if 'paid employee', else FALSE or NA4 Mempstat <- sdcMincer@origData$empstat=='Paid employee'5 Mage <- sdcMincer@origData$age # age in years6 Meducy <- sdcMincer@origData$educy # education in years7 Mexp <- sdcMincer@origData$exp # experience in years8 Mexp2 <- Mexp^2 # squared experience9 Mgender <- sdcMincer@origData$gender # gender dummy

10 Mwgt <- sdcMincer@origData$wgt # weight variable for regression11 MfileB <- as.data.frame(cbind(Mlwage, Mempstat, Mage, Meducy, Mexp, Mexp2,12 Mgender, Mwgt))13 # Mincer equation variables after anonymization14 Mlwage <- log(sdcMincer@manipNumVars$wage) # log wage15 Mempstat <- sdcMincer@manipKeyVars$empstat=='Paid employee'16 # TRUE if 'paid employee', else FALSE or NA17 Mage <- sdcMincer@manipKeyVars$age # age in years18 Meducy <- sdcMincer@manipKeyVars$educy # education in years19 Mexp <- sdcMincer@manipKeyVars$exp # experience in years20 Mexp2 <- Mexp^2 # squared experience21 Mgender <- sdcMincer@manipKeyVars$gender # gender dummy22 Mwgt <- sdcMincer@origData$wgt # weight variable for regression23 MfileA <- as.data.frame(cbind(Mlwage, Mempstat, Mage, Meducy, Mexp, Mexp2,24 Mgender, Mwgt))25

26 # Specify regression formula27 Mformula <- 'Mlwage ~ Meducy + Mexp + Mexp2 + Mgender'28

29 # Regression Mincer equation30 mincer1565B <- lm(Mformula, data = subset(MfileB,31 MfileB$Mage >= 15 & MfileB$Mage <= 65 & MfileB$Mempstat==TRUE &32 MfileB$Mlwage != -Inf), na.action = na.exclude, weights = Mwgt) # before33 mincer1565A <- lm(Mformula,34 data = subset(MfileA,35 MfileA$Mage >= 15 & MfileA$Mage <= 65

→˓&36 MfileA$Mempstat==TRUE &37 MfileA$Mlwage != -Inf),38 na.action = na.exclude, weights = Mwgt) # after39

40 # The objects mincer1565B and mincer1565A contain the results of the41 regressions before and after anonymization42 mincer1565B$coefficients # before43 ## (Intercept) Meducy Mexp Mexp2 Mgender44 ## 3.9532064886 0.0212367075 0.0255962570 -0.0005682651 -0.493128941345

46 mincer1565A$coefficients # after47 ## (Intercept) Meducy Mexp Mexp2 Mgender48 ## 4.0526250282 0.0141090329 0.0326711056 -0.0007605492 -0.539364186249

50 # Compute the 95 percent confidence interval51 confint(obj = mincer1565B, level = 0.95) # before

(continues on next page)

7.3. Regression 85

Page 90: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

52 ## 2.5 % 97.5 %53 ## (Intercept) 3.435759991 4.470652986054 ## Meducy -0.018860497 0.061333912055 ## Mexp 0.004602597 0.046589916756 ## Mexp2 -0.000971303 -0.000165227357 ## Mgender -0.658085143 -0.328172739658

59 confint(obj = mincer1565A, level = 0.95) # after60 ## 2.5 % 97.5 %61 ## (Intercept) 3.46800378 4.637246275862 ## Meducy -0.03305743 0.061275496463 ## Mexp 0.01024867 0.055093536664 ## Mexp2 -0.00119162 -0.000329478465 ## Mgender -0.71564602 -0.3630823543

If the new estimates fall within the original confidence interval and the new and original confidence intervals aregreatly overlapping, the data can be considered valid for this type of regression after anonymization. Fig. 7.2 showsthe point estimates and confidence intervals for the gender coefficient in this trade-off for a sample income datasetand several SDC methods and parameters. The red dot and confidence bar (on the top) correspond to the estimatesfor the untreated data, whereas the other confidence bars correspond to the respective SDC methods and differentparameters. The anonymization reduces the number of expected re-identifications in the data (left axis) and the pointestimates and confidence intervals vary greatly for the different SDC methods. We would choose a method, whichreduces the expected number of identifications, while not changing the gender coefficient and having a great overlapof the confidence interval with the confidence interval estimated from the original data.

Fig. 7.2: Effect of anonymization on the point estimates and confidence interval of the gender coefficient in the Mincerequation

86 Chapter 7. Measuring Utility and Information Loss

Page 91: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

7.4 Assessing data utility with the help of data visualizations (in R)

The use of graphs and other visualization techniques is a good way to assess at a glance how much the data havechanged after anonymization, and can aid the selection of appropriate anonymization techniques for the data. Visual-izations can be a useful tool to assess the impact on data utility of anonymization methods and helps choose amonganonymization methods. The software package R provides several functions and packages that can help visualize theresults of anonymization. This section lists a few of these functions and packages and provides code examples toillustrate how to implement them. We present the following visualizations:

• histograms and density plots

• boxplots

• mosaic plots

To make appropriate visualizations, we need to use the raw data and the anonymized data. When using an sdcMicroobject for the anonymization process, the raw data are stored in the “origData” slot of the object and the anonymizedvariables are in the slots “manipKeyVars”, “manipPramVars”, “manipNumVars” and “manipStrataVar” slots. See theSection Objects of class sdcMicroObj for more information on sdcMicro objects, slots and how to access slots.

7.4.1 Histograms and density plots

Histograms and density plots are useful for quick comparisons of variable distribution before and after anonymization.The advantage of histograms is that the results are exact. Visualization depends on the bin widths and the start point ofthe first bin, however. Histograms can be used for continuous and semi-continuous variables. Density plots display thekernel density of the data; therefore, the plot depends on the kernel that is chosen and whether the data fits the kernelwell. Nevertheless, density plots are a good tool to illustrate the change of values and value ranges of continuousvariables.

Histograms can be plotted with function hist() and kernel densities with the functions plot() and density() in R. Listing7.11 provides examples of how to use these functions to illustrate the changes in the variable ”INC”, an incomevariable. The function hist() needs as argument the break points for the histogram. The results are shown in Fig. 7.3and Fig. 7.4. The histograms and density plots give a clear indication how the values have changed: the variability ofthe data has increased and the shape of the distribution has changed.

Note: The vertical axes of the histograms have different scales.

Listing 7.11: Plotting histograms and kernel densities

1 # Plot histograms2 # Plot histogram before anonymization3 hist(sdcInitial@origData$INC, breaks = (0:180)*1e2,4 main = "Histogram income - original data")5

6 # Plot histogram after anonymization (noise addition)7 hist(sdcInitial@manipNumVars$INC, breaks = (-20:190)*1e2,8 main = "Histogram income - anonymized data")9

10 # Plot densities11 # Plot original density curve12 plot(density(sdcInitial@origData$INC), xlim = c(0, 8000), ylim = c*(0, 0.006),13 main = "Density income", xlab = "income")14 par (new = TRUE)15

(continues on next page)

7.4. Assessing data utility with the help of data visualizations (in R) 87

Page 92: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

16 # Plot density curve after anonymization (noise addition)17 plot(density(sdcInitial@manipNumVars$INC), xlim = c(0, 8000), ylim = c(0, 0.006),18 main = "Density income", xlab = "income")

Fig. 7.3: Histograms of income before and after anonymization

7.4.2 Box plots

Box plots give a quick overview of the changes in the spread and outliers of continuous variables before and afteranonymization. Listing 7.12 shows how to generate box plots in R with the function boxplot(). The result in Fig. 7.5shows an example for an expenditure variable after adding noise. The box plot shows clearly that the variability in theexpenditure variable increased as a result of the anonymization methods applied.

Listing 7.12: Creating boxplots for continuous variables

1 boxplot(sdcObj@origData$TOTFOOD, sdcObj@manipNumVars$TOTFOOD,2 xaxt = 'n', ylab = "Expenditure")3 axis(1, at = c(1,2), labels = c('before', 'after'))

7.4.3 Mosaic plots

Univariate and multivariate mosaic plots are useful for showing changes in the tabulations of categorical variables,especially when comparing several “scenarios” next to one another. A scenario here refers to the choice of anonymiza-tion methods and their parameters. With mosaic plots we can, for instance, quickly see the effect of different levelsof 𝑘-anonymity or differences in the importance vectors in the local suppression algorithm (see the Section Localsuppression).

We illustrate the changes in tabulations with an example of the variable “WATER” before and after applying PRAM.We can use mosaic plots to quickly see the changes for each category. Listing 7.13 shows the code in R. The functionmosaicplot() is available in base R. To plot a tabulation, first the tabulation must be made with the table() function.

88 Chapter 7. Measuring Utility and Information Loss

Page 93: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 7.4: Density plots of income before and after anonymization

Fig. 7.5: Example of box plots of an expenditure variable before and after anonymization

7.4. Assessing data utility with the help of data visualizations (in R) 89

Page 94: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

To show the labels in the mosaicplot(), we change the class of the variables to ‘factor’ (see the Section Classes in R).Looking at the mosaic plot in Fig. 7.6 we see invariant PRAM has virtually no influence on the univariate distribution.

Listing 7.13: Creating univariate mosaic plots

1 # Collecting data of variable WATER before and after anonymization,2 # assigning factor levels for labels in plot3 dataWater <- t(cbind(table(factor(sdcHH@origData$WATER,4 levels = c(1, 2, 3, 4, 5, 6, 7, 8, 9),5 labels = c("Pipe (own tap)", "Public standpipe",6 "Borehole", "Wells (protected)",7 "Wells (unprotected)", "Surface water",8 "Rain water", "Vendor/truck", "Other"))),9 table(factor(sdcHH@manipPramVars$WATER,

10 levels = c(1,2, 3, 4, 5, 6, 7, 8, 9),11 labels = c("Pipe (own tap)", "Public standpipe",12 "Borehole", "Wells (protected)",13 "Wells (unprotected)", "Surface water",14 "Rain water", "Vendor/truck","Other")))))15 rownames(dataWater) <- c("before", "after")16

17 # Plotting mosaic plot18 mosaicplot(dataWater, main = "", color = 2:10, las = 2)

Fig. 7.6: Mosaic plot to illustrate the changes in the WATER variable

We use the variables “gender” and “relationship status” to illustrate the use of mosaic plots for the illustration ofchanges in univariate tabulations introduced by several sets of anonymization methods. Table 7.1 provides the methodsapplied in each scenario. Scenario 0, the base scenario, shows the original categories of the gender and relationshipstatus variables, while scenarios 1 to 6 show shifts in the categories after applying different anonymization techniques.Table 6.1 provides a description of the anonymization methods used in each scenario. In total we visualize the impactof six different sets of anonymization methods. We can use mosaic plots to quickly see which set of methods haswhat impact on the gender and relationship status variables, which can be used to select the best scenario. Lookingat the mosaic plots in Fig. 7.7 , we see that scenarios 2, 5 and 6 give the smallest changes for the gender variable andscenarios 3 and 4 for the relationship status variable.

90 Chapter 7. Measuring Utility and Information Loss

Page 95: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 7.1: Description of anonymization methods by scenarioSce-nario

Description of anonymization methods applied

0(base)

Original data, no treatment

1 Recode age (five-year intervals), plus local suppression (required k = 3, high importance on water, toiletand literacy variables)

2 Recode age (five-year intervals), plus local suppression (required k = 5, no importance vector)3 Recode age (five-year intervals), plus local suppression (required k = 3, high importance on toilet), while

also recoding region, urban, education level and occupation variables4 Recode age (five-year steps), plus local suppression (required k = 5, high importance on water, toilet and

literacy), while also recoding region, urban, education level and occupation variables5 Recode age (five-year intervals), plus local suppression (required k = 3, no importance vector), microag-

gregation (wealth index), while also recoding region, urban, education level and occupation variables6 Recode age (five-year intervals) plus local suppression (required k=3, no importance vector), PRAM liter-

acy, while also recoding region, urban, education level and occupation variables

Fig. 7.7: Comparison of treated vs. untreated gender and relationship status variables with mosaic plots

As we discussed in the Section ‘PRAM (Post RAndomization Method) <anon_methods.html# PRAM (Post RAn-domization Method)‘__ , invariant PRAM preserves the univariate distributions. Therefore, in this case it is moreinteresting to look at the multivariate mosaic plots. Mosaic plots are also a powerful tool to show changes in cross-tabulations/contingency tables. Listing 7.14 shows how to generate mosaic plots for two variables. To compare thechanges, we need to compare two different plots. Fig. 7.8 and Fig. 7.9 illustrate that (invariant) PRAM does notpreserve the two-way tables in this case.

Listing 7.14: Creating multivariate mosaic plots

1 # Before anonymization: contingency table and mosaic plot2 ROOFTOILETbefore <- t(table(factor(sdcHH@origData$ROOF, levels = c(1,2, 3, 4, 5, 9),3 labels = c("Concrete/cement/ \n brick/stone", "Wood

→˓",(continues on next page)

7.4. Assessing data utility with the help of data visualizations (in R) 91

Page 96: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

4 "Bamboo/thatch", "Tiles/shingles",5 "Tin/metal sheets", "Other")),6 factor(sdcHH@origData$TOILET, levels = c(1,2, 3, 4, 9),7 labels = c("Flush \n toilet",8 "Improved \n pit \n latrine",9 "Pit \n latrine", "No \n facility",

10 "Other"))))11 mosaicplot(ROOFTOILETbefore, main = "", las = 2, color = 2:6)12

13 # After anonymization: contingency table and mosaic plot14 ROOFTOILETafter <- t(table(factor(sdcHH@manipPramVars$ROOF, levels = c(1,2, 3, 4, 5,

→˓9),15 labels = c("Concrete/cement/ \n brick/stone", "Wood

→˓",16 "Bamboo/thatch", "Tiles/shingles",17 "Tin/metal sheets", "Other")),18 factor(sdcHH@manipPramVars$TOILET, levels = c(1,2, 3, 4,

→˓9),19 labels = c("Flush \n toilet",20 "Improved \\n pit \n latrine",21 "Pit \n latrine", "No \n facility",22 "Other"))))23 mosaicplot(ROOFTOILETafter, main = "", las = 2, color = 2:6)

Fig. 7.8: Mosaic plot of the variables ROOF and TOILET before anonymization

7.5 Choice of utility measure

Besides the users’ requirements on the data, the utility measures should be chosen in accordance with the variable typesand anonymization methods employed. The employed utility measures can be a combination of both general and user-specific measures. As discussed earlier, different utility measures should be used for continuous and categorical data.Furthermore, some utility measures are not informative after certain anonymization methods have been applied. For

92 Chapter 7. Measuring Utility and Information Loss

Page 97: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 7.9: Mosaic plot of the variables ROOF and TOILET after anonymization

example, after applying perturbative methods that interchange data values, comparing values directly is not usefulbecause they will give the impression of high levels of information loss. In such cases, it is more informative to look atmeans, covariances and benchmarking indicators that can be computed from the data. Furthermore, it is important notonly to focus on the characteristics of variables one by one, but also on the interactions between variables. This canbe done by cross-tabulations and regressions. In general, when anonymizing sampled data, it is advisable to computeconfidence intervals around estimates to interpret the magnitude of changes.

Recommended Reading Material on Measuring Utility and Information Loss

A.G. De Waal and L.C.R.J. Willenborg. 1999. “Information Loss through Global Recoding and Local Suppression”In Netherlands Official Statistics: Special Issue on SDC, 14, 17-10.

J. Domingo-Ferrer, J.M. Mateo-Sanz and V. Torra. 2001. “Comparing SDC Methods for Microdata on the basis ofInformation Loss and Disclosure Risk”. In Pre-proceedings of ETK-NTTS 2001 (vol. 2), 807-826. http://neon.vb.cbs.nl/casc/NTTSJosep.pdf

J. Domingo-Ferrer and V. Torra. 2001. “Disclosure Protection Methods and Information Loss for Microdata”. In P.Doyle, J.I. Lane, J.J.M. Theeuwes and L. Zayatz (eds.) Theory and Practical Applications for Statistical Agencies,91-110, Amsterdam. http://crises-deim.urv.cat/webCrises/publications/bcpi/cliatpasa01Disclosure.pdf

References

7.5. Choice of utility measure 93

Page 98: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

94 Chapter 7. Measuring Utility and Information Loss

Page 99: Statistical Disclosure Control: A Practice Guide

CHAPTER 8

SDC with sdcMicro in R: Setting Up Your Data and more

8.1 Installing R, sdcMicro and other packages

This guide is based on the software package sdcMicro, which is an add-on package for the statistical software R. Both Rand sdcMicro, as well as other R packages, are freely available from the CRAN (Comprehensive R Archive Network)website for Linux, Mac and Windows (http://cran.r-project.org). This website also offers descriptions of packages.Besides the standard version of R, there is a more user-friendly user interface for R: RStudio. RStudio is also freelyavailable for Linux, Mac and Windows (http://www.rstudio.com). The sdcMicro package is dependent on (i.e., uses)other R packages that must be installed on your computer before using sdcMicro. Those will automatically be installedwhen installing sdcMicro. For some functionalities, we use still other packages (such as foreign for reading data andsome graphical packages). If so, this is indicated in the appropriate section in this guide. R, RStudio, the sdcMicropackage and its dependencies and other packages have regular updates. It is strongly recommended to regularly checkfor updates: this requires installing a new version for an update of R; with the update.packages() command or usingthe menu options in R or RStudio one can update the installed packages.

When starting R or RStudio, it is necessary to specify each time which packages are being used by loading those.This loading of packages can be done either with the library() or the require() function. Both options are illustrated inListing 8.1.

Listing 8.1: Loading required packages

1 library(sdcMicro) # loading the sdcMicro package2 require(sdcMicro) # loading the sdcMicro package

All packages and functions are documented. The easiest way to access the documentation of a specific function is touse the built-in help, which generally gives an overview of the parameters of the functions as well as some examples.The help of a specific function can be called by a question mark followed by the function name without any arguments.Listing 8.2 shows how to call the help file for the microaggregation() function of the sdcMicro package.1 The downloadpage of the each package on the CRAN website also provides a reference manual with a complete overview of thefunctions in the package.

1 Often it is also useful to search the internet for help on specific functions in R. There are many fora where R users discuss issues they encounter.One particularly useful site is stackoverflow.com.

95

Page 100: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 8.2: Displaying help for functions

1 ?microaggregation # help for microaggregation function

When issues or bugs in the sdcMicro package are encountered, comments, remarks or suggestions can be posted forthe developers of sdcMicro on their GitHub.

8.2 Read functions in R

The first step in the SDC process when using sdcMicro is to read the data into R and create a dataframe.2 R iscompatible with most statistical data formats and provides read functions for most types of data. For those readfunctions, it is sometimes necessary to install additional packages and their dependencies in R. An overview of dataformats, functions and the packages containing these functions is provided in Table 8.1. These functions are alsoavailable as write (e.g., write_dta()) to save the anonymized data in the required format.3

Table 8.1: Packages and functions for reading data in RType/software Extension Package FunctionSPSS .sav haven read_sav()STATA (v. 5-14) .dta haven read_dta()SAS .sas7bdat haven read_sas()Excel .csv utils (base package) read.csv()Excel .xls/.xlsx readxl readxl()

Most of these functions have options that specify how to handle missing values and variables with factor levels andvalue labels. Listing 8.3, Listing 8.4 and Listing 8.5 provide example code for reading in a STATA (.dta) file, an Excel(.csv) file and a SPSS (.sav) file.

Listing 8.3: Reading in a STATA file

1 setwd("/Users/World Bank") # working directory with data file2 fname = "data.dta" # name of data file3 library(haven) # loads required package for read/write function for STATA files4 file <- read_dta(fname)5 # reads the data into the data frame tbl called file

Listing 8.4: Reading in a Excel file

1 setwd("/Users/World Bank") # working directory with data file2 fname = "data.csv" # name of data file3 file <- read.csv(fname, header = TRUE, sep = ",", dec = ".")4 # reads the data into the data frame called file,5 # the first line contains the variable names,6 # fields are separated with commas, decimal points are indicated with ‘.’

Listing 8.5: Reading in a SPSS file

1 setwd("/Users/World Bank") # working directory with data file2 fname = "data.sav" # name of data file

(continues on next page)

2 A dataframe is an object class in R, which is similar to a data table or matrix.3 Not all functions are compatible with all versions of the respective software package. We refer to the help files of the read and write functions

for more information.

96 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 101: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

3 library(haven) # loads required package for read/write function for SPSS files4 file <- read_sav(fname)5 # reads the data into the data frame called file

The maximum data size in R is technically restricted. The maximum size depends on the R build (32-bit or 64-bit)and the operating system. Some SDC methods require long computation times for large datasets (see the Section onComputation time).

8.3 Missing values

The standard way missing values are represented in R is by the symbol ‘NA’, which is different to impossible values,such as division by zero or the log of a negative number, which are represented by the symbol ‘NaN’. The value‘NA’ is used for both numeric and categorical variables.4 Values suppressed by the localSuppression() routine are alsoreplaced by the ‘NA’ symbol. Some datasets and statistical software might use different values for missing values,such as ‘999’ or strings. It is possible to include arguments in read functions to specify how missing values in thedataset should be treated and automatically recode missing values to ‘NA’. For instance, the function read.table() hasthe ‘na.strings’ argument, which replaces the specified strings with ‘NA’ values.

Missing values can also be recoded after reading the data into R. This may be necessary if there are several differentmissing value codes in the data, different missing value codes for different variables or the read function for thedatatype does not allow specifying the missing value codes. When preparing data, it is important to recode anymissing values that are not coded as ‘NA’ to ‘NA’ in R before starting the anonymization process to ensure the correctmeasurement of risk (e.g., 𝑘-anonymity), as well as to ensure that many of the methods are correctly applied to thedata. Listing 8.6 shows how to recode the value ‘99’ to ‘NA’ for the variable “toilet”.

Listing 8.6: Recoding missing values to NA

1 file[file[,'toilet'] == 99,'toilet'] <- NA2 # Recode missing value code 99 to NA for variable toilet

8.4 Classes in R

All objects in R are of a specific class, such as integer, character, matrix, factor or dataframe. The class of an objectis an attribute from which the object inherits. To find out the class of an object, one can use the function class().Functions in R might require objects or arguments of certain classes or functions might have different functionalitydepending on the class of the argument. Examples are the write functions that require dataframes and most functionsin the sdcMicro package that require either dataframes or sdcMicro objects. The functionality of the functions in thesdcMicro package differs for dataframes and sdcMicro objects. It is easy to change the class attribute of an object withfunctions that start with “as.”, followed by the name of the class (e.g., as.factor(), as.matrix(), as.data.frame()). Listing8.7 shows how to check the class of an object and change the class to “data.frame”. Before changing the class attributeof the object “file”, it was in the class “matrix”. An important class defined and used in the sdcMicro package is theclass named sdcMicroObj. This class is described in the next section.

Listing 8.7: Changing the class of an object in R

1 # Finding out the class of the object ‘file’2 class(file)3 "matrix"4

(continues on next page)

4 This is regardless of the class of the variable in R. See the Section Classes in R for more on classes in R.

8.3. Missing values 97

Page 102: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

5 # Changing the class to data frame6 file <- as.data.frame(file)7

8 # Checking the result class(file)9 "data.frame"

8.5 Objects of class sdcMicroObj

The sdcMicro package is built around objects5 of class sdcMicroObj, a class especially defined for the sdcMicro pack-age. Each member of this class has a certain structure with slots that contain information regarding the anonymizationprocess (see Table 8.2 for a description of all slots). Before evaluating risk and utility and applying SDC methods, cre-ating an object of class sdcMicro is recommended. All examples in this guide are based on these objects. The functionused to create an sdcMicro object is createSdcObj(). Most functions in the sdcMicro package, such as microaggrega-tion() or localSuppression(), automatically use the required information (e.g., quasi-identifiers, sample weights) fromthe sdcMicro object if applied to an object of class sdcMicro.

The arguments of the function createSdcObj() allow one to specify the original data file and categorize the variablesin this data file before the start of the anonymization process.

Note: For this, disclosure scenarios must already have been evaluated and quasi-identifiers selected. In addition, onemust ensure there are no problems with the data, such as variables containing only missing values.

In Listing 8.8, we show all arguments of the function createSdcObj(), and first define vectors with the names of thedifferent variables. This practice gives a better overview and later allows for quick changes in the variable choices ifrequired. We choose the categorical quasi-identifiers (keyVars); the variables linked to the categorical quasi-identifiersthat need the same suppression pattern (ghostVars, see the Section Local suppression); the numerical quasi-identifiers(numVars); the variables selected for applying PRAM (pramVars); a variable with sampling weights (weightVar); theclustering ID (hhId, e.g., a household ID, see the Section Household risk); a variable specifying the strata (strataVar)and the sensitive variables specified for the computation of 𝑙-diversity (sensibleVar , see the Section l-diversity).

Note: Most SDC methods in the sdcMicro package are automatically applied within the strata, if the ‘strataVar’argument is specified.

Examples are local suppression and PRAM. Not all variables must be specified, e.g., if there is no hierarchical (house-hold) structure, the argument ‘hhId’ can be omitted. The names of the variables correspond to the names of thevariables in the dataframe containing the microdata to be anonymized. The selection of variables is important for therisk measures that are automatically calculated. Furthermore, several methods are by default applied to all variablesof one sort, e.g., microaggregation to all key variables.6 After selecting these variables, we can create the sdcMicroobject. To obtain a summary of the object, it is sufficient to write the name of the object.

Listing 8.8: Selecting variables and creating an object of class sdcMi-croObj for the SDC process in R

1 # Select variables for creating sdcMicro object2 # All variable names should correspond to the names in the data file3 # selected categorical key variables4 selectedKeyVars = c('region', 'age', 'gender', 'marital', 'empstat')

(continues on next page)

5 Class sdcMicroObj has S4 objects, which have slots or attributes and allow for object-oriented programming.6 Unless otherwise specified in the arguments of the function.

98 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 103: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

5

6 # selected linked variables (ghost variables)7 selectedGhostVars = c('urbrur')8

9 # selected categorical numerical variables10 selectedNumVar = c('wage', 'savings')11

12 # weight variable13 selectedWeightVar = c('wgt')14

15 # selected pram variables16 selectedPramVars = c('roof', 'wall')17

18 # household id variable (cluster)19 selectedHouseholdID = c('idh')20

21 # stratification variable22 selectedStrataVar = c('strata')23

24 # sensitive variables for l-diversity computation25 selectedSensibleVar = c('health')26

27 # creating the sdcMicro object with the assigned variables28 sdcInitial <- createSdcObj(dat = file,29 keyVars = selectedKeyVars,30 ghostVars = selectedGhostVars,31 numVar = selectedNumVar,32 weightVar = selectedWeightVar,33 pramVars = selectedPramVars,34 hhId = selectedHouseholdID,35 strataVar = selectedStrataVar,36 sensibleVar = selectedSensibleVar)37

38 # Summary of object39 sdcInitial40

41 ## Data set with 4580 rows and 14 columns.42 ## --> Categorical key variables: region, age, gender, marital, empstat43 ## --> Numerical key variables: wage, savings44 ## --> Weight variable: wgt45 ## ---------------------------------------------------------------------------46 ##47 ## Information on categorical Key-Variables:48 ##49 ## Reported is the number, mean size and size of the smallest category for recoded

→˓variables.50 ## In parenthesis, the same statistics are shown for the unmodified data.51 ## Note: NA (missings) are counted as seperate categories!52 ##53 ## Key Variable Number of categories Mean size54 ## region 2 (2) 2290.000 (2290.000)55 ## age 5 (5) 916.000 (916.000)56 ## gender 3 (3) 1526.667 (1526.667)57 ## marital 8 (8) 572.500 (572.500)58 ## empstat 3 (3) 1526.667 (1526.667)59 ##60 ## Size of smallest

(continues on next page)

8.5. Objects of class sdcMicroObj 99

Page 104: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

61 ## 646 (646)62 ## 16 (16)63 ## 50 (50)64 ## 26 (26)65 ## 107 (107)66 ## ---------------------------------------------------------------------------67 ##68 ## Infos on 2/3-Anonymity:69 ##70 ## Number of observations violating71 ## - 2-anonymity: 15772 ## - 3-anonymity: 28173 ##74 ## Percentage of observations violating75 ## - 2-anonymity: 3.428 %76 ## - 3-anonymity: 6.135 %77 ## ---------------------------------------------------------------------------78 ##79 ## Numerical key variables: wage, savings80 ##81 ## Disclosure risk is currently between [0.00%; 100.00]82 ##83 ## Current Information Loss:84 ## IL1: 0.0085 ## Difference of Eigenvalues: 0.000%86 ## ---------------------------------------------------------------------------

Table 8.2 presents the names of the slots and their respective contents. The slot names can be listed using the functionslotNames(), which is illustrated in Listing 8.9. Not all slots are used in all cases. Some slots are filled only afterapplying certain methods, e.g., evaluating a specific risk measure. Certain slots of the objects can be accessed byaccessor functions (e.g., extractManipData for extracting the anonymized data) or print functions (e.g., print()) withthe appropriate arguments. The content of a slot can also be accessed directly with the ‘@’ operator and the slot name.This is illustrated for the risk slot in Listing 8.9. This functionality can be practical to save intermediate results andcompare the outcomes of different methods. Also, for manual changes to the data during the SDC process, such aschanging missing value codes or manual recoding, the direct accession of the data in the slots with the manipulateddata (i.e., slot names starting with ‘manip’) is useful. Within each slot there are generally several elements. Theirnames can be shown with the names() function and they can be accessed with the ‘$’ operator. This is shown for theelement with the individual risk in the risk slot.

Listing 8.9: Displaying slot names and accessing slots of an S4 object

1 # List names of all slots of sdcMicro object2 slotNames(sdcInitial)3

4 ## [1] "origData" "keyVars" "pramVars"5 ## [4] "numVars" "ghostVars" "weightVar"6 ## [7] "hhId" "strataVar" "sensibleVar"7 ## [10] "manipKeyVars" "manipPramVars" "manipNumVars"8 ## [13] "manipGhostVars" "manipStrataVar" "originalRisk"9 ## [16] "risk" "utility" "pram"

10 ## [19] "localSuppression" "options" "additionalResults"11 ## [22] "set" "prev" "deletedVars"12

13 # Accessing the risk slot14 sdcInitial@risk15

(continues on next page)

100 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 105: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

16 # List names within the risk slot17 names(sdcInitial@risk)18 ## [1] "global" "individual" "numeric"19

20 # Two ways to access the individual risk within the risk slot21 sdcInitial@risk$individual22 get.sdcMicroObj(sdcInitial, "risk")$individual

Table 8.2: Slot names and slot description of sdcMicro objectSlotname ContentorigData original data as specified in the dat argument of the createSdcObj() functionkeyVars indices of columns in origData with specified categorical key variablespramVars indices of columns in origData with specified PRAM variablesnumVars indices of columns in origData with specified numerical key variablesghostVars indices of columns in origData with specified ghostVarsweightVar indices of columns in origData with specified weight variablehhId indices of columns in origData with specified cluster variablestrataVar indices of columns in origData with specified strata variablesensibleVar indices of columns in origData with specified sensitive variables for lDiversitymanipKeyVars manipulated categorical key variables after applying SDC methods (cf. keyVars slot)manipPramVars manipulated PRAM variables after applying PRAM (cf. pramVars slot)manipNumVars manipulated numerical key variables after applying SDC methods (cf. numVars slot)manipGhostVars manipulated ghost variables (cf. ghostVars slot)manipStrataVar manipulated strata variables (cf. strataVar slot)originalRisk global and individual risk measures before anonymizationrisk global and individual risk measures after applied SDC methodsutility utility measures (il1 and eigen)pram details on PRAM after applying PRAMlocalSuppression number of suppression per variable after local suppressionoptions options specifiedadditionalResults additional resultsset list of slots currently in use (for internal use)prev information to undo one step with the undo() functiondeletedVars variables deleted (direct identifiers)

There are two options to save the results after applying SDC methods:

• Overwriting the existing sdcMicro object, or

• Creating a new sdcMicro object. The original object will not be altered and can be used for comparing results.This is especially useful for comparing several methods and selecting the best option.

In both cases, the result of any function has to be re-assigned to an object with the ‘<-‘ operator. Both methods areillustrated in Listing 8.10.

8.5. Objects of class sdcMicroObj 101

Page 106: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 8.10: Saving results of applying SDC methods

1 # Applying local suppression and reassigning the results to the same sdcMicro object2 sdcInitial <- localSuppression(sdcInitial)3

4 # Applying local suppression and assigning the results to a new sdcMicro object5 sdc1 <- localSuppression(sdcInitial)

If the results are reassigned to the same sdcMicro object, it is possible to undo the last step in the SDC process. Thisis useful when changing parameters. The results of the last step, however, are lost after undoing that step.

Note: The undolast() function can be used to go only one step back, not several.

The result must also be reassigned to the same object. This is illustrated in Listing 8.11.

Listing 8.11: Undo last step in SDC process

1 # Undo last step in SDC process2 sdcInitial <- undolast(sdcInitial)

8.6 Household structure

If the data has a hierarchical structure and some variables are measured on the higher hierarchical level and oth-ers on the lower level, the SDC process should be adapted accordingly (see also the Sections Household risk andAnonymization of the quasi-identifier household size). A common example in social survey data is datasets with ahousehold structure. Variables that are measured on the household level are, for example, household income, type ofhouse and region. Variables measured on the individual level are, for example, age, education level and marital status.Some variables are measured on the individual level, but are nonetheless the same for all household members in almostall households. These variables should be treated as measured on the household level from the SDC perspective. Anexample is the variable religion for some countries.

The SDC process should be divided into two stages in cases where the data have a household structure. First, thevariables on the higher (household) level should be anonymized; subsequently, the treated higher-level variables shouldbe merged with the individual variables and anonymized jointly. In this section, we explain how to extract householdvariables from a file and merge them with the individual levels variables after treatment in R. We illustrate this processwith an example of household and individual-level variables.

These steps are illustrated in Listing 8.12. We require both an individual ID and a household ID in the dataset; if theyare lacking, they must be generated. The individual ID has to be unique for every individual in the dataset and thehousehold ID has to be unique across households. The first step is to extract the household variables and save themin a new dataframe. We specify the variables that are measured at the household level in the string vector “HHVars”and subtract only these variables from the dataset. This dataframe will have for each household the same number ofentries as it has household members (e.g., if a household has four members, this household will appear four times inthe file). We next apply the function unique() to select only one record per household. This argument of the uniquefunction is the household ID, which is the same for all household members, but unique across households.

Listing 8.12: Create a household level file with unique records (removeduplicates)

1 # Create subset of file with only variables measured at household level2 HHVars <- c('region', 'hhincome')3 fileHH <- file[,HHVars]

(continues on next page)

102 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 107: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

4

5 # Remove duplicated rows based on the household ID / only every household once in→˓fileHH

6 fileHH <- unique(fileHH, by = c('HID'))7

8 # Dimensions of fileHH (number of households)9 dim(fileHH)

After anonymizing the household variables based on the dataframe “fileHH”, we recombine the anonymized householdvariables with the original variables, which are measured on the individual level. We can extract the individual-level variables from the original dataset using “INDVars” – a string vector with the individual-level variable names.For extracting the anonymized data from the sdcMicro object, we can use the extractManipData() function from thesdcMicro package. Next, we merge the data using the merge function. The ‘by’ argument in the merge functionspecifies the variable used for merging – in this case the household ID, which has the same variable name in bothdatasets. All other variables should have different names in both datasets. These steps are illustrated in Listing 8.13.

Listing 8.13: Merging anonymized household-level variables withindividual-level variables

1 # Extract manipulated household level variables from the SDC object2 HHmanip <- extractManipData(sdcHH)3

4 # Create subset of file with only variables measured at individual level5 fileIND <- file[,INDVars]6

7 # Merge the file by using the household ID8 fileCombined <- merge(HHmanip, fileIND, by = c('HID'))

The file fileCombined is used for the SDC process with the entire dataset. How to deal with data with householdstructure is illustrated in the case studies in the Section Case studies.

As discussed in the Section Anonymization of the quasi-identifier household size), the size of a household can also bea quasi-identifier, even if the household size is not included in the dataset as variable. For the purpose of evaluatingthe disclosure risk, it might be necessary to create such a variable by a headcount of the members of each household.Listing 8.14 shows how to generate a variable household size with values for each individual based on the householdID (HID). Two cases are shown: 1) the file sorted by household ID and 2) the file not sorted.

Listing 8.14: Generating the variable household size

1 # Sorted by HID2 file$hhsize <- rep(unname(table(file$HID)), unname(table(file$HID)))3

4 # Unsorted5 file$hhsize <- rep(diff(c(1, 1 + which(diff(file$HID) != 0), length(file$HID) + 1)),6 diff(c(1, 1 + which(diff(file$HID) != 0), length(file$HID) + 1)))

Note: In some cases, the order of the individuals within the households can provide information that could lead tore-identification.

An example is information on the relation to the household head. In many countries, the first individual in the house-hold is the household head, the second the partner of the household head and the next few are children. Therefore,the line number within the household could correlate well with a variable that contains information on the relation tothe household head. One way to avoid this unintended release of information is to change the order of the individualswithin each household at random. Listing 8.15 illustrates a way to do this in R.

8.6. Household structure 103

Page 108: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 8.15: Changing the order of individuals within households

1 # List of household sizes by household2 hhsize <- diff(c(1, 1 + which(diff(file$HID) != 0), length(file$HID) + 1))3

4 # Line numbers randomly assigned within each household5 set.seed(123)6 dataAnon$INDID <- unlist(lapply(hhsize,7 function(n){sample(1:n, n, replace = FALSE,8 prob = rep(1/n, n))}))9

10 # Order the file by HID and randomized INDID (line number)11 dataAnon <- dataAnon[order(dataAnon$HID, dataAnon$INDID),]

8.7 Randomizing order and numbering of individuals or households

Often the order and numbering of individuals, households, and also geographical units contains information that couldbe used by an intruder to re-identify records. For example, households with IDs that are close to one another in thedataset are likely to be geographically close as well. This is often the case in a census, but also in a household surveyhouseholds close to one another in the dataset likely share the same low level geographical unit if the dataset is sortedin that way. Another example is a dataset that is alphabetically sorted by name. Here, removing the direct identifiername before release is not sufficient to guarantee that the name information cannot be used (e.g. first record has a namewhich likely starts with ‘a’). Therefore, it is often recommended to randomize the order of records in a dataset beforerelease. Randomization can also be done within subsets of the dataset, e.g., within regions. If suppressions were madein the geographical variable used for creating the subsets, randomization within the geographical subsets implies thatthe geographical variable is the same for all records in the subset and the suppressed value can be easily derived (forinstance, in cases where the geographical unit is included in the randomized ID). Therefore, if the variable used forthe subsets has suppressed values, randomization should be done at the dataset level and not at the subset level.

Table 8.3 illustrates the need and process of randomizing the order of records in a dataset. The first three columns inTable 8.3 show the original dataset. Some suppressions were made in the variable “district”, as shown in columns 4 to 6(‘NA’ values). This dataset also already shows the randomized household IDs. The order of the records in the columns1-3 and columns 4-6 is unchanged. By the order of the records, it is easy to guess the values of the two suppressedvalues. Both the record before and after have the same value for district as the suppressed values, respectively 3 and 5.After reordering the dataset based on the randomized household IDs, we see that it becomes impossible to reconstructthe suppressed values based on the values of the neighboring records. Note that in this example the randomization wascarried out within the regions and the region number is included in the household ID (first digit).

104 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 109: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 8.3: Illustration of randomizing order of records in a datasetOriginal dataset Dataset with randomized house-

hold IDDataset for release ordered bythenew randomized household ID

Household Region District RandomizedRegion District RandomizedRegion District

ID householdID

householdID

101 1 1 108 1 1 101 1 4102 1 1 106 1 1 102 1 3103 1 2 104 1 2 103 1 5104 1 2 112 1 2 104 1 2105 1 2 105 1 2 105 1 2106 1 3 102 1 3 106 1 1107 1 3 109 1 NA 107 1 3108 1 3 107 1 3 108 1 1109 1 4 101 1 4 109 1 NA110 1 5 111 1 5 110 1 NA111 1 5 110 1 NA 111 1 5112 1 5 103 1 5 112 1 2201 2 6 203 2 6 201 2 6202 2 6 204 2 6 202 2 6203 2 6 201 2 6 203 2 6204 2 6 202 2 6 204 2 6

The randomization is easiest if done before or after the anonymization process with sdcMicro and directly on thedataset (data.frame in R). To randomize the order, we need an ID, such as an individual ID, household ID or geograph-ical ID. If the dataset does not contain such ID, this should be created first. Listing 8.16 shows how to randomizehouseholds. “HID” is the household ID and “regionid” is the region ID. First the variable “HID” is replaced by arandomized variable “HIDrandom”. Then the file is sorted by region and the randomized ID and the actual order ofthe records in the dataset is changed. To make the randomization reproducible, it is advisable to set a seed for therandom number generator.

8.7. Randomizing order and numbering of individuals or households 105

Page 110: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 8.16: Randomize order of households

1 n <- length(file$HID) # number of households2

3 set.seed(123) # set seed4 # generate random HID5 file$HIDrandom <- sample(1:n, n, replace = FALSE, prob = rep(1/n, n))6

7 # sort file by regionid and random HID8 file <- file1[order(file$regionid, file$HIDrandom),]9

10 # renumber the households in randomized order to 1-n11 file$HIDrandom <- 1:n

8.8 Computation time

Some SDC methods can take a very long time to evaluate in terms of computation. For instance, local suppressionwith the function localSuppression() of the sdcMicro package in R can take days to execute on large datasets of morethan 30,000 individuals that have many categorical quasi-identifiers. Our experiments reveal that computation timeis a function of the following factors: the applied SDC method; data size, i.e., number of observations, number ofvariables and the number of categories or factor levels of each categorical variable; data complexity (e.g., the numberof different combinations of values of key variables in the data); as well as the computer/server specifications.

Table 8.4 gives some indication of computation times for different methods on datasets of different size and complexitybased on findings from our experiments. The selected quasi-identifiers and categories for those variables in Table 8.4are the same in both datasets being compared. Because it is impossible to predict the exact computation time, this tableshould be used to illustrate how long computations may take. These methods have been executed on a powerful server.Given long computation times for some methods, it is recommended, where possible, to first test the SDC methodson a subset or sample of the microdata, and then choose the appropriate SDC methods. R provides functions to selectsubsets from a dataset. After setting up the code, it can then be run on the entire dataset on a powerful computer orserver.

Table 8.4: Computation times of different methods on datasets of differ-ent sizes

Dataset with 5,000 observations Dataset with 45,000 obervationsMethods Computation time

(hours)Methods Computation time

(hours)Top coding age, local suppres-sion (k=3)

11 Top coding age, local suppres-sion (k=3)

268

Recoding age, local suppres-sion (k=3)

8 Recoding age, local suppres-sion (k=3)

143

Recoding age, local suppres-sion (k=5)

10 Recoding age, local suppres-sion (k=5)

156

The number of categories and the product of the number of categories of all categorical quasi-identifiers give an idea ofthe number of potential combinations (keys). This is only an indication of the actual number of combinations, whichinfluences the computation time to compute, for example, the frequencies of each key in the dataset. If there are manycategories but not so many combinations (e.g., when the variables correlate), the computation time will be shorter.

Table 8.5 shows the number of categories for seven datasets with the same variables but of different complexitiesthat were all processed using the same script on 16 processors, in order of execution time. The table also shows anapproximation of the number of unique combinations of quasi-identifiers, as indicated by the percentage of observa-

106 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 111: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

tions violating 𝑘-anonymity in each dataset pre-anonymization in relation to processing time. The results in the tableclearly indicate that both the number of observations (i.e., sample size) and the complexity of the data play a rolein the execution time. Also, using the same script (and hence anonymization methods), the execution time can varygreatly; the longest running time is about 10 times longer than the shortest. Computer specifications also influence thecomputation time. This includes the processor, RAM and storage media.

Table 8.5: Number of categories (complexity), record uniqueness andcomputation times

Sam-plesize

Number of categories per quasi-identifier (complexity)

Percentage of observations violating k-anonimity before before anonymization

Executiontime inhours

n Wa-ter

Toi-let

Oc-cu-pa-tion

Re-li-gion

Eth-nic-ity

Re-gion

k3 k5

20,014 10 4 70 5 7 6 74 88 53.7266,285 15 6 39 4 0 24 40 49 67.1960,747 13 6 70 8 9 4 35 45 74.4726,601 19 6 84 10 10 10 77 87 108.8438,089 17 6 30 5 56 9 70 81 198.9035,820 19 7 67 6 NA 6 81 90 267.6051,976 12 6 32 8 50 12 77 87 503.58

The large-scale experiment executed for this guide utilized 75 microdata files from 52 countries, using surveys ontopics including health, labor, income and expenditure. By applying anonymization methods available in the sdcMicropackage, at least 20 different anonymization scenarios7 were tested on each dataset. Most of the processing wasdone using a powerful server8 and up to 16 – 20 processors (cores) at a time. Other processing platforms included alaptop and desktop computers, each using four processors. Computation times were significantly shorter for datasetsprocessed on the server, compared to those processed on the laptop and desktop.

The use of parallelization can improve performance even on a single computer with one processor with multiplecores. Since R does not use multiple cores unless instructed to do so, our anonymization programs allowed forparallelization such that jobs/scenarios in each dataset could be processed simultaneously through efficient allocationof tasks to different processors. Without parallelization, depending on the server/computer, only one core is used whenrunning the jobs sequentially. Running the anonymization program without parallelization leads to significantly longerexecution time. Note however, that the parallelization itself also causes overhead. Therefore, a summation of the timesit takes to run each task in parallel does not necessarily amount to the time it may take to run them sequentially. Thefact that the RAM is shared might, however, slightly reduce the gains of parallelization. If you want to compare theresults of different methods on large datasets that require long computation times, using parallel computing can be asolution.9

Appendix D zooms in on seven selected datasets from a health survey that were processed using the same paralleliza-tion program and anonymization methods. Note that the computation times in the appendix are only meant to createawareness for expected computation time, and may vary based on the type of computer used. In our case, although alldatasets were anonymized using the parallelization program, computation times were significantly shorter for datasetsprocessed on the server, compared to those processed on the laptop and desktop. Among those datasets processed on

7 Here a scenario refers to a combination of SDC methods and their parameters.8 The server has 512 GB RAM and four processors each with 16 cores, translating to 64 cores total.9 The following website provides an overview of parallelization packages and solutions in R: http://cran.r-project.org/web/views/

HighPerformanceComputing.html.

Note: Solutions are platform-dependent and therefore our solution is not further presented.

8.8. Computation time 107

Page 112: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the server using the same number of processors (datasets 1, 2 and 6), some variation also exists in the computationtimes.

Note: Computation time in the table in Appendix D includes recalculating the risk after applying the anonymizationmethods, which is automatically done in sdcMicro when using standard methods/functions.

Using the function groupVars(), for instance, is not computationally intensive but can still take a long time if the datasetis large and risk measures have to be recalculated.

8.9 Common errors

In this section, we present a few common errors and their causes, which might be encountered when using the sdcMicropackage in R for anonymization of microdata:

• The class of a certain variable is not accepted by the function, e.g., a categorical variable of class numeric shouldbe first recoded to the required class (e.g., factor or data.frame). In the Section Classes in R is shown how to dothis.

• After manually making changes to variables the risk did not change, since it is not updated automatically andhas to be manually recomputed by using the function calcRisks().

108 Chapter 8. SDC with sdcMicro in R: Setting Up Your Data and more

Page 113: Statistical Disclosure Control: A Practice Guide

CHAPTER 9

The SDC Process

This section presents the SDC process in a step-by-step representation and can be used as guidance for the actual SDCprocess. It should be noted, however, that jumping between steps and returning to previous steps is often requiredduring the actual SDC process, as it is not necessarily a linear step-by-step process. This guidance brings together thedifferent parts of the SDC process as discussed in the previous sections and links to these sections. The case studiesin the next section follow these steps. This presentation is adapted from HDFG12. Fig. 9.1 at the end of this sectionpresents the entire process in a schematic way.

9.1 Step 1: Need for confidentiality protection

Before starting the SDC process for a microdata set, the need for confidentiality protection has to be determined. This isclosely linked to the interpretation of laws and regulations on this topic from the country in which the data originatesand thus country-specific. A first step is to determine the statistical units in the dataset: if these are individuals,households or legal entities, such as companies, a need for disclosure control is likely. There are also examplesof microdata for which there is no need for disclosure control. Examples could be data with climate and weatherobservations or data with houses as statistical units. Even if the primary statistical units are not natural or legalpersons, however, the data can still contain confidential information on natural or legal persons. For example, a datasetwith houses as primary statistical units can also contain information on the persons living in these houses and theirincome or a dataset on hospitalizations can include information about the hospitalized patients. In these cases, thereis likely still a need for confidentiality protection. One option to solve this is to remove the information on the naturaland legal persons in the datasets for release.

One dataset can also contain more than one type of statistical unit. The standard example here is a dataset containingboth information on individuals and households. Another example is data with employees in enterprises. All types ofstatistical units present in the dataset have to be considered for the need of SDC. This is especially important in casethe data has a hierarchical structure, such as individuals in households or employees in enterprises.

In addition, one has to evaluate whether the variables contained in the dataset are confidential or sensitive. Whichvariables are sensitive or confidential depends again on the applicable legislation and can differ substantially fromcountry to country. In case the dataset includes sensitive or confidential variables, SDC is likely required. The set ofsensitive variables and confidential variables together with the statistical units in the dataset determine the need forstatistical disclosure control.

109

Page 114: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

9.2 Step 2: Data preparation and exploring data characteristics

After assessing the need for statistical disclosure control, we should prepare the data and, if there are multiple, combineand consider all related data files. Then we explore the characteristics and structure in the data, which are importantfor the users of the data. Compiling an inventory of these characteristics is important for assessing the utility of thedata after anonymization and producing an anonymized dataset, which is useful for end users.

The first step in data preparation is classifying the variables as sensitive or non-sensitive, and removing direct identi-fiers such as full names, passport numbers, addresses, phone numbers and GPS coordinates. In case of survey data,an inspection of the survey questionnaire is useful to classify the variables. Furthermore, it is necessary to select thevariables that contain relevant information for end users and should be included in the dataset for release. At this point,it can also be useful to remove variables other than direct identifiers from the microdata set to be released. An examplecan be a variable with many missing values, e.g., a variable recorded only for a select group of individuals eligible fora particular survey module, and missing values for the rest. Such variables can cause a high level of disclosure riskwhile adding little information for end users. Examples are variables relating to education (current grade), where amissing value indicates that the individual is not currently in school, or variables relating to childbirth, where a missingvalue indicates that the individual has not delivered a child in the reference period. Missing values in themselves canbe disclosive, especially if they indicate that the variable is not applicable. Often variables with the majority of valuesmissing are deleted at this stage already. Other variables that might be deleted at this stage are those too sensitive tobe anonymized and released or those not important to data users and that could increase the risk of disclosure.

Relationships may exist among variables in a dataset for a variety of reasons. For instance, variables can be mutuallyexclusive in cases where several binary variables are used for each category. An individual not in the labor force willhave a missing value for the sector in which this person is employed (or more precisely not applicable). Relationshipsmay also exist if some variables are ratios, sums or other mathematical functions of other variables. Examples are thevariable household size (as a count of individuals per household), or aggregate expenditure (as a sum of all expenditurecomponents). A certain value in one variable may also reduce the number of possible or valid values for anothervariable; for example, the age of an individual attending primary education or the gender of an individual havingdelivered a child. These relationships are important for two reasons: 1) they can be used by intruders to reconstructanonymized values. For example, if age is suppressed but another variable indicates that they are in school, then it isstill possible to infer a likely age range for that individual. Another example is if an individual is shown to be active inSector B of the economy. Even if the labor status of this individual is suppressed, it can be inferred that this person isemployed. 2) the relationships in the original data should be maintained in the anonymized dataset and inconsistenciesshould be avoided (e.g., SDC methods should not create 58-year-old school boys, or married 3-year-olds), or thedataset will be invalid for analysis. Another example is the case of expenditures per category, where it is importantthat the sum of the categories adds up to the total. One way to guarantee this is to anonymize the totals and thenrecalculate the sub-categories according to the original shares of the anonymized totals.

It is also useful at this stage to consolidate variables that provide the same information where possible, so as reducethe number of variables, reduce the likelihood of inconsistencies and minimize the variables an intruder can use toreconstruct the data. This is especially true if the microdata stems from an elaborate questionnaire and each variablerepresents one (sub-) question leading to a dataset with hundreds of variables. As an example, we take a surveywith several labor force variables indicating whether a person is in the labor force, employed or unemployed, andif employed, in what sector. The data in Table 9.1 illustrates this example. It is possible that each type of sectorhas its own binary variable. In that case, this set of variables can be summarized in two variables: one variableindicating whether a person is in labor force and another indicating the employment status, as well as the respectivesector if a person is employed. These two variables contain all information contained in the previous five variablesand make the anonymization process easier. If data users are used to a certain release format where including all fivevariables has been the norm, then it is possible to transform the variables back after the anonymization process ratherthan complicating the anonymization process by trying to treat more variables than is necessary. This approach alsoguarantees that the relationships between the variables are preserved (e.g., no individuals will be employed in severalsectors).

110 Chapter 9. The SDC Process

Page 115: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 9.1: Illustration of merging variables without information loss forSDC process

Before AfterIn labor force Employed Sector A Sector B Sector C In labor force EmployedYes Yes Missing Yes Missing Yes BNo No Missing Missing Missing No NoYes Yes Yes Missing Missing Yes AYes Yes Missing Yes Missing Yes BYes Yes Missing Missing Yes Yes CYes No Missing Missing Missing Yes No

Besides relationships between variables, we also gather information about the survey methodology, such as strata,sampling methods, survey design and sample weights. This information is important in later stages, when estimatingthe disclosure risk and choosing the SDC methods. It is important to distinguish between a full census and a sample.For a full census, it is common practice to publish only a sample, as the risk of disclosure for a full sample is too high,given that we know that everyone in the country or institution is in the data (see also the Section ‘Special case: censusdata <anon_methods.html#Special case: census data‘__). Strata and sample weights can disclose informationabout the area or group to which an individual belongs (e.g., the weights can be linked with the geographical area orspecific group in case of stratified sampling); this should be taken into account in the SDC process and checked beforerelease of the dataset.

9.3 Step 3: Type of release

The type of release is an important factor for determining the required level of anonymization as well as the require-ments end users have for the data (e.g., researchers need more detail than students for whom a teaching file might besufficient) and should be clarified before the start of the anonymization process. Data release or dissemination by sta-tistical agencies and data producers is often guided by the applicable law and dissemination strategies of the statisticalagency, which specify the type of data that should be disseminated as well as the fashion.

Generally, there exist three types of data release methods for different target groups (the Section Release types providesmore information on different release types):

• PUF: The data is directly available to anyone interested, e.g., on the website of the statistical agency

• SUF: The data is available to accredited researchers, who have to file their research proposals beforehand andhave to sign a contract; this is also known as licensed file or microdata under contract

• Available in a controlled research data center: only on-site access to data on special computers; this is alsoknown as data enclave

There are other data access possibilities besides these, such as teaching files or files for other specific purposes.Obviously, the required level of protection depends on the type of release: a PUF file must be protected to a greaterextent than a SUF file, which in turn has to be protected more than a file which is available only in an on-site facility,since the options the intruder can use the data are limited in the latter case.

Besides the applicable legislation, the choice of the type of release depends on the type of the data and the content.

Note: Not every microdata set is suitable for release in any release type, even after SDC.

Some data cannot be protected sufficiently – it might always contain information that is too sensitive to be publishedas SUF or PUF. In such cases, the data can be released in on-site facilities, or the number of variables can be reducedby removing problematic variables.

9.3. Step 3: Type of release 111

Page 116: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Generally, the release of two or more anonymized datasets, e.g., tailored for different end users from the same original,is problematic because it can lead to disclosure if the two were later obtained and merged by the same user. Theinformation contained in one dataset that is not contained in the other can lead to unintended disclosure. An exceptionis the simultaneous release and anonymization of a microdata set as PUF and SUF files. In this case, the PUF file isconstructed from the SUF file by further anonymization. In that way, all information in the PUF file is also containedin the SUF file and the PUF file does not provide any additional information for users that have access to the SUF file.

Note: The anonymization process is an iterative process where steps can be revisited, whereas the publication of ananonymized dataset is a one-shot process.

Once the anonymized data is published, it is not possible to revoke and publish another dataset of the same microdatafile. This would in fact mean publishing more than one anonymized file from the same microdata set, since some usersmight have saved the previous file.

9.4 Step 4: Intruder scenarios and choice of key variables

After determining the release type of the data, the possibilities of how an individual in the microdata could (realisti-cally) be identified by an intruder under that release type should be examined. For PUF and SUF release the focus is onthe use of external datasets from various source. These possibilities are described in disclosure or intruder scenarios,which specify what data an intruder could possibly have access to and how this auxiliary data can be used for identitydisclosure. This leads to the specification of quasi-identifiers, which are a set of variables that are available both in thedataset to be released and in auxiliary datasets and need protection.

Note: If the number of quasi-identifiers is high, it is recommended to reduce the set of quasi-identifiers by removingsome variables from the dataset for release.

This is especially true for PUF releases. Disclosure scenarios can also help define the required level of anonymization.

Drafting disclosure scenarios requires the support of subject matter specialists, assuming the subject specialist is notthe same as the person doing the anonymization. Auxiliary datasets may contain information on the identity of theindividuals and allow identity disclosure. Examples of such auxiliary data files are population registers and electoralrolls, as well as data collected by specialized firms.

Note: External datasets can come from many sources (e.g., other institutions, private companies) and it is sometimesdifficult to make a full list of external data sources.

In addition, not all external data sources are in the public domain. Nevertheless, proprietary data can be used by theowner to re-identify individuals and should be taken into account in the SDC process, even if the exact content is notknown. Also, the variables or datasets may not coincide perfectly (e.g., different definitions, more or less detailedvariables, different survey period). Nevertheless, they should be considered in the SDC process.

Disclosure scenarios include both identity and inferential disclosure. The disclosure depends on the type of release,i.e., different data users have different data available and may use the data in a different way for re-identification. Forexample, a user in a research data center cannot match with large external datasets as (s)he is not permitted to takethese into the data center. A user of a SUF is bound by an agreement specifying the use of the data and consequencesif the agreement is breached (see the Section Release types ). Furthermore, it should be evaluated whether, in case ofa sample, possible intruders have knowledge as to which individuals are in the sample. This can be the case if it isknown which schools were visited by the survey team, for example. A few examples of disclosure scenarios are (seethe Section Disclosure scenarios for more information):

112 Chapter 9. The SDC Process

Page 117: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

• Matching: The intruder uses auxiliary data, e.g., data on region, marital status and age from a populationregister, and matches them to released microdata. Individuals from the two datasets that match and are uniqueare successfully identified. This principle is used as an assumption in several disclosure risk measures, such as𝑘-anonymity, individual and global risk, as described in the Section Measuring Risk. This scenario can apply toboth PUFs and SUFs.

• Spontaneous recognition: This scenario should be considered for SUF files, but is especially important fordata available in research data centers where outliers are present in the data and data is often not stronglyanonymized. The researcher might (unintentionally) recognize some individuals he knows (e.g., his colleagues,neighbors, family members, public figures, famous persons or large companies), while inspecting the data. Thisis especially true for rare combinations of values, such as outliers or unlikely combinations.

9.5 Step 5: Data key uses and selection of utility measures

In this step, we analyze the main uses of the data by the end users of the released microdata file. The data should beuseful for the type of statistical analysis for which the data was collected and for which it is mostly used. The usesand requirements of data users will be different for different release types. Contacting data users directly or searchingfor scientific studies and papers that use similar data can be useful when collecting this information and making thisassessment. Alternatively, this information can be collected from research proposals by researchers when applyingfor microdata access (SUF) or user groups can be set up. Furthermore, it is important to understand what level ofprecision the data users need and what types of categories are used. For instance, in the case of global recoding of agein years, one could recode age in groups of 10 years, e.g., 0 – 9, 10 – 19, 20 – 29, . . . Many indicators relating tothe labor market use categories that span the range 15 – 65, however. Therefore, constructing categories that coincidewith the categories used for the indicators keeps the data much more useful while at the same time reducing the riskof disclosure in a similar way. This knowledge is important for the selection of useful utility measures, which in turnare used for selecting appropriate SDC methods.

The uses of the data depend on the release type, too. Researchers using SUF files require a higher level of detail in thedata than PUF users.

Note: Anonymization will always lead to information loss and a PUF file will have reduced utility. If certain usersrequire a high level of detail, release types other than PUF should be considered, such as SUF or release through aresearch data center.

In the case of SUFs, it is easier to find the main uses of the data since access is documented. One way to obtaininformation on the use of PUF files is to ask for a short description of intended use of the data before supplying thedata. This is, however, useful only if microdata has been released previously.

Statistics computed from the anonymized and released microdata file should produce analytical results that agree oralmost agree with previously published statistics from the original data. If, for instance, a previously published primaryschool enrollment rate was computed from these data and published, the released anonymized dataset should producea very similar result to the officially published result. At the very least, the result should fall within the confidenceregion of the published result. It might be the case that not all published statistics can be generated from the publisheddata. If this is the case, a choice should be made on which indicators and statistics to focus, and inform the users as towhich ones have been selected and why.

As discussed in the Section Measuring Utility and Information Loss, it is necessary to compute general utility measuresthat compare the raw and anonymized data, taking into consideration the end user’s need for their analysis. In somecases the utility measures can give contradicting results, for example, a certain SDC method might lead to lowerinformation loss for labor force figures but greater information loss for ratios relating to education. In such cases, thedata uses might need to be ranked in order of importance and it should be clearly documented for the user that theprioritization of certain metrics over others means that certain metrics are no longer valid. This may be necessary, as

9.5. Step 5: Data key uses and selection of utility measures 113

Page 118: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

it is not possible to release multiple files for different users. This problem occurs especially in multi-purpose studies.For more details on utility measures, refer to the Section Measuring Utility and Information Loss.

Note on Steps 6 to 10

The following Steps 6 through 10 should be repeated if the data has quasi-identifiers that are on different hier-archical levels, e.g., individual and household. In that case, variables on the higher hierarchical level should beanonymized first, and then merged with the lower-level untreated variables. Subsequently, the merged dataset shouldbe anonymized. This approach guarantees consistency in the treated data. If we neglect this procedure, the values ofvariables measured on the higher hierarchical level could be treated differently for observations of the same unit. Forinstance, the variable “region” is the same for all household members. If the value ‘rural’ would be suppressed for twomembers but not for the remaining three, this would lead to unintended disclosure; with the household ID the vari-able region would be easy to reconstruct for the two suppressed values. The Sections Household risk and Householdstructure provide more details on how to deal with data with household structure in R and sdcMicro.

9.6 Step 6: Assessing disclosure risk

The next step is to evaluate the disclosure risk of the raw data. Here it is important to distinguish between sample dataand census data. In the case of census data, it is possible to directly calculate the risk measures when assuming that thedataset covers the entire population. If working with a sample, or a sample of the census (which is the more commoncase when releasing sample data), we can use the models discussed in the Section Measuring Risk to estimate therisk in the population. The main inputs for the risk measurement are the set of quasi-identifiers determined from thedisclosure scenarios in Step 4 and the thresholds for risk calculations (e.g., the level of 𝑘-anonymity or the thresholdfor which an individual is considered at risk). If the data has a hierarchical structure (e.g., a household structure), therisk should be measured taking into account this structure as described the Section Household risk.

The different risk measures described in the Section Measuring Risk each have advantages and disadvantages. Gen-erally, 𝑘-anonymity, individual risk and global risk are used to produce an idea of the disclosure risk. These valuescan initially be very high but can often very easily be reduced after some simple but appropriate recoding (see Step 8:Choice and application of SDC methods). The thresholds shall be determined according to the release type. Alwaysremember, though, that when using a sample, the risk measures based on the models presented in the literature offera worst-case risk scenario and might therefore be an exaggeration of the real risks for some cases (see the SectionIndividual risk).

9.7 Step 7: Assessing utility measures

To quantify the information loss due to the anonymization, we first compute the utility measures selected in Step 5using the raw data. This creates a base for comparison of results obtained when using the anonymized data – i.e., inStep 10.

Note: If the raw data is a sample, the utility measures are an estimate with a variance and therefore it is useful toconstruct confidence intervals in addition to the point estimates for the utility measures.

9.8 Step 8: Choice and application of SDC methods

The choice of SDC methods depends on the need for data protection (as measured by the disclosure risk), the structureof the data and the type of variables. The influence of different methods on the characteristics of the data important

114 Chapter 9. The SDC Process

Page 119: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

for the users or the data utility should also be taken into account when selecting the SDC methods. In practice, thechoice of SDC methods is partially a trial-and-error process: after applying a chosen method, disclosure risk and datautility are measured and compared to other choices of methods and parameters. The choice of methods is bound bylegislation on the one hand, and a trade-off between utility and risk on the other.

The classification of methods as presented in Table 6.1 gives a good overview for choosing the appropriate methods.Methods should be chosen according to the type of variable – continuous or categorical – the requirements by the usersand the type of release. The anonymization of datasets with both continuous and categorical variables is discussed inthe Section Classification of variables.

In general for anonymization of categorical variables, it is useful to restrict the number of suppressions by first applyingglobal recoding and/or removing variables from the microdata set. When the required number of suppressions toachieve the required level of risk is sufficiently low, the few individuals at risk can be treated by suppression. Theseare generally outliers. It should be noted that possibly not all variables can be released and some must be removedfrom the dataset (see Step 2: Data preparation and exploring data characteristics ). Recoding and minimal use ofsuppression ensures that already published figures from the raw data can be reproduced sufficiently well from theanonymized data. If suppression is applied without sufficient recoding, the number of suppressions can be very highand the structure of the data can change significantly. This is because suppression mainly affects combinations thatare rare in the data.

If the results of recoding and suppression do not achieve the required result, especially in cases where the number ofselect quasi-identifiers is high, an alternative is using perturbative methods. These can be used without prior recodingof variables. These methods, however, preserve data structure only partially. The preferred method depends on therequirements of the users. We refer to the Section Anonymization Methods and especially the Section Perturbativemethods for a discussion of perturbative methods implemented in sdcMicro.

Finally, the choice of SDC methods depends on the data used since the same methods produce different results ondifferent datasets. Therefore, the comparison of results with respect to risk and utility (Steps 9 and 10) is key to thechoice made. Most methods are implemented in the sdcMicro package. Nevertheless, it is sometimes useful to usecustom-made solutions. A few examples are presented in the Section Anonymization Methods.

9.9 Step 9: Re-measure risk

In this step, we re-evaluate the disclosure risk with the risk measures chosen under Step 6 after applying SDC methods.Besides these risk measures, it is also important to look at individuals with high risk and/or special characteristics,combinations of values or outliers in the data. If the risk is not at an acceptable level, Steps 6 to 10 should be repeatedwith different methods and/or parameters.

Note: Risk measures based on frequency counts (𝑘-anonymity, individual risk, global risk and household risk) cannotbe used after applying perturbative methods since their risk estimates are not valid.

These methods are based on introducing uncertainty into the dataset and not on increasing the frequencies of keys inthe data and will hence overestimate the risk.

9.10 Step 10: Re-measure utility

In this step, we re-measure the utility measures from Step 7 and compare these with the results from the raw data. Also,it is useful here to construct confidence intervals around the point estimates and compare these confidence intervals.The importance of the absolute value of a deviation can only be interpreted knowing the variance of the estimate.Besides these specific utility measures, general utility measures, as discussed in the Section Measuring Utility andInformation Loss , should be evaluated. This is especially important if perturbative methods have been applied. If

9.9. Step 9: Re-measure risk 115

Page 120: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the data does not meet the user requirements and deviations are too large, repeat Steps 6 to 10 with different methodsand/or different parameters.

Note: Anonymization will always lead to at least some information loss.

9.11 Step 11: Audit and Reporting

After anonymization, it is important to check whether all relationships in the data as identified in Step 2, such as vari-ables that are sums of other variables or ratios, are preserved. Also, any unusual values caused by the anonymizationshould be detected. Examples of such anomalies are negative income or a pupil in the twentieth grade of school.This can happen after applying perturbative SDC methods. Furthermore, it is necessary to check whether previouslypublished indicators from the raw data are reproducible from the data to be released. If this is not the case, data usersmight question the credibility of the anonymized dataset.

An important step in the SDC process is reporting, both internal and external. Internal reporting contains the exactdescription of anonymization methods used, parameters but also the risk measures before and after anonymization.This allows replication of the anonymized dataset and is important for supervisory authorities/bodies to ensure theanonymization process is sufficient to guarantee anonymity according to the applicable legislation.

External reporting informs the user that the data has been anonymized, provides information for valid analysis onthe data and explains the limitations to the data as a result of the anonymization. A brief description of the methodsused can be included. The release of anonymized microdata should be accompanied by the usual metadata of thesurvey (survey weight, strata, survey methodology) as well as information on the anonymization methods that allowresearchers to do valid analysis (e.g., amount of noise added, transition matrix for PRAM).

Note: Care should be taken that this information cannot be used for re-identification (e.g., no release of random seedfor PRAM).

The metadata must be updated to comply with the anonymized data. Variable descriptions or value labels might havechanged as a result of the anonymization process. In addition, the information loss due to the anonymization processshould be explained in detail to the users to make them aware of the limits to the validity of the data and their analyses.

9.12 Step 12: Data release

The last step in the SDC process is the actual release of the anonymized data. This step depends on the type of releasechosen in Step 3. Changes to the variables made under Step 2, such a merging variables, can be undone to generate adataset useful for users.

Recommended Reading Material on Risk Measurement

Dupriez, Olivier, and Ernie Boyko. 2010. Dissemination of Microdata Files; Principles, Procedures and Practices.IHSN Working Paper No. 005, International Household Survey Network (IHSN). http://www.ihsn.org/HOME/sites/default/files/resources/IHSN-WP005.pdf

References

116 Chapter 9. The SDC Process

Page 121: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 9.1: Overview of the SDC process

9.12. Step 12: Data release 117

Page 122: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

118 Chapter 9. The SDC Process

Page 123: Statistical Disclosure Control: A Practice Guide

CHAPTER 10

Appendices

10.1 Appendix A: Overview of Case Study Variables

. Variable Description Type1 REGION Region HH2 DIST District HH3 URBRUR Area of residence HH4 WGTHH Individual weighting coefficient (country-specific weighting co-efficient to derive individual-level indicators) HH5 WGTPOP Population weighting coefficient (weighting co-efficient to derive population-level indicators) HH6 IDH Household unique identification HH7 IDP Individual identification HH8 HHSIZE Household members HH9 GENDER Sex IND10 REL Relationship to household head IND11 MARITAL Marital status IND12 AGEYRS Age in completed years IND13 AGEMTH Age of child in completed months IND14 RELIG Religion of household head HH15 ETHNICITY Ethnicity IND16 LANGUAGE Language IND17 MORBID Morbidity last x weeks IND18 MEASLES Child immunized against Measles IND19 MEDATT Sought medical attention IND20 CHWEIGHTKG Weight of the child (Kg) IND21 CHHEIGHTCM Height of the child (cms) IND22 ATSCHOOL Current school enrolment IND23 EDUCY Highest level of education completed IND24 EDYRS Years of education IND25 EDYRSCURRAT Years of education for currently enrolled IND26 SCHTYP Type of school attending IND

Continued on next page

119

Page 124: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 10.1 – continued from previous page. Variable Description Type27 LITERACY Literacy status IND28 EMPTYP1 Type of employment, Primary job IND29 UNEMP1 Unemployed IND30 INDUSTRY1 1 digit industry classification, Primary job IND31 EMPCAT1 Employment categories, Primary job IND32 WHOURSWEEK1 Hours worked last week, Primary job IND33 OWNHOUSE Ownership of dwelling unit HH34 ROOF Main material used for roof HH35 TOILET Main toilet facility HH36 ELECTCON Connection of electricity in dwelling HH37 FUELCOOK Main cooking fuel HH38 WATER Main source of water HH39 OWNAGLAND Ownership of agricultural land HH40 LANDSIZEHA Land size owned by household (ha) HH41 OWNMOTORCYCLE Ownership of motorcycle HH42 CAR Ownership of car HH43 TV Ownership of television HH44 LIVESTOCK Number of large-sized livestock owned HH45 INCRMT Total amount of remittances received from remittance sending members HH46 INCWAGE Wage and salaries (annual) HH47 INCBONSOCALL Bonus and social allowance from wage job (annual) HH48 INCFARMBSN Gross income from household farm businesses (annual) HH49 INCNFARMBSN Gross income from household non-farm businesses (annual) HH50 INCRENT Rental income (annual) HH51 INCFIN Financial income from savings, loans, tax refunds, maturity payments on insurance HH52 INCPENSN Pension and other social assistance (annual) HH53 INCOTHER Other income(annual) HH54 INCTOTGROSSHH Total gross household income (annual) HH55 FARMEMP Farm employment HH56 THOUSEXP Total expenditure on housing HH57 TFOODEXP Total food expenditure HH58 TALCHEXP Total alcohol expenditure HH59 TCLTHEXP Total expenditure on clothing and footwear HH60 TFURNEXP Total expenditure on furnishing HH61 THLTHEXP Total expenditure on health HH62 TTRANSEXP Total expenditure on transport HH63 TCOMMEXP Total expenditure on communications HH64 TRECEXP Total expenditure on recreation HH65 TEDUEXP Total expenditure on education HH66 TRESTHOTEXP Total expenditure on restaurants and hotel HH67 TMISCEXP Total miscellaneous expenditure HH68 TANHHEXP Total annual nominal household expenditures HH

10.2 Appendix B: Example of Blanket Agreement for SUF

Agreement between [providing agency] and [receiving agency] regarding the deposit and use of microdata

A. This agreement relates to the following microdatasets:

1. _______________________________________________________

120 Chapter 10. Appendices

Page 125: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

2. _______________________________________________________

3. _______________________________________________________

4. _______________________________________________________

5. _______________________________________________________

B. Terms of the agreement:

As the owner of the copyright in the materials listed in section A, or as duly authorized by the owner of the copyright inthe materials, the representative of [providing agency] grants the [receiving agency] permission for the datasets listedin section A to be used by [receiving agency] employees, subject to the following conditions:

1. Microdata (including subsets of the datasets) and copyrighted materials provided by the [providing agency] willnot be redistributed or sold to other individuals, institutions or organisations without the [providing agency]’swritten agreement. Non-copyrighted materials which do not contain microdata (such as survey questionnaires,manuals, codebooks, or data dictionaries) may be distributed without further authorization. The ownership ofall materials provided by the [providing agency] remains with the [providing agency].

2. Data will be used for statistical and scientific research purposes only. They will be employed solely for reportingaggregated information, including modeling, and not for investigating specific individuals or organisations.

3. No attempt will be made to re-identify respondents, and there will be no use of the identity of any person orestablishment discovered inadvertently. Any such discovery will be reported immediately to the [providingagency].

4. No attempt will be made to produce links between datasets provided by the [providing agency] or between[providing agency] data and other datasets that could identify individuals or organisations.

5. Any books, articles, conference papers, theses, dissertations, reports or other publications employing data ob-tained from the [providing agency] will cite the source, in line with the citation requirement provided with thedataset.

6. An electronic copy of all publications based on the requested data will be sent to the [providing agency].

7. The [providing agency] and the relevant funding agencies bear no responsibility the data’s use or for interpreta-tion or inferences based upon it.

8. An electronic copy of all publications based on the requested data will be sent to the [providing agency].

9. Data will be stored in a secure environment, with adequate access restrictions. The [providing agency] may atany time request information on the storage and dissemination facilities in place.

10. The [recipient agency] will provide an annual report on uses and users of the listed microdatasets to the [provid-ing agency], with information on the number of researchers having accessed each dataset, and on the output ofthis research.

11. This access is granted for a period of [provide information on this period, or state that the agreement is openended].

C. Communications:

The [receiving organisation] will appoint a contact person who

will act as unique focal person for this agreement. Should the focal person be replaced, the [recipient agency] willimmediately communicate the name and coordinates of the new contact person to the [providing agency]. Communi-cations for administrative and procedural purposes may be made by email, fax or letter as follows:

Communications made by [providing agency] to [recipient agency] will be directed to:

Name of contact person:

Title of contact person:

10.2. Appendix B: Example of Blanket Agreement for SUF 121

Page 126: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Address of the recipient agency:

Email:

Tel:

Fax:

Communications made by [recipient agency] to [depositor agency]

will be directed to:

Name of contact person:

Title of contact person:

Address of the recipient agency:

Email:

Tel:

Fax:

D. Signatories

The following signatories have read and agree with the Agreement as presented above:

Representative of the [providing agency]

Name ____________________________________________________

Signature _______________________________ Date ______________

Representative of the [recipient agency]

Name ____________________________________________________

Signature _______________________________ Date ______________

Source: DuBo10

10.3 Appendix C: Internal and External Reports for Case Studies

This appendix provides example of internal and external reports on the anonymization process for the case studies inSection 9.1. The internal report consists of two parts: the first is for the anonymization of the household-level variablesand the second is for the anonymization of the individual-level variables.

10.4 Case study 1 - Internal report

SDC report (adapted from the report function in sdcMicro)

The dataset consists of 10,574 observations (i.e., 10,574 individuals in 2,000 households).

Household-level variables

Anonymization methods applied to household-level variables:

• Removing households of size larger than 13 (29 households)

• Local suppression to achieve 2-anonymity, with importance vector to prevent suppressing values of the variablesHHSIZE, REGION and URBRUR

122 Chapter 10. Appendices

Page 127: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

• Recoding the variable LANDSIZEHA: rounding to one digit for values smaller than 1, rounding to zero digitsfor other values, grouping values 5-19 and 20-40, topcoding at 40

• PRAMming the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR,TV and LIVESTOCK

• Noise addition (level 0.01 and 0.05 for outliers) to the income and expenditure components, replacing aggregatesby sum of perturbed components

Selected (key) variables:

• Modifications on categorical key variables: TRUE

• Modifications on continuous key variables: TRUE

• Modifications using PRAM: TRUE

• Local suppressions: TRUE

Disclosure risk (household-level variables):

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (unmodified data: 103)

3-Anonymity: 104 (unmodified data: 229)

5-Anonymity: 374 (unmodified data: 489)

Percentage of observations violating

2-Anonymity: 0% (unmodified data: 5.15%)

3-Anonymity: 5.28% (unmodified data: 11.45%)

5-Anonymity: 18.7% (unmodified data: 24.45%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.05161614% (~ 1.0 observations)

(unmodified data: 0.001820465% (~ 0.36 observations))

10 combinations of categories with highest risk:

URBRUR REGION HHSIZE OWNAGLAND RELIG fk Fk1 2 6 2 3 7 1 372.372 1 5 1 1 6 1 226.353 2 5 2 3 6 1 430.214 2 2 1 1 NA 1 173.055 2 6 1 1 5 1 80.056 1 6 1 3 5 1 343.277 2 5 1 2 NA 1 140.608 2 6 1 3 7 1 230.299 2 5 12 1 9 1 475.0110 2 6 3 1 1 1 338.57

Disclosure risk continuous scaled variables:

Distance-based Disclosure Risk for Continuous Key Variables:

Disclosure Risk is between 0% and 100% in the modified data. In the original data, the risk is approximately 100%.

10.4. Case study 1 - Internal report 123

Page 128: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Data Utility (household-level variables):

29 households have been removed due to their household sizes

Frequencies categorical key variables

URBRUR

categories1 1 2 NAorig 1316 684 0categories2 1 2 NArecoded 1299 666 6

REGION

categories1 1 2 3 4 5 6 NAorig 324 334 371 375 260 336 0categories2 1 2 3 4 5 6 NArecoded 315 328 370 370 257 330 1

HHSIZE

categories1 1 2 3 4 5 6 7 8 9 10 11 12orig 152 194 238 295 276 252 214 134 84 66 34 21categories1 13 14 15 16 17 18 19 20 21 22 33orig 11 6 6 5 4 2 1 2 1 1 1categories2 1 2 3 4 5 6 7 8 9 10 11 12recoded 152 194 238 295 276 252 214 134 84 66 34 21categories2 13recoded 10

OWNAGLAND

categories1 1 2 3 NAorig 763 500 332 405categories2 1 2 3 NArecoded 735 482 310 444

RELIG

categories1 1 5 6 7 9 NAorig 179 383 267 7 154 1010categories2 1 5 6 7 9 NArecoded 175 380 260 5 148 1003

Local suppressions

Number of local suppressions:

URBRUR REGION HHSIZE OWNAGLAND RELIGabsolute 6 1 1 48 16relative (in percent) 0.304% 0.051% 0.051% 2.435% 0.812%

124 Chapter 10. Appendices

Page 129: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Data utility of continuous scaled key variables:

Univariate summary:

Min. 1st Qu Median Mean 3rd Qu Max.TANHHEX P 0 0,2 1 6,689 2,421 1214TANHHEX P.m 0 0,2 1 3,427 2 40TFOODEX P 498 15170 17090 24340 23260 353200TFOODEX P.m 127,1 15100 17060 23410 22110 275300TALCHEX P 0 8438 11890 12920 13070 127900TALCHEX P.m -209,7 8377 11880 12570 13030 124800TCLTHEX P 0 0 0 401,7 0 85280TCLTHEX P.m -77,53 -13,59 6,42 404,7 30,69 85280THOUSEX P 0 121 131 733,8 672,8 28400THOUSEX P.m -54,65 111,4 138,8 706,1 618,9 28410TFURNEX P 0 1211 1340 2233 1970 197500TFURNEX P.m -39,54 1198 1340 2066 1933 49230THLTHEX P 0 153,8 167 479,8 302 17780THLTHEX P.m -18,79 146,8 168,6 453,1 295,2 15720TTRANSE XP 0 1 634 961 687 49650TTANSEX P.m -80,58 26,66 627,1 917,2 692,4 49640TCOMMEX P 0 146 241 1158 434 91920TCOMMEX P.m -115,2 139,1 238,3 1104 403,2 91920TRECEXP 0 3 95 577,2 107 34000TRECEXP .m -61,27 21,35 92,28 555,4 128,8 33960TEDUEXP 0 0 0 123,7 0 15880TEDUEXP .m -29,23 -5,06 1,213 121,8 9,748 15860TRESHOT EXP 0 154 722 2730 784 240300TRESHOT EXP.m -396,1 190,5 671,6 2568 872 240400TMISCEX P 0 0 467 875,1 528 63700TMISCEX P.m -93,39 0,7588 442,7 860,7 531,9 63680INCTOTG ROSSHH 0 444 1041 1148 1126 67420INCTOTG ROSSHH. m -24,92 446 1041 1087 1124 14940INCRMT 5000 12400 13390 30840 24200 683900INCRMT. m 4069 9071 17000 33040 36680 570000INCWAGE 0 0 0 1276 0 300000INCWAGE .m -295,1 -46,95 20,93 1261 114,4 300100INCFARM BSN 0 9262 12950 23460 14570 683900INCFARM BSN.m -1466 9336 12980 23420 14750 684000INCNFAR MBSN 0 0 0 3809 3900 165400INCNFAR MBSN.m -232,4 -10,69 142,6 3415 3846 160100INCRENT 0 0 827,5 9166 7307 400000INCRENT .m -757,4 43,89 783,7 8637 7267 394800INCFIN 0 0 0 1783 0 120000INCFIN. m -248,5 -56,57 11,54 1608 90,27 120000INCPENS N 0 0 0 74,58 0 14400INCPENS N.m -20,2 -4,591 0,1964 76,62 5,796 14380INCOTHE R 0 0 0 331,3 0 60000INCOTHE R.m -123,3 -24,78 -0,0261 7 331,1 26,75 60050LANDSIZ EHA 0 0 0 549,1 0 82300LANDSIZ EHA.m -126,2 -21,91 3,4 486,7 30,88 79670

10.4. Case study 1 - Internal report 125

Page 130: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Information loss:

Criteria IL1: 0.01219892

Individual-level variables

• Modifications on categorical key variables: TRUE

• Modifications on continuous key variables: FALSE

• Modifications using PRAM: FALSE

• Local suppressions: TRUE

Disclosure risk (individual-level variables):

Anonymization methods applied to individual-level variables:

• Recoding AGEYRS from months to years for age under 1, and to ten-year intervals for age values between 15and 65, topcoding age at 65

• Local suppression to achieve 2-anonymity

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (unmodified data: 998)

3-Anonymity: 0 (unmodified data: 1384)

5-Anonymity: 935 (unmodified data: 2194)

Percentage of observations violating

2-Anonymity: 0% (unmodified data: 9.91%)

3-Anonymity: 0% (unmodified data: 13.75%)

5-Anonymity: 6.23% (unmodified data: 21.79%)

Disclosure risk categorical variables:

Expected Percentage of Reidentifications: 0.02% (~ 2.66 observations)

(unmodified data: 0.24% (~23.98 observations))

Expected Percentage of Reidentifications (hierarchical risk): 0.1% (~ 15.34 observations)

(unmodified data: 1.26 % (~ 127.12 observations))

10 combinations of categories with highest risk:

126 Chapter 10. Appendices

Page 131: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

GEN-DER

REL MARI-TAL

AGEYRS EDUCY EDYR-SATCURRAT

ATSCHOOLINDUS-TRY1

fk Fk

1 1 1 3 38 6 NA 0 9 1 73.31

2 1 1 3 20 1 NA 0 6 1 69.53

3 1 1 2 39 2 NA 0 5 1 54.63

4 1 1 1 36 6 NA 0 9 1 73.31

5 1 1 3 42 2 NA 0 1 1 39.58

6 0 1 6 74 1 NA 0 1 1 58.12

7 0 1 6 34 2 NA 0 1 1 57.40

8 1 1 1 26 4 NA 0 5 1 66.21

9 1 1 4 35 1 NA 0 10 1 57.13

10 1 6 1 12 1 NA 0 5 1 57.13

Data utility (individual-level variables):

Frequencies categorical key variables

GENDER

categories1 0 1 NAorig 5197 4871 0categories2 0 1 NArecoded 5197 4871 0

REL

categories1 1 2 3 4 5 6 7 8 9 NAorig 1970 1319 4933 57 765 89 817 51 63 4categories2 1 2 3 4 5 6 7 8 9 NArecoded 1698 1319 4933 52 765 54 817 40 63 327

MARITAL

categories1 1 2 3 4 5 6 NAorig 3542 2141 415 295 330 329 3016categories2 1 2 3 4 5 6 NArecoded 3542 2141 415 295 330 329 3016

AGEYRS

10.4. Case study 1 - Internal report 127

Page 132: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

categories1 0 1/12 2/12 3/12 4/12 5/12 6/12 7/12 8/12 9/12orig 178 8 1 14 15 19 17 21 18 7categories1 10/12 11/12 1 2 3 4 5 6 7 8orig 5 8 367 340 332 260 334 344 297 344categories1 9 10 11 12 13 14 15 16 17 18orig 281 336 297 326 299 263 243 231 196 224categories1 19 20 21 22 23 24 25 26 27 28orig 202 182 136 146 150 137 128 139 117 152categories1 29 30 31 32 33 34 35 36 37 38orig 111 143 96 123 104 107 148 91 109 87categories1 39 40 41 42 43 44 45 46 47 48orig 89 93 58 78 72 64 84 74 48 60categories1 49 50 51 52 53 54 55 56 57 58orig 58 66 50 55 29 30 34 38 33 44categories1 59 60 61 62 63 64 65 66 67 68orig 35 36 25 33 21 15 30 18 13 29categories1 69 70 71 72 73 74 75 76 77 78orig 26 36 17 16 12 3 16 10 8 18categories1 79 80 81 82 83 84 85 86 87 88orig 11 13 5 2 7 7 7 3 2 2categories1 89 90 91 92 93 95 NAorig 4 4 3 1 1 1 188categories2 0 1 2 3 4 5 6 7 8 9recoded 311 367 340 332 260 334 344 297 344 281categories2 10 11 12 13 14 20 30 40 50 60recoded 336 297 326 299 263 1847 1220 889 554 314categories2 65 NArecoded 325 188

EDUCY

categories1 0 1 2 3 4 5 6 NAorig 1582 4755 1062 330 139 46 104 2050categories2 0 1 2 3 4 5 6 NArecoded 1582 4755 1062 330 139 46 104 2050

EDYRSATCURR

categories1 0 1 2 3 4 5 6 7 8 9orig 177 482 445 446 354 352 289 266 132 127categories1 10 11 12 13 15 16 18 NAorig 143 58 46 27 18 10 54 6642categories2 0 1 2 3 4 5 6 7 8 9recode 177 482 445 446 354 352 289 266 132 127categories2 10 11 12 13 15 16 18 NArecode 143 58 46 27 18 10 54 6642

ATSCHOOL

128 Chapter 10. Appendices

Page 133: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

categories1 0 1 NAorig 4696 3427 1945categories2 0 1 NArecoded 4696 3427 1945

INDUSTRY1

categories1 1 2 3 4 5 6 7 8 9 10 NAorig 5300 16 153 2 93 484 95 17 70 292 3546categories2 1 2 3 4 5 6 7 8 9 10 NArecoded 5300 16 153 2 93 484 95 17 70 292 3546

Local suppressions

Number of local suppressions:

GENDER REL MARITAL AGEYRS EDUCYabsolute 0 323 0 0 0relative (in percent) 0 3.21% 0 0 0

EDYRSATCURR ATSCHOOL INDUSTRY1absolute 0 0 0relative (in percent) 0 0 0

10.5 Case study 1 - External report

This case study microdata set has been treated to protect confidentiality. Several methods have been applied to protectthe confidentiality: removing variables from the original dataset, removing records from the dataset, reducing detailin variables by recoding and top-coding, removing particular values of individuals at risk (local suppression) andperturbing values of certain variables.

Removing variables

The released microdata set has only a selected number of variables contained in the initial survey. Not all variablescould be released in this SUF without breaching confidentiality rules.

Removing records

To protect confidentiality, records of households larger than 13 were removed. Thirty households out of a total of2,000 households in the dataset were removed.

Reducing detail in variables by recoding and top-coding

The variable LANDSIZEHA was rounded to one digit for values smaller than 1, rounded to zero digits for other values,grouped for values 5-19 and 20-40 and topcoded at 40. The variable AGEYRS was recoded to ten-year age intervalsfor values in the age range 15 ΓÇô 65.

Local suppression

Values of certain variables for particular households and individuals were deleted. In total, six values of the variableURBRUR, one of the REGION variable, 48 for the OWNAGLAND variable, 16 for the RELIG variable and 323values of the variable REL were deleted.

Perturbing values

10.5. Case study 1 - External report 129

Page 134: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Uncertainty was introduced in the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCY-CLE, CAR, TV and LIVESTOCK by using the PRAM method. This method changes a certain percentage of values ofvariables within each variable. Here invariant PRAM was used, which guarantees that the univariate tabulations stayunchanged. Multivariate tabulations may be changed. Unfortunately, the transition matrix cannot be published.

The income and expenditure variables were perturbed by adding noise (adding small random values to the originalvalues). The noise added was 0.01 times the standard deviation in the original data and 0.05 for outliers. Noisewas added to the components and the aggregates were recomputed to guarantee that the proportions of the differentcomponents did not change.

10.6 Case study 2 - Internal report

SDC report (adapted from the report function in sdcMicro)

This report describes the anonymization measures for the PUF release additional to those already taken in the first casestudy. Therefore, this report should be read in conjunction with the internal report for case study 1. The original datasetconsists of 10,574 observations (i.e., 10,574 individuals in 2,000 households). The dataset used for the anonymizationof the PUF file is the anonymized SUF file from case study 1. This dataset consists of 10.068 observations in 1,970households. The difference is due to the removal of large households and sensitive or identifying variables in the firstcase study.

Household-level variables

Anonymization methods applied to household-level variables:

• For SUF release (see case study 1):

– Removing households of size larger than 13 (29 households)

– Local suppression to achieve 2-anonymity, with importance vector to prevent suppressing values of thevariables HHSIZE, REGION and URBRUR

• For PUF release:

– Remove variables OWNLANDAG, RELIG and LANDSIZEHA

– Local suppression to achieve 5-anonymity, with importance vector to prevent suppressing values of thevariables HHSIZE and REGION

– PRAMming the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE,CAR, TV and LIVESTOCK

– Create deciles for aggregate income and expenditure (TANNEXP and INCTOTGROSSHH) and replacethe actual values with the mean of the corresponding decile. Replace income and expenditure componentswith the proportion of original totals.

Selected (key) variables:

categorical URBRUR REGION HHSIZEcontinuous TANHHEXP INCTOTGROSSHHweight WGTPOPhhID not definedstrata not defined

• Modifications on categorical key variables: TRUE

• Modifications on continuous key variables: TRUE

• Modifications using PRAM: TRUE

130 Chapter 10. Appendices

Page 135: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

• Local suppressions: TRUE

Disclosure risk (household-level variables):

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (PUF file: 0, unmodified data: 103)

3-Anonymity: 0 (PUF file: 18, unmodified data: 229)

5-Anonymity: 0 (PUF file: 92, unmodified data: 489)

Percentage of observations violating

2-Anonymity: 0.00% (PUF file: 0.00%, unmodified data: 5.15%)

3-Anonymity: 0.00% (PUF file: 0.91%, unmodified data: 11.45%)

5-Anonymity: 0.00% (PUF file: 4.67%, unmodified data: 24.45%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.0000526% (~ 0.10 observations),

PUF file: 0.0000642% (~ 0.13 observations), unmodified data: 0.001820465% (~ 0.36 observations)

11 combinations of categories with highest risk in PUF file:

URBRUR REGION HHSIZE fk Fk1 2 4 1 7 1152.0842 2 4 1 7 1152.0843 2 2 9 2 2356.9264 2 4 1 7 1152.0845 2 4 1 7 1152.0846 2 4 1 7 1152.0847 2 5 12 2 2978.4548 2 4 1 7 1152.0849 2 4 1 7 1152.08410 2 5 12 2 2978.45411 2 2 9 2 2356.926

Disclosure risk continuous scaled variables:

Distance-based Disclosure Risk for Continuous Key Variables:

Disclosure Risk is between 0% and 100% in the modified data. In the original data, the risk is approximately 100%.

Data Utility (household-level variables):

Frequencies categorical key variables

URBRUR

categories1 1 2 NAorig 1316 684 0categories2 1 2 NArecoded 1280 623 67

REGION

10.6. Case study 2 - Internal report 131

Page 136: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

categories1 1 2 3 4 5 6 NAorig 324 334 371 375 260 336 0categories2 1 2 3 4 5 6 NArecoded 311 325 369 370 253 329 13

HHSIZE

categories1 1 2 3 4 5 6 7 8 9 10 11 12orig 152 194 238 295 276 252 214 134 84 66 34 21categories1 13 14 15 16 17 18 19 20 21 22 33orig 11 6 6 5 4 2 1 2 1 1 1categories2 1 2 3 4 5 6 7 8 9 10 11 12recoded 152 194 238 295 276 252 214 134 84 66 34 21categories2 13recoded 10

Local suppressions

Number of local suppressions:

URBRUR REGION HHSIZEabsolute 61 125 0relative (in percent) 3.096% 0.609% 0.000%

Data utility of continuous scaled key variables:

Univariate summary:

Min. 1st Qu Median Mean 3rd Qu Max.TANHHEX P 498 15,170 17,090 24,340 23,260 353,230TANHHEX P.m 827 14,700 17,060 23,420 22,750 83,963INCTOTG ROSSHH 5,000 12,400 13,390 30,840 24,200 683,900INCTOTG ROSSHH. m 6353 12,390 13,400 30,250 24,240 149,561

Information loss:

Criteria IL1: 0.2422625

Disclosure risk (individual-level variables):

Anonymization methods applied to individual-level variables:

• For SUF release (see case study 1):

– Recoding AGEYRS from months to years for age under 1, and to ten-year intervals for age values between15 and 65, topcoding age at 65

– Local suppression to achieve 2-anonymity

• For PUF release:

– Remove variable EDYRSCURRAT

– Recode REL to ‘Head’, ‘Spouse’, ‘Child’, ‘Other relative’, ‘Other’

– Recode MARITAL to ‘Never married’, ‘Married/Living together’, ‘Divorced/Separated/Widowed’

132 Chapter 10. Appendices

Page 137: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

– Recode AGEYRS for values under 15 to 7

– Recode EDUCY to ‘No education’, ‘Pre-school/ Primary not completed’, ‘Completed lower secondary orhigher’

– Recode INDUSTRY1 to ‘Primary sector’, ‘Secondary sector’, ‘Tertiary sector’

Frequency analysis for categorical key variables:

Number of observations violating

2-Anonymity: 0 (PUF file: 0, unmodified data: 998)

3-Anonymity: 0 (PUF file: 167, unmodified data: 1384)

5-Anonymity: 0 (PUF file: 463, unmodified data: 2194)

Percentage of observations violating

2-Anonymity: 0.00% (PUF file: 0.00%, unmodified data: 9.91%)

3-Anonymity: 0.00% (PUF file: 1.66%, unmodified data: 13.75%)

5-Anonymity: 0.00% (PUF file: 4.60%, unmodified data: 21.79%)

Disclosure risk categorical variables:

Expected Percentage of Re-identifications: 0.00% (~0.41 observations)

(PUF file: 0.02 % (~ 1.69 observations), unmodified data: 0.24% (~23.98 observations))

Expected Percentage of Re-identifications (hierarchical risk): 0.02% (~2.29 observations)

(PUF file: 0.10 % (~ 9.57 observations), unmodified data: 1.26 % (~ 127.12 observations))

10 combinations of categories with highest risk:

GEDNER REL MARIT AL AGEYR S EDUCY INDUS TRY1 fk Fk1 1 1 2 50 1 7 2 324.9 2752 0 1 3 40 3 6 2 330.0 5213 0 1 6 60 0 3 2 350.5 0004 0 1 3 40 3 6 2 330.0 5215 1 1 2 30 4 5 2 253.7 4316 1 1 2 50 1 7 2 324.9 2757 0 1 6 50 1 6 2 255.6 1428 1 1 4 40 1 10 2 175.0 7979 1 1 4 40 1 10 2 175.0 79710 1 1 3 30 1 6 2 323.4 879

Data utility (individual-level variables):

Frequencies categorical key variables

GENDER

categories1 0 1 NAorig 5,197 4,871 0categories2 0 1 NArecoded 5,197 4,871 0

REL

10.6. Case study 2 - Internal report 133

Page 138: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

categories1 1 2 3 4 5 6 7 8 9 NAorig 1,970 1,319 4,933 57 765 89 817 51 63 4categories2 1 2 3 7 9 NArecoded 1,698 1,319 4,933 1,688 103 327

MARITAL

categories1 1 2 3 4 5 6 NAorig 3,542 2,141 415 295 330 329 3,016categories2 1 2 9 NArecoded 3,542 2,851 659 3,016

AGEYRS

categories1 0 1/12 2/12 3/12 4/12 5/12 6/12 7/12 8/12 9/12orig 178 8 1 14 15 19 17 21 18 7categories1 10/12 11/12 1 2 3 4 5 6 7 8orig 5 8 367 340 332 260 334 344 297 344categories1 9 10 11 12 13 14 15 16 17 18orig 281 336 297 326 299 263 243 231 196 224categories1 19 20 21 22 23 24 25 26 27 28orig 202 182 136 146 150 137 128 139 117 152categories1 29 30 31 32 33 34 35 36 37 38orig 111 143 96 123 104 107 148 91 109 87categories1 39 40 41 42 43 44 45 46 47 48orig 89 93 58 78 72 64 84 74 48 60categories1 49 50 51 52 53 54 55 56 57 58orig 58 66 50 55 29 30 34 38 33 44categories1 59 60 61 62 63 64 65 66 67 68orig 35 36 25 33 21 15 30 18 13 29categories1 69 70 71 72 73 74 75 76 77 78orig 26 36 17 16 12 3 16 10 8 18categories1 79 80 81 82 83 84 85 86 87 88orig 11 13 5 2 7 7 7 3 2 2categories1 89 90 91 92 93 95 NAorig 4 4 3 1 1 1 188categories2 7 20 30 40 50 60 65 NArecoded 4,731 1,847 1,220 889 554 314 325 188

EDUCY

categories1 0 1 2 3 4 5 6 NAorig 1582 4755 1062 330 139 46 104 2050categories2 0 1 2 3 4 5 6 NArecoded 1,582 4,755 1,062 330 139 46 104 2,050

INDUSTRY1

134 Chapter 10. Appendices

Page 139: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

categories1 1 2 3 4 5 6 7 8 9 10 NAorig 5,300 16 153 2 93 484 95 17 70 292 3,546categories2 1 2 3 NArecoded 5,316 248 958 3,546

Local suppressions

Number of local suppressions:

GENDER REL MARITAL AGEYRS EDUCY INDUSTRY1absolut e 0 0 0 91 0 0relativ e (in percent ) 0.00% 0.00% 0.00% 0.90% 0.00% 0.00%

10.7 Case study 2- External report

This case study microdata set has been treated to protect confidentiality. Several methods have been applied to protectthe confidentiality: removing variables from the original dataset, removing records from the dataset, reducing detailin variables by recoding and top-coding, removing particular values of individuals at risk (local suppression) andperturbing values of certain variables.

Removing variables

The released microdata set has only a selected number of variables contained in the initial survey. Not all variablescould be released in this PUF without breaching confidentiality rules.

Removing records

To protect confidentiality, records of households larger than 13 were removed. Twenty-nine households out of a totalof 2,000 households in the dataset were removed.

Reducing detail in variables by recoding and top-coding

The variable AGEYRS was recoded to ten-year age intervals for values in the age range 15 ΓÇô 65 and bottom- andtop-coded at 15 and 65. The variables REL, MARITAL, EDUCY and INDUSTRY1 were recoded to less detailedcategories. The total income and expenditure variables were recoded to the mean of the corresponding deciles and theincome and expenditure components to the proportion of the totals.

Local suppression

Values of certain variables for particular households and individuals were deleted. In total, 67 values of the variableURBRUR, 126 of the REGION variable, 91 for the AGEYRS variable and 323 values of the variable REL weredeleted.

Perturbing values

Uncertainty was introduced in the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCY-CLE, CAR, TV and LIVESTOCK by using the PRAM method. This method changes a certain percentage of values ofvariables within each variable. Here invariant PRAM was used, which guarantees that the univariate tabulations stayunchanged. Multivariate tabulations may be changed. Unfortunately, the transition matrix cannot be published.

10.8 Appendix D: Execution Times for Multiple Scenarios Tested us-ing Selected Sample Data

10.7. Case study 2- External report 135

Page 140: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

136 Chapter 10. Appendices

Page 141: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Fig. 10.1: Description of anonymization scenarios

10.8. Appendix D: Execution Times for Multiple Scenarios Tested using Selected Sample Data 137

Page 142: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

References

138 Chapter 10. Appendices

Page 143: Statistical Disclosure Control: A Practice Guide

CHAPTER 11

Case Studies (Illustrating the SDC Process)

In order to evaluate the use of different SDC methods on different types of survey datasets, we compared the results ofthe different methods applied to 75 datasets from 52 countries representing six geographic regions: Latin America andthe Caribbean (LAC), Sub-Saharan Africa (AFR), South Asia (SA), Europe and Central Asia (ECA), Middle East andNorth Africa (MENA) as well as East Asia and the Pacific (EAP). The datasets chosen were from a mix of datasets thatare already publically available, as well as data made available to the World Bank. The surveys used included, amongstothers, household, demographic, and health surveys. The variables from these surveys used for the experiments wereselected based on their relevance for users (e.g., for indicators, MDGs), their sensitivity, and their classification withrespect to the SDC process.

The following case studies draw from knowledge gained from these experiments and try to incorporate the lessonslearned. The case studies use synthetic data that mimic the structure of the survey types we used in our experi-ments and present the anonymization of a dataset similar to many surveys designed to measure household income andconsumption, labor force participation and general demographic characteristics. The first case study creates a SUF,whereas in the second case study we take this SUF and treat it further to create a PUF.

11.1 Case study 1- SUF

This case study shows an example of how the anonymization process might be approached, particularly for a datasetwith many continuous variables. We also show how this can be achieved using the open source and free sdcMicropackage and R. A ready-to-run R script for this case study and the dataset are also available to reproduce the resultsand allow the user to adapt the code (see http://ihsn.org/home/projects/sdc-practice). Extracts of this code are presentedin this section to illustrate several steps of the anonymization process.

Note: The choices of methods and parameters in this case study are based on this particular dataset and the resultsand choices might be different for other datasets.

The aim is to show the process, not to compare methods per se.

This example uses a dataset with a similar structure to that of a typical social survey with a focus on demographics,labor force participation and income and expenditure patterns. The dataset has been compiled using observations from

139

Page 144: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

several datasets from different countries. They are considered synthetic data and as such are used only for illustrativepurposes. The source datasets were already treated for disclosure control by their producers. This does not matter, asour concern is to illustrate the process only. The data from which we compiled our case study file was from surveysthat contain many variables, but pay particular attention to labor force variables as well as household income andhousehold expenditure variables. The variables in the demo dataset have already been pre-selected from the total setof variables available in the datasets. See Appendix A for the complete overview of all variables.

This case study follows the steps of the SDC process outlined in the Section The SDC Process.

11.1.1 Step 1: Need for disclosure control

The statistical units in this dataset are individuals and households. The household structure provides a hierarchicalstructure in the data, which should be taken into account when measuring risk and selecting anonymization methods.

The data contains variables with demographic information, income, expenditures, education variables and variablesrelating to the labor status of the individual. These variables include sensitive and confidential variables. The datasetis an example of a social survey and, due to the nature of the statistical units and the variables, disclosure control isneeded before release of the microdata. This is the case regardless of the legal framework, which is not specified here,as this is a hypothetical dataset.

11.1.2 Step 2: Data preparation and exploring data characteristics

The first step is to explore the data. To analyze the data in R we first have to read the data into R. In our case, the datais saved in a STATA file (.dta file). To read STATA files, we need to load the R package foreign (see the Section Readfunctions in R on importing other data formats in R). We also load the sdcMicro package and several other packagesused later for the computation of the utility measures. If these packages are not yet installed, you should do so beforetrying to load them. The R code for this case study demonstrates how to do this.

Listing 11.1: Loading required packages

1 # Load required packages2 library(foreign) # for read/write function for STATA files3 library(sdcMicro) # sdcMicro package with functions for the SDC process4 library(laeken) # for GINI5 library(reldist) # for GINI6 library(bootstrap) # for bootstrapping7 library(ineq) # for Lorenz curves

After setting the working directory to the directory where the STATA file is stored, we load the data into the objectcalled file. All output, unless otherwise specified, is saved in the working directory.

Listing 11.2: Loading the data

1 # Set working directory2 setwd("C:/WorldBank/CaseStudy/")3

4 # Specify file name5 fname <- " case_1_data.dta"6

7 # Read-in file8 file <- read.dta(fname, convert.factors = F) # factors as numeric code

We check the number of variables, number of observations and variable names, as shown in Listing 11.3.

140 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 145: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.3: Number of individuals and variables and variable names

1 dim(file) # Dimensions of file (observations, variables)2 ## [1] 10574 683

4 colnames(file) # Variable names5 ## [1] "REGION" "DIST" "URBRUR" "WGTHH"6 ## [5] "WGTPOP" "IDH" "IDP" "HHSIZE"7 ## [9] "GENDER" "REL" "MARITAL" "AGEYRS"8 ## [13] "AGEMTH" "RELIG" "ETHNICITY" "LANGUAGE"9 ## [17] "MORBID" "MEASLES" "MEDATT" "CHWEIGHTKG"

10 ## [21] "CHHEIGHTCM" "ATSCHOOL" "EDUCY" "EDYRS"11 ## [25] "EDYRSCURRAT" "SCHTYP" "LITERACY" "EMPTYP1"12 ## [29] "UNEMP1" "INDUSTRY1" "EMPCAT1" "WHOURSWEEK1"13 ## [33] "OWNHOUSE" "ROOF" "TOILET" "ELECTCON"14 ## [37] "FUELCOOK" "WATER" "OWNAGLAND" "LANDSIZEHA"15 ## [41] "OWNMOTORCYCLE" "CAR" "TV" "LIVESTOCK"16 ## [45] "INCRMT" "INCWAGE" "INCBONSOCALL" "INCFARMBSN"17 ## [49] "INCNFARMBSN" "INCRENT" "INCFIN" "INCPENSN"18 ## [53] "INCOTHER" "INCTOTGROSSHH" "FARMEMP" "THOUSEXP"19 ## [57] "TFOODEXP" "TALCHEXP" "TCLTHEXP" "TFURNEXP"20 ## [61] "THLTHEXP" "TTRANSEXP" "TCOMMEXP" "TRECEXP"21 ## [65] "TEDUEXP" "TRESTHOTEXP" "TMISCEXP" "TANHHEXP"

The dataset has 10,574 individuals in 2,000 households and contains 68 variables. The survey corresponds to apopulation of about 4.3 million individuals, which means that the sample is relatively small and the sample weightsare high. This has an impact on the disclosure risk, as we will see in Steps 6a and 6b.

To get an overview of the values of the variables, we use tabulations and cross-tabulations for categorical variablesand summary statistics for continuous variables. To include the number of missing values (NA or other), we use theoption useNA = “ifany” in the table() function (see Listing 11.4).

In Table 11.1 the variables in the dataset are listed along with concise descriptions of the variables, the level at whichthey are collected (individual (IND), household (HH)), the measurement type (continuous, semi-continuous, categor-ical) and value ranges. Note that the dataset contains a selection of 68 variables (cf. Appendix A) of a total of 112variables in the survey dataset. The variables have been preselected based on their relevance for data users. This allowsto reduce the total numbers of variables to consider in the anonymization process and makes the process easier. Thenumerical values for many of the categorical variables are codes that refer to values, e.g., in the variable URBRUR,1 stands for rural and 2 for urban. More information on the meanings of coded values of the categorical variables isavailable in the R code for this case study.

We identified the following sensitive variables in the data: ETHNICITY, RELIGION, variables related to the laborforce status of the individual and the variables containing information on income and expenditures of the household.Whether variables can be identified as sensitive may vary across countries and datasets.

The case study dataset does not have any direct identifiers that, if they were present, would need to be removed at thisstage. Examples of direct identifiers would be names, telephone numbers, geographical location coordinates, etc.

Listing 11.4: Tabulation of the variable ‘gender’ and summary statisticsfor the variable ‘total annual expenditures’ in R

1 # tabulation of variable GENDER (sex, categorical)2 table(file$GENDER, useNA = "ifany")3 ## 0 14 ## 5448 51265

6 # summary statistics for variable TANHHEXP (total annual household expenditures,

(continues on next page)

11.1. Case study 1- SUF 141

Page 146: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

7 # continuous)8 summary(file$TANHHEXP)9 ## Min. 1st Qu. Median Mean 3rd Qu. Max.

10 ## 498 15550 17290 28560 29720 353200

Table 11.1: Overview of variables in datasetNo. Variable name Description Level Measurement Values1 IDH Household ID HH . 1-2,0002 IDP Individual ID IND . 1-333 REGION Region HH categorical 1-64 DISTRICT District HH categorical 101-11055 URBRUR Area of residence HH categorical 1, 26 WGTHH Individual weighting coefficient HH weight 31.2-84957 WGTPOP Population weighting coefficient IND weight 45.8-93452.28 HHSIZE Household size HH semi-cont 1-339 GENDER Gender IND categorical 0, 110 REL Relationship to household head IND categorical 1-911 MARITAL Marital status IND categorical 1-612 AGEYRS Age in completed years IND semi-continuous 0-95 (under 1, 1/12 year increments)13 AGEMTH Age of child in completed years IND semi-continuous 1-114014 RELIG Religion of household head HH categorical 1, 5-7, 915 ETHNICITY Ethnicity of household head HH categorical all missing values16 LANGUAGE Language of household head HH categorical all missing values17 MORBID Morbidity last x weeks IND categorical 0, 118 MEASLES Child immunized against measles IND categorical 0, 1, 919 MEDATT Sought medical attention IND categorical 0, 120 CHWEIGHTKG Weight of child (Kg) IND continuous 2 – 26.521 CHHEIGHTCM Height of child (cms) IND continuous 7 - 14022 ATSCHOOL Currently enrolled in school IND categorical 0, 123 EDUCY Highest level of education attended IND categorical 1-624 EDYEARS Years of education IND semi-continuous 0-1825 EDYRSCURRAT Years of education for currently enrolled IND semi-continuous 1-1826 SCHTYP Type of school attending IND categorical 1-3, 927 LITERACY Literacy IND categorical 1-328 EMPTYP1 Type of employment IND categorical 1-929 UNEMP1 Unemployed IND categorical 0, 130 INDUSTRY1 Industry classification (1-digit) IND categorical 1-1031 EMPCAT1 Employment categories IND categorical 11, 12, 13, 14, 21, 2232 WHOURSLASTWEEK1 Hours worked last week IND continuous 0-15433 OWNHOUSE Ownership of dwelling HH categorical 0, 134 ROOF Main material used for roof IND categorical 1-5, 935 TOILET Main toilet facility HH categorical 1-4, 936 ELECTCON Electricity HH categorical 0-337 FUELCOOK Main cooking fuel HH categorical 1-5, 938 WATER Main source of water HH categorical 1-939 OWNAGLAND Ownership of agricultural land HH categorical 1-340 LANDSIZEHA Land size owned by household (ha) (agric and non agric) HH continuous 0-121441 OWNMOTORCYCLE Ownership of motorcycle HH categorical 0, 142 CAR Ownership of car HH categorical 0, 143 TV Ownership of television HH categorical 0, 1

Continued on next page

142 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 147: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.1 – continued from previous pageNo. Variable name Description Level Measurement Values44 LIFESTOCK Number of large-sized livestock owned HH semi-continuous 0-2545 INCRMT Income – Remittances HH continuous46 INCWAGE Income - Wages and salaries HH continuous47 INCBONSOCAL Income - Bonuses and social allowances derived from wage jobs HH continuous48 INCFARMBSN Income - Gross income from household farm businesses HH continuous49 INCNFARMBSN Income - Gross income from household nonfarm businesses HH continuous50 INCRENT Income - Rent HH continuous51 INCFIN Income - Financial HH continuous52 INCPENSN Income - Pensions/social assistance HH continuous53 INCOTHER Income - Other HH continuous54 INCTOTGROSHH Income - Total HH continuous55 FARMEMP56 TFOODEXP Total expenditure on food HH continuous57 TALCHEXP Total expenditure on alcoholic beverages, tobacco and narcotics HH continuous58 TCLTHEXP Total expenditure on clothing HH continuous59 THOUSEXP Total expenditure on housing HH continuous60 TFURNEXP Total expenditure on furnishing HH continuous61 THLTHEXP Total expenditure on health HH continuous62 TTRANSEXP Total expenditure on transport HH continuous63 TCOMMEXP Total expenditure on communication HH continuous64 TRECEXP Total expenditure on recreation HH continuous65 TEDUEXP Total expenditure on education HH continuous66 TRESHOTEXP Total expenditure on restaurants and hotels HH continuous67 TMISCEXP Total expenditure on miscellaneous spending HH continuous68 TANHHEXP Total annual nominal household expenditures HH continuous

It is always important to ensure that the relationships between variables in the data are preserved during the anonymiza-tion process and to explore and take note of these relationships before beginning the anonymization. In the final step inthe anonymization process, an audit should be conducted, using these initial results, to check that these relationshipsare maintained in the anonymized dataset.

In our demo dataset, we identify several relationships between variables that need to be preserved during theanonymization process. The variables TANHHEXP and INCTOTGROSSHH represent the total annual nominal house-hold expenditure and the total gross annual household income, respectively, and these variables are aggregations ofexisting income and expenditure components in the dataset.

The variables related to education are available only for individuals in the appropriate age groups and missing forother individuals. We make a similar observation for variables relating to children, such as height, weight and agein months. In addition, the household-level variables (cf. fourth column of Table 11.1) have the same values for allmembers in any particular household. The value of household size corresponds to the actual number of individualsbelonging to that household in the dataset. As we proceed, we have to take care that these relationships and structuresare preserved in the anonymization process.

When tabulating the variables, we notice that the variables RELIG, EMPTYP1 and LIVESTOCK have missing valuecodes different from the R standard missing value code NA. Before proceeding, we need to recode these to NA so Rinterprets them correctly. The missing value codes are resp. 99999, 99 and 9999 for these three variables. These aregenuine missing value codes and not caused by the variables being not applicable to the individual. Listing 11.5 showshow to make these changes.

Note: At the end of the anonymization process, and if desired for users, it is relatively easy to change these valuesback to their original missing value code.

11.1. Case study 1- SUF 143

Page 148: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.5: Recoding missing value codes

1 # Set different NA codes to R missing value NA2 file[,'RELIG'][file[,'RELIG'] == 99999] <- NA3 file[,'EMPTYP1'][file[,'EMPTYP1'] == 99] <- NA4 file[,'LIVESTOCK'][file[,'LIVESTOCK'] == 9999] <- NA

We also take note that the variables LANGUAGE and ETHNICITY have only missing values. Variables that containonly missing values should be removed from the dataset at this stage and excluded from the anonymization process.Removing these variables does not mean loss of data or reduction of the data utility, since these variables did notcontain any information. It is, however, necessary to remove them, because keeping them can lead to errors in someof the anonymization methods in R. It is always possible to add these variables back into the dataset to be released atthe end of the anonymization process. It is useful to reduce the dataset to those variables and records relevant for theanonymization process. This guarantees the best results in R and fewer errors. In Listing 11.6 we drop the variablesthat contain all missing values.

Listing 11.6: Dropping variables with only missing values

1 # Drop variables containing only missings2 file <- file[,!names(file) %in% c('LANGUAGE', 'ETHNICITY')]

We assume that the data are collected in a survey that uses simple sampling of households. The data contains twoweight coefficients: WGTHH and WGTPOP. The relationship between the weights is WGTPOP = WGTHH * HH-SIZE. WGTPOP is the sampling weight for the households and WGTHH is the sampling weight for the individualsto be used for disclosure risk calculations. WGTHH is used for computing individual-level indicators (such as educa-tion) and WGTPOP is used for population level indicators (such as income indicators). There are no strata variablesavailable in the data. We will use WGTPOP for the anonymization of the household variables and WGTHH for theanonymization of the individual-level variables.

11.1.3 Step 3: Type of release

In this case study, we assume that data will be released as a SUF, which will be only available under license toaccredited researchers with approved research proposals (see the Section Conditions for SUFs for more informationof the release of a SUF). Therefore, the accepted risk level is higher and a broader set of variables can be released thanwould be the case when releasing a PUF. Since we do not have an overview of the requirements of all users, we restrictthe utility measures to a selected number of data uses (see Step 5).

11.1.4 Step 4: Intruder scenarios and choice of key variables

Next, we analyze possible intruder scenarios and select quasi-identifiers or key variables based on these scenarios.Since the dataset used in this case study is a demo dataset that does not stem from an existing country (and hencewe do not have information on external data sources available to possible intruders) and the original data has alreadybeen anonymized, it is not possible to define exact disclosure scenarios. Instead, we draft intruder scenarios for thisdemo dataset based on some hypothetical assumptions about availability of external data sources. We consider twotypes of disclosure scenarios: 1) matching to other publicly available datasets and 2) spontaneous recognition. Thelicense under which the dataset will be distributed (SUF) prohibits matching to external resources. Still this canhappen. However, the more important scenario is the one of spontaneous recognition. We describe both scenarios inthe following two paragraphs.

For the sake of illustration, we assume that population registers are available with the demographic variables gender,age, place of residence (region, urban/rural), religion and other variables such as marital status and variables relatingto education and professional status that are also present in our dataset. In addition, we assume that there is a pub-lically available cadastral register on land ownership. Based on this analysis of available data sources, we select the

144 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 149: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

variables REGION, URBRUR, HHSIZE, OWNAGLAND, RELIG, GENDER, REL (relationship to household head),MARITAL (marital status), AGEYRS, INDUSTRY1 and two variables relating to school attendance as categoricalquasi-identifiers, the expenditure and income variables as well as LANDSIZEHA as continuous quasi-identifiers. Ac-cording to our assessment, these variables might enable an intruder to re-identify an individual or household in thedataset by matching with other available datasets.

Table 11.2 gives an overview of the selected quasi-identifiers and their levels of measurement.

The decision to release the dataset as a SUF means the level of anonymization will be relatively low and consequently,the variables are more detailed and a scenario of spontaneous recognition is our main concern. Therefore, we shouldcheck for rare combinations or unusual patterns in the variables. Variables that may lead to spontaneous recognitionin our sample are amongst others HHSIZE (household size), LANDSIZEHA as well as income and expenditurevariables. Large households and large land ownership are easily identifiable. The same holds for extreme outliers inwealth and expenditure variables, especially when combined with other identifying variables such as region. Theremight be only one or a few households in a certain region with a high income, such as the local doctor. Variablesthat are easily observable and known by neighbors such as ROOF, TOILET, WATER, ELECTCON, FUELCOOK,OWNMOTORCYCLE, CAR, TV and LIVESTOCK may also need protection depending on what stands out in thecommunity, since a researcher might be able to identify persons (s)he knows. This is called the nosy-neighbor scenario.

Table 11.2: List of selected quasi-identifiersName MeasurementREGION (region) Household, categoricalURBRUR (area of residence) Household, categoricalHHSIZE (household size) Household, categoricalOWNAGLAND (agricultural land ownership) Household, categoricalRELIG (religion of household head) Household, categoricalLANDSIZEHA (size of agr. and non-agr. land) Household, continuousTANHHEXP (total expenditures) Household, continuousTEXP (expenditures in category) Household, continuousINCTOTGROSSHH (total income) Household, continuousINC (income in category) Household, continuousGENDER (sex) Individual, categoricalREL (relationship to household head) Individual, categoricalMARITAL (marital status) Individual, categoricalAGEYRS (age in completed years) Individual, semi-continuousEDYRSCURATT (years of education for currently enrolled) Individual, semi-continuousEDUCY (highest level of education completed) Individual, categoricalATSCHOOL (currently enrolled in school) Individual, categoricalINDUSTRY1 (industry classification) Individual, categorical

11.1.5 Step 5: Data key uses and selection of utility measures

In this case study, our aim is to create a SUF that provides sufficient information for accredited researchers. Weknow that the primary use of these data will be to evaluate indicators relating to income and inequality. Examplesare the GINI coefficient and indicators on what share of income is spent on what type of expenditures. Furthermore,we focus on some education indicators. Table 11.3 gives an overview of the utility measures we selected. Besidesthese utility measures, which are specific to the data uses, we also do standard checks, such as comparing tabulations,cross-tabulations and summary statistics before and after anonymization.

11.1. Case study 1- SUF 145

Page 150: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.3: Overview of selected utility measuresGini point estimates and confidence intervals for total expenditures .Lorenz curves for total expendituresMean monthly per capita total expenditures by area of residenceAverage share of components for expendituresMean monthly per capita total income by area of residenceAverage share of components for incomeNet enrollment in primary education by gender

There are no published figures and statistics available that are calculated from this dataset because it is a demo. Ingeneral, the published figures should be re-computed based on the anonymized dataset and compared to the publishedfigures in Step 11. Large differences would reduce the credibility of the anonymized dataset.

11.1.6 Hierarchical (household) structure

Our demo survey collects data on individuals in households. The household structure is important for data users andshould be considered in the risk assessment. Since some variables are measured on the household level and thus haveidentical values for each household member, the values of the household variables should be treated in the same wayfor each household member (see the Section Anonymization of the quasi-identifier household size). Therefore, wefirst anonymize only the household variables. After this, we merge them with the individual-level variables and thenanonymize the individual-level and household-level variables jointly.

Since the data has a hierarchical structure, Steps 6 through 10 are repeated twice: Steps 6a through 10a are for thehousehold-level variables and Steps 6b through 10b for the combined dataset. In this way, we ensure that household-level variable values remain consistent across household members for each household and the household structurecannot be used to re-identify individuals. This is further explained in the Sections Levels of risk and Randomizingorder and numbering of individuals or households .

Before continuing to Step 6a, we select the categorical key variables, continuous key variables and any variablesselected for use in PRAM routines, as well as household-level sampling weights. We extract these selected householdvariables and the households from the dataset and save them as fileHH. The choice of PRAM variables is furtherexplained in Step 8a. Listing 11.7 illustrates how these steps are done in R (see also the Section Household structure).

Note: In our dataset, some of the categorical variables when imported from the STATA file were not imported asfactors. sdcMicro requires that these be converted to factors before proceeding.

Conversion of these variables to factors is also shown in Listing 11.7.

Listing 11.7: Selecting the variables for the household-level anonymiza-tion

1 ### Select variables (household level)2 # Key variables (household level)3 selectedKeyVarsHH = c('URBRUR', 'REGION', 'HHSIZE', 'OWNHOUSE',4 'OWNAGLAND', 'RELIG')5

6 # Changing variables to class factor7 file$URBRUR <- as.factor(file$URBRUR)8 file$REGION <- as.factor(file$REGION)9 file$OWNHOUSE <- as.factor(file$OWNHOUSE)

10 file$OWNAGLAND <- as.factor(file$OWNAGLAND)11 file$RELIG <- as.factor(file$RELIG)

(continues on next page)

146 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 151: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

12

13 # Numerical variables14 numVarsHH = c('LANDSIZEHA', 'TANHHEXP', 'TFOODEXP', 'TALCHEXP',15 'TCLTHEXP', 'THOUSEXP', 'TFURNEXP', 'THLTHEXP',16 'TTRANSEXP', 'TCOMMEXP', 'TRECEXP', 'TEDUEXP',17 'TRESHOTEXP', 'TMISCEXP', 'INCTOTGROSSHH', 'INCRMT',18 'INCWAGE', 'INCFARMBSN', 'INCNFARMBSN', 'INCRENT',19 'INCFIN', 'INCPENSN', 'INCOTHER')20 # PRAM variables21 pramVarsHH = c('ROOF', 'TOILET', 'WATER', 'ELECTCON',22 'FUELCOOK', 'OWNMOTORCYCLE', 'CAR', 'TV', 'LIVESTOCK')23

24 # sample weight (WGTPOP) (household)25 weightVarHH = c('WGTPOP')26

27 # All household level variables28 HHVars <- c('HID', selectedKeyVarsHH, pramVarsHH, numVarsHH, weightVarHH)

We then extract these variables from file, the dataframe in R that contains all variables. Every household has thesame number of entries as it has members (e.g., a household of three will be repeated three times in fileHH). Beforeanalyzing the household-level variables, we select only one entry per household, as illustrated in Listing 11.8. This isfurther explained in the Section Household structure.

Listing 11.8: Taking a subset with only households

1 # Create subset of file with households and HH variables2 fileHH <- file[,HHVars]3

4 # Remove duplicated rows based on IDH, select uniques,5 # one row per household in fileHH6 fileHH <- fileHH[which(!duplicated(fileHH$IDH)),]7

8 dim(fileHH)9 ## [1] 2000 39

The file fileHH contains 2,000 households and 39 variables. We are now ready to create our sdcMicro object withthe corresponding variables we selected in Listing 11.7. For our case study, we will create an sdcMicro object calledsdcHH based on the data in fileHH, which we will use for steps 6a – 10a (see Listing 11.9).

Note: When the sdcMicro object is created, the sdcMicro package automatically calculates and stores the riskmeasures for the data.

This leads us to Step 6a.

11.1. Case study 1- SUF 147

Page 152: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.9: Creating a sdcMicro object for the household variables

1 # Create initial SDC object for household level variables2 sdcHH <- createSdcObj(dat = fileHH, keyVars = selectedKeyVarsHH, pramVars =

→˓pramVarsHH,3 weightVar = weightVarHH, numVars = numVarsHH)4

5 numHH <- length(fileHH[,1]) # number of households

11.1.7 Step 6a: Assessing disclosure risk (household level)

As a first measure, we evaluate the number of households violating k-anonymity at the levels 2, 3 and 5.

Table 11.4 shows the number of violating households as well as the percentage of the total number of households.Listing 11.10 illustrates how to find these values with sdcMicro. The print() function in sdcMicro shows only thevalues for thresholds 2 and 3. Values for other thresholds can be calculated manually by summing up the frequenciessmaller than the k-anonymity threshold, as shown in Listing 11.10.

Table 11.4: Number and proportion of households violating k-anonymityk-anonymity level Number of HH violating Percentage of total number of HH2 103 5.15 %3 229 11.45 %5 489 24.45 %

Listing 11.10: Showing number of households violating k-anonymity forlevels 2, 3 and 5

1 # Number of observations violating k-anonymity (thresholds 2 and 3)2 print(sdcHH)3 ## Infos on 2/3-Anonymity:4 ##5 ## Number of observations violating6 ## - 2-anonymity: 1037 ## - 3-anonymity: 2298 ##9 ## Percentage of observations violating

10 ## - 2-anonymity: 5.150 %11 ## - 3-anonymity: 11.450 %12 --------------------------------------------------------------------------13

14 # Calculate sample frequencies and count number of obs. violating k(5) - anonymity15 kAnon5 <- sum(sdcHH@risk$individual[,2] < 5)16

17 kAnon518 ## [1] 48919

20 # As percentage of total21 kAnon5 / numHH22 ## [1] 0.2445

It is often useful to view the values for the household(s) that violate 𝑘-anonymity. This might help clarify whichvariables cause the uniqueness of these households; this can then be used later when choosing appropriate SDCmethods. Listing 11.11 shows how to assess the values of the households violating 3- and 5-anonymity. It seems thatamong the categorical key variables, the variable HHSIZE is responsible for many of the unique combinations and the

148 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 153: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

origin of much of the risk. Having determined this, we can flag HHSIZE as a possible first variable to treat to obtainthe required risk level. In practice, with a variable like HHSIZE, this will likely involve removing large householdsfrom the dataset to be released. As explained in the Section Anonymization of the quasi-identifier household size ,recoding and local suppression are no valid options for the variable HHSIZE. The frequencies of household size inTable 11.7 show that there are few households with more than 13 household members. This makes these householdseasily identifiable based on the number of household members and at high risk of re-identification, also in the contextof the nosy neighbor scenario.

Listing 11.11: Showing households that violate 𝑘-anonymity

1 # Show values of key variable of records that violate k-anonymity2 fileHH[sdcHH@risk$individual[,2] < 3, selectedKeyVarsHH] # for 3-anonymity3 fileHH[sdcHH@risk$individual[,2] < 5, selectedKeyVarsHH] # for 5-anonymity

We also assess the disclosure risk of the categorical variables with the individual and global risk measures as describedin the Sections Individual risk and Global risk. In fileHH every entry represents a household. Therefore, we usethe individual non-hierarchical risk here, where the individual refers in this case to a household. fileHH contains onlyhouseholds and has no hierarchical structure. In Step 6b, we evaluate the hierarchical risk in file, the dataset containingboth households and individuals. The individual and global risk measures automatically take into consideration thehousehold weights, which we defined in Listing 11.7. In our file, the global risk measure calculated using the chosenkey variables is 0.05%. This percentage is extremely low and corresponds to 1.03 expected re-identifications. Theresults are also shown in Listing 11.12. This low figure can be explained by the relatively small sample size of 0.25%of the total population. Furthermore, one should keep in mind that this risk measure is based only on the categoricalquasi-identifiers at the household level. Listing 11.12 illustrates how to print the global risk measure.

Listing 11.12: Printing global risk measures

1 print(sdcHH, "risk")2

3 ## Risk measures:4 ##5 ## Number of observations with higher risk than the main part of the data: 06 ## Expected number of re-identifications: 1.03 (0.05 %)

The global risk measure does not provide information about the spread of the individual risk measures. There mightbe a few households with relatively high risk, while the global (average) risk is low. It is therefore useful as a next stepto inspect the observations with relatively high risk. The highest risk is 5.5% and only 14 households have risk largerthan 1%. Listing 11.13 shows how to display those households with risk over a certain threshold. Here the thresholdis 0.01 (1%).

Listing 11.13: Observations with individual risk higher than 1%

1 # Observations with risk above certain threshold (0.01)2 fileHH[sdcHH@risk$individual[, "risk"] > 0.01,]

Since the selected key variables at the household level are both categorical and numerical, the individual and globalrisk measures based on frequency counts do not completely reflect the disclosure risk of the entire dataset. Bothcategorical and continuous key variables are important for the data users, thus options like recoding the continuousvariables (e.g., by creating quantiles of income and expenditure variables) to make all of them categorical will likelynot satisfy the data user’s needs. We therefore avoid recoding continuous variables and assess the disclosure risk ofthe categorical and continuous variables separately. This approach can be partly justified by the fact that any potentialmatching to external data sources for the continuous and categorical variables are available from different externaldata sources and as such will not be used simultaneously for matching.

Continuous variables

To measure the risk of the continuous variables, we use an interval measure, which measures the number of

11.1. Case study 1- SUF 149

Page 154: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

anonymized values that are too close to their original values. See the Section Interval measure for more informa-tion on interval-based risk measures for continuous variables. This measure is an ex-post measure, meaning that therisk can be evaluated only after anonymization and measures whether the perturbation is sufficiently large. Since it isan ex-post measure, we can evaluate it only in Step 9a after the variables have been treated. Evaluating this measurebased on the original data would lead to a risk of 100%; all values would be too close to the original values since theywould coincide with the original values, no matter how small the chosen intervals would be.

We also look at the distribution of LANDSIZEHA. In the variable LANDSIZEHA high values are rare and can leadto re-identification. An example is a large landowner in a specific region. To evaluate the distribution of the variableLANDSIZEHA, we look at the percentiles. Every percentile represents approximately 20 households. In addition,we look at the values of the largest 50 plots. Listing 11.14 shows how to use R to display the quantiles and thelargest landplots. Table 11.5 shows the 90th – 100th percentiles and Table 11.6 displays the largest 50 values forLANDSIZEHA. Based on these values, we conclude that values of LANDSIZEHA over 40 make the household veryidentifiable. These large households and households with large land plots need extra protection, as discussed in Step8a.

Listing 11.14: Percentiles of LANDSIZE and listing the sizes of thelargest 50 plots

1 # 1st - 100th percentiles of land size2 quantile(fileHH$LANDSIZEHA, probs = (1:100)/100, na.rm= TRUE)3

4 # Values of landsize for largest 50 plots5 tail(sort(fileHH$LANDSIZEHA), n = 50)

Table 11.5: Percentiles 90-100 of the variable LANDSIZEPercentile 90 91 92 93 94 95Value 6.00 8.00 8.09 10.12 10.12 10.12Percentile 96 97 98 99 100Value 12.14 20.23 33.83 121.41 1,214.08

Table 11.6: 50 largest values of the variable LANDSIZE12.14 15.00 15.37 15.78 16.19 20.00 20.23 20.23 20.23 20.2320.23 20.23 20.23 20.23 20.23 20.23 20.23 20.23 20.23 20.2320.23 20.23 20.50 30.35 32.38 40.47 40.47 40.47 40.47 40.4740.47 40.47 80.93 80.93 80.93 80.93 121.41 121.41 161.88 161.88161.88 182.11 246.86 263.05 283.29 404.69 404.69 607.04 809.39 1214.08

11.1.8 Step 7a: Assessing utility measures (household level)

The utility of the data does not only depend on the household level variables, but on the combination of household-level and individual-level variables. Therefore, it is not useful to evaluate all the utility measures selected in Step5 at this stage, i.e., before anonymizing the individual level variables. We restrict the initial measurement of utilityto those measures that are solely based on the household variables. In our dataset, these are the measures related toincome and expenditure and their distributions. The results are presented in Step 10a, together with the results afteranonymization, which allow direct comparison. If after the next anonymization step it appears that the data utility hasbeen significantly decreased by the suppression of some household level variables, we can return to this step.

150 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 155: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

11.1.9 Step 8a: Choice and application of SDC methods (household variables)

This step is divided into the anonymization of the variable HHSIZE, as this is a special case, the anonymization of theother selected categorical quasi-identifiers and the anonymization of the selected continuous quasi-identifiers.

Variable HHSIZE

The variable HHSIZE poses a problem for the anonymization of the file, since suppressing it will not anonymize thisvariable: a simple headcount based on the household ID would allow the reconstruction of this variable. Table 11.7shows the absolute frequencies of HHSIZE. The number of households for each size larger than 13 is 6 or fewer andcan be considered outliers with a higher risk of re-identification, as discussed in Step 6a. One way to deal with this is toremove all households of size 14 or larger from the dataset1. Removing 29 households of size 14 or larger reduces thenumber of 2-anonymity violations by 18, of 3-anonymity violations by 26 and of 5-anonymity violations by 29. Thismeans that all removed households violated 5-anonymity due to the value of the variable HHSIZE and many of them2- or 3-anonymity. In addition, the average individual risk amongst the 29 households is 0.15%, which is almost threetimes higher than the average individual risk of all households. The impact on the global risk measure of removingthese 29 households is, however, limited, due to the relatively small number of removed households in comparison tothe total number of 2,000 households. Removing the households is primarily to protect these specific households, notto reduce the global risk.

Changes, such as removing records, cannot be done in the sdcMicro object. Listing 11.15 illustrates the way to removehouseholds and recreate the sdcMicro object.

Table 11.7: Frequencies of variable HHSIZE (household size)HHSIZE 1 2 3 4 5 6 7 8 9 10 11 12Frequency 152 194 238 295 276 252 214 134 84 66 34 21HHSIZE 13 14 15 16 17 18 19 20 21 22 33Frequency 11 6 6 5 4 2 1 2 1 1 1

Listing 11.15: Removing households with large (rare) household sizes

1 # Tabulation of variable HHSIZE2 table(sdcHH@manipKeyVars$HHSIZE)3

4 # Remove large households (14 or more household members) from file and fileHH5 file <- file[!file[,'HHSIZE'] >= 14,]6

7 fileHHnew <- fileHH[!fileHH[,'HHSIZE'] >= 14,]8

9 # Create new sdcMicro object based on the file without the removed households10 sdcHH <- createSdcObj(dat=fileHHnew, keyVars=selectedKeyVarsHH, pramVars=pramVarsHH,11 weightVar=weightVarHH, numVars = numVarsHH)

Categorical variables

We are now ready to move on to the choice of SDC methods for the categorical variables on the household level inour dataset. As noted in our discussion of the methods, applying perturbative methods and local suppression may leadto large loss of utility. The common approach is to apply recoding to the largest possible extent as a first approach, toreach a prescribed level of risk and reduce the number of suppressions needed. Only after that should methods such aslocal suppression be applied. If this approach does not already achieve the desired result, we can consider perturbativemethods.

Since the file is to be released as a SUF, we can keep a higher level of detail in the data. The selected categoricalkey variables at the household level are not suitable for recoding at this point. Due to the relatively low risk of re-

1 Other methods and guidance on treating datasets where household size is a quasi-identifier are discussed in the Section Anonymization of thequasi-identifier household size.

11.1. Case study 1- SUF 151

Page 156: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

identification based on the five selected categorical household level variables, it is possible in this case to use an optionlike local suppression to achieve our desired level of risk. Applying local suppression when initial risk is relatively lowwill likely only lead to suppression of few observations and thus limit the loss of utility. If, however, the data had beenmeasured to have a relatively high risk, then applying local suppression without previous recoding would likely resultin a large number of suppressions and greater information loss. Efforts such a recoding should be taken first beforesuppressing in cases where risk is initially measured as high. Recoding will reduce risk with little information loss andthus the number of suppressions, if local suppression is applied as an additional step. We apply local suppression toreach 2-anonymity. The choice of the low level of two is based on the overall low re-identification risk due to the highsample weights and the release as SUF. High sample weights mean, ceteris paribus, a low level of re-identificationrisk. Achieving 2-anonymity is the same as removing sample uniques. This leads to 42 suppressions in the variableHHSIZE and 4 suppressions in the variable REGION. As explained earlier, suppression of the value of the variableHHSIZE does not lead to actual suppression of this information. Therefore, we redo the local suppression, but thistime we tell sdcMicro to, if possible, not suppress HHSIZE but one of the other variables.

In sdcMicro it is possible to tell the algorithm which variables are important and less important for making smallchanges (see also the Section Local suppression). To prevent HHSIZE being suppressed, we set the importance ofHHSIZE in the importance vectors to the highest (i.e., 1). Listing 11.16 shows how to apply local suppression andput importance on the variable HHSIZE. The variable REGION is the type of variable that should not have anysuppressions either. We also set the importance of REGION to 2 and the importance of RURURB to 3. This leads toan order of the variables to be considered for suppression by the algorithm. Instead of 42 suppressions in the variableHHSIZE, this leads one suppressed value in the variable HHSIZE, and to 6, 1, 48 and 16 suppressions respectivelyfor the variables URBRUR, REGION, OWNAGLAND and RELIG (which we set as less important). The importanceis clearly reflected in the number of suppression. The total number of suppressions is higher than without importancevector (71 vs. 46), but 2-anonymity is achieved in the dataset with fewer suppressions in the variables HHSIZE andREGION. We remove the one household with the suppressed value of HHSIZE (13) to protect this household.

Note: In Listing 11.16 we use the undolast() function in sdcMicro to go one step back after we had first applied localsuppression with no importance vector.

The undolast() function restores the sdcMicro object back to the previous state (i.e., before we applied local sup-pression), which allows us to rerun the same command, but this time with an importance vector set. The undolast()function can only be used to go one step back.

Listing 11.16: Local suppression with and without importance vector

1 # Local suppression2 sdcHH <- localSuppression(sdcHH, k=2, importance = NULL) # no importance vector3

4 print(sdcHH, "ls")5 ## Local Suppression:6 ## KeyVar | Suppressions (#) | Suppressions (%)7 ## URBRUR | 0 | 0.0008 ## REGION | 4 | 0.2039 ## HHSIZE | 37 | 1.877

10 ## OWNAGLAND | 0 | 0.00011 ## RELIG | 0 | 0.00012

13 sdcHH <- undolast(sdcHH)14

15 sdcHH <- localSuppression(sdcHH, k=2, importance = c(3, 2, 1, 5, 5))16 # importance on HHSIZE (1), REGION (2) and URBRUR (3)17

18 print(sdcHH, "ls")19 ## Local Suppression:20 ## KeyVar | Suppressions (#) | Suppressions (%)

(continues on next page)

152 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 157: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

21 ## URBRUR | 6 | 0.30422 ## REGION | 1 | 0.05123 ## HHSIZE | 1 | 0.05124 ## OWNAGLAND | 43 | 2.18225 ## RELIG | 16 | 0.812

The variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVE-STOCK are not sensitive variables and were not selected as quasi-identifiers because we assumed that there are noexternal data sources containing this information that could be used for matching. Values can be easily observedor be known to neighbors, however, and therefore are important, together with other variables, for the spontaneousrecognition scenario and nosy neighbor scenario. To anonymize these variables, we want to introduce a low levelof uncertainty in them. Therefore, we decide to use invariant PRAM for the variables ROOF, TOILET, WATER,ELECTCON, FUELCOOK, OWNMOTORCYCLE, CAR, TV and LIVESTOCK, where we treat LIVESTOCK asa semi-continuous variable due to the low number of different values. The Section PRAM (Post RAndomizationMethod) provides more information on the PRAM method and its implementation in sdcMicro. Listing 11.17 il-lustrates how to apply PRAM. We choose the parameter pd, the lower bound for the probability that a value is notchanged, to be relatively high at 0.8. We can choose a high value, because the variables themselves are not sensitiveand we only want to introduce a low level of changes to minimize the utility loss. Because the distribution of many ofthe variables chosen for PRAM depends on the REGION, we decide to use the variable REGION as a strata variable.In this way the transition matrix is computed for each region separately. Because PRAM is a probabilistic method, weset a seed for the random number generator before applying PRAM to ensure reproducibility of the results.

Note: In practice, it is not advisable to set a seed of 12345, but rather a longer more complex and less easy to guesssequence.

The seed should not be released, since it allows for reconstructing the original values if combined with the transitionmatrix. The transition matrix can be released: this allows for consistent statistical inference by correcting the statisticalmethods used if the researcher has knowledge about the PRAM method (at this point sdcMicro does not allow theretrieval of the transition matrix).

Listing 11.17: Applying PRAM

1 # Pram2 set.seed(12345)3 sdcHH <- pram(sdcHH, strata_variables = "REGION", pd = 0.8)4

5 ## Number of changed observations:6 ## - - - - - - - - - - -7 ## ROOF != ROOF_pram : 98 (4.97%)8 ## TOILET != TOILET_pram : 151 (7.66%)9 ## WATER != WATER_pram : 167 (8.47%)

10 ## ELECTCON != ELECTCON_pram : 90 (4.57%)11 ## FUELCOOK != FUELCOOK_pram : 113 (5.73%)12 ## OWNMOTORCYCLE != OWNMOTORCYCLE_pram : 41 (2.08%)13 ## CAR != CAR_pram : 172 (8.73%)14 ## TV != TV_pram : 137 (6.95%)15 ## LIVESTOCK != LIVESTOCK_pram : 149 (7.56%)

PRAM has changed values within the variables according to the invariant transition matrices. Since we used theinvariant PRAM method (see the Section PRAM (Post RAndomization Method)), the absolute univariate frequenciesremain unchanged. This is not the case for the multivariate frequencies. In Step 10a we compare the changes in themultivariate frequencies for the PRAMmed variables.

Continuous variables

11.1. Case study 1- SUF 153

Page 158: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

We have selected income and expenditures variables and the variable LANDSIZEHA as numerical quasi-identifiers,as discussed in Step 4. In Step 5 we identified variables having high interest for the users of our data: many users usethe data for measuring inequality and expenditure patterns.

Based on the risk evaluation in Step 6a, we decide to anonymize the variable LANDSIZEHA by top coding at thevalue 40 (cf. Table 11.5 and Table 11.6) and round values smaller than 1 to one digit, and values larger than 1 to zerodigits. Rounding the values prevents exact matching with the available cadastral register. Furthermore, we group thevalues between 5 and 40 in the groups 5 – 19 and 20 – 39. After these steps, no household has a unique plot sizeand the number of households in the sample with the same plot size was increased to at least 7. This is shown by thetabulation of the variable LANDSIZEHA after manipulation in the last line of Listing 11.18. In addition, all outliershave been removed by top coding the values. This has reduced the risk of spontaneous recognition as discussed inStep 6. How to recode values in R is introduced in the Section Recoding and, for this particular case, shown in Listing11.18.

Listing 11.18: Anonymizing the variable LANDSIZEHA

1 # Rounding values of LANDSIZEHA to 1 digit for plots smaller than 1 and2 # to 0 digits for plots larger than 13 sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA <= 1 &4 !is.na(sdcHH@manipNumVars$LANDSIZEHA)] <-5 round(sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA <= 1 &6 !is.na(sdcHH@manipNumVars

→˓$LANDSIZEHA)],7 digits = 1)8

9 sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA > 1 &10 !is.na(sdcHH@manipNumVars$LANDSIZEHA)] <-11 round(sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA > 1 &12 !is.na(sdcHH@manipNumVars

→˓$LANDSIZEHA)],13 digits = 0)14

15 # Grouping values of LANDSIZEHA into intervals 5-19, 20-3916 sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA >= 5 &17 sdcHH@manipNumVars$LANDSIZEHA < 20 & !is.na(sdcHH@manipNumVars$LANDSIZEHA)] <- 1318

19 sdcHH@manipNumVars$LANDSIZEHA[sdcHH@manipNumVars$LANDSIZEHA >= 20 &20 sdcHH@manipNumVars$LANDSIZEHA < 40 &!is.na(sdcHH@manipNumVars$LANDSIZEHA)] <- 3021

22 # Topcoding values of LANDSIZEHA larger than 40 (also recomputes risk after manual→˓changes)

23 sdcHH <- topBotCoding(sdcHH, value = 40, replacement = 40, kind = 'top', column =→˓'LANDSIZEHA')

24

25 # Results for LANDSIZEHA26 table(sdcHH@manipNumVars$LANDSIZEHA)27 ## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 13 30 4028 ## 188 109 55 30 24 65 22 7 31 16 154 258 53 60 113 18 25

For the expenditure and income variables we compared, based on the actual case study data, several methods. Asmentioned earlier, the main use of the data is to compute inequality measures, such as the Gini coefficient. Recodingthese variables into percentiles creates difficulties computing these measures or changes these measures to a largeextent and is hence not a suitable method. Often, income and expenditure variables that are released in a SUF areanonymized by top-coding. This protects the outliers, which are the values that are the most at risk. Top-coding,however, destroys the inequality information in the data, by removing high (and low) incomes. Therefore, we decideto use noise addition. To take into account the higher risk of outliers, we add a higher level of noise to those.

Adding noise can lead to a transformation of the shape of the distribution. Depending on the magnitude of the noise

154 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 159: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(see the Section Noise addition for the definition of the magnitude of noise), the values of income can also becomenegative. One way to solve this would be to cut off the values below zero and set them to zero. This would, however,destroy the properties conserved by noise addition (amongst others the value of the expected mean, see also the SectionNoise addition) and we chose to keep the negative values.

As mentioned before, the aggregates of income and expenditures are the sums of the components. Adding noiseto each of the components might lead to violation of this condition. Therefore, one solution is to add noise to theaggregates and remove the components. We prefer to keep the components in the data and apply noise addition to eachcomponent separately. This allows to apply a lower level of noise than when applying noise only to the aggregates. Anoise level of 0.01 seems to be sufficient with extra noise of 0.05 added to the outliers. The outliers are defined by arobust Mahalanobis distance (see the Section Noise addition). After adding noise to the components, we recomputedthe aggregates as the sum of the perturbed components.

Note: This result is only based on the actual case study dataset and is not necessarily true for other datasets.

The noise addition is shown in Listing 11.19. Before applying probabilistic methods such as noise addition, we set aseed for the random number generator. This allows us to reproduce the results.

Listing 11.19: Anonymizing continuous variables

1 # Add noise to income and expenditure variables by category2

3 # Anonymize components4 compExp <- c("TFOODEXP", "TALCHEXP", "TCLTHEXP", "THOUSEXP",5 "TFURNEXP", "THLTHEXP", "TTRANSEXP", "TCOMMEXP", "TRECEXP", "TEDUEXP",6 "TRESHOTEXP", "TMISCEXP")7 set.seed(123)8

9 # Add noise to expenditure variables10 sdcHH <- addNoise(noise = 0.01, obj = sdcHH, variables = compExp, method = "additive")11

12 # Add noise to outliers13 sdcHH <- addNoise(noise = 0.05, obj = sdcHH, variables = compExp, method = "outdect")14

15 # Sum over expenditure categories to obtain consistent totals16 sdcHH@manipNumVars[,'TANHHEXP'] <- rowSums(sdcHH@manipNumVars[,compExp])17 compInc <- c('INCRMT', 'INCWAGE', 'INCFARMBSN', 'INCNFARMBSN',18 'INCRENT', 'INCFIN', 'INCPENSN', 'INCOTHER')19

20 # Add noise to income variables21 sdcHH <- addNoise(noise = 0.01, obj = sdcHH, variables = compInc, method = "additive")22

23 # Add noise to outliers24 sdcHH <- addNoise(noise = 0.05, obj = sdcHH, variables = compInc, method = "outdect")25

26 # Sum over income categories to obtain consistent totals27 sdcHH@manipNumVars[,'INCTOTGROSSHH'] <- rowSums(sdcHH@manipNumVars[,compInc])28

29 # recalculate risks after manually changing values in sdcMicro object30 calcRisks(sdcHH)

11.1.10 Step 9a: Re-measure risk

For the categorical variables, we conclude that we have achieved 2-anonymity in the data with local suppression.Only 104 households, or about 5% of the total number, violate 3-anonymity. Table 11.8 gives an overview of these

11.1. Case study 1- SUF 155

Page 160: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

risk measures. The global risk is reduced to 0.02% (expected number of re-identifications 0.36), which is extremelylow. Therefore, we conclude that based on the categorical variables, the data has been sufficiently anonymized. Nohousehold has a risk of re-identification higher than 0.01 (1%). By removing households with rare values (or outliers)of the variable HHSIZE, we have reduced the risk of spontaneous recognition of these households. This reasoningcan also be applied to the result of the risk of recoding the variable LANDSIZEHA and PRAMming the variablesidentified to be important in the nosy neighbor scenario. An intruder cannot know with certainty whether a householdthat he recognizes in the data is the correct household, due to the noise.

Table 11.8: Number and proportion of households violating k-anonymityafter anonymization

k-anonymity Number HH violating Percentage2 0 0 %3 104 5.28 %5 374 18.70 %

These measures refer only to the categorical variables. To evaluate the risk of the continuous variables we could usean interval measure or closest neighbor algorithm. These risk measures are discussed in the Section Risk measures forcontinuous variables. We chose to use an interval measure, since exact value matching is not our largest concern basedon the assumed scenarios and external data sources. Instead, datasets with similar values but not the exact same valuescould be used for matching. Here the main concern is that the values are sufficiently far from the original values,which is measured with an interval measure.

Listing 11.20 shows how to evaluate the interval measure for each of the expenditure variables, which are containedin the vector compExp2. The different values of the parameter k in the function dRisk() define the size of the intervalaround the original value, as explained in the Section Interval measure. The larger k, the larger the intervals, the higherthe probability that a perturbed value is in the interval around the original value and the higher the risk measure. Theresult is satisfactory with relatively small intervals (k = 0.01), but not when increasing the size of the intervals. In ourcase, k = 0.01 is sufficiently large, since we are looking at the components, not the aggregates. We have to pay specialattention to the outliers. Here the value 0.01 for k is too small to assume that they are protected when outside this smallinterval. It would be necessary to check outliers and their perturbed values and there might be a need for a higher levelof perturbation for outliers. We conclude that, from a risk perspective and based on the interval measure, the chosenlevels of noise are acceptable. In the next step, we will look at the impact on the data utility of the noise addition.

Listing 11.20: Measuring risk of re-identification of continuous variables

1 dRisk(sdcHH@origData[,compExp], xm = sdcHH@manipNumVars[,compExp], k = 0.01)2 [1] 0.06088283

4 dRisk(sdcHH@origData[,compExp], xm = sdcHH@manipNumVars[,compExp], k = 0.02)5 [1] 0.90258756

7 dRisk(sdcHH@origData[,compExp], xm = sdcHH@manipNumVars[,compExp], k = 0.05)8 [1] 1

11.1.11 Step 10a: Re-measure utility

None of the variables has been recoded and the original level of detail in the data is kept, except for the variableLANDSIZEHA. As described in Step 8a, local suppression has only removed a few values in the other variables,which has not greatly reduced the validity of the data.

The univariate frequency distributions of the variables ROOF, TOILET, WATER, ELECTCON, FUELCOOK, OWN-MOTORCYCLE, CAR, TV and LIVESTOCK did not, by definition of the invariant PRAM method (see the Section

2 For illustrative purposes, we only show this evaluation for the expenditure variables. It can be easily copied for the income variables. Theresults are similar.

156 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 161: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

PRAM (Post RAndomization Method)), change to a large extent. The tabulations are presented in Table 11.9 (thevalues 1 – 9 and NA in the first row are the values of the variables and .m after the variable name refers to the valuesafter anonymization).

Note: Although the frequencies are almost the same, this does not mean that the values of particular households didnot change.

Values have been swapped between households. This becomes apparent when looking at the multivariate frequenciesof the WATER with the variable URBRUR in Table 11.10. The multivariate frequencies of the PRAMmed with thevariable URBRUR could be of interest for users, but these are not preserved. Since we applied PRAM within theregions, the multivariate frequencies of the PRAMmed variables with REGION are preserved.

Table 11.9: Univariate frequencies of the PRAMmed variable before andafter anonymization

. 0 1 2 3 4 5 6 7 8 9 NAROOF 27 1 914 307 711 10 1ROOF.m 25 1 907 319 712 6 1TOILET 76 594 817 481 3TOILET.m 71 597 816 483 4WATER 128 323 304 383 562 197 18 21 35WATER.m 134 319 308 378 573 188 16 21 34ELECTCON 768 216 8 2 977ELECTCON.m 761 218 8 3 981FUELCOOK 1289 21 376 55 36 139 55FUELCOOK.m 1284 22 383 50 39 143 50OWNMOTORCYCLE 1883 86 2OWNMOTORCYCLE.m 1882 86 2CAR 963 31 977CAR.m 966 25TV 1216 264 491TV.m 1203 272 496

Table 11.10: Multivariate frequencies of the variables WATER with RU-RURB before and after anonymization

. 1 2 3 4 5 6 7 8 9WATER/URB 11 49 270 306 432 183 12 15 21WATER/RUR 114 274 32 76 130 14 6 6 14WATER/URB.m 79 220 203 229 402 125 10 12 19WATER/RUR.m 54 98 105 147 169 63 6 9 15

For conciseness, we restrict ourselves to the analysis of the expenditure variables. The analysis of the income variablescan be done in the same way and leads to similar results.

We look at the effect of anonymization on some indicators as discussed in Step 5. Table 11.11 presents the pointestimates and bootstrapped confidence interval of the GINI coefficient3 for the sum of the expenditure components.The calculation of the GINI coefficient and the confidence interval are based on the positive expenditure values. Weobserve very small changes in the Gini coefficient, that are statistically negligible. We use a visualization to illustratethe impact on utility of the anonymization. Visualizations are discussed in the Section Assessing data utility with thehelp of data visualizations (in R) and the specific R code for this case study is available in the R script. The change

3 To compute the GINI coefficient, bootstrap to construct the confidence intervals and plot the Lorenz curve we used the R packages laeken,reldist, bootstrap and ineq.

11.1. Case study 1- SUF 157

Page 162: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

in the inequality measures is illustrated in Fig. 11.1, which shows the Lorenz curves based on the positive expenditurevalues before and after anonymization.

Table 11.11: GINI point estimates and bootstrapped confidence intervalsfor sum of expenditure components

. before afterPoint estimate 0.510 0.508Left bound of CI 0.476 0.476Right bound of CI 0.539 0.538

Fig. 11.1: Lorenz curve based on positive total expenditures values

We compare the mean monthly expenditures (MME) and mean monthly income (MMI) for rural, urban and totalpopulation. The results are shown in Table 11.12. We observe that the chosen levels of noise add only small distortionsto the MME and slightly larger changes to the MMI.

Table 11.12: Mean monthly expenditure and mean monthly income percapita by rural/urban

. before afterMME rural 400.5 398.5MME urban 457.3 459.9MME total 412.6 412.6MMI rural 397.1 402.2MMI urban 747.6 767.8MMI total 472.1 478.5

Table 11.13 shows the share of each of the components of the expenditure variables before and after anonymization.

158 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 163: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.13: Shares of expenditures components. TFOODEXP TALCHEXP TCLTHEXP THOUSEXP TFURNEXP THLTHEXPbefore 0.58 0.01 0.03 0.09 0.02 0.03after 0.59 0.01 0.03 0.09 0.02 0.03. TTRANSEXP TCOMMEXP TRECEXP TEDUEXP TRESHOTEXP TMISCEXPbefore 0.04 0.02 0.00 0.08 0.03 0.05after 0.04 0.02 0.00 0.08 0.03 0.05

Anonymization for the creation of a SUF will inevitably lead to some degree of utility loss. It is important to describethis loss in the external report, so that users are aware of the changes in the data. This is described in Step 11 andpresented in Appendix C. Appendix C also shows summary statistics and tabulations of the household level variablesbefore and after anonymization.

Merging the household- and individual-level variables

The next step is to merge the treated household variables with the untreated individual variables for the anonymizationof the individual level variables. Listing 11.21 shows the steps to merge these files. This also includes the selec-tion of variables used in the anonymization of the individual-level variables. We create the sdcMicro object for theanonymization of the individual variables in the same way as for the household variable in Listing 11.9. Subsequently,we repeat Steps 6-10 for the individual-level variables.

Listing 11.21: Merging the files with household and individual-levelvariables and creating an sdcMicro object for the anonymization of theindividual-level variables

1 ### Select variables (individual level)2 # Key variables (individual level)3 selectedKeyVarsIND = c('GENDER', 'REL', 'MARITAL', 'AGEYRS',4 'EDUCY', 'ATSCHOOL', 'INDUSTRY1') # list of selected key

→˓variables5

6 # Sample weight (WGTHH, individual weight)7 selectedWeightVarIND = c('WGTHH')8

9 # Household ID10 selectedHouseholdID = c('IDH')11

12 # No strata13

14 # Recombining anonymized HH datasets and individual level variables15 indVars <- c("IDH", "IDP", selectedKeyVarsIND, "WGTHH") # HID and all non HH variables16 fileInd <- file[indVars] # subset of file without HHVars17

18 HHmanip <- extractManipData(sdcHH) # manipulated variables HH19 HHmanip <- HHmanip[HHmanip[,'IDH'] != 1782,]20

21 fileCombined <- merge(HHmanip, fileInd, by.x= c('IDH'))22

23 fileCombined <- fileCombined[order(fileCombined[,'IDH'],24 fileCombined[,'IDP']),]25

26 dim(fileCombined)27

28 # SDC objects with all variables and treated HH vars for29 # anonymization of individual level variables30 sdcCombined <- createSdcObj(dat = fileCombined, keyVars = selectedKeyVarsIND,

(continues on next page)

11.1. Case study 1- SUF 159

Page 164: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

31 weightVar = selectedWeightVarIND, hhId =→˓selectedHouseholdID)

11.1.12 Step 6b: Assessing disclosure risk (individual level)

All key variables at the individual level are categorical. Therefore, we can use k-anonymity and the individual andglobal risk measures (see the Sections Individual risk and Global risk). The hierarchical risk is now of interest, giventhe household structure in the dataset fileCombined, which includes both household- and individual-level variables.The number of individuals (absolute and relative) that violate k-anonymity at the levels 2, 3 and 5 are shown in Table11.14.

Note: k-anonymity does not consider the household structure and therefore underestimates the risk. Therefore, weare more interested in the individual and global hierarchical risk measures.

Table 11.14: k-anonymity violationsk-anonymity Number HH violating Percentage2 998 9.91%3 1,384 13.75%5 2,194 21.79%

The global risk measures can be found using R as illustrated in Listing 11.22. The global risk is 0.24%, whichcorresponds to 24 expected re-identifications. Accounting for the hierarchical structure, this rises to 1.26%, or 127expected re-identifications. The global risk measures are low compared to the number of 𝑘-anonymity violators due tothe low sampling weights. The high number of 𝑘-anonymity violators is mainly due to the very detailed age variable.The risk measures are based only on the individual level variables, since we assume that the individual and householdlevel variables are be used simultaneously by an intruder. If we would consider an intruder scenario where thesevariables are used simultaneously by an intruder to re-identify individuals, the household level variables should alsobe taken into account here. This would results in a high number of key variables.

Listing 11.22: Global risk of the individual-level variables

1 print(sdcCombined, 'risk')2 ## Risk measures:3 ##4 ## Number of observations with higher risk than the main part of the data: 05 ## Expected number of re-identifications: 23.98 (0.24 %)6 ##7 ## Information on hierarchical risk:8 ## Expected number of re-identifications: 127.12 (1.26 %)

11.1.13 Step 7b: Assessing utility (individual level)

We evaluate the utility measures selected in Step 5 besides some general utility measures. The values computed fromthe raw data are presented in step 10b to allow for direct comparison with the values computed from the anonymizeddata.

160 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 165: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

11.1.14 Step 8b: Choice and application of SDC methods (individual level)

We use the same approach for the anonymization of the individual-level categorical key variables as for the householdlevel categorical variables described earlier: first use global recoding to limit the necessary number of suppressions,then apply local suppressions and finally, if necessary, use of perturbative methods.

The variable AGEYRS (i.e., age in years) has many different values (age in months for children 0 – 1 years and agein years for individuals over 1 year). This level of detail leads to a high level of re-identification risk, given externaldatasets with exact age as well as knowledge of the exact age of close relatives. We have to reduce the level of detail inthe age variables by recoding the age values (see the Section Recoding ). First, we recode the values from 15 to 65 inten-year intervals. Since some indicators related to education are computed from the survey dataset, our first approachis not to recode the age range 0 – 15 years. For children under the age of 1 year, we reduce the level of detail andrecode these to 0 years. These recodes are shown in Listing 11.23. We also top-code age at the age of 65 years. Thisprotects individuals with high (rare) age values.

Listing 11.23: Recoding age in 10-year intervals in the range 15 – 65 andtop code age over 65 years

1 # Recoding age and top coding age (top code 65), below that 10 year age2 # groups, children aged under 1 are recoded 0 (previously in months)3

4 sdcCombined@manipKeyVars$AGEYRS[sdcCombined@manipKeyVars$AGEYRS >= 0 &5 sdcCombined@manipKeyVars$AGEYRS < 1] <- 06

7 sdcCombined@manipKeyVars$AGEYRS[sdcCombined@manipKeyVars$AGEYRS >= 15 &8 sdcCombined@manipKeyVars$AGEYRS < 25] <- 209

10 ...11

12 sdcCombined@manipKeyVars$AGEYRS[sdcCombined@manipKeyVars$AGEYRS >= 55 &13 sdcCombined@manipKeyVars$AGEYRS < 66] <- 6014

15 # topBotCoding also recalculates risk based on manual recoding above16 sdcCombined <- **topBotCoding(obj = sdcCombined, value = 65,17 replacement = 65, kind = 'top', column = 'AGEYRS')18

19 table(sdcCombined@manipKeyVars$AGEYRS) # check results20 ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1421 ## 311 367 340 332 260 334 344 297 344 281 336 297 326 299 26322 ## 20 30 40 50 60 6523 ## 1847 1220 889 554 314 325

These recodes already reduce the risk to 531 individuals violating 3-anonymity. We could recode the values of agein the lower range according to the age categories users require (e.g., 8 – 11 for education). There are many differentcategories for different indicators, however, including education indicators. This would reduce the utility of the datafor some users. Therefore, we decide to look first at the number of suppressions needed in local suppression after thislimited recoding. If the number of suppressions is too high, we can go back and recode age in the range 1 – 14 years.

In Listing 11.24 we demonstrate how one might experiment with local suppression to find the best option. We uselocal suppression to achieve 3-anonymity (see the Section Local suppression . On the first attempt, we do not specifyany importance vector; this leads to many suppressions in the variable AGEYRS (see Table 11.15 below, first row),however. This is undesirable from a utility point of view. Therefore, we decide to specify an importance vector toprevent suppressions in the variable AGEYRS. Suppressing the variable GENDER is also undesirable from the utilitypoint of view. The variable GENDER is a type of variable that should not have suppressions. We set GENDER asvariable with the second highest importance. After specifying the importance vector to prevent suppressions of theage variable, there are no age suppressions (see Table 11.15, second row). The total number of suppressions in theother variables increased, however, from 253 to 323 because of the importance vector. This is to be expected because

11.1. Case study 1- SUF 161

Page 166: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the algorithm without the importance vector minimizes the total number of suppressions by first suppressing values invariables with many categories – in this case, age and gender. Specifying an importance vector prevents reaching thisoptimality and hence leads to a higher total number of suppressions. There is a trade-off between which variables aresuppressed and the total number of suppressions. After specifying an importance vector, the variable REL has manysuppressions (see Table 11.15, second row). We choose this second option.

Listing 11.24: Experimenting with different options in local suppression

1 # Copy of sdcMicro object to later undo steps2 sdcCopy <- sdcCombined3

4 # Importance vectors for local suppression (depending on utility measures)5 impVec1 <- NULL # for optimal suppression6 impVec2 <- rep(length(selectedKeyVarsIND), length(selectedKeyVarsIND))7 impVec2[match('AGEYRS', selectedKeyVarsIND)] <- 1 # AGEYRS8 impVec2[match('GENDER', selectedKeyVarsIND)] <- 2 # GENDER9

10 # Local suppression without importance vector11 sdcCombined <- localSuppression(sdcCombined, k = 2, importance = impVec1)12

13 # Number of suppressions per variable14 print(sdcCombined, "ls")15

16 ## Local Suppression:17 ## KeyVar | Suppressions (#) | Suppressions (%)18 ## GENDER | 0 | 0.00019 ## REL | 34 | 0.33820 ## MARITAL | 0 | 0.00021 ## AGEYRS | 195 | 1.93722 ## EDUCY | 0 | 0.00023 ## EDYRSCURRAT | 3 | 0.03024 ## ATSCHOOL | 0 | 0.00025 ## INDUSTRY1 | 21 | 0.20926

27 # Number of suppressions per variable for each value of AGEYRS28 table(sdcCopy@manipKeyVars$AGEYRS) - table(sdcCombined@manipKeyVars$AGEYRS)29

30 ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 30 40 50 60 6531 ## 0 0 0 0 0 0 2 0 2 1 0 1 4 1 5 25 53 37 36 15 1332

33 # Undo local suppression34 sdcCombined <- undolast(sdcCombined)35

36 # Local suppression with importance vector on AGEYRS and GENDER37 sdcCombined <- localSuppression(sdcCombined, k = 2, importance = impVec2)38

39 # Number of suppressions per variable40 print(sdcCombined, "ls")41 ## Local Suppression:42 ## KeyVar | Suppressions (#) | Suppressions (%)43 ## GENDER | 0 | 0.00044 ## REL | 323 | 3.20845 ## MARITAL | 0 | 0.00046 ## AGEYRS | 0 | 0.00047 ## EDUCY | 0 | 0.00048 ## EDYRSCURRAT | 0 | 0.00049 ## ATSCHOOL | 0 | 0.00050 ## INDUSTRY1 | 0 | 0.000

(continues on next page)

162 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 167: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

51

52 # Number of suppressions for each value of the variable AGEYRS53 table(sdcCopy@manipKeyVars$AGEYRS) - table(sdcCombined@manipKeyVars$AGEYRS)54 ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 20 30 40 50 60 6555 ## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Table 11.15: Number of suppressions by variable for different variationsof local suppression

Local suppressionoptions

GEN-DER

REL MARI-TAL

AGEYRS EDUCY EDYRSCU-RATT

ATSCHOOLINDUS-TRY1

k = 2, no imp 0 34 0 195 0 3 0 21k = 2, imp onAGEYRS

0 323 0 0 0 0 0 0

11.1.15 Step 9b: Re-measure risk (individual level)

We re-evaluate the risk measures selected in Step 6b. Table 11.16 shows that local suppression, not surprisingly, hasreduced the number of individuals violating 2-anonymity to 0.

Table 11.16: k-anonymity violationsk-anonymity Number HH violating Percentage2 0 0.00 %3 197 1.96 %5 518 5.15 %

The hierarchical global risk was reduced to 0.11%, which corresponds to 11.3 expected re-identifications. The highestindividual hierarchical re-identification risk is 1.21%. These risk levels would seem acceptable for a SUF.

11.1.16 Step 10b: Re-measure utility (individual level)

We selected two utility measures for the individual variables: primary and secondary education enrollment, both alsoby gender. These two measures are sensitive to changes in the variables gender (GENDER), age (AGEYRS) andeducation (EDUCY and EDYRSATCURR), and therefore give a good overview of the impact of the anonymization.As shown in Table 11.17 the anonymization did not change the results. The results of the tabulations in Appendix Cconfirm these results.

Table 11.17: Net enrollment in primary and secondary education by gen-der

. Primary education Secondary education

Before 72.6% 74.2% 70.9% 42.0% 44.8% 39.1%After 72.6% 74.2% 70.9% 42.0% 44.8% 39.1%

11.1.17 Step 11: Audit and reporting

In the audit step, we check whether the data allow for reproduction of published figures from the original dataset andrelationships between variables and other data characteristics are preserved in the anonymization process. In short, we

11.1. Case study 1- SUF 163

Page 168: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

check whether the dataset is valid for analytical purposes. There are no figures available that were published from thedataset and need to be reproducible from the anonymized data.

In Step 2, we explored the data characteristics and relationships between variables. These data characteristics and rela-tionships have been mainly preserved, since we took them into account when choosing the appropriate anonymizationmethods. The variables TANHHEXP and INCTOTGROSSHH are the sums of the individual components, because weadded noise to the components and reconstructed the aggregates by summing over the components. Initially, the in-come variables were all positive. This characteristic has been violated, as a result of noise addition. Since values of thevariable AGEYRS were not perturbed, but only recoded and suppressed, we did not introduce unlikely combinations,such as a 60-year-old individual enrolled in primary education. Also, by separating the anonymization process intotwo parts, one for household-level variables and one for individual-level variables, the values of variables measured atthe household level agree for all members of each household.

Furthermore, we drafted two reports, internal and external, on the anonymization of the case study dataset. Theinternal report includes the methods used, the risk before and after anonymization as well as the reasons for theselected methods and their parameters. The external report focuses on the changes in the data and the loss in utility.Focus here should be on the number of suppressions as well as the perturbative methods (PRAM). This is described inthe previous steps.

Note: When creating a SUF, it is inevitable that there will be a loss of information and it is very important for theusers to be aware of these changes and release them in a report that accompanies the data.

Appendix C provides examples of an internal and external report of the anonymization process of this dataset. De-pending on the users and readers of the reports, the content may differ. The code to this case study shows how to obtainthe information for the reports. Some measures are also available in the standard reports generated with the report()function. This is shown in Listing 11.25. The report() function will only use the data available in the sdcMicro object,which does not contain all households for sdcHH.

Listing 11.25: Using the report() function for internal and external re-ports

1 # Create reports with sdcMicro report() function2 report(sdcHH, internal = F) # external (brief) report3 report(sdcHH, internal = T) # internal (extended) report4

5 # Create reports with sdcMicro report() function6 report(sdcCombined, internal = F) # external (brief) report7 report(sdcCombined, internal = T) # internal (extended) report

11.1.18 Step 12: Data release

The final step is the release of the anonymized dataset together with the external report. Listing 11.26 shows how tocollect the data from the sdcMicro object with the extractManipData() function. Before releasing the file, we add anindividual ID to the file (line number in household). We export the anonymized dataset in as STATA file. The SectionRead functions in R presents functions for exporting files in other data formats.

Listing 11.26: Exporting the anonymized dataset

1 # Anonymized dataset2 # Household variables and individual variables3 # extracts all variables, not just the manipulated ones4 dataAnon <- extractManipData(sdcCombined, ignoreKeyVars = F, ignorePramVars = F,5 ignoreNumVars = F, ignoreStrataVar = F)6

(continues on next page)

164 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 169: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

7 # Create STATA file8 write.dta(dataframe = dataAnon, file= 'Case1DataAnon.dta', convert.dates=TRUE)

11.2 Case study 2 - PUF

This case study is a continuation of case study 1 in the Section Case study 1- SUF . Case study 1 produces a SUFfile. In this case study we use this SUF file to produce a PUF file of the same dataset, which can be freely distributed.The structure of the SUF and PUF releases will be the same. However, the PUF will contain fewer variables and less(detailed) information than the SUF. We refer to the Section Case study 1- SUF for a description of the dataset.

Note: It is also possible to directly produce a PUF from a dataset without first creating a SUF.

As in case study 1, we show how the creation of a PUF can be achieved using the open source and free sdcMicropackage and R. A ready-to-run R script for this case study and the dataset are also available to reproduce the resultsand allow the user to adapt the code (see http://ihsn.org/home/projects/sdc-practice). Extracts of this code are presentedin this section to illustrate several steps of the anonymization process.

Note: The choices of methods and parameters in this case study are based on this particular dataset and the resultsand choices might be different for other datasets.

This case study follows the steps of the SDC process outlined in The SDC Process.

11.2.1 Step 1: Need for disclosure control

The same reasoning as in case study 1 applies: the SUF dataset produced in case study 1 contains data on individualsand households and some variables are confidential and/or sensitive. The decisions made in case study 1 are basedon the disclosure scenarios for a SUF release. The anonymization applied for the SUF does not provide sufficientprotection for the release as PUF and the SUF file cannot be released as PUF without further treatment. Therefore,we have to repeat the SDC process with a different set of disclosure scenarios based on the characteristics of a PUFrelease (see Step 4). This leads to different risk measures, lower accepted risk levels and different SDC methods.

11.2.2 Step 2: Data preparation and exploring data characteristics

In order to guarantee consistency between the released PUF and SUF files, which is required to prevent intruders fromusing the datasets together (SUF users have also access to the PUF file), we have to use the anonymized SUF file tocreate the PUF file (see also the Section Step 3: Type of release). In this way all information in the PUF file is alsocontained in the SUF, and the PUF does not provide additional information to an intruder with access to the SUF. Weload the required packages to read the data (foreign package for STATA files) and load the SUF dataset into “file” asillustrated in Listing 11.27. We also load the original data file (raw data) as “fileOrig”. We need the raw data to undoperturbative methods used in case study 1 (see Step 8) and to compare data utility measures (see Step 5). To evaluatethe utility loss in the PUF, we have to compare the information in the anonymized PUF file with the information in theraw data. For an overview of the data characteristics and a description of the variables in both files, we refer to Step 2of Case study 1- SUF .

11.2. Case study 2 - PUF 165

Page 170: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.27: Loading required packages and datasets

1 # Load required packages2 library(foreign) # for read/write function for STATA3 library(sdcMicro) # sdcMicro package4

5 # Set working directory - set to the path on your machine6 setwd("/Users/CaseStudy2")7

8 # Specify file name of SUF file from case study 19 fname <- "CaseDataAnon.dta"

10

11 # Specify file name of original dataset (raw data)12 fnameOrig <- "CaseA.dta"13

14 # Read-in files15 file <- read.dta(fname, convert.factors = TRUE) # SUF file case study 116 fileOrig <- read.dta(fnameOrig, convert.factors = TRUE) # original data

We check the number of variables and number of observations of both files and the variable names of the SUF file,as shown in Listing 11.28. The PUF file has fewer records and fewer variables than the original data file, since weremoved large households and several variables to generate the SUF file.

Listing 11.28: Number of individuals and variables and variable names

1 # Dimensions of file (observations, variables)2 dim(file)3 ## [1] 10068 494

5 dim(fileOrig)6 ## [1] 10574 687

8 colnames(file) # Variable names9 ## [1] "IDH" "URBRUR" "REGION" "HHSIZE"

10 ## [5] "OWNAGLAND" "RELIG" "ROOF" "TOILET"11 ## [9] "WATER" "ELECTCON" "FUELCOOK" "OWNMOTORCYCLE"12 ## [13] "CAR" "TV" "LIVESTOCK" "LANDSIZEHA"13 ## [17] "TANHHEXP" "TFOODEXP" "TALCHEXP" "TCLTHEXP"14 ## [21] "THOUSEXP" "TFURNEXP" "THLTHEXP" "TTRANSEXP"15 ## [25] "TCOMMEXP" "TRECEXP" "TEDUEXP" "TRESTHOTEXP"16 ## [29] "TMISCEXP" "INCTOTGROSSHH" "INCRMT" "INCWAGE"17 ## [33] "INCFARMBSN" "INCNFARMBSN" "INCRENT" "INCFIN"18 ## [37] "INCPENSN" "INCOTHER" "WGTPOP" "IDP"19 ## [41] "GENDER" "REL" "MARITAL" "AGEYRS"20 ## [45] "EDUCY" "EDYRSCURRAT" "ATSCHOOL" "INDUSTRY1"21 ## [49] "WGTHH"

To get an overview of the values of the variables, we use tabulations and cross-tabulations for categorical variablesand summary statistics for continuous variables. To include the number of missing values (‘NA’ or other), we use theoption useNA = “ifany” in the table() function. For some variables, these tabulations differ from the tabulations of theraw data, due to the anonymization of the SUF file.

In Table 11.18 the variables in the dataset “file” are listed along with concise descriptions of the variables, the level atwhich they are collected (individual level (IND) or household level (HH)), the measurement type (continuous, semi-continuous or categorical) and value ranges. Note that the dataset contains a selection of 49 variables of the 68 variableselected for the SUF release. The variables have been preselected based on their relevance for data users and somevariables were removed while creating a SUF file. The numerical values for many of the categorical variables are

166 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 171: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

codes that refer to values, e.g., in the variable “URBRUR”, 1 stands for ‘rural’ and 2 for ‘urban’. More information onthe meanings of coded values of the categorical variables is available in the R code for this case study.

Any data cleaning, such as recoding missing value codes and removing empty variables, was already done in casestudy 1. The same holds for removing any direct identifiers. Direct identifiers are not released in the SUF file.

We identified the following sensitive variables in the dataset: variables related to schooling and labor force status aswell as income and expenditure related variables. These variables need protection. Whether a variable is consideredsensitive may depend on the release type, country and the dataset itself.

Table 11.18: Overview of the variables in the datasetNo. Variable name Description Level Measurement Values1 IDH Household ID HH . 1-2,0002 IDP Individual ID IND . 1-133 REGION Region HH categorical 1-64 URBRUR Area of residence HH categorical 1, 25 WGTHH Individual weighting coefficient HH weight 31.2-8495.76 WGTPOP Population weighting coefficient IND weight 45.8-93452.27 HHSIZE Household size HH semi-continuous 1-338 GENDER Gender IND categorical 0, 19 REL Relationship to household head IND categorical 1-910 MARITAL Marital status IND categorical 1-611 AGEYRS Age in completed years IND semi-continuous 0-6512 RELIG Religion of household head HH categorical 1, 5-7, 913 ATSCHOOL Currently enrolled in school IND categorical 0, 114 EDUCY Highest level of education attended IND categorical 1-615 EDYRSCUR AT Years of education for currently enrolled IND semi-continuous 1-1816 INDUSTRY Industry classification (1-digit) IND categorical 1-1017 ROOF Main material used for roof IND categorical 1-5, 918 TOILET Main toilet facility HH categorical 1-4, 919 ELECTCON Electricity HH categorical 0-320 FUELCOOK Main cooking fuel HH categorical 1-5, 921 WATER Main source of water HH categorical 1-922 OWNAGLAN Ownership of agricultural land HH categorical 1-323 LANDSIZE Land size owned by household (ha) (agric and non agric) HH continuous 0-4024 OWNMOTORYCLE Ownership of motorcycle HH categorical 0, 125 CAR Ownership of car HH categorical 0, 126 TV Ownership of television HH categorical 0, 127 LIVESTOC Number of large-sized livestock owned HH semi-continuous 0-2528 INCRMT Income – Remittances HH continuous29 INCWAGE Income - Wages and salaries HH continuous30 INCFARMBSN Income - Gross income from household farm businesses HH continuous31 INCNFARMBSN Income - Gross income from household nonfarm businesses HH continuous32 INCRENT Income - Rent HH continuous33 INCFIN Income - Financial HH continuous34 INCPENSN Income - Pensions/social assistance HH continuous35 INCOTHER Income - Other HH continuous36 INCTOTGROSHH Income - Total HH continuous37 FARMEMP38 TFOODEXP Total expenditure on food HH continuous39 TALCHEXP Total expenditure on alcoholic beverages, tobacco and narcotics HH continuous40 TCLTHEXP Total expenditure on clothing HH continuous

Continued on next page

11.2. Case study 2 - PUF 167

Page 172: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.18 – continued from previous pageNo. Variable name Description Level Measurement Values41 THOUSEXP Total expenditure on housing HH continuous42 TFURNEXP Total expenditure on furnishing HH continuous43 THLTHEXP Total expenditure on health HH continuous43 TTRANSEXP Total expenditure on transport HH continuous44 TCOMMEXP Total expenditure on communication HH continuous45 TRECEXP Total expenditure on recreation HH continuous46 TEDUEXP Total expenditure on education HH continuous47 TRESHOTEXP Total expenditure on restaurants and hotels HH continuous48 TMISCEXP Total expenditure on miscellaneous spending HH continuous49 TANHHEXP Total annual nominal household expenditures HH continuous

It is always important to ensure that the relationships between variables in the data are preserved during the anonymiza-tion process and to explore and take note of these relationships before beginning the anonymization. At the end of theanonymization process before the release of the data, an audit should be conducted, using these initial results, to checkthat these relationships are maintained in the anonymized dataset (see Step 11).

In our dataset, we identify several relationships between variables that need to be preserved during the anonymizationprocess. The variables “TANHHEXP” and “INCTOTGROSSHH” represent the total annual nominal household ex-penditure and the total gross annual household income, respectively, and these variables are aggregations of existingincome and expenditure components in the dataset.

The variables related to education are available only for individuals in the appropriate age groups and missing for otherindividuals. In addition, the household-level variables (cf. fourth column of Table 11.18) have the same values for allmembers in any particular household. The value of household size corresponds to the actual number of individualsbelonging to that household in the dataset. As we proceed, we have to take care that these relationships and structuresare preserved in the anonymization process.

We assume that the data are collected in a survey that uses simple sampling of households. The data contains twoweight coefficients: “WGTHH” and “WGTPOP”. The relationship between the weights is 𝑊𝐺𝑇𝑃𝑂𝑃 = 𝑊𝐺𝑇𝐻𝐻*𝐻𝐻𝑆𝐼𝑍𝐸. “WGTPOP” is the sampling weight for the households and “WGTHH” is the sampling weight for theindividuals to be used for disclosure risk calculations. “WGTHH” is used for computing individual-level indicators(such as education) and “WGTPOP” is used for population level indicators (such as income indicators). There are nostrata variables available in the data. We will use “WGTPOP” for the anonymization of the household variables and“WGTHH” for the anonymization of the individual-level variables.

11.2.3 Step 3: Type of release

In this case study, we assume that the file will be released as a PUF, which will be freely available to anyone interestedin the data (see the Section Conditions for PUFs for the conditions and more information on the release of PUFs). ThePUF release is intended for users with lower information requirements (e.g., students) and researchers interested in thestructure of the data and willing to do preliminary research. The PUF file can give an idea to the researcher whether itis worthwhile for their research to apply for access to the SUF file. Researchers willing to do more in-depth researchwill most likely apply for SUF access. Generally, users of a PUF file are not restricted by an agreement that preventsthem from using the data to re-identify individuals and hence the accepted risk level is much lower than in the case ofthe SUF and the set of released variables is limited.

11.2.4 Step 4: Intruder scenarios and choice of key variables

Next, based on the release type, we reformulate the intruder scenarios for the PUF release. This leads to the selectionof a set of quasi-identifiers. Since this case study is based on a demo dataset, we do not have a real context andwe cannot define exact disclosure scenarios. Therefore, we make hypothetical assumptions on possible disclosure

168 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 173: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

scenarios. We consider two types of disclosure scenarios: 1) matching with other publically available datasets and 2)spontaneous recognition. Since the dataset will be distributed as PUF, there are de facto no restrictions on the use ofthe dataset by intruders.

For the sake of illustration, we assume that population registers are available with the demographic variables gender,age, place of residence (region, urban/rural), religion and other variables such as marital status and variables relat-ing to education and professional status that are also present in our dataset. In addition, we assume that there is apublically available cadastral register on land ownership. Based on this analysis of available data sources, we haveselected in case study 1 the variables “REGION”, “URBRUR”, “HHSIZE”, “OWNAGLAND”, “RELIG”, “GEN-DER”, “REL” (relationship to household head), “MARITAL” (marital status), “AGEYRS”, “INDUSTRY1” and twovariables relating to school attendance as categorical quasi-identifiers, the expenditure and income variables as well asLANDSIZEHA as continuous quasi-identifiers. According to our assessment, these variables might enable an intruderto re-identify an individual or household in the dataset by matching with other available datasets. The key variables forPUF release generally coincide with the key variables for the SUF release. Possibly, more variables could be added,since the user has more possibilities to match the data extensively and is not bound by any contract, as is in the case ofthe SUF file. Equally, some key variables in the SUF file may not be released in the PUF file and, as a consequence,these variables are removed from the list of key variables.

Upon further consideration, this initial set of identifying variables is too large for a PUF release, as the number ofpossible combinations (keys) is very high and hence many respondents could be identified based on these variables.Therefore, we decide to limit the set of key variables, by excluding variables from the dataset for PUF release. Thechoice of variables to be removed is led by the needs of the intended PUF users. Assuming the typical users aremainly interested in aggregate income and expenditure data, we can therefore remove from the initial set of keyvariables “OWNAGLAND”, “RELIG” and “LANDSIZEHA” at the household level and “EDYRSCURRAT” and“ATSCHOOL” at the individual level.

Note: These variables will not be released in the PUF file.

We also remove the income and expenditure components from the list of key variables, since we reduce their infor-mation content by building proportions (see Step 8a). The list of the remaining key variables is presented in Table11.19.

Table 11.19: Overview of selected key variables for PUF fileVariable name Variable description Measurement levelREGION region Household, categoricalURBRUR area of residence Household, categoricalHHSIZE household size Household, categoricalTANHHEXP total expenditure Household, continuousINCTOTGROSSHH total income Household, continuousGENDER gender Individual, categoricalREL relationship to household head Individual, categoricalMARITAL marital status Individual, categoricalAGEYRS age in completed years Individual, semi-continuous/categoricalEDUCY highest level of education completed Individual, categoricalINDUSTRY1 industry classification Individual, categorical

The decision to release the dataset as a PUF means the level of anonymization will be relatively high and consequently,the variables are less detailed (e.g., after recoding) and a scenario of spontaneous recognition is less likely. Never-theless, we should check for rare combinations or unusual patterns in the variables. Variables that may lead to spon-taneous recognition in our sample are amongst others “HHSIZE” (household size) as well as “INCTOTGROSSHH”(aggregate income) and “TANHHEXP” (total expenditure). Large households and households with high income areeasily identifiable, especially when combined with other identifying variables such as a geographical identifier (“RE-GION”). There might be only one or a few households/individuals in a certain region with a high income, such as the

11.2. Case study 2 - PUF 169

Page 174: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

local doctor. Variables that are easily observable and known by neighbors such as “ROOF”, “TOILET”, “WATER”,“ELECTCON”, “FUELCOOK”, “OWNMOTORCYCLE”, “CAR”, “TV” and “LIVESTOCK” may also need protec-tion depending on what stands out in the community, since a user might be able to identify persons (s)he knows. Thisis called the nosy-neighbor scenario.

11.2.5 Step 5: Data key uses and selection of utility measures

A PUF file contains less information and the file is generally used by students as a teaching file, by researchers to getan idea about the data structure, and for simple analyses. The users have generally lower requirements than for a SUFfile and the results of analysis may be less precise. The researcher interested in a more detailed dataset, would have toapply for access to the SUF file. Therefore, we select more aggregate utility measures for the PUF file that reflect theintended use of a PUF file. Data intensive measures, such as the Gini coefficient, cannot be computed from the PUFfile. Besides the standard utility measures, such as tabulations, we evaluate the decile dispersion ratio and a regressionwith the income deciles as regressand.

To measure the information loss, we should compare the initial data file before any anonymization (including theanonymization for the SUF) with the file after the anonymization for the PUF. Comparing the files directly before andafter the PUF anonymization would underestimate the information loss, as this would omit the information loss due toSUF anonymization. Therefore, in Step 2, we also loaded the raw dataset.

Hierarchical (household) structure

As noted in case study 1, the data has a household structure. For the SUF release, we protected large households byremoving these from the dataset. Since some variables are measured on the household level and thus have identicalvalues for each household member, the values of the household variables should be treated in the same way foreach household member (see the Section Anonymization of the quasi-identifier household size ). Therefore, we firstanonymize only the household variables. After this, we merge them with the individual-level variables and thenanonymize the individual-level and household-level variables jointly.

Since the data has a hierarchical structure, Steps 6 through 10 are repeated twice: Steps 6a through 10a are for thehousehold-level variables and Steps 6b through 10b for the combined dataset. In this way, we ensure that household-level variable values remain consistent across household members for each household and the household structurecannot be used to re-identify individuals. This is further explained in the Sections Levels of risk and Householdstructure.

Before continuing to Step 6a, we select the categorical key variables, continuous key variables and any variablesselected for use in PRAM routines, as well as household-level sampling weights in R. We also collect the variablenames of the variables that will not be released. The PRAM variables are variables select for the PRAM routine,which we discuss further in Step 8a. We extract these selected household variables from the SUF dataset and savethem as “fileHH”. The choice of PRAM variables is further explained in Step 8a. Listing 11.29 illustrates how thesesteps are done in R (see also the Section Household structure).

Listing 11.29: Selecting the variables for the household-levelanonymization

1 # Categorical key variables at household level2 selectedKeyVarsHH <- c('URBRUR', 'REGION', 'HHSIZE')3

4 # Continuous key variables5 numVarsHH <- c('TANHHEXP', 'INCTOTGROSSHH')6

7 # PRAM variables8 pramVarsHH <- c('ROOF', 'TOILET', 'WATER', 'ELECTCON',9 'FUELCOOK', 'OWNMOTORCYCLE', 'CAR', 'TV', 'LIVESTOCK')

10

11 # Household weight

(continues on next page)

170 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 175: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

12 weightVarHH <- c('WGTPOP')13

14 # Variables not suitable for release in PUF (HH level)15 varsNotToBeReleasedHH <- c("OWNAGLAND", "RELIG", "LANDSIZEHA")16

17 # Vector with names of all HH level variables18 HHVars <- c('IDH', selectedKeyVarsHH, pramVarsHH, numVarsHH, weightVarHH)19

20 # Create subset of file with only HH level variables21 fileHH <- file[,HHVars]

Every household has the same number of entries as it has members (e.g., a household of three will be repeatedthree times in “fileHH”). Before analyzing the household-level variables, we select only one entry per household, asillustrated in Listing 11.30. This is further explained in the Section Household structure. In the same way we extract“fileOrigHH” from “fileOrig”. “fileOrigHH” contains all variables from the raw data, but contains every householdonly once. We need “fileOrigHH” in Steps 8a and 10a for undoing some perturbative methods used in the SUF fileand computing utility measures from the raw data respectively.

Listing 11.30: Taking a subset with only households

1 # Remove duplicated rows based on IDH, one row per household in fileHH2 fileHH <- fileHH[which(!duplicated(fileHH$IDH)),] # SUF file3 fileOrigHH <- fileOrig[which(!duplicated(fileOrig$IDH)),] # original dataset4

5 # Dimensions of fileHH6 dim(fileHH)7 ## [1] 1970 168

9 dim(fileOrigHH)10 ## [1] 2000 68

The file “fileHH” contains 1,970 households and 16 variables. We are now ready to create our sdcMicro object withthe corresponding variables we selected in Listing 11.28. For our case study, we will create an sdcMicro object called“sdcHH” based on the data in “fileHH”, which we will use for steps 6a – 10a (see Listing 11.34).

Note: When the sdcMicro object is created, the sdcMicro package automatically calculates and stores the riskmeasures for the data.

This leads us to Step 6a.

Listing 11.31: Creating a sdcMicro object for the household variables

1 # Create initial sdcMicro object for household level variables2 sdcHH <- createSdcObj(dat = fileHH, keyVars = selectedKeyVarsHH,3 pramVars = pramVarsHH, weightVar = weightVarHH, numVars =

→˓numVarsHH)4 numHH <- length(fileHH[,1]) # number of households

11.2.6 Step 6a: Assessing disclosure risk (household level)

Based on the key variables selected in the disclosure scenarios, we can evaluate the risk at the household level. ThePUF risk measures show a lower risk level than in the SUF file after anonymization in case study 1. The reason is that

11.2. Case study 2 - PUF 171

Page 176: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

the set of key variables is smaller, since some variables will not be released in the PUF file. Removing (key) variablesreduces the risk, and it is one of the most straightforward SDC methods.

As a first measure, we evaluate the number of households violating 𝑘-anonymity at the levels 2, 3 and 5. Table11.20 shows the number of violating households as well as the percentage of the total number of households. Listing11.32 illustrates how to find these values with sdcMicro. The print() function in sdcMicro shows only the values forthresholds 2 and 3. Values for other thresholds can be calculated manually by summing up the frequencies smallerthan the 𝑘-anonymity threshold, as shown in Listing 11.32. The number of violators is already at a low level, due tothe prior anonymization of the SUF file and the reduced set of key variables.

Table 11.20: Number and proportion of households violating 𝑘-anonymity

k-anonymity level Number of HH violating Percentage of total number of HH2 0 0.0%3 18 0.9%5 92 4.7%

Listing 11.32: Showing number of households violating 𝑘-anonymity forlevels 2, 3 and 5

1 # Number of observations violating k-anonymity (thresholds 2 and 3)2 print(sdcHH)3 ## Infos on 2/3-Anonymity:4 ##5 ## Number of observations violating6 ## - 2-anonymity: 07 ## - 3-anonymity: 188 ##9 ## Percentage of observations violating

10 ## - 2-anonymity: 0.000 %11 ## - 3-anonymity: 0.914 %12 --------------------------------------------------------------------------13

14 # Calculate sample frequencies and count number of obs. violating k(5) - anonymity15 kAnon5 <- sum(sdcHH@risk$individual[,2] < 5)16 kAnon517 ## [1] 9218

19 # As percentage of total20 kAnon5 / numHH21 ## [1] 0.04670051

It is often useful to view the records of the household(s) that violate 𝑘-anonymity. This might help to find which vari-ables cause the uniqueness of these households; this can then be used later when choosing appropriate SDC methods.Listing 11.32 shows how to access the values of the households violating 3 and 5-anonymity. Not surprisingly, thevariable “HHSIZE” is responsible for many of the unique combinations and the origin of much of the risk. This iseven the case after removing large households for the SUF release.

Listing 11.33: Showing records of households that violate 𝑘-anonymity

1 # Show values of key variable of records that violate k-anonymity2 fileHH[sdcHH@risk$individual[,2] < 3, selectedKeyVarsHH] # for 3-anonymity3 fileHH[sdcHH@risk$individual[,2] < 5, selectedKeyVarsHH] # for 5-anonymity

We also assess the disclosure risk of the categorical variables with the individual and global risk measures as describedin the Sections Individual risk and Global risk. In “fileHH” every entry represents a household. Therefore, we use the

172 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 177: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

individual non-hierarchical risk here, where the individual refers in this case to a household. “fileHH” is a subset of thecomplete dataset and contains only households and has, contrary to the complete dataset, no hierarchical structure. InStep 6b, we evaluate the hierarchical risk in the dataset “file”, the dataset containing both households and individuals.The individual and global risk measures automatically take into consideration the household weights, which we definedin Listing 11.29. In our file, the global risk measure calculated using the chosen key variables is lower than 0.01%(the smallest reported value is 0.01%, in fact the global risk is 0.0000642 %). This percentage is extremely low andcorresponds to 0.13 expected re-identifications. The results are also shown in Listing 11.34. This low figure can beexplained by the relatively small sample size of 0.25% of the total population (see case study 1). Furthermore, oneshould keep in mind that this risk measure is based only on the categorical quasi-identifiers at the household level.

Listing 11.34: Printing global risk measures

1 print(sdcHH, "risk")2 ## Risk measures:3 ##4 ## Number of observations with higher risk than the main part of the data: 05 ## Expected number of re-identifications: 0.13 (0.01 %)

The global risk measure does not provide information about the spread of the individual risk measures. There mightbe a few households with relatively high risk, while the global (average) risk is low. Therefore we check the highestindividual risk as shown in Listing 11.35. The individual risk of the household with the highest risk is 0.1 %, which isstill very low.

Listing 11.35: Determining the highest individual risk

1 # Highest individual risk2 max(sdcHH@risk$individual[, "risk"])3 ## [1] 0.001011633

Since the selected key variables at the household level are both categorical and numerical, the individual and globalrisk measures based on frequency counts do not completely reflect the disclosure risk of the entire dataset. Whengenerating the SUF file, we concluded that recoding of continuous variables to make them all categorical would likelynot satisfy the needs of the SUF users. For the PUF file it is acceptable to recode continuous variables, such as incomeand expenditures since PUF content is typically less detailed. In Step 8a we will recode these variables into decilesand convert them into categorical variables. Therefore, we exclude these variables from the risk calculations now Wetake these variables into account while remeasuring risk after anonymization.

11.2.7 Step 7a: Assessing utility measures (household level)

The utility of the data does not only depend on the household level variables, but on the combination of household-level and individual-level variables. Therefore, it is not useful to evaluate all the utility measures selected in Step5 at this stage, i.e., before anonymizing the individual level variables. We restrict the initial measurement of utilityto those measures that are solely based on the household variables. In our dataset, these are the measures related toincome and expenditure and their distributions. The results are presented in Step 10a, together with the results afteranonymization, which allow direct comparison. If after the next anonymization step it appears that the data utility hasbeen significantly decreased by the suppression of some household level variables, we can return to this step.

Note: To analyze the utility loss, the utility measures before anonymization have to be calculated from the raw dataand not from the anonymized SUF file.

Not all measures from case study 1 can be computed from the PUF file, since the information content is lower. The setof utility measures we use to evaluate the information loss in the PUF file consists of measures that need less detailedvariables. This reflects the lower requirements a PUF user has on the dataset.

11.2. Case study 2 - PUF 173

Page 178: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

11.2.8 Step 8a: Choice and application of SDC methods (household level)

This step is divided into the anonymization of the categorical key variables and the continuous key variables, sincedifferent methods are used for both sets of variables. As already discussed in Step 4, we do not release all variablesin the PUF file. At the household level “RELIG” (religion of household head), “OWNAGLAND” (land ownership)and “LANDSIZEHA” (plot size in ha) are not released in addition to the variables removed for the SUF release. Forthe sake of illustration, we assume that the variable “RELIG” is too sensitive and the variables “OWNAGLAND” and“LANDSIZEHA” are too identifying.

Categorical variables

We are now ready to move on to the choice of SDC methods for the categorical variables on the household levelin our dataset. In the SUF file we already recoded some of the key variables and used local suppression. We onlyhave three categorical key variables at the household level; “URBRUR”, “REGION” and “HHSIZE”. The selectedcategorical key variables at the household level are not suitable for recoding at this point, since the values cannot begrouped further. “URBRUR” has only two distinct categories and “REGION” has only six non-combinable regions.As noted before, the variable “HHSIZE” can be reconstructed by a headcount per household. Therefore, recoding ofthis variable alone does not lead to disclosure control.

Due to the relatively low risk of re-identification based on the three selected categorical household level variables,it is possible in this case to use an option like local suppression to achieve our desired level of risk. Applying localsuppression when initial risk is relatively low will likely only lead to suppression of few observations and thus limit theloss of utility. If, however, the data had been measured to have a relatively high risk, then applying local suppressionwithout previous recoding would likely result in a large number of suppressions and greater information loss. Effortssuch a recoding should be taken first before suppressing values in cases where risk is initially measured as high.Recoding will reduce risk with little information loss and thus the number of suppressions, if local suppression isapplied as an additional step.

We apply local suppression to reach 5-anonymity. The chosen level of five is higher than in the SUF release and isbased on the release type as PUF. This leads to a total of 39 suppressions, all in the variable “HHSIZE”. As explainedearlier, suppression of the value of the variable “HHSIZE” does not lead to actual suppression of this information.Therefore, we redo the local suppression, but this time we tell sdcMicro to, if possible, not suppress “HHSIZE”but one of the other variables. Alternatively, we could remove households with suppressed values of the variable“HHSIZE”, remove large households or split households.

In sdcMicro it is possible to tell the algorithm which variables are important and less important for making smallchanges (see also the Section Local suppression). To prevent values of the variable “HHSIZE” being suppressed, we setthe importance of “HHSIZE” in the importance vectors to the highest (i.e., 1). We try two different importance vectors:the first where “REGION” is more important than “URBRUR” and the second with the importance of “REGION”and “URBRUR” swapped. Listing 11.36 shows how to apply local suppression and put importance on the variable“HHSIZE”.

Note: In Listing 11.36 we use the undolast() function in sdcMicro to go one step back after we had first applied localsuppression with no importance vector.

The undolast() function restores the sdcMicro object back to the previous state (i.e., before we applied local sup-pression), which allows us to rerun the same command, but this time with an importance vector set. The undolast()function can only be used to go one step back.

The suppression patterns of the three different options are shown in Table 11.21. The importance is clearly reflectedin the number of suppressions per variable. The total number of suppressions is with an importance vector higher thanwithout an importance vector (44/73 vs. 39), but 5-anonymity is achieved in the dataset with no suppressions in thevariable “HHSIZE”. This means that we do not have to remove or split households. The variable “REGION” is thetype of variable that should not have any suppressions either. From that perspective we chose the third option. Thisleads to more suppressions, but no suppressions in “HHSIZE” and as few as possible in “REGION”.

174 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 179: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.21: Number of suppressions by variable after local suppressionwith and without importance vector

Key vari-able

Number of suppressions and proportion of total

. No importancevector

Importance HHSIZE, URBRUR,REGION

Importance HHSIZE, REGION,URBRUR

URBRUR 0 (0.0 %) 2 (0.1 %) 61 (3.1 %)REGION 0 (0.0 %) 42 (2.1 %) 12 (0.6 %)HHSIZE 39 (2.0 %) 0 (0.0 %) 0 (0.0 %)

Listing 11.36: Local suppression with and without importance vector

1 # Local suppression to achieve 5-anonimity2 sdcHH <- localSuppression(sdcHH, k = 5, importance = NULL) # no importance vector3 print(sdcHH, "ls")4 ## Local Suppression:5 ## KeyVar | Suppressions (#) | Suppressions (%)6 ## URBRUR | 0 | 0.0007 ## REGION | 0 | 0.0008 ## HHSIZE | 39 | 1.9809 ## ---------------------------------------------------------------------------

10

11 # Undo suppressions to see the effect of an importance vector12 sdcHH <- undolast(sdcHH)13

14 # Redo local suppression minimizing the number of suppressions in HHSIZE15 sdcHH <- localSuppression(sdcHH, k = 5, importance = c(2, 3, 1))16

17 print(sdcHH, "ls")18 ## Local Suppression:19 ## KeyVar | Suppressions (#) | Suppressions (%)20 ## URBRUR | 2 | 0.10221 ## REGION | 42 | 2.13222 ## HHSIZE | 0 | 0.00023 ## ---------------------------------------------------------------------------24

25 # Undo suppressions to see the effect of a different importance vector26 sdcHH <- undolast(sdcHH)27

28 # Redo local suppression minimizing the number of suppressions in HHSIZE29 sdcHH <- localSuppression*(sdcHH, k = 5, importance = c(3, 2, 1))30

31 print(sdcHH, "ls")32 ## Local Suppression:33 ## KeyVar | Suppressions (#) | Suppressions (%)34 ## URBRUR | 61 | 3.09635 ## REGION | 12 | 0.60936 ## HHSIZE | 0 | 0.00037 ## ---------------------------------------------------------------------------

In case study 1 we applied invariant PRAM to the variables “ROOF”, “TOILET”, “WATER”, “ELECTCON”, “FU-ELCOOK”, “OWNMOTORCYCLE”, “CAR”, “TV” and “LIVESTOCK”, since these variables are not sensitive andwere not selected as quasi-identifiers because we assumed that there are no external data sources containing this in-formation that could be used for matching. Values can be easily observed or be known to neighbors, however, andtherefore are important, together with other variables, for the nosy neighbor scenario. For the PUF release we wouldlike to level of uncertainty by increasing the number of changes. Therefore, we redo PRAM with a different transi-

11.2. Case study 2 - PUF 175

Page 180: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

tion matrix. As discussed in the Section PRAM (Post RAndomization Method), the invariant PRAM method has theproperty that the univariate distributions do not change. To maintain this property, we reapply PRAM to the raw data,rather than to the already PRAMmed variables in the SUF file.

Listing 11.37 illustrates how to apply PRAM. We use the original values to apply PRAM and replace the values inthe sdcMicro object with these values. We choose the parameter ‘pd’, the lower bound for the probability that a valueis not changed, to be relatively low at 0.6. This is a lower value than the 0.8 used in the SUF file and will lead to ahigher number of changes (cf. Listing 11.17). This is acceptable for a PUF file and introduces more uncertainty asrequired for a PUF release. Listing 11.37 also shows the number of changed records per variables. Because PRAM is aprobabilistic method, we set a seed for the random number generator before applying PRAM to ensure reproducibilityof the results.

Note: In some cases the choice of the seed matters. The choice of seed changes the results.

The seed should not be released, since it allows for reconstructing the original values if combined with the transitionmatrix. The transition matrix can be released: this allows for consistent statistical inference by correcting the statisticalmethods used if the researcher has knowledge about the PRAM method (at this point sdcMicro does not allow toretrieve the transition matrix).

Listing 11.37: Applying PRAM

1 # PRAM2 set.seed(10987)3

4 # Replace PRAM variables in sdcMicro object sdcHH with the original raw values5 sdcHH@origData[,pramVarsHH] <- fileHH[match(fileHH$IDH, fileOrigHH$IDH), pramVarsHH]6 sdcHH@manipPramVars <- fileHH[match(fileHH$IDH, fileOrigHH$IDH), pramVarsHH]7

8 sdcHH <- pram(obj = sdcHH, pd = 0.6)9 ## Number of changed observations:

10 ## - - - - - - - - - - -11 ## ROOF != ROOF_pram : 305 (15.48%)12 ## TOILET != TOILET_pram : 260 (13.2%)13 ## WATER != WATER_pram : 293 (14.87%)14 ## ELECTCON != ELECTCON_pram : 210 (10.66%)15 ## FUELCOOK != FUELCOOK_pram : 315 (15.99%)16 ## OWNMOTORCYCLE != OWNMOTORCYCLE_pram : 95 (4.82%)17 ## CAR != CAR_pram : 255 (12.94%)18 ## TV != TV_pram : 275 (13.96%)19 ## LIVESTOCK != LIVESTOCK_pram : 109 (5.53%)

PRAM has changed values within the variables according to the invariant transition matrices. Since we used theinvariant PRAM method (see the Section PRAM (Post RAndomization Method)), the absolute univariate frequenciesremain approximately unchanged. This is not the case for the multivariate frequencies. In Step 10a we compare themultivariate frequencies before and after anonymization for the PRAMmed variables.

Continuous variables

We have selected the variables “INCTOTGROSSHH” (total income) and “TANHHEXP” (total expenditure) as nu-merical quasi-identifiers, as discussed in Step 4. In Step 5 we identified variables having high interest for the users ofour data: many users use the data for measuring inequality and expenditure patterns. The noise addition in the SUFfile does not protect these variables sufficiently, especially, because outliers are not protected. Therefore, we decide torecode total income and total expenditure into deciles.

As with PRAM, we want to compute the deciles from the raw data rather than from the perturbed values in the SUF file.We compute the deciles directly from the raw data and overwrite these values in the sdcMicro object. Subsequently,we compute the mean of each decile from the raw data and replace the values for total income and total expenditures

176 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 181: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

with the mean of the respective decile. In this way the mean of both variables does not change. This approach canbe interpreted as univariate microaggregation with very large groups (group size n/10) with the mean as replacementvalue (see the Section Microaggregation).

The information in the income and expenditure variables by component is too sensitive to release as PUF, and, sum-ming those variables would allow an intruder to reconstruct the totals. PUF users might however be interested in theshares. Therefore, we decide to keep the income and expenditure components as proportions of the raw totals, roundedto two digits. The anonymization of the income and expenditure variables is shown in Listing 11.38.

Listing 11.38: Anonymization of income and expenditure variables

1 # Create bands (deciles) for income and expenditure variables2 (aggregates) based on the original data3 decExp <- as.numeric(cut(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH), "TANHHEXP"],4 quantile(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓"TANHHEXP"],5 (0:10)/10, na.rm = T),6 include.lowest = TRUE, labels = c(1, 2, 3, 4, 5, 6, 7, 8, 9,

→˓10)))7 decInc <- as.numeric(cut(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH), "INCTOTGROSSHH

→˓"],8 quantile(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓"INCTOTGROSSHH"],9 (0:10)/10, na.rm = T),

10 include.lowest = TRUE, labels = c(1, 2, 3, 4, 5, 6, 7, 8,→˓9, 10)))

11

12 # Mean values of deciles13 decExpMean <- round(sapply(split(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓"TANHHEXP"],14 decExp), mean))15 decIncMean <- round(sapply(split(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓"INCTOTGROSSHH"],16 decInc), mean))17

18 # Replace with mean value of decile19 sdcHH@manipNumVars$TANHHEXP <- decExpMean[match(decExp,names(decExpMean))]20 sdcHH@manipNumVars$INCTOTGROSSHH <- decIncMean[\ match(decInc, names(decIncMean))]21

22 # Recalculate risks after manually changing values in sdcMicro object calcRisks(sdcHH)23 calcRisks(sdcHH)24

25 # Extract data from sdcHH26 HHmanip <- extractManipData(sdcHH) # manipulated variables HH27

28 # Keep components of expenditure and income as share of total,29 # use original data since previous data was perturbed30 compExp <- c('TFOODEXP', 'TALCHEXP', 'TCLTHEXP', 'THOUSEXP',31 'TFURNEXP', 'THLTHEXP', 'TTRANSEXP', 'TCOMMEXP',32 'TRECEXP', 'TEDUEXP', 'TRESTHOTEXP', 'TMISCEXP')33 compInc <- c('INCRMT', 'INCWAGE', 'INCFARMBSN', 'INCNFARMBSN',34 'INCRENT', 'INCFIN', 'INCPENSN', 'INCOTHER')35 HHmanip <- cbind(HHmanip, round(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓compExp] /36 fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),37 "TANHHEXP"], 2))38 HHmanip <- cbind(HHmanip, round(fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),

→˓compInc] /

(continues on next page)

11.2. Case study 2 - PUF 177

Page 182: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

39 fileOrigHH[match(fileHH$IDH, fileOrigHH$IDH),40 "INCTOTGROSSHH"], 2))

11.2.9 Step 9a: Re-measure risk (household level)

For the categorical variables, we conclude that we have achieved 5-anonymity in the data with local suppression.5-anonymity also implies 2- and 3-anonymity. The global risk stayed close to zero (as the expected number of re-identifications), which is very low. Therefore, we conclude that based on the categorical variables, the data has beensufficiently anonymized. One should keep in mind that the anonymization methods applied are complementing theones used for the SUF.

Note: The methods selected methods in this case study alone would not be sufficient to protect the data set for a PUFrelease.

We have reduced the risk of spontaneous recognition of households, by removing the variable “LANDSIZEHA” andPRAMming the variables identified to be important in the nosy neighbor scenario. An intruder cannot know withcertainty whether a household that (s)he recognizes in the data is the correct household, due to the noise in thesevariables.

These measures refer only to the categorical variables. To evaluate the risk of the continuous variables we could usean interval measure or closest neighbor algorithm. These risk measures are discussed in the Section Risk measures forcontinuous variables. We chose to use an interval measure, since exact value matching is not our largest concern basedon the assumed scenarios and external data sources. Instead, datasets with similar values but not the exact same valuescould be used for matching. Here the main concern is that the values are sufficiently far from the original values,which is measured with an interval measure.

Listing 11.39 shows how to evaluate the interval measure for the variables “INCTOTGROSSHH” and “TANHHEXP”(total income and expenditure). The different values of the parameter 𝑘 in the function dRisk() define the size of theinterval around the original value as a function of the standard deviation, as explained in the Section Interval measure. The larger 𝑘, the larger the intervals, the higher the probability that a perturbed value is in the interval around theoriginal value and the higher the risk measure. The results are satisfactory, especially when keeping in mind that thereare only 10 distinct values in the dataset (the means of each of the deciles). All outliers have been recoded. Lookingat the proportions of the components, we do not detect any outliers (households with an unusual high or low spendingpattern in one component).

Listing 11.39: Measuring risk of re-identification of continuous variables

1 # Risk evaluation continuous variables2 dRisk(sdcHH@origData[,c("TANHHEXP", "INCTOTGROSSHH")],3 xm = sdcHH@manipNumVars[,c("TANHHEXP", "INCTOTGROSSHH")], k = 0.01)4 ## [1] 0.46192895

6 dRisk(sdcHH@origData[,c("TANHHEXP", "INCTOTGROSSHH")],7 xm = sdcHH@manipNumVars[,c("TANHHEXP", "INCTOTGROSSHH")], k = 0.02)8 ## [1] 0.6421329

10 dRisk(sdcHH@origData[,c("TANHHEXP", "INCTOTGROSSHH")],11 xm = sdcHH@manipNumVars[,c("TANHHEXP", "INCTOTGROSSHH")], k = 0.05)12 ## [1] 0.8258883

178 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 183: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

11.2.10 Step 10a Re-measure utility (household level)

The utility in the data has decreased compared to the raw data, mainly because variables were completely removed.Many of the utility measures used in case study 1 are not applicable to the PUF file. However, by replacing the decileswith their means, we can still use the income and expenditure variables for arithmetic operations. Also the shares ofthe income and expenditure components can still be used, since they are based on the raw data.

We select two additional utility measures: the decile dispersion ratio and the share of total consumption by the poorestdecile. The decile dispersion ratio is the ratio of the average income of the top decile and the average income ofthe bottom decile. Listing 11.40 shows how to compute these from the raw data and the household variables afteranonymization. Table 11.22 presents the estimated values. The differences are small and mainly due to the removedhouseholds.

Table 11.22: Comparison of utility measures. Raw data PUF fileDecile dispersion ratio 24.12 23.54Share of consumption by the poorest decile 0.0034 0.0035

Listing 11.40: Computation of decile dispersion ratio and share of totalconsumption by the poorest decile

1 # Decile dispersion ratio2 # raw data3 mean(tail(sort(fileOrigHH$INCTOTGROSSHH), n = 200)) /4 mean(head(sort(fileOrigHH$INCTOTGROSSHH), n = 200))5 ## [1] 24.121526

7 mean(tail(sort(HHmanip$INCTOTGROSSHH), n = 197)) /8 mean(head(sort(HHmanip$INCTOTGROSSHH), n = 197))9 ## [1] 23.54179

10

11 # Share of total consumption by the poorest decile households12 sum(head(sort(fileOrigHH$TANHHEXP), n = 200)) / sum(fileOrigHH$TANHHEXP)13 ## [1] 0.00341166414

15 sum(head(sort(HHmanip$TANHHEXP), n = 197)) / sum(HHmanip$TANHHEXP)16 ## [1] 0.003530457

Merging the household- and individual-level variables

The next step is to merge the treated household variables with the untreated individual variables for the anonymizationof the individual level variables. Listing 11.41 shows the steps to merge these files. This also includes the selec-tion of variables used in the anonymization of the individual-level variables. We create the sdcMicro object for theanonymization of the individual variables in the same way as for the household variable in Listing 11.31. Generally,at this stage, the household level and individual level variables should be combined and quasi-identifiers at both levelsbe used (see the Section Levels of risk). Unfortunately, in our dataset, this leads to long computation times. Therefore,we create two sdcMicro objects, one with all key variables (“sdcCombinedAll”) and one with only the individual levelkey variables (“sdcCombined”). In Step 6b we compare the risk measures for both cases and in Step 8b we discussalternative approaches to keeping the complete set of variables. We now repeat Steps 6-10 for the individual-levelvariables.

11.2. Case study 2 - PUF 179

Page 184: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.41: Merging the files with household and individual-levelvariables and creating an sdcMicro object for the anonymization of theindividual-level variables

1 ### Select variables (individual level)2 selectedKeyVarsIND = c('GENDER', 'REL', 'MARITAL',3 'AGEYRS', 'EDUCY', 'INDUSTRY1') # list of selected key

→˓variables4

5 # sample weight (WGTHH, individual weight)6 selectedWeightVarIND = c('WGTHH')7

8 # Household ID9 selectedHouseholdID = c('IDH')

10

11 # Variables not suitable for release in PUF (IND level)12 varsNotToBeReleasedIND <- c("ATSCHOOL", "EDYRSCURRAT")13

14 # All individual level variables15 INDVars <- c(selectedKeyVarsIND)16

17 # Recombining anonymized HH data sets and individual level variables18 indVars <- c("IDH", "IDP", selectedKeyVarsIND, "WGTHH") # HID and all non HH vars19 fileInd <- file[indVars] # subset of file without HHVars20 fileCombined <- merge(HHmanip, fileInd, by.x = c('IDH'))21 fileCombined <- fileCombined[order(fileCombined[,'IDH'], fileCombined[,'IDP']),]22

23 dim(fileCombined)24 ## [1] 10068 4425

26 # SDC objects with only IND level variables27 sdcCombined <- createSdcObj(dat = fileCombined, keyVars = c(selectedKeyVarsIND),28 weightVar = selectedWeightVarIND, hhId =

→˓selectedHouseholdID)29

30 # SDC objects with both HH and IND level variables31 sdcCombinedAll <- createSdcObj(dat = fileCombined,32 keyVars = c(selectedKeyVarsIND, selectedKeyVarsHH ),33 weightVar = selectedWeightVarIND, hhId =

→˓selectedHouseholdID)

11.2.11 Step 6b: Assessing disclosure risk (individual level)

As first measure, we evaluate the number of records violating 𝑘-anonymity at the levels 2, 3 and 5. Table 11.23 showsthe number of violating individuals as well as the percentage of the total number of households. The second and thirdcolumn refer to “sdcCombined” and the fourth and fifth column to “sdcCombinedAll”. We see that combining theindividual level and household level variables leads to a large increase in the number of 𝑘-anonymity violators. Thechoice not to include the household level variables is pragmatically driven by the computation time and can be justifiedby the different type of variables on the household level and individual level. One could assume that these variables arenot available in the same dataset and can therefore not simultaneously be used by an intruder to re-identify individuals.

180 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 185: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.23: Number of records violating 𝑘-anonimity. sdcCombined sdcCombinedAllk-anonymity

Number of recordsviolating

Percentage of totalrecords

Number of recordsviolating

Percentage of totalrecords

2 0 0.0 % 4,048 40.2 %3 167 1.7 % 6,107 60.7 %5 463 4.6 % 8,292 82.4 %

The global hierarchical risk measure is 0.095%, which corresponds to approximately 10 expected re-identifications.We use here the hierarchical risk measure, since the re-identification of a single household member would lead to there-identification of the other members of the same household too. This number is low compared to the number of𝑘-anonymity violations, due to the high sample weights, which protect the data already to a large extent. Only 24observations have an individual hierarchical risk higher than 1%, with a maximum of 1.17%. This is mainly becauseof the lower sample weights of these records. Listing 11.42 shows how to retrieve these measures in R.

Listing 11.42: Risk measures before anonymization

1 numIND <- length(fileCombined[,1]) # number of households2

3 # Number of observations violating k-anonymity4 print(sdcCombined)5 ## Infos on 2/3-Anonymity:6 ##7 ## Number of observations violating8 ## - 2-anonymity: 09 ## - 3-anonymity: 167

10 ##11 ## Percentage of observations violating12 ## - 2-anonymity: 0.000 %13 ## - 3-anonymity: 1.659 %14 ## ---------------------------------------------------------------------------15

16 # Calculate sample frequencies and count number of obs. violating k(3,5) - anonymity17 kAnon5 <- sum(sdcCombined@risk$individual[,2] 5)18 kAnon519 ## [1] 46320

21 # As percentage of total22 kAnon5 / numIND23 ## [1] 0.0459872924

25 # Global risk on individual level26 print(sdcCombined, 'risk')27 ## Risk measures:28 ##29 ## Number of observations with higher risk than the main part of the data: 030 ## Expected number of re-identifications: 1.69 (0.02 %)31 ##32 ## Information on hierarchical risk:33 ## Expected number of re-identifications: 9.57 (0.10 %)34 ## ---------------------------------------------------------------------------35

36 # Number of observation with relatively high risk37 dim(fileCombined[sdcCombined@risk$individual[, "hier_risk"] > 0.01,])38 ## [1] 24 44

(continues on next page)

11.2. Case study 2 - PUF 181

Page 186: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

39

40 # Highest individual risk41 max(sdcCombined@risk$individual[, "hier_risk"])42 ## [1] 0.01169091

11.2.12 Step 7b: Assessing utility measures (individual level)

We evaluate the utility measures as discussed in Step 5 based on the raw data (before applying any anonymizationmeasures). The results are presented in Step 10b together with the values after anonymization to allow for directcomparison.

11.2.13 Step 8b: Choice and application of SDC methods (individual level)

In this step, we discuss four different techniques used for anonymization: 1) removing variables from the dataset to bereleased, 2) recoding of categorical variables to reduce the level of detail, 3) local suppression to achieve the requiredlevel of 𝑘-anonymity, 4) randomization of the order of the records in the file. Finally, we discuss some alternativeoptions for treating the household structure in the dataset.

Removing variables

Additional to the variables removed from the dataset for the SUF release (see case study 1), we further reduce thenumber of variables in the dataset to be released. This is normal practice for PUF releases. Sensitive or identifyingvariables are removed, which allows to release other variables at a more detailed level. In a PUF release, the set of keyvariables should be limited.

In our case, we decide to remove at the individual level the variables “EDYRSCURRAT”, as this variable is tooidentifying (identifies whether there are school-going children in the household). We keep the variable “EDUCY”(highest level of education attended) for information on education.

Note: As an alternative to removing the variables from the dataset, one could also set all values to missing. Thiswould allow the user to see the structure and variables contained in the SUF file.

Recoding

As noted before, PUF users require a lower level of information and therefore we can recode the key variables evenfurther to reduce the disclosure risk. The recoding of variables in case study 1 is not sufficient for a PUF release.Therefore, we recode most of the categorical key variables from Table 11.19 to reduce the risk and number of necessarysuppressions by local suppression. Table 11.24 gives an overview of the recodes made. All new categories are formedwith the needs of the data user in mind. Listing 11.43 shows how to do this in R and also shows value labels and theunivariate tabulations of these variables before and after recoding.

182 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 187: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Table 11.24: Overview of recodes of categorical variables at individuallevel

Variable RecodingREL (relationto householdhead)

recode ‘Father/Mother’, ‘ Grandchild’, ‘Son/Daughter in law’, ‘Other relative’ to ‘Other relative’and recode ‘Domestic help’ and ‘Non-relative’ to ‘Other’

MARITAL(marital status)

recode ‘Married monogamous’, ‘Married polygamous’, ’Common law, union coutumiere, unionlibre, living together’ to ‘Married/living together’ and ‘Divorced/Separated’ and ‘Widowed’ to‘Divorced/Separated/Widowed’

AGEYRS (agein completedyears)

recode values under 15 to 7 (other values have been recoded for SUF)

EDUCY (high-est level ofeducationcompleted)

recode ‘Completed lower secondary (or post-primary vocational education) but less than com-pleted upper secondary’, ‘Completed upper secondary (or extended vocational/technical educa-tion)’, ‘Post secondary technical’ and ‘University and higher’ to ‘Completed lower secondary orhigher’

INDUSTRY1 recode to ‘primary’, ‘secondary’ and ‘tertiary’

Listing 11.43: Recoding the categorical and continuous variables

1 # Recode REL (relation to household head)2 table(sdcCombined@manipKeyVars$REL, useNA = "ifany")3 ##4 ## 1 2 3 4 5 6 7 8 9 <NA>5 ## 1698 1319 4933 52 765 54 817 40 63 3276

7 # 1 - Head, 2 - Spouse, 3 - Child, 4 - Father/Mother, 5 - Grandchild, 6 - Son/→˓Daughter in law

8 # 7 - Other relative, 8 - Domestic help, 9 - Non-relative9 sdcCombined <- groupVars(sdcCombined, var = "REL", before = c("4", "5", "6", "7"),

10 after = c("7", "7", "7", "7")) # other relative11 sdcCombined <- groupVars(sdcCombined, var = "REL", before = c("8", "9"),12 after = c("9", "9")) # other13

14 table(sdcCombined@manipKeyVars$REL, useNA = "ifany")15 ##16 ## 1 2 3 7 9 <NA>17 ## 1698 1319 4933 1688 103 32718

19 # Recode MARITAL (marital status)20 table(sdcCombined@manipKeyVars$MARITAL, useNA = "ifany")21

22 ##23 ## 1 2 3 4 5 6 <NA>24 ## 3542 2141 415 295 330 329 301625

26 # 1 - Never married, 2 - Married monogamous, 3 - Married polygamous,27 # 4 - Common law, union coutumiere, union libre, living together, 5 - Divorced/

→˓Separated,28 # 6 - Widowed29 sdcCombined <- groupVars(sdcCombined, var = "MARITAL", before = c("2", "3", "4"),30 after = c("2", "2", "2")) # married/living together31 sdcCombined <- groupVars(sdcCombined, var = "MARITAL", before = c("5", "6"),32 after = c("9", "9")) # divorced/seperated/widowed*33

(continues on next page)

11.2. Case study 2 - PUF 183

Page 188: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

34 table(sdcCombined@manipKeyVars$MARITAL, useNA = "ifany")35 ##36 ## 1 2 9 <NA>37 ## 3542 2851 659 301638

39 # Recode AGEYRS (0-15 years)40 table(sdcCombined@manipKeyVars$AGEYRS, useNA = "ifany")41 ##42 ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 1443 ## 311 367 340 332 260 334 344 297 344 281 336 297 326 299 26344 ## 20 30 40 50 60 65 <NA>45 ## 1847 1220 889 554 314 325 18846

47 sdcCombined <- groupVars(sdcCombined, var = "AGEYRS",48 before = c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9",49 "10", "11", "12", "13", "14"),50 after = rep("7", 15))51

52 table(sdcCombined@manipKeyVars$AGEYRS, useNA = "ifany")53 ##54 ## 7 20 30 40 50 60 65 <NA>55 ## 4731 1847 1220 889 554 314 325 18856

57 sdcCombined <- calcRisks(sdcCombined)58 # Recode EDUCY (highest level of educ compl)59 table(sdcCombined@manipKeyVars$EDUCY, useNA = "ifany")60 ##61 ## 0 1 2 3 4 5 6 <NA>62 ## 1582 4755 1062 330 139 46 104 205063

64 # 0 - No education, 1 - Pre-school/ Primary not completed,65 # 2 - Completed primary, but less than completed lower secondary66 # 3 - Completed lower secondary (or post-primary vocational education)67 # but less than completed upper secondary68 # 4 - Completed upper secondary (or extended vocational/technical education),69 # 5 - Post secondary technical70 # 6 - University and higher71 sdcCombined <- groupVars(sdcCombined, var = "EDUCY", before = c("3", "4", "5", "6"),72 after = c("3", "3", "3", "3")) # completed lower secondary

→˓or higher73 table(sdcCombined@manipKeyVars$EDUCY, useNA = "ifany")74 ##75 ## 0 1 2 3 <NA>76 ## 1582 4755 1062 619 205077

78 # Recode INDUSTRY1 ()79 table(sdcCombined@manipKeyVars$INDUSTRY1, useNA = "ifany")80 ##81 ## 1 2 3 4 5 6 7 8 9 10 <NA>82 ## 5300 16 153 2 93 484 95 17 70 292 354683

84 # 1 - Agriculture and Fishing, 2 - Mining, 3 - Manufacturing, 4 - Electricity and→˓Utilities

85 # 5 - Construction, 6 - Commerce, 7 - Transportation, Storage and Communication, 8 -→˓Financial, Insurance and Real Estate

86 # 9 - Services: Public Administration, 10 - Other Services, 11 - Unspecified87 sdcCombined <- groupVars(sdcCombined, var = "INDUSTRY1", before = c("1", "2"),

(continues on next page)

184 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 189: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

(continued from previous page)

88 after = c("1", "1")) # primary89 sdcCombined <- groupVars(sdcCombined, var = "INDUSTRY1", before = c("3", "4", "5"),90 after = c("2", "2", "2")) # secondary91 sdcCombined <- groupVars(sdcCombined, var = "INDUSTRY1", before = c("6", "7", "8", "9

→˓", "10"),92 after = c("3", "3", "3", "3", "3")) # tertiary93 table(sdcCombined@manipKeyVars$INDUSTRY1, useNA = "ifany")94 ##95 ## 1 2 3 <NA>96 ## 5316 248 958 3546

Local suppression

The recoding has reduced the risk already considerably. We use local suppression to achieve the required level of𝑘-anonymity. Generally, the required level of 𝑘-anonymity for PUF files is 3 or 5. In this case study, we require5-anonimity. Listing 11.44 shows the suppression pattern without specifying an importance vector. All suppressionsare made in the variable “AGEYRS”. This is the variable with the highest number of different values, and henceconsidered first by the algorithm. We try different suppression patterns by specifying importance vectors, but wedecide that the pattern without importance vector yields the best result. This is also the result with the lowest totalnumber of suppressions. Less than 1 percent suppression in the age variable is acceptable. We could reduce thisnumber by further recoding the variable “AGEYRS”.

Listing 11.44: Local suppression to reach 5-anonimity

1 # Local suppression without importance vector2 sdcCombined <- localSuppression(sdcCombined, k = 5, importance = NULL)3

4 # Number of suppressions per variable5 print(sdcCombined, "ls")6 ## Local Suppression:7 ## KeyVar | Suppressions (#) | Suppressions (%)8 ## GENDER | 0 | 0.0009 ## REL | 0 | 0.000

10 ## MARITAL | 0 | 0.00011 ## AGEYRS | 91 | 0.90412 ## EDUCY | 0 | 0.00013 ## INDUSTRY1 | 0 | 0.000

Randomization of order of records

The records in the dataset are ordered by region and household ID. There is a certain geographical order of the house-holds within the regions, due to the way the households IDs were assigned. Intruders could reconstruct suppressedvalues by using this structure. To prevent this, we randomly reorder the records within the regions. Listing 11.45shows how to do this in R. We first count the number of records per region

Note: Some records have their region value suppressed, so we include the count of NAs.

Subsequently, we draw randomly household IDs, in such way that the regional division is respected. Finally, we sortthe file by the new, randomized, individual ID (“IDP”). Households with suppressed values for “REGION” will belast in the reordered file. Before randomizing the order, we extract the data from the sdcMicro object “sdcCombined”as shown in Listing 11.45.

11.2. Case study 2 - PUF 185

Page 190: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Listing 11.45: Randomizing the order of records within regions

1 # Randomize order of households dataAnon and recode IDH to random2 number (sort file by region)3 set.seed(97254)4 # Sort by region5 dataAnon <- dataAnon[order(dataAnon$REGION),]6

7 # Number of households per region8 hhperregion <- table(dataAnon[match(unique(dataAnon$IDH), dataAnon$IDH), "REGION"],9 useNA = "ifany")

10

11 # Randomized IDH (household ID)12 randomHHid <- c(sample(1:hhperregion[1], hhperregion[1]),13 unlist(lapply(1:(length(hhperregion)-1),14 function(i){sample((sum(hhperregion[1:i]) + 1):

→˓sum(hhperregion[1:(i+1)]), hhperregion[(i+1)])})))15

16 dataAnon$IDH <- rep(randomHHid, table(dataAnon$IDH)[match(unique(dataAnon$IDH),17 as.numeric(names(table(dataAnon$IDH))))])18

19 # Sort by IDH (and region)20 dataAnon <- dataAnon[order(dataAnon$IDH),]

Alternative options for dealing with household structure

In Step 6b we compared the disclosure risk for two cases: one with only individual level key variables and anotherwith individual level and household level key variables combined. We decided to use only the individual level keyvariables to reduce the computation time and justified this choice by arguing that intruders cannot use household andindividual level variables simultaneously. This might not always be the case. Therefore we explore other options toreduce the risk when taking both individual level and household level variables into account. We present two options:removing the household structure; and using options in the local suppression algorithm.

Removing household structure

We consider the risk emanating from the household structure in the dataset to be very high. We can remove the hierar-chical household structure completely and treat all variables at the individual level. This entails, besides removing thehousehold id (“IDH”), also treating variables that could be used for reconstructing households. These are, for instance,“REL” (relation to household head), “HHSIZE” (household size), and any of the household level variables, such asincome and expenditure. However, not all household level variables need to be treated. For example, “REGION” is ahousehold level variable, but the probability that this variable leads to the reconstruction of a household is low. Also,we need to reorder the records in the file, as they are sorted by households. Note that by removing the householdstructure, we interpret all variables as individual level variables for measuring disclosure risk. This leads to a lowerneed for recoding and suppression, since the hierarchical risk disappears. The reason why we did not opt for thisapproach is the loss of utility for the user. The household structure is an important feature of the data, and should bekept in the PUF file.

Using different options for local suppression

The long running time is mainly due to the local suppression algorithm. In the Section Local suppression we discussoptions to reduce the running time of the local suppression in case of many key variables. The all-𝑚 approach reducesthe running time by first considering subsets of the complete set of key variables. This reduces the complexity of theproblem and leads to lower computation times. However, the total number of suppressions made is likely to be higher.Also, if not explicitly specified, it is not guaranteed that the required level for 𝑘-anonymity is automatically achievedon the complete set of key variables. It is therefore important to check the results.

186 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 191: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

11.2.14 Step 9b: Re-measure risk

We re-evaluate the risk measures selected in Step 6. Table 11.25 shows that local suppression, not surprisingly, hasreduced the number of individuals violating 5-anonymity to 0. The global hierarchical risk was reduced to 0.02%,which corresponds to approximately 2 correct re-identifications. The highest individual hierarchical re-identificationrisk is 0.2%. These risk levels are acceptable for a PUF release. Furthermore, the recoding has removed any unusualcombinations in the data.

Note: The risk may be underestimated by excluding the household level variables.

Table 11.25: k-anonymity violationsk-anonymity Number of records violating Percentage2 0 0.0 %3 0 0.0 %5 0 0.0 %

11.2.15 Step 10b: Re-measure utility

We compare (cross-)tabulations before and after anonymization, which are illustrated in the R code to this case study.We note that due to the recoding in Step 8b, the detail in the variables is reduced. This reduces the number of necessarysuppressions and is acceptable for a public use file.

11.2.16 Step 11: Audit and reporting

In the audit step, we check whether the data allow for reproduction of published figures from the original dataset andrelationships between variables and other data characteristics are preserved in the anonymization process. In short, wecheck whether the dataset is valid for analytical purposes. There are no figures available that were published from thedataset and need to be reproducible from the anonymized data.

In Step 2, we explored the data characteristics and relationships between variables. These data characteristics and rela-tionships have been mainly preserved, since we took them into account when choosing the appropriate anonymizationmethods. Since values of the variable “AGEYRS” were not perturbed, but only recoded and suppressed, we did notintroduce unlikely combinations, such as a 60-year-old individual enrolled in primary education. Also, by separatingthe anonymization process into two parts, one for household-level variables and one for individual-level variables, thevalues of variables measured at the household level agree for all members of each household.

Furthermore, we drafted two reports, internal and external, on the anonymization of the case study dataset. Theinternal report includes the methods used, the risk before and after anonymization as well as the reasons for theselected methods and their parameters. The external report focuses on the changes in the data and the loss in utility.Focus here should be on the number of suppressions as well as the perturbative methods (PRAM). This is described inthe previous steps.

Note: When creating a PUF, it is inevitable that there will be a loss of information and it is very important for theusers to be aware of these changes and release them in a report that accompanies the data.

Appendix C provides examples of an internal and external report of the anonymization process of this dataset. De-pending on the users and readers of the reports, the content may differ.

11.2. Case study 2 - PUF 187

Page 192: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

Note: The report() function in sdcMicro is at this point not useful, since this will only report on the SDC measures inthe second case study.

However, the report should contain the entire process, including the measures applied in case study 1.

11.2.17 Step 12: Data release

The final step is the release of the anonymized dataset together with the external report. Listing 11.46 shows how toexport the anonymized dataset as STATA file. The Section Read functions in R presents functions for exporting files inother data formats.

Listing 11.46: Exporting the anonymized PUF file

1 # Create STATA file2 write.dta(dataframe = dataAnon, file= 'Case2DataAnon.dta', convert.dates=TRUE)

188 Chapter 11. Case Studies (Illustrating the SDC Process)

Page 193: Statistical Disclosure Control: A Practice Guide

Bibliography

[TKM15] Matthias Templ, Alexander Kowarik, Bernhard Meindl (2015). Statistical Disclosure Controlfor Micro-Data Using the R Package sdcMicro. Journal of Statistical Software, 67(4), 1-36.<doi:10.18637/jss.v067.i04>

[DuBo10] Dupriez, O., & Boyko, E. (2010). Dissemination of Microdata Files; Principles, Procedures and Prac-tices. International Household Survey Network (IHSN).

[HDFG12] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012).Statistical Disclosure Control. Chichester, UK: John Wiley & Sons Ltd.

[DuBo10] Dupriez, O., & Boyko, E. (2010). Dissemination of Microdata Files; Principles, Procedures and Prac-tices. International Household Survey Network (IHSN).

[DoTo03] Domingo-Ferrer, J., & Torra, V. (2003). Disclosure Risk Assesment in Statistical Microdata Protectionvia Advanced Record Linkage. Statistics and Computing 13 (4), 343-354

[ElMa03] Elliot , M. J., & Manning, A. M. (2003). Using DIS to Modify the Classification of Special Uniques.Invited Paper. Joint ECE/Eurostat Work Session on Statistical Data Confidentiality. Luxemboug 2-9 April2003.

[ElMF02] Elliot, M. J., Manning, A. M., & Ford, R. W. (2002). A Computational Algorithm for Handling theSpecial Uniques Problem. International Journal of Uncertainty, Fuzziness and Knowledge Based System, 10 (5), 493-509.

[ELMP10] Elliot, M. J., Lomax, S., Mackey, E. & Purdam, K. (2010) Data Environment Analysismand the KeyVariable Mapping System. In Privacy in Databases, PSD 2010 (ed. Domingo-Ferrer, J. & Magkos, E.),vol. 6344 of Lecture Notes in Computer Science, pp. 138-145. Berlin/Heidelberg: Springer

[ElSD98] Elliot, M.J., Skinner, C.J. & Dale, A. (1998) Special Uniques, Random Uniques, and Sticky Popula-tions: Some Counterintuitive Effects of Geographical Detail on Disclosure Risk Reasearch in OfficialStatistics 1(2), pp. 53-67

[EMMG05] Elliot, M. J., Manning, A., Mayes, K., Gurd, J., & Bane, M. (2005). SUDA: A Program for DetectingSpecial Uniques. Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality. Geneva.

[HDFG12] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012).Statistical Disclosure Control. Chichester, UK: John Wiley & Sons Ltd.

[Lamb93] Lambert, D. (1993). Measures of Disclosure Risk and Harm. Journal of Official Statistics , 9 (2), 313-331.

189

Page 194: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

[MaHK08] Manning, A. M., Haglin, D. J., & Keane, J. A. (2008). A Recursive Search Algorithm for StatisticalDisclosure Assessment. Data Mining and Knowledge Discovery , 16 (2), 165-196.

[MKGV07] Machanavajjhala, A., Kifer, D., Gehrke, J., & Venkitasubramaniam, M. (2007). L-diversity: PrivacyBeyond K-anonymity. ACM Trans. Knowl. Discov. Data , 1 (Article 3) (1556-4681).

[TeMe08] Templ, M. & Meindl, B. (2008) Robust Statistics Meets SDC: New Disclosure Risk Measures forContinuous Microdata Masking. In Privacy in Statistical Databases, PSD 2008 (eds. Domingo-FerrerJ. and Saygin Y.), vol. 5262 of Lecture Notes in Computer Science, pp. 177-189. Berlin/Heidelberg:Springer.

[TeMK14] Templ, M., Meindl, B., & Kowarik, A. (2014, August). Tutorial for SDCMicroGUI. Re-trieved from International Household Survey Network (IHSN): http://www.ihsn.org/home/software/disclosure-control-toolbox

[TMKC14] Templ, M., Meindl, B., Kowarik, A., & Chen, S. (2014, August 1). Introduction to StatisticalDisclosure Control (SDC). Retrieved November 13, 2014, from http://www.ihsn.org/home/software/disclosure-control-toolbox.

[BCRZ13] Burgert, C. R., Colston, J., Roy, T., & Zachary, B. (2013). Geographic Displacement Procedure andGeoreferenced Data Release Policy for the Demographic and Health Surveys. DHS Spatial AnalysisReport No. 7.

[Bran02] Brand, R. (2002). Microdata Protection through Noise Addition. In J. Domingo-Ferrer (Ed.), InferenceControl in Statistical Databases - From Theory to Practice (Vol. Lecture Notes in Computer Science SeriesVolume 2316, pp. 97-116). Berlin Heidelberg, Germany: Springer.

[DMOT02] Domingo-Ferrer, J., Mateo-Sanz, J.M., Oganian, A. & Torres, A. On the Security of Microaggrega-tion with Individual Ranking: Analytics Attacks. International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems 10(5), pp. 477-492.

[DoTo01a] Domingo-Ferrer, J., & Torra, V. (2001). A Quantitative Comparison of Disclosure Control Methods forMicrodata. In P. Doyle, J. Lane, J. Theeuwes, & L. Zayatz (Eds.), Confidentiality, Disclosure and DataAccess: Theory and Practical Applications for Statistical Agencies (pp. 111-133). Amsterdam, North-Holland: Elsevier Science.

[DoTo05] Domingo-Ferrer, J., & Torra, V. (2005). Ordinal, Continuous and Heterogeneous :math:‘k‘-anonimitythrough Microaggregation Data Mining and Knowledge Discovery 11(2), pp. 195-212.

[Drec11] Drechsler, J. (2011). Synthetic Datasets for Statistical Disclosure Control. Heidelberg/Berlin: Springer.

[HDFG12] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012).Statistical Disclosure Control. Chichester, UK: John Wiley & Sons Ltd.

[HuDr15] Hu, J., & Drechsler, J. (2015). Generating synthetic geocoding infromation for public release. NTTS -Conferences on New Techniques and Technologies for Statistics. Brussels.

[KiWi03] Kim, J. J., & Winkler, W. W. (2003, April 17). Multiplicative Noise for Masking Continuous Data.Research Report Series.

[MuSa06] Muralidhar, K., & Sarathy, R. (2006). Data Shuffling- A New Masking Approach for Numerical Data.Management Science , 658-670.

[TeMK14] Templ, M., Meindl, B., & Kowarik, A. (2014, August). Tutorial for SDCMicroGUI. Re-trieved from International Household Survey Network (IHSN): http://www.ihsn.org/home/software/disclosure-control-toolbox

[TMKC14] Templ, M., Meindl, B., Kowarik, A., & Chen, S. (2014, August 1). Introduction to Statisti-cal Disclosure Control (SDC). Retrieved July 9, 2018, from http://www.ihsn.org/home/software/disclosure-control-toolbox.

[Wolf15] de Wolf, P.-P. (2015). Public Use Files of EU-SILC and EU-LFS data.

190 Bibliography

Page 195: Statistical Disclosure Control: A Practice Guide

Statistical Disclosure Control: A Practice Guide

[DoTo01b] Domingo-Ferrer, J., & Torra, V. (2001). Disclosure Protection Methods and Information Loss forMicrodata. In P. Doyle, J. Lane, J. Theeuwes, & Z. L., Theory and Practical Applications for StatisticalAgencies (pp. 91-110). Amsterdam.

[HDFG12] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012).Statistical Disclosure Control. Chichester, UK: John Wiley & Sons Ltd.

[TMKC14] Templ, M., Meindl, B., Kowarik, A., & Chen, S. (2014, August 1). Introduction to Statisti-cal Disclosure Control (SDC). Retrieved July 9, 2018, from http://www.ihsn.org/home/software/disclosure-control-toolbox.

[YaWC02] Yancey, W. W., Winkler, W. E., & Creecy, R. H. (2002). Disclosure Risk Assessment in PerturbativeMicrodata Protection. Research Report Series , Statistics 2002-01.

[HDFG12] Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E. S., Spicer, K., et al. (2012).Statistical Disclosure Control. Chichester, UK: John Wiley & Sons Ltd.

[DuBo10] Dupriez, O., & Boyko, E. (2010). Dissemination of Microdata Files; Principles, Procedures and Prac-tices. International Household Survey Network (IHSN).

Bibliography 191