Upload
rajiv-ranjan
View
214
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Timing, data access types and degree of anonymization in microdata dissemination
Citation preview
Timing, data access types and
degree of anonymization in microdata dissemination
…Rajiv Ranjan
NISR/UNDP-Rwanda
Reflections on data
confidentiality, privacy, and
curationRegional Workshop on Microdata Dissemination Policy
Kigali, Rwanda: 27 – 29 August 2014
Confidentiality concerns
Access issues
Legal basis
Assurance
Challenges
Harmony Governance
Practices
Timing, data access types
and degree of
anonymization in microdata
dissemination
Scheme of the presentation
Confidentiality
Caveat
Microdata dissemination must maintain confidentiality of individual units: people, households or enterprises.
Individual data collected by statistical agencies for statistical compilation, whether they refer to natural or legal persons, are to be strictly confidential and used exclusively for statistical purposes.
Principle 6
United Nations Fundamental Principles of Official Statistics
http://unstats.un.org/unsd/dnss/gp/fundprinciples.aspx
Legal basis in Rwanda
Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article 17: Prohibited dissemination of information (N° 45/2013 of 16/06/2013)
Data collected by the institutions of the national statistical system through surveys or any other method of collection are protected by statistical confidentiality. Statistical confidentiality implies that the dissemination of such data as well as statistical information which can be calculated from them, shall be conducted in a way that those who provided it are not identified whether directly or indirectly.
Access
Access benefits
• Fosters diversity of research
• Increases transparency and accountability
• Mitigates duplication of data collection work
• Increases the quality of data
https://unstats.un.org/unsd/accsub-public/microdata.pdf
Access assurance in Rwanda
The anonymous basic databases on individuals and other institutions shall be accessible to researchers who, however, shall be committed to : 1° make a written note, that they shall not communicate to any person the contents of such databases without the written authorization of the National Institute of Statistics of Rwanda;2° give to the National Institute of Statistics of Rwanda, the findings of their research.
Source: Law on the organisation of statistical activities in Rwanda. Chapter VI: Statistical Confidentiality, Article 19: Accessibility to anonymous basic database not to be published (N° 45/2013 of 16/06/2013)
Challenges
Balancing act
Disclosure risks Information loss
• In practice, the more the disclosure risks are reduced, the lower will be the expected utility of the microdata sets.
• The objective remains to deal with the trade-off between disclosure risks and information loss.
Source: Chris Skinner: Statistical Disclosure Control for Survey Data: http://personal.lse.ac.uk/skinnecj/SDC%20for%20survey%20data%20S3RI.pdf
Challenges
[Emerging mash-ups]
Datasets are being reused and combined with other datasets in ways never before thought possible, including for use that go beyond the original intent.
[Growing motives]
While there are promising research efforts underway to protect privacy, far more advanced efforts are presently in use to re-identify seemingly “anonymous” data
[Improved access]
Access to datasets have eased their discoverability and data could be used to re-identify previously de-identified datasets
http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_5.1.14_final_print.pdf
Complicating the challenges
Disclosure risks Information loss
Images: (1.) From the cover of ‘Open Data Now’ - a book by Joel Gurin, exploring how open data within public records will create new jobs, applications and other technology innovations . http://www.opendatanow.com & (2.) A project at PARIS21 on data revolution for post 2015 SDGs http://www.paris21.org/node/1654
Machine readability,
Open standards and
Free for reuse
Post 20151 2
Harmony
Coexistence
“There is nothing inherently contradictory about hiding one piece of information while revealing another, so long as the information we want to hide is different from the information we want to disclose.”
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2031808
- Felix T. Wu in Defining Privacy and Utility in Data Sets.
Though not easy, but it is possible and desirable for openness and privacy to co-exist.
Decision factors
Disclosure risks Information loss
Sensitivity of the dataset
Usage intent
Enabling dimensions
• Asserting users types
• Controlling release timing
• Categorizing access methods
• Varying the degree of anonymization
Tools & Methods1 Governance Practices
• Legal basis• Policy backing• Institutionalization
• sdcMicro• sdcMicroGUI
• Deterministic• Probabilistic
1: http://cran.r-project.org/web/packages/sdcMicro/vignettes/sdc_guidelines.pdf
Anon
ymiza
tion
Governance
Law on the organisation of
statistical activities in
Rwanda(Feb 14, 2006)
Law
MicrodataReleasePolicy
@National Institute of Statistics of Rwanda
Policy
MicrodataRelease
Committee&
Data curation team@
NISR
Institutionalization
Practices
Users types served
Govt. (Policy makers and researchers)
International development agencies
Research and academic institutions
Students and professors
Others (scientific researchers)
Release timing
6 – 24 monthsafter the 1st release of aggregated data from a survey/census
Within
DHS 2010
EICV(3) 2010-2011
Census 2012
7
7
?
Seasonal Agri Survey 2013 ?
24 Months
Exam
ples
Integrated Household Living Conditions Survey (EICV)
Access methods
Web-based distribution
Types of files/access
16
1
3
Open access (no restriction)
Direct access or Public Use Files (some restrictions on use, but no screening of users)
Research Use Files (or Scientific Use Files, or Licensed Files)
Availability only in an enclave
No access authorized
Data not available
Data available from external repo 4 Tot
al n
o of
stud
ies
= 24
Degree of anonymization
• Suppressing/deleting the records of direct identifiers (e.g. name of the head of HH) and few indirect identifiers (e.g. sub-national admin boundaries)
• Generalizing/replacing (recoding) some indirect identifiers with less specific but semantically consistent groupings of observation values (e.g. place of birth, occupation)
• Perturbing/distorting some indirect identifiers by randomizing the values (e.g. clusters)
Removing or modifying the identifying variables contained in the microdata
The usual practice at NISR is to release microdata as Public Use Files.
For example, in EICV3, the methods applied for anonymizing data were:
Integrated Household Living Conditions Survey (EICV): EICV3 was done in 2010-2011
Variations in the degree of anonymization (and resulting access files/types) may be considered depending on the sensitivity of the dataset and the use.
e.g.: Recoding (Occupation)
@rajiv_r_in…
Thank you!
“87% of the U.S. population can be uniquely identified by date of birth + gender + zip”
Latanya Sweeney, CMUlatanyasweeney.org