17
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University [email protected]

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University [email protected]

Embed Size (px)

Citation preview

Page 1: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Alternative Approaches to Data Dissemination and Data Sharing

Jerome ReiterDuke University

[email protected]

Page 2: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Two general settings Agency seeks to release confidential data to the

public.

Multiple agencies seek to improve analyses by sharing their confidential data.

For both settings, agencies seek strategies that:

i) do not reveal identities or sensitive attributes,

ii) are useful for a wide range of analyses,

iii) are easy for analysts and agencies to use.

Page 3: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Some alternative approaches Remote access servers

Synthetic (i.e. simulated) data

Secure computation techniques

Page 4: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Definition of servers Server is any system that

(i) allows users to submit queries for output from statistical analyses of microdata,

but

(ii) does not give direct access to microdata.

Table Servers / Model Servers

Page 5: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Queries and responses Queries to model server:

Users request results from fitting a statistical model to the data.

Response from model server:

Answerable query: model output.Unanswerable query: no results.

Model output also should include diagnostics.

Page 6: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Challenges in developing model servers

Non-statistical:Operation costs, server security, etc.

Statistical:-- Disclosure risks from smart queries (e.g., subsets, transformations).-- Inferential disclosure risks.-- Enabling complex model fitting.

Page 7: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Synthetic dataRubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that:

No unit in released data has sensitive data from actual unit in population.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Page 8: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Generating fully synthetic data Randomly sample new units from sampling frame. Impute survey variables for new units using models fit from observed data.

Repeat multiple times and release datasets.

Page 9: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Modification: Release partially synthetic dataLittle (1993, JOS ): create multiple, partially synthetic datasets for public release so that:

Released data comprise mix of observed and synthetic values.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Page 10: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Existing applications Kennickel (1997, Record Linkage

Techniques): Replace sensitive values for selected units.

Liu and Little (2002, JSM Proceedings):Replace values of key identifiers for selected units.

Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access):Replace all values of sensitive variables.

Page 11: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Sample of research agenda

Implement and compare various data generation approaches on genuine data in production settings.

Evaluate risk/usefulness profile on genuine data in production setting.

Develop packaged synthesizers for data disseminators to use.

Page 12: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Secure computations Horizontally Partitioned:

Agencies have different records but same variables.

Purely Vertically Partitioned:Agencies have same records but different variables.

Partially Overlapping, Vertically Partitioned:Agencies have different records and different variables, with some common records and variables.

Page 13: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Horizontally Partitioned Data:Secure Summation

Secure summation-- shares sums without sharing data -- allows regressions, clustering, classifications-- assumes semi-honest

Page 14: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Horizontal Partitioning:Secure summation

Obtain without sharing individual values

1. Agency A passes (x + R) to 2nd agency.2. Agency B adds its x to this value and

passes sum to Agency C.3. Process continues until all agencies

have added their x.4. Agency A subtracts R from the sum.

ix

Page 15: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Purely vertical partitioning Secure dot/matrix product

-- shares dot/matrix products without sharing data.-- allows regressions, clustering, classification.-- assumes semi-honest.

Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.

Page 16: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

A research agenda for secure computation methods

- How to specify models without viewing data?

- What if sophisticated models needed?

- How to incorporate matching errors, differences in data quality and definitions?

- How to account for disclosure risks from models that “fit too well?”

Page 17: Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University jerry@stat.duke.edu

Some References Remote access servers

- Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science)

Synthetic data

- Raghunathan, Reiter, and Rubin (2003, JOS )- Reiter (2003, Surv. Meth.; 2005, JRSSA)

Secure computation

- Benaloh (1987, CRYPTO86 )- Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)