Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University...

Preview:

Citation preview

Alternative Approaches to Data Dissemination and Data Sharing

Jerome ReiterDuke University

jerry@stat.duke.edu

Two general settings Agency seeks to release confidential data to the

public.

Multiple agencies seek to improve analyses by sharing their confidential data.

For both settings, agencies seek strategies that:

i) do not reveal identities or sensitive attributes,

ii) are useful for a wide range of analyses,

iii) are easy for analysts and agencies to use.

Some alternative approaches Remote access servers

Synthetic (i.e. simulated) data

Secure computation techniques

Definition of servers Server is any system that

(i) allows users to submit queries for output from statistical analyses of microdata,

but

(ii) does not give direct access to microdata.

Table Servers / Model Servers

Queries and responses Queries to model server:

Users request results from fitting a statistical model to the data.

Response from model server:

Answerable query: model output.Unanswerable query: no results.

Model output also should include diagnostics.

Challenges in developing model servers

Non-statistical:Operation costs, server security, etc.

Statistical:-- Disclosure risks from smart queries (e.g., subsets, transformations).-- Inferential disclosure risks.-- Enabling complex model fitting.

Synthetic dataRubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that:

No unit in released data has sensitive data from actual unit in population.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Generating fully synthetic data Randomly sample new units from sampling frame. Impute survey variables for new units using models fit from observed data.

Repeat multiple times and release datasets.

Modification: Release partially synthetic dataLittle (1993, JOS ): create multiple, partially synthetic datasets for public release so that:

Released data comprise mix of observed and synthetic values.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Existing applications Kennickel (1997, Record Linkage

Techniques): Replace sensitive values for selected units.

Liu and Little (2002, JSM Proceedings):Replace values of key identifiers for selected units.

Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access):Replace all values of sensitive variables.

Sample of research agenda

Implement and compare various data generation approaches on genuine data in production settings.

Evaluate risk/usefulness profile on genuine data in production setting.

Develop packaged synthesizers for data disseminators to use.

Secure computations Horizontally Partitioned:

Agencies have different records but same variables.

Purely Vertically Partitioned:Agencies have same records but different variables.

Partially Overlapping, Vertically Partitioned:Agencies have different records and different variables, with some common records and variables.

Horizontally Partitioned Data:Secure Summation

Secure summation-- shares sums without sharing data -- allows regressions, clustering, classifications-- assumes semi-honest

Horizontal Partitioning:Secure summation

Obtain without sharing individual values

1. Agency A passes (x + R) to 2nd agency.2. Agency B adds its x to this value and

passes sum to Agency C.3. Process continues until all agencies

have added their x.4. Agency A subtracts R from the sum.

ix

Purely vertical partitioning Secure dot/matrix product

-- shares dot/matrix products without sharing data.-- allows regressions, clustering, classification.-- assumes semi-honest.

Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.

A research agenda for secure computation methods

- How to specify models without viewing data?

- What if sophisticated models needed?

- How to incorporate matching errors, differences in data quality and definitions?

- How to account for disclosure risks from models that “fit too well?”

Some References Remote access servers

- Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science)

Synthetic data

- Raghunathan, Reiter, and Rubin (2003, JOS )- Reiter (2003, Surv. Meth.; 2005, JRSSA)

Secure computation

- Benaloh (1987, CRYPTO86 )- Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)