Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University...

Alternative Approaches to Data Dissemination and Data Sharing

Jerome ReiterDuke University

jerry@stat.duke.edu

Two general settings Agency seeks to release confidential data to the

public.

Multiple agencies seek to improve analyses by sharing their confidential data.

For both settings, agencies seek strategies that:

i) do not reveal identities or sensitive attributes,

ii) are useful for a wide range of analyses,

iii) are easy for analysts and agencies to use.

Some alternative approaches Remote access servers

Synthetic (i.e. simulated) data

Secure computation techniques

Definition of servers Server is any system that

(i) allows users to submit queries for output from statistical analyses of microdata,

(ii) does not give direct access to microdata.

Table Servers / Model Servers

Queries and responses Queries to model server:

Users request results from fitting a statistical model to the data.

Response from model server:

Answerable query: model output.Unanswerable query: no results.

Model output also should include diagnostics.

Challenges in developing model servers

Non-statistical:Operation costs, server security, etc.

Statistical:-- Disclosure risks from smart queries (e.g., subsets, transformations).-- Inferential disclosure risks.-- Enabling complex model fitting.

Synthetic dataRubin (1993, JOS ): create multiple, fully synthetic datasets for public release so that:

No unit in released data has sensitive data from actual unit in population.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Generating fully synthetic data Randomly sample new units from sampling frame. Impute survey variables for new units using models fit from observed data.

Repeat multiple times and release datasets.

Modification: Release partially synthetic dataLittle (1993, JOS ): create multiple, partially synthetic datasets for public release so that:

Released data comprise mix of observed and synthetic values.

Released data look like actual data.

Statistical procedures valid for original data are valid for released data.

Existing applications Kennickel (1997, Record Linkage

Techniques): Replace sensitive values for selected units.

Liu and Little (2002, JSM Proceedings):Replace values of key identifiers for selected units.

Abowd and Woodcock (2001, Confidentiality, Disclosure, and Data Access):Replace all values of sensitive variables.

Sample of research agenda

Implement and compare various data generation approaches on genuine data in production settings.

Evaluate risk/usefulness profile on genuine data in production setting.

Develop packaged synthesizers for data disseminators to use.

Secure computations Horizontally Partitioned:

Agencies have different records but same variables.

Purely Vertically Partitioned:Agencies have same records but different variables.

Partially Overlapping, Vertically Partitioned:Agencies have different records and different variables, with some common records and variables.

Horizontally Partitioned Data:Secure Summation

Secure summation-- shares sums without sharing data -- allows regressions, clustering, classifications-- assumes semi-honest

Horizontal Partitioning:Secure summation

Obtain without sharing individual values

1. Agency A passes (x + R) to 2nd agency.2. Agency B adds its x to this value and

passes sum to Agency C.3. Process continues until all agencies

have added their x.4. Agency A subtracts R from the sum.

Purely vertical partitioning Secure dot/matrix product

-- shares dot/matrix products without sharing data.-- allows regressions, clustering, classification.-- assumes semi-honest.

Synthetic data approaches-- share synthetic copies of data across agencies.-- allows any analysis when distributions used to generate data are accurate.-- generates public use data file.

A research agenda for secure computation methods

- How to specify models without viewing data?

- What if sophisticated models needed?

- How to incorporate matching errors, differences in data quality and definitions?

- How to account for disclosure risks from models that “fit too well?”

Some References Remote access servers

- Rowland (2003, NAS Panel on Data Access). - Gomatam, Karr, Reiter, Sanil (2005, Stat. Science)

Synthetic data

- Raghunathan, Reiter, and Rubin (2003, JOS )- Reiter (2003, Surv. Meth.; 2005, JRSSA)

Secure computation

- Benaloh (1987, CRYPTO86 )- Karr, Lin, Sanil, and Reiter (2005, NISS tech. rep.)

Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University...

Documents

Neal Reiter Portfolio

JEROME P. REITER - Duke Universityjerry/jerryreitercv.pdf · 2020-01-10 · JEROME P. REITER Department of Statistical Science, Duke University Box 90251, Durham, ... Con dential

Multiple imputation when records used for imputation are ...jerry/Papers/micombobm.pdf · not used or disseminated for analysis BY JEROME P. REITER, Department of Statistical Science,

Reiter magazine 03

St. Eugene de Mazenod Parish Bulletin - Nov. 5, 2016Nov 14, 2016 · Lorna Birn Marianne Delhommeau Cory Reiter, Ed Reiter, Len Schlosser, Jerome Schroh Kim Herbst. Jan MacDonald

Reiter István

Kosmetik Andrea Reiter

Reiter Magazine 04.10

Kari Lock Morgan Department of Statistical Science, Duke University kari@stat.duke.edu

Facharbeit Robert Reiter

Jerome K. Jerome

Lock5 - stat.duke.edu

Edda Seidl-Reiter

Jerome k Jerome Teatable Talk

Jerome k. jerome

Reiter lecture 11.11.14

18.06.10. 18.06.10 Der Blaue Reiter 18.06.10 Was ist Der Blaue Reiter? Künstlervereinigung Der Almanach (Jahrbuch) Im Blaue Reiter gab es keinen Gruppenstil

Reiter magazine 01

Novel Notes - Jerome k jerome

Der blaue reiter