24
False promises of data anonymity jeopardize data access Neil Walker 12 th September 2016 JDRF/Wellcome Trust Diabetes and Inflammation Laboratory University of Cambridge [email protected] ORCID: http://orcid.org/0000-0001-9796-7688

Sci datacon neil-walker-2016-09-12

Embed Size (px)

Citation preview

Page 1: Sci datacon neil-walker-2016-09-12

False promises of data anonymity jeopardize data access

Neil Walker 12th September 2016 JDRF/Wellcome Trust Diabetes and Inflammation Laboratory University of Cambridge [email protected] ORCID: http://orcid.org/0000-0001-9796-7688

Page 2: Sci datacon neil-walker-2016-09-12

Contents

In the context of clinical trial data, I’ll discuss • The proposal to share individual-level data • False promises of anonymity • Consequences • Alternatives

Neil Walker 2

Page 3: Sci datacon neil-walker-2016-09-12

The proposal

In a Jan 2016 editorial, the International Committee of Medical Journal Editors (ICMJE)

“Proposes to require authors to share with others the deidentified individual patient data (IPD) underlying the results presented in the article.”

Implementation delayed a year1 to allow new consent models 1. Post adoption

Neil Walker 3

Page 4: Sci datacon neil-walker-2016-09-12

… response to …

Institute of Medicine (IoM) report (2015)

Neil Walker 4

Page 5: Sci datacon neil-walker-2016-09-12

… which includes …

(p144, my emphasis): “De-identification is commonly used to protect the privacy of participants in a clinical trial (see also Appendix B). Various jurisdictions may differ on the degree to which the risk of re-identification must be reduced for the data to be considered sufficiently de-identified to justify more widespread sharing, particularly in the absence of specific informed consent of the data subjects.”

[Appendix B is 54 pages on "Concepts and Methods for De-identifying Clinical Trial Data", by Khaled El Emam, and Bradley Malin]

Neil Walker 5

Page 6: Sci datacon neil-walker-2016-09-12

… and cites and relies on …

Neil Walker 6

Page 7: Sci datacon neil-walker-2016-09-12

… which is a poor implementation of …

Neil Walker 7

Page 8: Sci datacon neil-walker-2016-09-12

Why poor? ISTDB-2 has 19435 participants, and 112 variables, e.g.:

Randomisation data; HOSPNUM;Hospital number RDELAY;Delay between stroke and randomisation in hours RCONSC;Conscious state at randomisation (F - fully alert, D - drowsy, U - unconscious) SEX;"M=male; F=female" AGE;Age in years RSLEEP;Symptoms noted on waking (Y/N) RATRIAL;"Atrial fibrillation (Y/N); not coded for pilot phase - 984 patients" ... COUNTRY;Abbreviated country code CNTRYNUM;Country code ...

This should be enough people, right?

Neil Walker 8

Page 9: Sci datacon neil-walker-2016-09-12

Let’s count the people from each country

$ cut -f82 IST_corrected.txt | sort | uniq -c | sort –nr

6257 UK 3437 ITAL 1631 SWIT 759 POLA ... 9 JAPA 2 FRAN 1 COUNTRY

NB: dataset superseded by ISTDB-3, currently emabargoed due to "UK NHS Information Governance"

Neil Walker 9

Page 10: Sci datacon neil-walker-2016-09-12

Is this just isolated sloppiness?

And noting a released dataset cannot be retrieved

Neil Walker 10

Page 11: Sci datacon neil-walker-2016-09-12

Examples - "Anecdata" (from Daniel C Barth-Jones, to whom many thanks)

1. Governor Weld - identified in insurance dataset in 1997

2. Netflix - customers identified in a dataset released to improve recommendations

3. Y-Chromosome STR surname inference - demonstration from Yaniv Erlich's lab

4. PGP - subjects identified in (Open) Personal Genome Project

5. Washington State Hospital Discharge data - patients identified in data sold by hospital

6. NYC Taxi - celebrities identified in FOIL request

7. Mobile phone - theoretical identification from mobile phone location data

Neil Walker 11

Page 12: Sci datacon neil-walker-2016-09-12

Failure modes?

• 1. and 4. are cases where too much data was released (Zipcodes, DOBs)

• 6. and 7. are breached by linking multiple records, individually OK (probably) - though 6. had a key hacked too

But all rely on data available outside the dataset to make the (often small number of) identifications - some of it not obvious

Neil Walker 12

Page 13: Sci datacon neil-walker-2016-09-12

So, de-identification hotly debated…

“There is no evidence that de-identification works either in theory or in practice and attempts to quantify its efficacy are unscientific and promote a false sense of security by assuming unrealistic, artificially constrained models of what an adversary might do.”

Neil Walker 13

Page 14: Sci datacon neil-walker-2016-09-12

How does this jeopardise data access?

• And not just bad publicity, though that doesn’t help!

Image from Fast Company

Neil Walker 14

Page 15: Sci datacon neil-walker-2016-09-12

Data access issue #1: where consent was not sought for data sharing

Data is being redacted e.g. from https://clinicalstudydatarequest.com

GSK’s exclusion criteria includes:

Whether GSK consider it feasible to anonymise the data without compromising the privacy and confidentiality of research participants. For example, anonymisation of data from studies of rare diseases is more difficult to achieve and will be reviewed on a case-by-case basis.

Neil Walker 15

Page 16: Sci datacon neil-walker-2016-09-12

Data access issue #2: where there is no experience of sharing data with consent

Neil Walker 16 Should have lots of choices

Page 17: Sci datacon neil-walker-2016-09-12

Where is clinical data sharing now?

Neil Walker 17 EBI and NIH like this …

Page 18: Sci datacon neil-walker-2016-09-12

Where should it be?

Neil Walker 18

Aggregate Consented, anonymised

Page 19: Sci datacon neil-walker-2016-09-12

Understanding Society1, at UK Data Archive

Neil Walker 19

1. https://www.understandingsociety.ac.uk/documentation/getting-started

Downloads, 2014 3 285 2510

Datasets 2 29 3

Time to decision 3 months 2 weeks 1 day

Decision by DAC Staff, reporting to DAC

Registration, delegated by DAC

i.e. some people do it well

Page 20: Sci datacon neil-walker-2016-09-12

Data access issue #3: no elegant way to respond to a new attack

Neil Walker 20

This paper led to all genotype summary statistics being placed behind firewalls

Page 21: Sci datacon neil-walker-2016-09-12

Data access issue #4: people take risks

STOP PRESS - September 7th 2016 NHGRI give up on access control?

https://www.genome.gov/director/

https://www.genome.gov/27566089/Workshop-on-Sharing-Aggregate-Genomic-Data

“NHGRI should recommend that NIH reconsider the policy for maintaining all genomic summary statistics under controlled access, and develop a default public access model based on transparent policy considerations for most genomics studies.”

Neil Walker 21

Page 22: Sci datacon neil-walker-2016-09-12

The elephant in the room?

Neil Walker 22

From Banksy’s Barely Legal show, LA, 2006

Page 23: Sci datacon neil-walker-2016-09-12

Anonymous data is seen as a asset to buy and sell

However not all subjects will agree to data sharing, with a recent (health-data-related) poll finding 17%

“objected to private companies having access to health data under any circumstances.”

(Ipsos MORI 2016)

Neil Walker 23

Page 24: Sci datacon neil-walker-2016-09-12

So, to repeat: this is not a matter of consent or anonymise

Do both

Neil Walker 24