34
The Genomics Revolution: The Good, The Bad, and The Ugly (A Privacy Researcher's Perspective) Emiliano De Cristofaro University College London https://emilianodc.com

The Genomics Revolution: The Good, The Bad, and The Ugly (UEOP16 Keynote)

Embed Size (px)

Citation preview

Whole Genome Sequecing: Innovation Dream or Privacy Nightmare

The Genomics Revolution: The Good, The Bad, and The Ugly

(A Privacy Researcher's Perspective)Emiliano De CristofaroUniversity College Londonhttps://emilianodc.com

This Talk In aThe GoodRevolution in medicine and healthcareGenetic testing for the massesThe BadCollection of highly sensitive dataVery hard to anonymize / de-identifyThe UglyGreater good vs privacyEncryption might not be the answer2

Id like to start by talking about how the tremendous progress over the past few years in genome sequencing technologies and genomics has been empowering a revolution in medicine and healthcare, getting us closer to a new era of personalized and preventive medicine. And this progress is making it increasingly possible for a large number of individuals to have access to genetic testing. On the other hand, the very same progress raises a number of privacy challenges as well as ethical, societal, and legal issues, as the genome is really a treasure trove of highly sensitive information. 2

3

First, Id like to give you some background information about genomics and genome sequencing a task that Im going to outsource to this nice and short TedEd video3

History1970s:DNA sequencing starts1990:The Human Genome Project starts2003:First human genome fully sequenced2012:UK announces sequencing of 100K genomes2015:USA announces sequencing of 1M genomes

$$$$3B:Human Genome Project $250K:Illumina (2008)$5K:Complete Genomics (2009), Illumina (2011)$1K:Illumina (2014)4

It is quite interesting to me how costs of whole genome sequencing have dropped much faster than what Moores law would predict. From futuristic-sounding venture with the HGP costing 3B to sequence one genome to an affordable technology. Companies like Illumina have now they technology to fully sequence human genomes in a matter of hours for less than $1000.Far fetched, progress enables initiatives 4

How to read the genome?Genotyping

Testing for genetic differences using a set of markers

5Sequencing

Determining the full nucleotide order of an organisms genome

5Why is this such a big deal? Full genome sequencing is not the only way to read the genome. For decades we have used genotyping, i.e., in-vitro processes to test for known genetic differences using a set of markers. But scientists also hope that being able to study the entire genome sequence will help them understand what they dont know, how the genome as a whole workshow genes work together to direct the growth, development and maintenance of an entire organism.

6

So progress in sequencing can help researchers gain better understanding of diseases and their relationship with genetic features, but it is also already bearing fruits in the clinical setting as we hear more and more news of how doctors have been able to diagnose and/or cure patients thanks to whole genome sequencing6

7

Another important point Id like to make is that the genomics revolution is not only happening within the walls of our research labs and our hospitals. Take as an example the case of Angelina Jolie, who was diagnosed her choice of undergoing really put genetic testing in the spotlight, if not for the debate it generated7

8

Another important phenomenon is the emergence of direct-to-consumer genomics, i.e., genetic tests that are marketed directly to customers, without necessarily involving a doctor or a health professional. One of the most popular companies in this market is 23andme, a US company now operating in the UK as well, that provides individuals with reports on a number of genetic traits, inherited risk factors, etc.8

9

But not all data are created equal!11

From a computer science perspective, genome sequencing sort of means turning this (a double stranded polymer of nucleotides) into this data. This is the output of the illumina sequencing machine, containing the nucleotides making up the genome as well as some additional information. And, as with any kind of data, we need to store it somewhere and wed like to provide tools to search it and enable computation over it.11

Privacy Researchers PerspectiveTreasure trove of sensitive informationEthnic heritage, predisposition to diseases

Genome = the ultimate identifierHard to anonymize / de-identifySensitivity is perpetualCannot be revokedLeaking ones genome leaking relatives genome12

Just like 23andme customers malevolent actors might be interested prompting fears of discrimination by employers or health insurance providers12

The Greater GoodvsPrivacy?13

Examination of many common genetic variants in different individuals to see if any variant is associated with a trait.Genomic advances dependent on data sharing, sharing is an important asset for research13

The rise of a new research communityStudying privacy issues

Exploring techniques to protect privacy

14

De-AnonymizationMelissa Gymrek et al. Identifying Personal Genomes bySurname Inference. Science Vol. 339, No. 6117, 2013

15

demonstrated feasibility of re-identifying DNA donors from a public research database using information available from popular genealogy Web sites and other available information.15

AggregationRe-identification of aggregated dataStatistics from allele frequencies can be used to identify genetic trial participants [1]Presence of an individual in a group can be determined by using allele frequencies and his DNA profile [2]

[1] R. Wang et al. Learning Your Identity and Disease from Research Papers: Information Leaks in Genome Wide Association Study. CCS, 2009[2] N. Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics,200816

p-values, r-squares16

Kin Privacy

Quantifying how much privacy do relatives lose when ones genome is leaked?

M. Humbert et al., Addressing the Concerns of the Lacks Family: Quantification of Kin Genomic Privacy. Proceedings of ACM CCS, 2013

Also read: Routes for breaching genetic privacyY. Erlich and A. Narayanan, Nature Review GeneticsVol. 15, No. 6, 2014

17

18

Another line of work aims to study individuals fears and concerns related to genetic tests. As I mentioned, these often include fears of being discriminated in the workplace or by health insurance providers. Researchers also study perceptions and reactions of people learning the outcome of genetic tests, e.g., being diagnosed with predisposition to cancer or alzheimers, or like with this case that was in the news a few months ago, of a guy who discovered having an unknown half brother, ultimately leading his parents to divorce.18

Human Aspects of Genome PrivacyDynamic Consent:Patients electronically control consent through time and receive information about the uses of their dataJane Kayes work at Oxford

Understanding fears and reactions, including:Survivors guiltFreedom to withdraw is crucial but poorly understoodInsurance carriers and big corporations most distrusted19

Ethnographic Studies in WGSSemi-structured interviews with 16 participantsAssessing perception of genetic tests, attitude toward WGS programs, as well as perception of privacy/ethical issues(Some) Preliminary resultsPreferred method is through doctors not companies (trust)Labor/healthcare discrimination top concernsDifferences in correlation with income and education

E. De Cristofaro. Users' Attitudes, Perception, and Concerns in the Era of Whole Genome Sequencing. (USEC 2014)20

The rise of a new research communityStudying privacy issues

Exploring techniques to protect privacy

21

Differential Privacy

Computing number/location of SNPs associated to diseaseSignificance/correlation between a SNP and a diseaseA. Johnson and V. Shmatikov. Privacy-Preserving Data Exploration inGenome-Wide Association Studies. Proceedings of KDD, 201322Genome Wide Association Studies (GWAS)

Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.

22

Computing on Encrypted GenomesGenomic datasets often used for association studies

Encrypt data & outsource to the cloudPerform private computation over encrypted dataUsing partial & fully homomorphic encryption

Examples:Pearson Goodness-of-Fit test, linkage disequilibriumEstimation Maximization, Cochran-Armitage TT, etc.23K. Lauter, A. Lopez-Alt, M. Naehrig.Private Computation on Encrypted Genomic Data

Computing on Encrypted Genomes24

L. Kamm, D. Bogdanov, S. Laur, J. Vilo. A new way to protect privacy in large- scale genome-wide association studies. Bioinformatics 29 (7): 886-893, 2013.

Consists in adding noise to a dataset with the goal of supporting statistical queries while preserving the privacy of the users whose information is contained in the dataset.

24

Private Personal Genomic TestsIndividuals retain control of their sequenced genomeAllow doctors/labs to run genetics tests, but:1. Genome never disclosed, only test output is2. Pharmas can keep test specifics confidential

two main approaches 25

(i) DNA sample(i) Clinical and Environmental data(ii) Encrypted SNPs(i) Encrypted clinical and environmental data(iii) Disease Risk ComputationCERTIFIED INSTITUTION (CI)MEDICAL UNIT (MU)STORAGE AND PROCESSING UNIT (SPU)PATIENT (P)

1. Using Semi-Trusted Parties26

Relying on a cloud storage providerData encrypted and stored at a Storage Process UnitAllow medical unit to privately test them26

1. Using Semi-Trusted PartiesAyday et al. (WPES13)Data is encrypted and stored at a Storage Process UnitDisease susceptibility testing

Ayday et al. (DPM13)Encrypting raw genomic data (short reads)Allowing medical unit to privately retrieve them

Danezis and De Cristofaro (WPES14)Regression for disease susceptibility

27

doctoror lab

genome

individualtest specificsSecure Function Evaluationtest resulttest resultPrivate Set Intersection (PSI)Authorized PSIPrivate Pattern MatchingHomomorphic EncryptionGarbled Circtuis[]Output reveals nothing beyond test resultPaternity/Ancestry TestingTesting of SNPs/Markers Compatibility Testing[]2. Users keep sequenced genomes28

Users keep sequenced genomes (encrypted)But still allow doctors/clinicians to run testsOnly disclose minimum amount of informationE.g. privacy-preserving version of paternity, ancestry, genealogy, personalized medicine tests28

2. Users keep sequenced genomesBaldi et al. (CCS11)Privacy-preserving version of a few genetic tests, based on private set operationsPaternity test, Personalized Medicine, Compatibility Tests(First work to consider fully sequenced genomes)De Cristofaro et al. (WPES12), extends the aboveFramework and prototype deployment on AndroidAdds Ancestry/Genealogy Testing

29

Open ProblemsWhere do we store genomes?Encryption cant guarantee security past 30-50 yrsReliability and availability issues?

CryptographyEfficiency overhead Data representation assumptionsHow much understanding required from users?

30

We all leave biological cells behindHair, saliva, etc., can be collected and sequenced?

Compare this attack to re-identifying millions of DNA donors or hacking into 23andmeThe former: expensive, prone to mistakes, only works against a handful of targeted victimsThe latter: very scalable

Why do we even careabout genome privacy?31

EpilogueWhole Genome SequencingA revolution in healthcareRaises worrisome privacy/ethical concerns

The Genomic Privacy research communityUnderstanding the privacy issuesPrivacy-preserving testing on whole genomesPossible using efficient crypto protocols & cross-discipline collaboration

A number of open research issues

32

For more info:http://genomeprivacy.org

Also:E. Ayday, E. De Cristofaro, J.P. Hubaux, G. Tsudik.Whole Genome Sequencing: Revolutionary Medicine or Privacy Nightmare?IEEE Computer Magazine

Thank you!Special thanks toE. Ayday, P. Baldi, R. Baronio, G. Danezis, S. Faber,P. Gasti, J-P. Hubaux, B. Malin, G. Tsudik