15
8/25/15 1 HotSec 2013 Privacy Considera9ons of Genome Sequencing E. Ayday, E. De Cristofaro, J.P. Hubaux, G. Tsudik With contribu9ons from Z. Huang, M. Humbert, J.L. Raisaro Many thanks to gene9cists J. Fellay, P. McLaren and A. Telen9 On Convergence… 2 ``The last inch´´ Digital medicine: Digital medical records Digital imaging Medical online social networks Genome sequencing Other ´omics data Wireless biosensors Telecom Compu9ng Modern IT …0100110100011… …CGTTAATTCCGTA…

Privacy&Consideraons&of& GenomeSequencing&gts/paps/wgs-HOTSEC13.pdfGATTACA&(1997&Movie)& ... • Modern&techniques&make&itpossible&to&determine&the&status& ... Countering GATTACA:

  • Upload
    lengoc

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

8/25/15  

1  

HotSec  2013    

Privacy  Considera9ons  of    Genome  Sequencing  

E.  Ayday,  E.  De  Cristofaro,  J.-­‐P.  Hubaux,  G.  Tsudik  With  contribu9ons  from  

Z.  Huang,  M.  Humbert,  J.-­‐L.  Raisaro    

Many  thanks  to  gene9cists    J.  Fellay,  P.  McLaren  and  A.  Telen9  

On  Convergence…  

2  

``The  last  inch´´  

Digital  medicine:  -­‐        Digital  medical  records  -­‐        Digital  imaging  -­‐ Medical  online  social  networks  -­‐ Genome  sequencing  -­‐ Other  ´omics  data  -­‐   Wireless  biosensors  …  

 

Telecom   Compu9ng  

Modern  IT  

…0100110100011…                …CGTTAATTCCGTA…  

8/25/15  

2  

3  

The Genomic Avalanche Is Coming…

4  

Genetic Sequencing

Courtesy  V.  Mooser  

8/25/15  

3  

GATTACA  (1997  Movie)  

Medical  Use  of  Gene9cs  •  Gene9c  disease  risk  tests  help  early  diagnosis  of  serious  diseases  

•  Pharmacogenomics  è  personalized  medicine  

6  

8/25/15  

4  

The  SNP  •  Human  Genome  iden9cal  in  most  places  for  all  people  

•  SNP  (Single  Nucleo9de  Polymorphism)  è  posi9ons  where  some  people  have  one  nucleo9de  pair  while  others  have  another  

7  Linkage  disequilibrium  (LD):  correla9on  between  the  alleles  of  SNPs  

(usually  located  close  to  each  other)  

Key  Concepts  of  Genomics  

•  Our  gene9c  informa9on  is  stored  in  the  sequence  of  DNA,  which  is  made  of  four  nucleo9des:  A,  T,  G,  and  C  

•  The  human  genome  is  ~3  billion  nucleo9des  long,  and  packaged  into  23  pairs  of  chromosomes  

•  Gene9c  variants  (including  SNPs)  are  posi9ons  in  the  genome  where  people  have  different  values  

•  Our  collec9on  of  gene9c  variants  is  what  makes  each  of  us  unique  

•  Modern  techniques  make  it  possible  to  determine  the  status  of  large  numbers  of  SNPs  very  efficiently  

8  

8/25/15  

5  

From  the  Sample  to  the  Full    Genome  Sequence    

Raw  data  (FASTq)  

Full  genome  

•  Individual  diagnosis,  personalized  medicine  

•  Sta9s9cs  

Deep  /  ultra-­‐deep    sequencing  

SAM  file  (aligned  reads)  

9  

Samples   Sequencing  machine    (Illumina,  Roche,  Life  Technology,  

Oxford  Nanopore,  PacBioScience,…)    

Threat  •  Leakage  of  genomic  data  •  Revela9on  of  privacy-­‐sensi9ve  data  about  the  pa9ent    –  Predisposi9on  to  disease,  ethnicity,  paternity  or  filia9on,  etc.  

–  Denial  of  access  to  health  insurance,  mortgage,  educa9on,  and  employment  

•  Cross-­‐layer  amacks  –  Using  privacy-­‐sensi9ve  informa9on  belonging  to  a  vic9m,  retrieved  from  different  sources  (e.g.,  online  social  networks)  

10  

8/25/15  

6  

Misconcep9ons  about    Genome  Privacy  (1/6)  

Misconcep:on  1:  Genome  privacy  is  hopeless,  because  all  of  us  leave  biological  cells  (hair,  skin,  droplets  of  saliva,…)  wherever  we  go  • Those  cells  can  be  collected  and  used  for  DNA  sequencing  • Hence  trying  to  protect  genome  privacy  is  a  lost  bamle  • What  is  wrong  with  this  reasoning?  • Collec9ng  human  biological  samples  and  sequencing  them  is  expensive,  illegal,  prone  to  mistakes,  and  non-­‐scalable!  (even  if  sequencing  techniques  keep  improving)  

11  

Misconcep9ons  about  Genome  Privacy  (2/6)  

Misconcep:on  2:  Genome  privacy  is  irrelevant,  because  gene9cs  is  non  determinis9c  •  Gene9c  data  as  such  is  of  limle  relevance  because  other  

aspects  (especially  the  environment,  nutri9on,  etc.)  also  play  a  major  role  in  the  evolu9on  of  health  

•  Hence  gene9c  data  is  of  limle  value  for  an  amacker  •  What  is  wrong  with  this  reasoning?  •  In  some  cases  (e.g.,  genes  BRCA1  and  BRCA2  for  breast  

cancer),  the  disease  probabili9es  are  highly  related  to  gene9c  data  

•  Paternity  can  be  checked  •  Environmental  data  can  be  obtained  from  various  sources,  

including  online  social  networks   12  

8/25/15  

7  

Misconcep9ons  about  Genome  Privacy  (3/6)  Misconcep:on  3:  Genome  privacy  should  be  leq  to  bioinforma9cians  • Specialists  of  bioinforma9cs  are  trained  in  both  biology  (including  gene9cs)  and  computer  science  • Hence  they  are  bemer  prepared  than  us  (computer  scien9sts)  to  address  those  problems  • What  is  wrong  with  this  reasoning?  • Genome  privacy  requires  a  strong  background  in  informa9on  security  (threat  analysis,  protocol  security,  cryptography,…)  • Such  a  culture  is  well  developed  among  computer  scien9sts,  notably  thanks  to  the  challenge  of  Internet  security  • Yet,  it  is  not  part  of  the  tradi9onal  background  of  bioinforma9cians  • Learning  the  basics  of  gene9cs  is  premy  straighsorward  for  computer  scien9sts,  see  e.g.  “Evolu9onary  Analysis”  by  Freeman  and  Herron,  5th  edi9on,  Pearson,  2013  

 

13  

Misconcep9ons  about  Genome  Privacy  (4/6)  

Misconcep:on  4:  Genome  privacy  will  be  guaranteed  by  legisla9on  • The  usage  of  gene9c  data  is  strictly  regulated,  see  e.g.  the  Gene9c  Informa9on  Nondiscriminatory  Act  (GINA),  2008,  in  the  US  • Legisla9on  will  act  as  a  deterrent  • What  is  wrong  with  this  reasoning?  • If  genomic  data  can  be  stealthily  accessed,  poten9al  employers,    bankers,  and  other  decision  makers  will  be  tempted  to  make  use  of  it  (as  recruiters  do  today  by  checking  Facebook  profiles  of  candidates)  • Organized  criminals  (who  are  rarely  deterred  by  laws)  can  misuse  those  data  in  mul9ple  ways  (blackmailing,…)  

 

 14  

8/25/15  

8  

Misconcep9ons  about  Genome  Privacy  (5/6)  

Misconcep:on  5:  Privacy  Enhancing  Technologies  are  a  nuisance  in  the  case  of  gene9cs:  gene9c  data  should  be  made  available  online  to  everyone  to  facilitate  research,  as  done  e.g.  in  the  case  of  the  Personal  Genome  Project  • Medical  progress  is  faster  if  (anonymized)  medical  records  are  freely  available  online  • What  is  wrong  with  this  reasoning?  • Medical  confiden9ality  is  a  crucial  component  of  the  trust  between  pa9ent  and  healthcare  provider  • If  the  popula9on  becomes  scared  about  leakage  of  genomic  data,  a  severe  backlash  on  genomics  research  (and  thus  personalized  medicine)  could  follow  Some9mes,  medical  researchers  tend  to  underes9mate  the  constraints  of  clinical  prac9ce…  

 

 15  

Misconcep9ons  about  Genome  Privacy  (6/6)  

Misconcep:on  6:  Encryp9ng  genomic  data  is  superfluous  because  it  is  hard  to  iden9fy  a  person  from  her  variants  • Databases  of  genomes  are  usually  anonymized  • Even  in  clear  text,  genomic  data  are  so  complicated  that  it  is  prac9cally  impossible  to  deanonymize  them  • What  is  wrong  with  this  reasoning?  • See  counter-­‐examples  hereaqer  

 

 

16  

8/25/15  

9  

Examples  of  Recent  Research  Results  

•  Deanonymiza9on  of  genomes  •  Quan9fica9on  of  Kin  Genomic  Privacy  •  Efficient  and  Secure  Tes9ng  of  Genomes  •  Android-­‐based  GenoDroid  Framework  •  Privacy-­‐Preserving  Computa9on  of  Disease  Risk  by  Using  Genomic,  Clinical,  and  Environmental  Data  

17  

Smith Smith

M.  Gymrek,  A.  L.  McGuire,  D.  Golan,  E.  Halperin,  and  Y.  Erlich,  “Iden:fying  Personal  Genomes  by  Surname  Inference,”  Science,  Jan.  2013.  

Gymrek  et  al.,  “Iden%fying  Personal  Genomes  by  Surname  Inference”  18  

www.ysearch.org:

Y

Y

Smith  

Smith  

Y

Smith  

Smith  

Smith

8/25/15  

10  

 Two  Largest  Public  Gene9c  Genealogy  Databases  with  Built-­‐in  Search  Engines  

19  

Publicly  available  135,000  surname-­‐YSTR  records…  

www.smgf.org www.ysearch.org

Y-­‐STR:  Short  tandem  repeat  on  the  Y-­‐chromosome  (typically  used  in  paternity  tests)  

Sorenson  Molecular    Genealogy  Founda9on  

What  is  the  Likelihood  to  Recover  a  Surname?

20  

Gymrek  et  al.,  “Iden%fying  Personal  Genomes  by  Surname  Inference”  

For  US  Caucasian  males  from  middle  and  upper  class:    

12%  successful  recoveries    

è  Morality:  Deanonymiza9on  of  online  genomic  data  is  easy  (and  will  become  easier)    

Empirical test on 900 surname/Y-STRs haplotype records

Y-­‐STR  of  a  real  person  Querying  

Ysearch  and  SMGF  

Calcula9ng  surname  confidence  score  

Inferring  surname  

Comparing  the  predicted  surname  to  the  true  one    

8/25/15  

11  

Quan9fica9on  of  Kin  Genomic  Privacy    (CCS  2013)    

Correlated  gene:c  informa:on  between  family  members  =>  an  individual  sharing  his/her  genome  threatens  his  (known)  rela:ves’  genomic  privacy   21  

Helping  a  Family  to  Decide  what  to  Reveal  

GPPM  

Adversary’s  Background  Knowledge  Familial  rela:onships  gathered  from  social  networks  or  genealogy  Websites   Reconstruc:on  AX

ack  (Inference)  

Genom

ic-­‐Privacy  Quan:fica:on  

Health-­‐Privacy  Quan:fica:on  

Linkage  disequilibrium  values      Matrix  of  pairwise  joint  proba.  

Actual  genomic  sequences   Observed  genomic  sequences  Decision  

Rules  of  meiosis  

SNP  j  

SNP  i   Pij  

MAF  

qi  

AG  CT  AA  GC  AT  …  AC  

AG  CC  AC  GC  AT  …  AA  

AG  CT  AA  CC  TT  …  AC  

X1  

X2  

XN  

…  

…  

M  loci  

AG  __  AA  __  AT  …  __  

__  __  __  __  __  …  __  

__    CT  AA  __  __  …  AC  

X1  

X2  

XN  

…  

…  

M  loci  

22  

SNP  i  

MAF:  Minor  Allele  Frequency  GPPM:  Genomic  Privacy  Protec9on  Mechanism  

8/25/15  

12  

Reconstruc9on  Amacks  Matrix  contains  probability  distribu9on  (of  BB,  Bb  and  bb)  for  known  and  unknown  values  of  alleles.  Ini9aliza9on  based  on  background  knowledge  

The  marginal  probabili9es  for  unknown  values    are  computed  by  using  sum-­‐product    (belief  propaga9on)  algorithms  (next  slide)  

Given  by  a  sparse  pairwise  joint  probability  matrix  L  where  Li,j    =  Pr(Xi,Xj)    

23  

m(k):  mother’s  allele  at  SNP  k        f(k):      father’s  allele  at  SNP  k          

M  loci  

N  rela:ves  

Factor  Graph  Example  with  a  trio  (3  individuals)  and  3  SNPs  in  LD  

f2   f3   f4   f5   f6   f7   f8   f9  f1  

f11   f12   f13   f14   f15   f16   f17   f18  f10  

M   F  

C  

P(XC|XM,  XF),  assuming  Mendelian  inheritance   P(X3)  P(X1),  given  by  popula9on  

allele  frequencies    P(X2)  

P(X1X2),  joint  probability  given  by  LD   P(X1X3),  joint  probability  given  by  LD   P(X2X3),  joint  probability  given  by  LD  

Pedigree  factor  nodes  

LD  factor  nodes  

=1   =1  

=1  =1  

mf10-­‐v1  mf13-­‐v1  

mf1-­‐v1  

mf3-­‐v1  

=mv1-­‐f1   =mv1-­‐f3  

=mv1-­‐f10  

=mv1-­‐f13  

mf-­‐v  =  messages  from  factors  to  variables  =  mul9ply  the  incoming  messages  with  the  factor  func9on  and  sum  out  the  variable  to  which  the  message  is  sent  

mv-­‐f  =  messages  from  variables  to  factors  =  mul9ply  all  incoming  messages  except  the  one  to  which  mv-­‐f  is  sent  

24  

M.  Humbert,  E.  Ayday,  JP  Hubaux,  A.  Telen9:  Addressing  the  Concerns  of  the    Lacks  Family:  Quan9fica9on  of  Kin  Genomic  Privacy,  CCS  2013  

8/25/15  

13  

Efficient  and  Secure  Tes9ng  of  Genomes  

Recent  results  [1]  offer  ini:al  steps  towards  efficient  and  secure  tes:ng  on  whole  genomes  

–  Privacy:  •  Individual  retains  control  of  own  sequenced  genome  •  Tes9ng  lab  and  individual  perform  a  genomic  test,  with  minimal  mutual  informa9on  disclosure:  

1.   Only  test  outcome  revealed  to  one  or  both  par:es  

2.   Individual’s  genome  remains  private    

3.   Lab  keeps  its  test  specifics  private  

•  Fast  cryptographic  protocols  used  for  secure  func9on  evalua9on  

–  Efficiency:  •  Maximizes  pre-­‐computa9on  •  Domain  knowledge  used  to  reduce  “input  size”  to  cryptographic  layer  (e.g.,  by  emula9ng  

current  in-­‐vitro  tests)  

[1] P. Baldi, R. Baronio, E. De Cristofaro, P. Gasti, G. Tsudik. Countering GATTACA: Efficient and Secure Testing of Fully-Sequenced Human Genomes. CCS 2011.!

doctor!or lab!

genome!

individual!

test specifics!

Secure Function

Evaluation!

test result! test result!

• Private Set Intersection (PSI)!• Authorized PSI!• Cardinality-Only PSI!• […]!

Output reveals nothing beyond test result!

• Paternity/Ancestry Testing!• Testing of SNPs/Markers !• Compatibility Testing!• […]!

8/25/15  

14  

Android-­‐based  GenoDroid  Framework  [2]  

Data Conversion!

Test Dependent Processing!

Cryptographic Pre-processing!

Secure Computation!

Communication & Pairing!

Sequencing Center!

Desktop!

Smartphone!

Only done once!

[2] E. De Cristofaro, S. Faber, P. Gasti, G. Tsudik. GenoDroid: Are Privacy-Preserving Genomic Tests Ready for Prime Time?. WPES 2012.!

Privacy-­‐Preserving  Computa9on  of  Disease  Risk  by  Using  Genomic,  Clinical,  and  Environmental  Data  

Presented  yesterday  at  HealthTech  by  E.  Ayday  

 Ø Protect  the  privacy  of  users’ genomic, clinical and  

environmental medical data at  a  centralized  biobank.

Ø Protect  the  privacy  of  medical  unit’s  confidential data.  

Ø Allow  medical  units  to perform some computations on  the  encrypted  data  in  a  privacy-preserving fashion.  

Ø Allow  specialists  to  access only to the genomic data they need (or  they  are  authorized  for).  

8/25/15  

15  

Proposed  Solu9on  

(i)

DN

A s

am

ple

(i) Clinical and Environmental data

(ii) Encrypted SNPs

(i) Encrypted clin

ical and

environmental d

ata

(iii)

Dis

ea

se

Ris

k

Co

mp

uta

tio

n

CERTIFIED INSTITUTION (CI)

MEDICAL UNIT (MU)

STORAGE AND PROCESSING UNIT (SPU)

PATIENT (P)

E.  Ayday,  J.  L.  Raisaro,  P.  J.  McLaren,  J.  Fellay,  and  J.-­‐P.  Hubaux.    Privacy-­‐Preserving  Computa:on  of  Disease  Risk  by  Using  Genomic,  Clinical,  and  Environmental  Data.    USENIX  Security  Workshop  on  Health  Informa%on  Technologies  (HealthTech  '13)      See  also  poster  at  the  main  conference:  “Towards  Quan:fying  and  Preven:ng  the  Leakage  of    Genomic  Data  Using  Privacy-­‐Enhancing  Technologies”  

Conclusion  •  Digital  medicine  is  coming  •  It  will  for  ever  change  the  landscape  of  privacy  protec9on  •  Genomics  is  par9cularly  relevant;  ongoing  huge  research  effort  •  Highly  sensi9ve  data  +  huge  amounts  of  data  +  complex    

correla9ons  between  data  è  Complex  field,  Big  Data  •  Very  few  reseachers  have  addressed  the  topic  of  genome  

privacy  è  much  more  needs  to  be  done  in  this  field!!  •  More  informa9on  and  pointers:  

hmp://sprout.ics.uci.edu/projects/privacy-­‐dna/  hmp://lca.epfl.ch/projects/genomic-­‐privacy/  

•  Survey  of  the  topic  (available  on  the  lamer  website):  E.  Ayday,  E.  De  Cristofaro,  J.-­‐P.  Hubaux,  G.  Tsudik  ``The  Chills  and  Thrills  of  Whole  Genome  Sequencing´´  EPFL-­‐REPORT-­‐186866,  June  2013    

  30