33
www.viscovery.net Privacy Preserving Data Mining: An approach to safely share and use sensible medical data Gerhard Kranner, Viscovery Biomax Symposium, June 24 th , 2016, Munich

2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

www.viscovery.net

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

Gerhard Kranner, ViscoveryBiomax Symposium, June 24th, 2016, Munich

Page 2: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Privacy protection vs knowledge gain

What is Privacy Preserving Data Mining?Terms and standardsRisks, limits, and issuesData mining without need of data disclosureData abstraction with perceptual mapsConnectome example

Page 3: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Privacy Preserving Data MiningØ „PPDM is the responsible use of data mining to

extract useful knowledge from data without compromising data privacy.“

Which implies to– Access, explore and model sensible data– Share results, deploy analytical models

But, in doing so, to– Observe legal and ethical standards– In particular, preserve data confidentiality

Page 4: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Basic terms

Pseudonymization– Replace identifying fields within each data record by

pseudonyms (artificial codes)De-identification– Remove, mask or generalize identifying information to prevent

a person’s identity from being connected with informationAnonymization– Irreversibly remove association between an identifying

dataset and the data subject

Page 5: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Common de-identification methods Removal of identifiers– Direct identifiers: name, address, social security number– Quasi-identifiers: birthday, ZIP, sex– Any links to identifying information

Data and/or output perturbation– Add non-deterministic noise to attribute values– Mask, modify, aggregate values systematically

Generalization (data binning, bucketing)– Original data values which fall in a given small interval, a bin, are

replaced by a value representative of that interval– Generalize all dates to year: 17th March 1983 à 1983– Reduce zip codes to three digits: D-82152 à 821

Page 6: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Example: Two-dimensional binning

Page 7: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

The HIPAA “Safe Harbor” Method

(I) Health plan beneficiary numbers(J) Account numbers(K) Certificate/license numbers(L) Vehicle identifiers and serial numbers, including license plate numbers(M) Device identifiers and serial numbers(N) Web Universal Resource Locators (URLs)(O) Internet Protocol (IP) addresses(P) Biometric identifiers, including finger and voiceprints(Q) Full-face photographs and any comparable images(R) Any other unique identifying number, characteristic, or code

HIPAA Privacy Rule, USA, 2003: Provides mechanisms for using and disclosing health data responsibly without the need for patient consentEITHER apply Expert Determination MethodOR remove or generalize 18 specific types of data:

(A) Names (B) All geographic subdivisions, including street address, city, county, precinct, ZIP code, if the geographic unit contains less than 20,000 people…(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89…(D) Telephone numbers(E) Fax numbers(F) Email addresses(G) Social security numbers(H) Medical record numbers

Page 8: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Usual de-identification process

Source: NISTIR 8053, De-Identification ofPersonal Information, 2015

Page 9: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Limits and issues

Re-identification risk– Cross-reference anonymous data with other data sources to

re-identify the origin (linkage attack)– May result in harms to individuals or groups

De-identification is of limited use– Not robust against advanced re-identification methods– Impossible in certain cases– E.g., genetic data cannot be safely anonymized due to huge

amount of pattern information in bio-specimens which allows to re-identify the donors

àCannot be sure whether information is re-identifiable!

Page 10: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Implicit disclosure risk

Attribute disclosure– Adversary derives sensible information about a patient from

released data in conjunction with disclosed information– E.g. all patients in a list have a specific diagnosis

Inferential disclosure– When information can be inferred with high confidence from

statistical properties of released data– E.g. infer the income of a data subject from the (publicly

available) purchase price of a home

Page 11: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Linkage attacks

Link records in datasets based on similarity between subsets of attributesCombination of attributes allows to discern records in each dataset (fingerprint information)Use machine learning for pattern matching

àCan link identity of data subjects in a (released or public) dataset with confidential information contained in another dataset

Page 12: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Linkage examples for re-identificationMovie ratings– Dataset 1: 500,000 training records containing customer ratings of

movies (1 to 5 stars) published by Netflix– Dataset 2: Ratings of (personally) registered users at IMDb– With only eight movie ratings and dates, 96% of released Netflix

subscribers can be uniquely identified

Medical tests– Only four consecutive laboratory test results of CHEM-7 (creatinine)

uniquely distinguished 89.9% oft test subjects in a sample of 61,280 patients

Credit card transactions– Four distinct points in space and time were sufficient to specify

uniquely 90% of the individuals in a sample of 1.1 million people

Page 13: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Conclusion

De-identification should be applied– Removal of direct identifiers is essential– Must conform with legal regulations

However, even complete anonymization– Only reduces matching accuracy– Doesn‘t prevent from re-identification

Ø Tradiditonal de-identification is not sufficient to ensure privacy, yet being detrimental to data mining!

Page 14: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Consequences

Need comprehensive strategies (Release Models) forthe use of confidential data and results– Observe data privacy– Limit risk of re-identification– Minimize information loss

Need technologies that support these strategies– Level of disclosed information under control of application– Ideal application: Provides complete conceptual information

without disclosing original data

Page 15: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Release ModelsData Use Agreement (DUA) model– Make de-identified data available under a legally binding

data use agreementConceptual model– Provide access only to aggregate data while prohibiting

access to records containing data on an individualEnclave model– Keep data in kind of segregated enclave that restricts export

of original data, instead accept queries from qualified users, run the queries on the data, and respond with results

Page 16: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Role and purpose based access control

Source: Indumathi, InTech, 2012, http://dx.doi.org/10.5772/49982

Page 17: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

PPDM by decoupling models from data Represent original data in perceptual map– Generates abstraction that directly shows data distribution– Data statistics contained in microcluster ensemble

Perform data mining on the map– Explore, visualize, and cluster data distribution– Enhance model with predictive capabilities

Segregate map from original data– Disclose map as conceptual repository for further explor’n– Deploy predictive model for use/integration in applications

Enable access to original data via map– Achievable through Micro-Cluster Queries (MCQ)– User authorization for MCQ under control of application

Page 18: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Vanfleteren et al., AJRCCM, 2013

Example: CIROCO data representation

Page 19: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

CIROCO study: Model publication

Page 20: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

CIROCO study: Diagnostic factors

Page 21: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

CIROCO study: Aggregate statistics

Page 22: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Self-Organizing Maps (SOM)

SOMs represent data distributions in perceptual maps– Able to create maps from big / complex data– Original data can be „forgotten“– Maintains essential distribution information– Contains local data statistics in microclusters (cluster binning)

Released map is a conceptual repository to– Visually explore data distributions– Make complex distributions tangible– Explore patterns and data dependences– Draw benefit from sensible data without disclosing data

Page 23: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

PPDM with Viscovery®

Workflow-oriented system for predictive modeling– Explorative data mining, visual clustering– Profiling, statistical analyses – Classification, non-linear regression

Based on innovative, patented combination of– Self-Organizing Maps (SOM)– Multivariate statistics

Map can be segregated from original data– Disclosure of map does not compromise privacy– Can be integrated in operational systems (BioXM)– Level of data disclosure under control of application

Page 24: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Viscovery® data flow (project mode)

Modeling

PredictiveModels

Viscovery® SOMineApplication data

Model data

Results

Preprocessing

De-identified

data

AnalyticalDatamarts

Application

Page 25: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Viscovery® data flow (operational mode)Viscovery® One(2)One Engine

Parameter name

De-identified

data

Model application

Model name

Data record

Model loading

Model recall

Result

Parameter value

PPDMapplicationwith useraccesscontrol

PredictiveModel

User interaction

Page 26: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Example: Mining the connectome

Connectome matrices of individual brains– Source: http://umcd.humanconnectomeproject.org/– De-identified, pseudonymized data (highly confidential)– Connectivity Matrix + Diagnosis (Autism) + Personal data– Draw conclusions about personality, mental disorders,…

Derive networks measures– Build network graph from each matrix– Calculate network measures (on global or local level)– E.g. Clustering Coefficient, Characteristic Path Length,

Transitivity, Assortativity, Betweenness

Visualize, explore, cluster network data in Viscovery®

Page 27: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Diffusion Tensor Imaging data from the Human Connectome Project

Source: www.nimh.nih.gov/news/science-news/2012/brain-wiring-a-no-brainer.shtml

Page 28: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Diffusion Tensor Imaging (DTI)

Thickness of detected fibersbetween brain areas (color coded)

Connectivity MatrixDiffusion Gradients

Directed flow of water moleculesdetected by MR indicating fiber tracts

Reconstructed Fiber Tracts

Reconstructed fiber tracts indicatea potential anatomical connection

between two brain areas

Page 29: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Topological graph of functional network

Source: Bullmore, Sporns 2009, Nature Reviews Neuroscience,Vol. 10

Page 30: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Source: http://umcd.humanconnectomeproject.org

Values are computed by Brain Connectivity Toolbox, Rubinov & Sporns, 2009

Calculation of network measures

Page 31: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Can network measures hold as biomarkers for brain diseases?

Page 32: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

© 2016 Viscovery Software GmbH www.viscovery.net

Stratification of autism patients

leveraging comprehensive clinical knowledge without compromising patient data privacy

Page 33: 2016.06.24 - Privacy Preserving Data Mining€¦ · – Keep data in kind of segregated enclave that restricts export of original data, instead accept queries from qualified users,

www.viscovery.net

Learn more and visit us at ...

Viscovery Software GmbH

Kupelwiesergasse 27A-1130 Wien

Tel. +43-1-532 [email protected]