44
ORNL is managed by UT-Battelle for the US Department of Energy Breaking the curse of dimensionality Explainable-AI and Evidence Mining as Applied to Systems Biology Dan Jacobson This research is supported by an INCITE award and uses resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE AC05-00OR22725.

Breaking the curse of dimensionality

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Breaking the curse of dimensionality

ORNL is managed by UT-Battelle for the US Department of Energy

Breaking the curse of dimensionalityExplainable-AI and Evidence Mining as Applied to Systems BiologyDan Jacobson

This research is supported by an INCITE award

and uses resources of the Oak Ridge Leadership

Computing Facility at the Oak Ridge National

Laboratory, which is supported by the Office of

Science of the U.S. Department of Energy under

Contract No. DE AC05-00OR22725.

Page 2: Breaking the curse of dimensionality

Experimental Data Types• Natural Variation

– Genome Wide Association Studies– 28 Million of SNPs– ~140,000 Primary Phenotypes

• Morphology/Phenology• Molecular

• Microbiomes & Metagenomes• Omics & Meta-omics– Genomics, Transcriptomics,

Proteomics, Metabolomics

• All publically available Genomes• Differential/Time Series

Expression Studies• Systems Biology Approach– Combining datasets across omics

layers, sample sets, and species

Page 3: Breaking the curse of dimensionality

Traditional Results

Page 4: Breaking the curse of dimensionality

Integrated Vision: From Systems Biology to 3D Structural Interactions

Structures:28 million

compounds

Page 5: Breaking the curse of dimensionality

Integrated Vision: From Human Systems Biology to 3D Structural Interactions - Pharmacogenomics and Personalized Medicine

• Co-evolution• 1000 Genomes

Project• Protein cross species

• Human RNA-seq• Crystal Structures• Protein-protein interaction• Human interactome• ENCODE• Explainable-AI

• iRF/TiRF• DNNs• MIPs

• Public GWAS data• As available from

collaboration with VA• Genetic data• Clinical phenotypes• Polypharmacy data

Structures:28 million

compounds

Page 6: Breaking the curse of dimensionality

Integrated Vision: From Human Systems Biology to 3D Structural Interactions - Pharmacogenomics and Personalized Medicine

Structures:28 million

compounds

Page 7: Breaking the curse of dimensionality

SNP Vectors Phenotype Vectors

Single QTL mapping: 28 million tests per phenotype

• SNP Matrix expansion from interpolation (ANL)• Control for effects of population structure

Page 8: Breaking the curse of dimensionality
Page 9: Breaking the curse of dimensionality

140,000 Manhattan Plots???

Page 10: Breaking the curse of dimensionality

Network Theory• Networks can be used to represent biological

systems – Nodes• Represent any object (genes, SNPs, proteins,

metabolites, species, microbiomes, etc.)

– Edges • Represent a relationship between two nodes

(correlation, co-occurrence, physical contact, etc.)

• Relationships can be quantitative (represented by the thickness of the line)

• Integration and Visualization of Systems Biology Models

• Mathematical Structure– Allows to be computed upon– Millions of nodes– Trillions of edges

Page 11: Breaking the curse of dimensionality

Deeper Discoveries in Systems Biology: The Balance Between Type 1 and Type 2 ErrorOur ability to reconstruct the entirety of a complex biological system improves as the number of population-scale endo-, meso- and exo-phenotypes are measured and combined with deep layers of experimental data collected on individual genotypes.

Pleiotropic and Epistatic Network-Based Discovery: Integrated Networks for Target Gene Discovery. Deborah Weighill , Piet Jones, Manesh Shah, Priya Ranjan, Wellington Muchero, Jeremy Schmutz, AvinashSreedasyam, David Macaya Sanz, Robert Sykes, Nan Zhao, Madhavi Martin, Stephen DiFazio, TImothy Tschaplinski, Gerald Tuskan, Daniel Jacobson. Front. Energy Res. - Bioenergy and Biofuels, DOI: 10.3389/fenrg.2018.00030

Page 12: Breaking the curse of dimensionality

GWAS: Single QTL Mapping

• Very Powerful• Frequently does not capture a significant portion (often

the majority) of the genetic signal• Often does not find complete genetic architectures for

complex phenotypes (Dementia, Alzheimer’s, Schizophrenia, Cardiovascular disease, PTSD, Suicide, Addiction, etc.)

Page 13: Breaking the curse of dimensionality

Epistatic Example: Transcription Initiation Complex

Page 14: Breaking the curse of dimensionality

The Need for Speed

-50

0

50

100

150

200

250

0 5 10 15 20 25 30

CPU

Hou

rs

Mill

ions

SNPs

Millions

Pairwise Epistatic Compute Time per Phenotype

4-way combinations = 2.4 x 1020 CPU hours per phenotype

Page 15: Breaking the curse of dimensionality

15 Presentation name

Breaking the curse of dimensionality

10M Genetic Variants in >40k genes

Genes do not work in isolation: 10170

potential interactions among

variants

Linking genetic variants to phenotypes

requires the exploration of an enormous space

To obtain accuracy and insight, we are developing procedures to detect interactions of any form or order at the same computational cost as main effects

Explainable-AI

Page 16: Breaking the curse of dimensionality

16 Presentation name

Machine and Deep Learning Algorithms

• Great at classification• Essentially black boxes

• Don’t reveal the interactions between variables that lead to the classification

• Need Explainable AI

Page 17: Breaking the curse of dimensionality

17 Presentation name

Finding Higher Order Combinatorial Interactions in Complex Systems

• X matrix and Y vector• Iterative Random Forests

Page 18: Breaking the curse of dimensionality

iRF Workflow

WeightedRandom Forest

Training Data

Tree Ensemble

Test Data

Node Filtering

Random Intersection Trees

Feature Interactions

Interactions(refined)

PredictionAccuracy

Bootstrap Aggregation (Bagging)

Interactions(stabilized)

FeatureImportance

X2> 0.9

X11> 0.5

X7> 0.1

X11> 0.7

X10< 0.5

X11< 0.7

X3> 0.1

X3< 0.1

X9< 0.5 X9>

0.5

X7< 0.1

X2> 0.3

X2< 0.3

X11< 0.5

X2< 0.9

X10< 0.5

{X2, X11} {X11} { } … … … …

{X2, X11, X7}

{X2, X11, X1} {X9, X3}

{X9, X11, X10}{X2, X11}

{X3, X9, X10} {X9, X3} … {X2, X11, X1} ... {X2, X7, X11}

DOE Collaboration: Ben Brown - LBNL

Page 19: Breaking the curse of dimensionality

19 Presentation name

iRF – X Matrix and 1 Y Vector

SNP Vectors Phenotype Vectors

Page 20: Breaking the curse of dimensionality

20 Presentation name

iRF – X Matrix and 1 Y Vector

SNP Vectors Phenotype Vectors

4-way combination = 1000 CPU hours per phenotype (140,000 phenotypes)

Page 21: Breaking the curse of dimensionality

21 Presentation name

Tensor iterative Random Forests (TiRFs)

• Effectively build forests that can be mined for interactions within a multi-dimensional X, a multi-dimensional Y and interactions between multiple dimensions in X and Y, all at the same time.

Page 22: Breaking the curse of dimensionality

22 Presentation name

SNP Vectors Phenotype Vectors

Page 23: Breaking the curse of dimensionality

23 Presentation name

TiRF – X Matrix and Y Matrix Simultaneously

SNP Vectors Phenotype Vectors

Page 24: Breaking the curse of dimensionality

24 Presentation name

Clinical Genomics and Human Systems Biology: DOE & VA – MVP Champion

• ORNL–Clinical records 23+ million

patients, 20 years–358,000 Genotypes–=> 4 million genotypes

Page 25: Breaking the curse of dimensionality

25 Presentation name

VA Use Case: Polypharmacy

• Simultaneous use of multiple medication• Of concern if 5 or more medications are used

Page 26: Breaking the curse of dimensionality

26 Presentation name

Why Worry About Polypharmacy?

• Drugs interact with each other, the more you are on, the more interactions can occur

• Side effects add up and are more pronounced in older individuals

• Medications are approved by FDA based on short term trails that typically exclude:– Those with other diagnoses on other medications– 65+ year olds

Page 27: Breaking the curse of dimensionality

27 Presentation name

Interaction Network• Drug Set Simulations

– For all set sizes 2 – 30• Create 20 million random sets

of drugs for each set size• 58 million sets

– Check for drug to drug edges amongst all possible pairs in each set for the shared target and shared pathway networks• 567 Billion interaction

tests• Clinical Data

– Create drug sets from clinical records

– Check for drug to drug edges amongst all possible pairs in each set for the shared target and shared pathway networks

Page 28: Breaking the curse of dimensionality

Drug Interaction: Simulation vs Clinical Practice

0

5

10

15

20

25

30

35

40

45

0 2 4 6 8 10 12 14 16

Know

n In

tera

ctio

ns

Drug Set Size

Non_HIV_Mean

Simulation

Page 29: Breaking the curse of dimensionality

Polypharmacy Morbidity & Mortality

1.00

1.20

1.40

1.60

1.80

2.00

2.20

2.40

2.60

0

5

10

15

20

25

30

35

40

45

2 4 6 8 10 12 14 16

Haz

ard

Rat

io

Inte

ract

ions

Number of Drugs

HIV+ Interactions

Unifected Interactions

Simulation

HIV+ HR

Unifected HR

Linear (HIV+ Interactions)

Linear (Unifected Interactions)

Linear (Simulation)

Linear (HIV+ HR)

Linear (Unifected HR)

Page 30: Breaking the curse of dimensionality

Drugs & Interactions

Outcomes (morbidity and mortality)

VA: Preliminary Results

• Polypharmacy • Clinically relevant patterns• iRF

• Steps toward automated phenotyping

• Interaction edges -> morbidity & mortality

• Diseaseome• iRF on diagnostic codes• 600,000 patients• Relationships between all known

human conditions• Co-morbidity map

• Discovered 9th-order combinations• 1.5 x 1026 possible 9-way

combinations

Page 31: Breaking the curse of dimensionality

31 Presentation name

Page 32: Breaking the curse of dimensionality

32 Presentation name

TiRF – Any Set of Matrices or Tensor Dimensions Simultaneously

• Spatial and temporal/longitudinal information• Different Omics layers (genome, transcriptome, proteome,

metabolome, microbiome…)• Quantum chemical tensors

Page 33: Breaking the curse of dimensionality

33 Presentation name

Tensors: Matrices Cubes

Page 34: Breaking the curse of dimensionality

Tensors: Matrices Cubes Polytopes

Page 35: Breaking the curse of dimensionality

35 Presentation name

From data matrix to cube to polytopes.

Machine Learning:

TiRF

Page 36: Breaking the curse of dimensionality

Tensor iterative Random Forests (TiRFs)• Effectively build forests that can be mined for interactions within a multi-

dimensional X, a multi-dimesional Y and interactions between multiple dimensions in X and Y, all at the same time.

• Applications in Systems Biology– Plants– Microbes– Humans, Mice– Drosophila

• Applications in Text Mining– Electronic Health Records– Scientific Literature

• Simulation Models– Combinatorial parameter sweeps (X) model output (Y)

• Any domain with high a dimensional set of matrices

Iterative Deep Neural Networks (iDNNs)• Unpacking the black box• Discovering the interactions encoded in DNNs

Page 37: Breaking the curse of dimensionality

37 Presentation name

High Order Interactions:Explainable AI: Machine and Deep Learning Integration

TiRFs

DNNs

MIPs

Page 38: Breaking the curse of dimensionality

38 Presentation name

High Order Interactions:Explainable AI: Machine and Deep Learning Integration

TiRFs

DNNs

MIPs

Netw

ork

Evol

utio

nMEN

NDL

Page 39: Breaking the curse of dimensionality

39 Presentation name

Exposome

Page 40: Breaking the curse of dimensionality

40 Presentation name

High Order Interactions: Exposome – Adverse Outcome NetworksExplainable AI: Machine and Deep Learning Integration

TiRFs

DNNs

MIPs

Titan/Summit

Page 41: Breaking the curse of dimensionality

41 Presentation name

Collaborators• Ben Brown• Jerry Tuskan

• Steve DiFazio• Wayne Joubert• Amy Justice

• Edmon Begoli

Computational Infrastructure• Oak Ridge Leadership Computing Facility (OLCF)• Compute and Data Environment for Science (CADES)

Acknowledgements

Joint Genome Institute

Page 42: Breaking the curse of dimensionality

42 Presentation name

Acknowledgements

CBIPMILDRDVAOak Ridge Leadership Computing Facility (OLCF) at ORNLCompute and Data Environment for Science (CADES) at ORNLINCITEJoint Genome Institute (JGI)

Oak Ridge National Laboratory (ORNL)Bredesen Center for Interdisciplinary Research and Graduate Education, University of Tennessee, Knoxville

Page 43: Breaking the curse of dimensionality

43 Presentation name

Acknowledgements• JAIL Team effort

– Debbie Weighill– Piet Jones– Carissa Bleker– Armin Geiger– Marek Piatek– Ben Garcia– Ashley Cliff– Jonathon Romero– David Kainer– Annie Fouche

– Sandra Truong– Ryan McCormick– Priya Ranjan– Manesh Shah– Doug Hyatt– Blake Wiley– Jesse Marks– Ian Hodge– Annabel Large– Chris Ellis

Page 44: Breaking the curse of dimensionality

Questions?Machine Learning:

TiRF