64
Joel Saltz MD, PhD Emory University February 2013 Integrative Multi-Scale Analyses

Integrative Multi-Scale Analyses

Embed Size (px)

DESCRIPTION

Integrative analyses of large scale spatio-temporal datasets play increasingly important roles in many areas of science and engineering. Our recent work in this area is motivated by application scenarios involving complementary digital microscopy, Radiology and "omic" analyses in cancer research. In these scenarios, our objective is to use a coordinated set of image analysis, feature extraction and machine learning methods to predict disease progression and to aid in targeting new therapies. We describe methods we have developed for extraction, management and analysis of features along with the systems software methods for optimizing execution on high end CPU/GPU platforms. We will also describe biomedical results obtained from these studies and extensions of the computational methods to broader application areas.

Citation preview

Page 1: Integrative Multi-Scale Analyses

Joel Saltz MD, PhDEmory UniversityFebruary 2013

Integrative Multi-Scale Analyses

Page 2: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 3: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Application Targets

• Multi-dimensional spatial-temporal datasets– Radiology and Microscopy Image Analyses

– Oil Reservoir Simulation/Carbon Sequestration/Groundwater Pollution Remediation

– Biomass monitoring and disaster surveillance using multiple types of satellite imagery

– Weather prediction using satellite and ground sensor data

– Analysis of Results from Large Scale Simulations

• Correlative and cooperative analysis of data from multiple sensor modalities and sources

• What-if scenarios and multiple design choices or initial conditions

Page 4: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Core Transformations

• Data Cleaning and Low Level Transformations• Data Subsetting, Filtering, Subsampling• Spatio-temporal Mapping and Registration• Object Segmentation • Feature Extraction, Object Classification• Spatio-temporal Aggregation• Change Detection, Comparison, and Quantification

Page 5: Integrative Multi-Scale Analyses

Emory In Silico Center for Brain Tumor Research (PI = Dan Brat, PD= Joel Saltz)

Page 6: Integrative Multi-Scale Analyses
Page 7: Integrative Multi-Scale Analyses
Page 8: Integrative Multi-Scale Analyses
Page 9: Integrative Multi-Scale Analyses

National Science Foundation Grand Challenge in Land Cover Dynamics

• Remote sensing analysis of high resolution satellite images.

• Databases of land cover dynamics are essential for global carbon models, biogeochemical cycling, hydrological modeling and ecosystem response modeling

• Maps of the world's tropical rain forest during the past three decades.

Larry Davis , Rama Chellappa , Joel Saltz , Alan Sussman , John Townshend

Page 10: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Analysis of Computational Data; Uncertainty Quantification, Comparisons with Experimental Results

Dimitri Mavriplis, Raja Das, Joel Saltz

Page 11: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 12: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Whole Slide Imaging: Scale

Data per slide: 500MB to 100GBRoughly 250-500M Slides/Year in USA

Total: 0.1-10 Exabytes/year

Page 13: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Pathology Computer Assisted Diagnosis

Shimada, Gurcan, Kong, Saltz

Page 14: Integrative Multi-Scale Analyses

Computerized Classification System for Grading Neuroblastoma

• Background Identification• Image Decomposition (Multi-

resolution levels)• Image Segmentation

(EMLDA)• Feature Construction (2nd

order statistics, Tonal Features)

• Feature Extraction (LDA) + Classification (Bayesian)

• Multi-resolution Layer Controller (Confidence Region)

No

YesImage Tile

InitializationI = L

Background? Label

Create Image I(L)

Segmentation

Feature Construction

Feature Extraction

Classification

Segmentation

Feature Construction

Feature Extraction

Classifier Training

Down-sampling

Training Tiles

Within ConfidenceRegion ?

I = I -1

I > 1?

Yes

Yes

No

No

TRAINING

TESTING

Page 15: Integrative Multi-Scale Analyses

Using TCGA Data to Study

Glioblastoma

Diagnostic Improvement

Molecular Classification

Predictors of Progression

Page 16: Integrative Multi-Scale Analyses

Digital Pathology

Neuroimaging

TCGA Network

Page 17: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Morphological Tissue Classification

Nuclei Segmentation

Cellular Features

Lee Cooper,Jun Kong

Whole Slide Imaging

Page 18: Integrative Multi-Scale Analyses

Oligodendroglioma Astrocytoma

Nuclear Qualities

Can we use image analysis of TCGA GBMs TO INFORM diagnostic criteria based on molecular or clinical endpoints?

Application: Oligodendroglioma Component in GBM

Page 19: Integrative Multi-Scale Analyses

Millions of Nuclei Defined by n Features

• Bottom-up analysis: let features define and drive the analysis

• Top-down analysis: use the features with existing diagnostic constructs

Page 20: Integrative Multi-Scale Analyses

TCGA Whole Slide Images

Jun Kong

Step 1:Nuclei

Segmentation

• Identify individual nuclei and their boundaries

Page 21: Integrative Multi-Scale Analyses

Nuclear Analysis Workflow

• Describe individual nuclei in terms of size, shape, and texture

Step 2:Feature

Extraction

Step 1:Nuclei

Segmentation

Page 22: Integrative Multi-Scale Analyses

Oligodendroglioma Astrocytoma

Nuclear Qualities

1 10

Step 3:Nuclei

Classification

Page 23: Integrative Multi-Scale Analyses

Oligo Astro

Representative Nuclei

Page 24: Integrative Multi-Scale Analyses

Comparison of Machine-based Classification to Human Based Classification

Separation of GBM, Oligo1, Oligo2 as Designated by Neuropathologists

Separation of GBM, Oligo1 and Oligo2 as Designated by Machine

Page 25: Integrative Multi-Scale Analyses

Survival Analysis

Human Machine

Page 26: Integrative Multi-Scale Analyses

Gene Expression Correlates of High Oligo-Astro Ratio on Machine-based Classification

Oligo Related Genes

Myelin Basic ProteinProteolipoproteinHoxD1

Nuclear features mostAssociated with Oligo Signature Genes:

Circularity (high)Eccentricity (low)

Page 27: Integrative Multi-Scale Analyses

Millions of Nuclei Defined by n Features

• Bottom-up analysis: let nuclear features define and drive the analysis

• Top-down analysis: analyze features in context of existing diagnostic constructs

Page 28: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Direct Study of Relationship Between Image Features vs Clinical Outcome, Response to Treatment, Molecular Information

Lee Cooper,Carlos Moreno

Page 29: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Consensus clustering of morphological signatures

Study includes 200 million nuclei taken from 480 slides corresponding to 167 distinct patients

Each possibility evaluated using 2000 iterations of K-means to quantify co-clustering

Nuclear Features Used to Classify GBMs

3 2 1

20 40 60 80 100 120 140 160

20

40

60

80

100

120

140

1602 3 4 5 6 725

30

35

40

45

50

# Clusters

Silh

ouet

te A

rea

0 0.5 1

1

2

3

Silhouette Value

Clu

ster

Page 30: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Clustering identifies three morphological groups• Analyzed 200 million nuclei from 162 TCGA GBMs (462 slides)• Named for functions of associated genes:

Cell Cycle (CC), Chromatin Modification (CM),

Protein Biosynthesis (PB)• Prognostically-significant (logrank p=4.5e-4)

Featu

re I

ndic

es

CC CM PB

10

20

30

40

500 500 1000 1500 2000 2500 3000

0

0.2

0.4

0.6

0.8

1

Days

Sur

viva

l

CC

CM

PB

Page 31: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Associations

Page 32: Integrative Multi-Scale Analyses

Molecular Correlates of MR Features Using TCGA Data

MRIs of TCGA GBMs reviewed by 3-6 neuroradiologists using VASARI feature set and In Vivo Imaging tools

MR Features compared to TCGA Transcriptional Classes and Genetic Alterations

David Gutman

Page 33: Integrative Multi-Scale Analyses

Capturing structured annotations and markups/ AIM Data Service

Page 34: Integrative Multi-Scale Analyses

VASARI Feature Set

Scott HwangChad HolderAdam Flanders

Page 35: Integrative Multi-Scale Analyses

Test ChiSquare DF P-ValueLog-Rank 12.4775 3 0.0059*Wilcoxon 10.0802 3 0.0179*

Prognostic Significance of Vasari FeaturesTests Between Groups: 0-33% vs. 34-95% Proportion enhancing

Page 36: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 37: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Titan – Peak Speed 30,000,000,000,000,000 floating point operations per second!

Page 38: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Page 39: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Extreme DataCutter Prototype

DataCutter

Pipeline of filters connected though logical streams

In transit processing

Flow control between filters and streams

Developed 1990s-2000s; led to IBM System S

Extreme DataCutter

Two level hierarchical pipeline framework

In transit processing

Coarse grained components coordinated by Manager that coordinates work on pipeline stages between nodes

Fine grained pipeline operations managed at the node level

Both levels employ filter/stream paradigm

Bottom line – everything ends up as DAGS

Page 40: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Extreme DataCutter – Two Level Model

Coarse Grained Level

Page 41: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Node Level Work Scheduling

Fine Grained Level

Page 42: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Brain Tumor Pipeline Scaling on Keeneland (100 Nodes)

Page 43: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Challenge: Structured/Unstructured Grid Calculations with Unpredictable Runtime Dependencies

Key Kernel in Distance Transform, Morphological Reconstruction, Delaney Triagulation

Page 44: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

“Speedup” relative to single CPU core

Page 45: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Large Scale Data Management

Represented by a complex data model capturing multi-faceted information including markups, annotations, algorithm provenance, specimen, etc.

Support for complex relationships and spatial query: multi-level granularities, relationships between markups and annotations, spatial and nested relationships

Highly optimized spatial query and analyses Implemented in a variety of ways including

optimized CPU/GPU, Hadoop/HDFS and IBM DB2

Page 46: Integrative Multi-Scale Analyses

Spatial Centric – Pathology Imaging “GIS”Point query: human marked point inside a nucleus

.

Window query: return markups contained in a rectangle

Spatial join query: algorithm validation/comparison

Containment query: nuclear featureaggregation in tumor regions

Page 47: Integrative Multi-Scale Analyses

Algorithm Validation: Intersection between Two Result Sets (Spatial Join)

PAIS: Example Queries

. .

Page 48: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

VLDB 2012

Change Detection, Comparison, and Quantification

Page 49: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Approach to Integrated Sensor Data Analysis Framework

• Abstract templates specify dataset geometry

• Templates describe collections of space-time regions

• Mapping to memory hierarchies provided by user defined mapping functions

• Leverages Parashar’s DataSpaces

Page 50: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 51: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

• Example Project: Find hot spots in readmissions within 30 days– What fraction of patients with a given principal diagnosis will be

readmitted within 30 days?– What fraction of patients with a given set of diseases will be readmitted

within 30 days?– How does severity and time course of co-morbidities affect readmissions?– Geographic analyses

• Compare and contrast with UHC Clinical Data Base– Repeat analyses across all UHC hospitals– Are we performing the same?– How are UHC-curated groupings of patients (e.g., product lines) useful?

• Need a repeatable process that we can apply identically to both local and UHC data

Clinical Phenotype Characterization and the Emory Analytic Information Warehouse

Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

Page 52: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Overall System

I2b2 Web Server

I2b2 Database

Source data

Database Mapper

Source data

Source data

Data Processing

Metadata Manager

Metadata Repository

Query Specification

Investigator

Data Analyst

Data Analyst

Data Modeler

Investigator

Query toolsStudy-

specific Database

Investigator

Page 53: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

5-year Datasets from Emory and University Healthcare Consortium

• EUH, EUHM and WW (inpatient encounters)• Removed encounter pairs with chemotherapy and radiation

therapy readmit encounters (CDW data)

• Encounter location (down to unit for Emory)• Providers (Emory only)• Discharge disposition• Primary and secondary ICD9 codes• Procedure codes• DRGs• Medication orders (Emory only)• Labs (Emory only)• Vitals (Emory only)• Geographic information (CDW only + US Census and American

Community Survey)Analytic Information Warehouse

Page 54: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Using Emory & UHC Data to Find Associations With 30-day Readmits

• Problem: “Raw” clinical and administrative variables are difficult to use for associative data mining– Too many diagnosis codes, procedure codes– Continuous variables (e.g., labs) require interpretation– Temporal relationships between variables are implicit

• Solution: Transform the data into a much smaller set of variables using heuristic knowledge– Categorize diagnosis and procedure codes using code

hierarchies– Classify continuous variables using standard interpretations

(e.g., high, normal, low)– Identify temporal patterns (e.g., frequency, duration, sequence)– Apply standard data mining techniques

Analytic Information Warehouse

Page 55: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Derived Variables

• 30-day readmit• The 9 Emory Enhanced Risk Assessment Tool diagnosis categories• UHC product lines• Variables derived from a combination of codes and/or laboratory test results

– Obesity– Diabetes/uncontrolled diabetes– End-stage renal disease (ESRD)– Pressure ulcer– Sickle cell disease/sickle cell crisis

• Temporal variables derived over multiple encounters– Multiple MI– Multiple 30-day readmissions– Chemotherapy within 180 (or 365) days before surgery– Previous encounter within the last 90 (or 180) days

Page 56: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

30-Day Readmission Rates for Derived VariablesEmory Health Care

Page 57: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Geographic AnalysesUHC Medicine General Product Line (#15)

Analytic Information Warehouse

Page 58: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Predictive Modeling for Readmission

• Random forests (ensemble of decision trees)– Create a decision tree using a random subset of the

variables in the dataset– Generate a large number of such trees– All trees vote to classify each test example in a

training dataset– Generate a patient-specific readmission risk for each

encounter

• Rank the encounters by risk for a subsequent 30-day readmission

Sharath Cholleti

Page 59: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Emory Readmission Rates for High and Low Risk Groups Generated with Random Forest

Page 60: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Status of Clinical Phenotype Characterization

• Integrative dataset analysis can leverage patient information gathered over many encounters

• Temporal analyses can generate derived variables that appear to correlate with readmissions

• Predictive modeling has promise of providing decision support

• Data Analytics arm of the Emory New Care Model Initiative led by Greg Esper

• Ongoing analyses involve characterization of clinical phenotype in GWAS, biomarker and quality improvement efforts

• Co-lead (with Bill Hersh) of CTSA CER Informatics taskforce dedicated to this issue

Page 61: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

a.k.a “Big Data”

• Integrative Spatio-Temporal Analytics• Deep Integrative Biomedical Research• High End Computing/”Big Data” Computers,

Systems Software• Analysis of Patient Populations

Page 62: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to:• In silico center team: Dan Brat (Science PI), Tahsin Kurc, Ashish

Sharma, Tony Pan, David Gutman, Jun Kong, Sharath Cholleti, Carlos Moreno, Chad Holder, Erwin Van Meir, Daniel Rubin, Tom Mikkelsen, Adam Flanders, Joel Saltz (Director)

• Digital Pathology R01 (s): Foran and Saltz; Jun Kong, Sharath Cholleti, Fusheng Wang, Tony Pan, Tahsin Kurc, Ashish Sharma, David Gutman (Emory), Wenjin Chen, Vicky Chu, Jun Hu, Lin Yang, David J. Foran (Rutgers)

• Analytic Warehouse team: Andrew Post, Sharath Cholleti, Doris Gao, Michel Monsour, Himanshu Rathod

• In vivo imaging Emory team: Tony Pan, Ashish Sharma, Joel Saltz• NIH/in silico TCGA Imaging Group: Scott Hwang, Bob Clifford, Erich

Huang, Dima Hammoud, Manal Jilwan, Prashant Raghavan, Max Wintermark, David Gutman, Carlos Moreno, Lee Cooper, John Freymann, Justin Kirby, Arun Krishnan, Seena Dehkharghani, Carl Jaffe

• ACTSI Biomedical Informatics Program: Marc Overcash, Tim Morris, Tahsin Kurc, Alexander Quarshie, Circe Tsui, Adam Davis, Sharon Mason, Andrew Post, Alfredo Tirado-Ramos

• ORNL HPC collaboration: Scott Klasky, David Pugmire ORNL

Page 63: Integrative Multi-Scale Analyses

Cen

ter

for

Com

pre

hen

sive In

form

ati

cs

Thanks to

• National Cancer Institute• National Library of Medicine• National Science Foundation• Cardiovascular Research Grid (NHLBI)• Minority Health Grid (ARRA)• Emory Health Care• Kaiser Health Care• Winship Cancer Institute• Oak Ridge National Laboratory• Woodruff Health Sciences

Page 64: Integrative Multi-Scale Analyses

Thanks!