Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
#CMIMI18#CMIMI18
Data Engineering and Imaging Informatics for Precision OncologyAshish Sharma PhDAssistant Professor, Biomedical InformaticsEmory University School of Medicine
@_AshishSharma
#CMIMI18
Disclosures
None
#CMIMI18
Cancer has been progressively redefined over the past 20 years
Global Oncology Trends 2017. Report by the
QuintilesIMS Institute
#CMIMI18
Increase In Number And Complexity Of Treatment
Global Oncology Trends 2017. Report by the QuintilesIMS Institute
#CMIMI18
How do Data Sci. & Engg. enable Precision Oncology ?
Data Science
AlgorithmsData Engineering
#CMIMI18#CMIMI18
- Data for AI Development
- Processing Pipelines
- Scale (Cloudy Medicine)
Data Engineering
Going from Bench to Bedside*
Good Algorithms
Outline
#CMIMI18
#CMIMI18#CMIMI18
Big Data is not helpful for developing algorithms if data is not FAIR
FindableAccessibleInteroperableReusable"The FAIR Guiding Principles For Scientific Data
Management And Stewardship." Scientific Data 3 (2016)
#CMIMI18
FAIR Data The Cancer Imaging ArchiveTCIA encourages and supports cancer imaging open science community by hosting and managing Findable, Accessible, Interoperable, and Reusable (FAIR) images and associated/derived dataClark et al. J Digital Imaging 26.6 (2013_: 1045-1057
~.75PB downloaded over a rolling 12 month window
#CMIMI18#CMIMI18 #CMIMI18
TCIA is Not Just an Image Repository • Radiology
• Digital Pathology
• Radiotherapy data
• Imaging features• Labels, Segmentations,
Features….
• Clinical data
• Links to genomic data
#CMIMI18
Hard to be FAIR
TCIAThis is where electronic medical record gets a little complicated
Sadly TCIA has multiple ways to store non-image data
• Often non-image data is difficult to reuse
• In some cases (e.g., NLST) it is used to create data cohorts
• Often it is difficult conduct studies that make use of non-image data in an integrative manner.
#CMIMI18
How do you build a FAIR repo —Requirements and ChallengesClinical DataOne uniform management strategy for all non-image data (clinical) Enhance data exploration, cohort identification, visual analytics
Imaging Features Featurebase for Radiomics and Pathomics featuresOne data representation
Enhanced and automated data curationNon-image data, pathology data, feature sets
Enable efficient deployment and support cloud deployments
#CMIMI18
Platform for Imaging in Precision Medicine (PRISM)
• PRISM will evolve and containerize the TCIA technology stack to streamline its deployment and incorporate new tools for analysis and management of images and imaging features with clinical context to enrich TCIA’s datasets.
• Semantic integration of TCIA non-image data • Tools for Pathology image data analysis and management • Some new functionality will go into both TCIA and PRISM• Freely available as containerized microservices and OSS
#CMIMI18
PRISM Architecture
#CMIMI18
Building upon PRISM at Emory
GOAL: Streamline access to imaging for research and quality studiesJoint between DBMI and Radiology
Near-real time replication of the PACS (ongoing) Extract metadata for research and quality studies (ongoing) Integrate with orders and reports Simplify access to images for research studies Secure storage, processing and de-Identification (when reqd.) Link w/ Data Warehouse; EMR… Co-located computing and storage
#CMIMI18
Imaging != Rad + RT
Hello Digital Pathology
#CMIMI18
Digital Pathology for Precision Oncology
Image analysis and DL methods to extract features from images Link Rad/Path features to “omics”, outcome biological phenomena Identify trillions of objects – nuclei, glands, ducts, tumor niches… Support queries against ensembles of features
(multiple algorithms/datasets) Analysis of integrated spatially mapped structural/”omic” information
to gain insight into cancer mechanism and to choose best intervention
18
● Deep learning based computational stain for staining tumor infiltrating lymphocytes (TILs)
● Computationally stained TILs correlate with pathologist eye and molecular estimates
● TIL patterns linked to tumor and immune molecular features, cancer type, and favorable outcomes● Potentially guide treatment
selection
● 4,759 subjects (TCGA) == 5,202 H&E slides; 13 cancer types
Saltz et al. Cell Reports 2018 doi.org/10.1016/j.celrep.2018.03.086
#CMIMI18
Quantitative Imaging Pathology - QuIP
#CMIMI18#CMIMI18
Data Processing Pipelines
#CMIMI18
Challenges
Model Development, Training TensorFlow, Keras, pyTorch,
MATLAB....Notebooks, IDEs…
Deployment (going to bedside) Data Wrangling (w/o Human in the
loop) on-demand deployment of
Algorithms Scalability Performance and LatencyMonitoring, Testing and ReliabilityUser Interfaces
#CMIMI18
Containers, Microservices and APIs
• No monoliths — think stages (preprocessing, segmentation, feature selection, classification, CNNs…)• Stage Independent and if possible stateless• Helps in scaling, deployment and redundancy
Containers (an easy way to do it)+Encapsulate the code and immediate dependencies+Easy to share, adopt, deploy- Security implications (Docker vs. Singularity)
• Situation gets better if using K8s Check out Grunt from Panos, Brad Erickson and the Mayo team
#CMIMI18
Simple, Effective Data Processing Design
Patient Data
PROs• Real-time Data Streams• Easy to test and maintain• Easier to upgrade algorithm• Easy to build dashboards and
visual analytic tools• Secure
CONs• Data and Processing are tightly
coupled• Hard to deploy multiple algorithms• Reengineer similar systems for
each new algorithm
• Deployment are not elastic• No-automatic failovers
#CMIMI18
Streaming Architectures
Modular design achieved by decoupling data and processing
Data is streamed into Kafka Cluster
Algorithmic pipelines subscribe to topics and process data
Enables rapid prototyping and deployment of algorithms
Preserves the scalability and reliability gains
#CMIMI18#CMIMI18
Scale and Integrate via Cloudy Pipelines
#CMIMI18#CMIMI18
Why Cloud First
What about Local infrastructure?
Hybrid Infrastructure?
Scalable and Affordable Computing- On Demand Computing
(lower capital expenditures)
Managed services that enable new design patterns for computing- Big Query/RedShift- Serverless Computing- data wrangling tools, e.g. DataFlow
Lower/Different barriers to adoption- Work with APIs, not Servers- Local IT has to become cloud aware
#CMIMI18
Cloudy for Scalability & Redundancy Patient Data
○ Leverage Vendor Services for Scalability and Redundancy○ Deployed AISE on AWS Lambda
and ML Engine
○ Deployment Time < 1day○ Improves model development
by allowing one to test, during development, with real-world scale and constraints
Ver 1.0
Ver 2.0
#CMIMI18
Processing at Scale
Hint: Docker is not the silver bullet
Cloudy Pipelines can Work (e.g. Google Genomics, DNANexus, NCI Cloud Resources, Globus Genomics…)
1. Think multi-stage pipelines not standalone executables/apps
2. Stages containerized or API endpoints
3. Workflow languages to author pipelines (CWL, WDL……)
4. Rely on orchestrators capable of running pipelines on local/cloud/hybrid
Lessons from Genomics
#CMIMI18
Processing at Scale
Reproducibility Share tools (via Code or Containers) Github + Docker Hub +
DockStore … Share and Publish Models TensorFlow Hub; ModelHub.AI; …
Scale Early stages Docker Swarm; Kubernetes etc. Technically getting there but hard to adopt Serverless AWS Lambda; Google Cloud Functions pywren
Integration w/ EMR FHIR, DICOMWeb
Examples are for illustrative purposes and not endorsements
#CMIMI18
National Cancer Data Ecosystem Recommendations
Warren Kibbe “Data: Where Precision Oncology and Learning Health Meet”. SAMSI Workshop on Precision Medicine, August 16, 2018
#CMIMI18
https://gdc.cancer.gov
NCI Cancer Research Data Commons
https://cbiit.cancer.gov/ncip/cancer-data-commons#CMIMI18
#CMIMI18#CMIMI18
Final Words
• Need Big, FAIR Data that is representative of the population
• Develop algorithms but think about deployment and scale
• Partnerships and Teams of Techies and MDs; Academic and IndustryTEAM SCIENCE AT ITS BEST
Cloud computing, HPC and AI can have a transformative effect on medicine
#CMIMI18#CMIMI18
Acknowledgements● Emory DBMI Engineers, PostDocs and
Students
● Fred Prior PhDTCIA Team, Dept. of Biomedical InformaticsUniv. of Arkansas for Medical Sciences
● Joel Saltz, MD PhDDept. of Biomedical InformaticsStony Brook University
U01CA187013-06
UG3CA225021-01
14X138
U24CA215109-02U24CA180924-05