On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

Pradeeban Kathiravelu1,2, Yiru Chen3, Ashish Sharma4, Helena Galhardas1, Peter Van Roy2, Luís Veiga1

On-Demand Service-Based Big Data Integration:

Optimized for Research Collaboration

The 3rd International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH), in conjunction with the 43rd International Conference on Very Large Data Bases.

Munich, Germany. September 1, 2017.

1 INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal2 Université catholique de Louvain, Louvain-la-Neuve, Belgium

3 Peking University, Beijing, China4 Department of Biomedical Informatics, Emory University, Atlanta, USA

Introduction

● Scale and diversity of big data are rising. – Geographically distributed data of exabytes.– Structured, semi-structured, unstructured, or ill-formed data.

● Integration of data is crucial for data science.● Sharing of integrated data and results.

– Mandatory for reproducible research.

Challenges in Medical Research for Big Data Integration

● Multiple types of data.– Imaging, clinical, and genomic.

● Numerous data sources.– No shared messaging protocol.

● Do we really need to integrate all the data?

A Story of Medical Data Researchers...A Story of Medical Data Researchers...

● Jim is interested in the effects of a medicine to treat brain tumor in patients of certain age groups.

Observation - 1

● Various sources.– Service-based data access through APIs.

● Thanks to specifications such as HL7 FHIR.

● The researchers possess domain knowledge.● Integrate On-Demand.

– Avoid eager loading of binary data or its textual metadata.– Use the researcher query as an input in loading data.

● Scalable storage in-house. – Potential to load, integrate, index, and query unstructured data.

● Paula has overlapping research interests with Jim.

Observation - 2

● Load data only once per organization.– Bandwidth and storage efficiency.

9/23● Sharing the research data with researchers,

beyond organization boundaries.

Observation - 3

● Do not duplicate data!– We ``own`` our interest; not the data.

● Point to the data in the data sources.– Pointers to data like Dropbox Shared Links work well.

● Avoids outdated duplicate data.● Easy to maintain.

● APIs – Access the list of research data sets.

Problems

● How to..– Load data from several service-based big data sources.

● Avoid duplicate downloads and near duplicate data.– Integrate disparate data and persist for future accesses.– Share pointers to data internally and externally.

ÓbidosOOn-demand BBig Data IIntegration,

DDistribution, and OOrchestration SSystem

● Researcher query →

Narrow down the search space.● Define subsets of data that are

of interest.– Exploiting the well-defined

hierarchical structure of medical data.● Medical Images (DICOM) ● Clinical data ● ..

Óbidos Approach● Hybrid of virtual and materialized data integration

approaches.– Lazy load of metadata: Load the matching subset of metadata.– Store integrated data and query results → scalable storage.

● Track already loaded data.– Near duplicate detection.– Download only updates (changesets).

● Efficient SQL queries on NoSQL storage.● Share pointers to the datasets rather than the dataset itself.● Generic design; implementation for medical research data.Generic design; implementation for medical research data.

Óbidos Architecture

Evaluation● Evaluation Data:

– Clinical data and DICOM imaging collections of TCIA.

● Benchmark Óbidos against eager and lazy ETL. – Performance of loading and querying data.

● Óbidos (inter- and intra- organization) against binary data sharing.– Space/bandwidth efficiency of data sharing.

Workload CharacterizationVarious Entries in Evaluated Collections

Data load time Change in total data volume (Same query and same interest)

● Observation:– Load time increases for eager and lazy ETL with total volume.– Load time for Óbidos remains constant.

● Total volume of data is irrelevant for Óbidos.

Change in studies of interest (Same query and constant total data volume)

Data load time

● Observation:– Load time for eager and lazy ETL remains constant.– Load time increases for Óbidos with the interest.

● Converges to the load time of lazy ETL.

Query completion time for the integrated data repository

● Observation:– We assume the corresponding data is already loaded.

● Thus, lazy and eager ETL perform similar.– Indexed scalable NoSQL architecture of Óbidos → Better performance.

Efficiency in Sharing Medical Research Data

● Observation:– A constant-size UID is sufficient, intra-organization.– With number of series, Óbidos pointers grow, inter-organization.– Traditional binary data sharing:

shared data size = volume of the image series.

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

Acknowledgements

● NCI QIN grant (1U01CA187013, Resources for Development and Validation Of Radiomic Analyses and Adaptive Therapy).

● Google Summer of Code (2014, 2015, and 2016).● The Cancer Imaging Archive (TCIA).● Tyk and API Umbrella Teams.

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

Thank you! Questions?

On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

Health & Medicine

Earth Science GRID at ESA GRID on-Demand, e-collaboration…

©1995-2001 i2 Technologies, Inc. CONFIDENTIAL Xiaoping Chen i2 Collaboration --- CPFR Demand Collaboration I2 Technologies

Developing a Demand-Driven Acquisitions Plan: A Library-Vendor Collaboration

CHAPTER 3 OPTIMIZED ON-DEMAND MULTICAST ROUTING …shodhganga.inflibnet.ac.in/bitstream/10603/24992/8/08_chapter3.pdf · OPTIMIZED ON-DEMAND MULTICAST ROUTING PROTOCOL FOR MOBILE

Earth Science GRID at ESA GRID on-Demand, e-collaboration… Luigi Fusco, Pedro Pereira Gonçalves

Demand Driven Collaboration

An Enhanced System Architecture for Optimized Demand Side ......applied sciences Article An Enhanced System Architecture for Optimized Demand Side Management in Smart Grid Anzar Mahmood

Tissue-Tek SmartWrite Experience fast on-demand or batch ...€¦ · Full line of optimized and validated cassettes The Tissue-Tek SmartWrite Cassette printers are optimized for Tissue-Tek

ON-DEMAND OFFLOADING COLLABORATION …

The top 10 features to demand from your cloud …This guide offers 10 key factors to consider when choosing a cloud collaboration solution. 1. External Collaboration Collaboration

Price Responsive Demand for Operating Reserves in … - Botterud.pdfPrice Responsive Demand for Operating Reserves in Co-Optimized Electricity Markets with Wind Power Zhi Zhou, Audun

Moving Collaboration Forward Elementary!blog.texasswede.com/wp-content/uploads/2017/08/MWLUG2017...MWLUG 2017 Moving Collaboration Forward Karl-Henry Martinsson CEO, Demand Better

Enhancing business collaboration with on-demand, automated, audio conferencing

I2 Technologies, Inc. Confidential1 i2 TradeMatrix Collaboration Planner Demand Collaboration Demo Script November, 2000

Process driven change, demand and supply planningsolutions.ait.ac.th/resources/pdf/11.Process Driven... · Forecasting and demand planning is done with lack of true collaboration

Polycom On Demand Collaboration Solutions Light version.ppt

Supplier Collaboration - MHI...er collaboration solutions offer ways to anticipate disruptive events. Visibility to demand and lead times, alerts for shipment delays, demand fore-casting,

Optimized Scheduling in Demand Responsive …...2017 Esri User Conference Presentation Keywords 2017 Esri User Conference—Presentation, 2017 Esri User Conference, Optimized Scheduling

RMCAD and Markit on Demand collaboration

On-Demand Service-Based Big Data Integration: Optimized ...pvr/DMAH2017_paper_2.pdf · On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration PradeebanKathiravelu1;2,YiruChen3,AshishSharma4,