On-Demand Service-Based Big Data Integration: Optimized for Research Collaboration

Preview:

Citation preview

1/23

Pradeeban Kathiravelu1,2, Yiru Chen3, Ashish Sharma4, Helena Galhardas1, Peter Van Roy2, Luís Veiga1

On-Demand Service-Based Big Data Integration:

Optimized for Research Collaboration

The 3rd International Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH), in conjunction with the 43rd International Conference on Very Large Data Bases.

Munich, Germany. September 1, 2017.

1 INESC-ID / Instituto Superior Técnico, Universidade de Lisboa, Lisboa, Portugal2 Université catholique de Louvain, Louvain-la-Neuve, Belgium

3 Peking University, Beijing, China4 Department of Biomedical Informatics, Emory University, Atlanta, USA

2/23

Introduction

● Scale and diversity of big data are rising. – Geographically distributed data of exabytes.– Structured, semi-structured, unstructured, or ill-formed data.

● Integration of data is crucial for data science.● Sharing of integrated data and results.

– Mandatory for reproducible research.

3/23

Challenges in Medical Research for Big Data Integration

● Multiple types of data.– Imaging, clinical, and genomic.

● Numerous data sources.– No shared messaging protocol.

● Do we really need to integrate all the data?

4/23

A Story of Medical Data Researchers...A Story of Medical Data Researchers...

5/23

● Jim is interested in the effects of a medicine to treat brain tumor in patients of certain age groups.

6/23

Observation - 1

● Various sources.– Service-based data access through APIs.

● Thanks to specifications such as HL7 FHIR.

● The researchers possess domain knowledge.● Integrate On-Demand.

– Avoid eager loading of binary data or its textual metadata.– Use the researcher query as an input in loading data.

● Scalable storage in-house. – Potential to load, integrate, index, and query unstructured data.

7/23

● Paula has overlapping research interests with Jim.

8/23

Observation - 2

● Load data only once per organization.– Bandwidth and storage efficiency.

9/23● Sharing the research data with researchers,

beyond organization boundaries.

10/23

Observation - 3

● Do not duplicate data!– We ``own`` our interest; not the data.

● Point to the data in the data sources.– Pointers to data like Dropbox Shared Links work well.

● Avoids outdated duplicate data.● Easy to maintain.

● APIs – Access the list of research data sets.

11/23

Problems

● How to..– Load data from several service-based big data sources.

● Avoid duplicate downloads and near duplicate data.– Integrate disparate data and persist for future accesses.– Share pointers to data internally and externally.

12/23

ÓbidosOOn-demand BBig Data IIntegration,

DDistribution, and OOrchestration SSystem

● Researcher query →

Narrow down the search space.● Define subsets of data that are

of interest.– Exploiting the well-defined

hierarchical structure of medical data.● Medical Images (DICOM) ● Clinical data ● ..

13/23

Óbidos Approach● Hybrid of virtual and materialized data integration

approaches.– Lazy load of metadata: Load the matching subset of metadata.– Store integrated data and query results → scalable storage.

● Track already loaded data.– Near duplicate detection.– Download only updates (changesets).

● Efficient SQL queries on NoSQL storage.● Share pointers to the datasets rather than the dataset itself.● Generic design; implementation for medical research data.Generic design; implementation for medical research data.

14/23

Óbidos Architecture

15/23

Evaluation● Evaluation Data:

– Clinical data and DICOM imaging collections of TCIA.

● Benchmark Óbidos against eager and lazy ETL. – Performance of loading and querying data.

● Óbidos (inter- and intra- organization) against binary data sharing.– Space/bandwidth efficiency of data sharing.

16/23

Workload CharacterizationVarious Entries in Evaluated Collections

17/23

Data load time Change in total data volume (Same query and same interest)

● Observation:– Load time increases for eager and lazy ETL with total volume.– Load time for Óbidos remains constant.

● Total volume of data is irrelevant for Óbidos.

18/23

Change in studies of interest (Same query and constant total data volume)

Data load time

● Observation:– Load time for eager and lazy ETL remains constant.– Load time increases for Óbidos with the interest.

● Converges to the load time of lazy ETL.

19/23

Query completion time for the integrated data repository

● Observation:– We assume the corresponding data is already loaded.

● Thus, lazy and eager ETL perform similar.– Indexed scalable NoSQL architecture of Óbidos → Better performance.

20/23

Efficiency in Sharing Medical Research Data

● Observation:– A constant-size UID is sufficient, intra-organization.– With number of series, Óbidos pointers grow, inter-organization.– Traditional binary data sharing:

shared data size = volume of the image series.

21/23

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

22/23

Acknowledgements

● NCI QIN grant (1U01CA187013, Resources for Development and Validation Of Radiomic Analyses and Adaptive Therapy).

● Google Summer of Code (2014, 2015, and 2016).● The Cancer Imaging Archive (TCIA).● Tyk and API Umbrella Teams.

23/23

Conclusion

● Óbidos offers on-demand service-based big data integration.– Fast and resource-efficient data analysis.– SQL queries over NoSQL data store for the integrated data.– Efficient data sharing without duplicating actual data.

● Future Work– Consume data from repositories of domains beyond medical data.

● EUDAT– Óbidos distributed virtual data warehouses.

● Leverage the proximity of the organizations in data integration and sharing.

Thank you! Questions?

Recommended