21
© Fraunhofer-Institut für Angewandte Informationstechnik FIT HUMIT – Interactive Data Integration in a Data Lake System for the Life Sciences PD Dr. Christoph Quix Fraunhofer-Institut für Angewandte Informationstechnik FIT Life Science Informatics Abteilungsleiter High Content Analysis & Information-intensive Instruments [email protected] Vertretungsprofessur „Data Science“ Leiter der Forschungsgruppe Big Data & Model Management RWTH Aachen University

HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

HUMIT –Interactive Data Integration in a Data Lake System for the Life Sciences

PD Dr. Christoph QuixFraunhofer-Institut für Angewandte Informationstechnik FITLife Science InformaticsAbteilungsleiter High Content Analysis & Information-intensive [email protected]

Vertretungsprofessur „Data Science“Leiter der Forschungsgruppe Big Data & Model Management

RWTH Aachen University

Page 2: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Funding period: 2015-2018, funded by BMBF

Use Case Partners

Regulation requirements & Quality assurance

Coordinator / Technology Partner

Page 3: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

High Content Screening

Systematic variation in parameters,

e.g. by compound or sequence

Automatic analysis by substructure

Page 4: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Big Data in Life Sciences

� High-Content-Analysis

� Systematic Analysis of

huge image sets

� Automated image analysis

� Meta data extraction from

multimedia data

� Data management not

only in life sciences �

Scientific Data Management

� Workflow integration

Page 5: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Zeta: Application Specific Platform

Directory Tree

Image Galeries

Time Line Animation

View Component

Overlays

Plugins

Plugin Toolbar

Page 6: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Example Configuration

Cell-Cycle Analysis

Registration FB Detection Segmentation Tracking Classification Evaluation

Cell-ID Position[x,y] Mother-ID Time-ID Size MeanIntensity TotalIntensity G phase Mitosis ImageName Well Site Wavelength

1 29,35 -1 t171 786 79 62753 1 0

SR100702Live_G12_w1_s1_t171.t

if G12 s1 w1

2 44,82 -1 t171 1107 40 44376 0 1

SR100702Live_G12_w1_s1_t171.t

if G12 s1 w1

3 63,465 -1 t171 1383 87 120778 1 0

SR100702Live_G12_w1_s1_t171.t

if G12 s1 w1

4 97,363 -1 t171 721 67 48396 1 0

SR100702Live_G12_w1_s1_t171.t

if G12 s1 w1

Result

Page 7: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Metadata and data is managed files and filenames!

File name: TSA_HDAC1_2.png

trichostatin A Histone deacetylase 1

is an inhibitor of

Assay: cell cycle inhibition

Table name

7

Page 8: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Agenda

1. Motivation: Data Management in the Life Sciences

2. Requirements for Scientific Data Management

3. Data Lake Architecture in HUMIT

4. Summary

Page 9: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Scientific Data

� Data collected during the work of scientist

� Measuring results, test data, reports, analysis, …

� Various file formats

� Excel, CSV, images/audio/video, text, XML, proprietary formats, …

� Heterogeneous semantics

� Test vs. Result data, own vs. other data, timeframe, …

Idea Proposal Experiment Result Report

Page 10: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Heterogeneity is unavoidable

Islands of data in separate projectsand applications

Integrated data analysis requires huge manual effort

Traceability and reproducability

is difficult because of manualprocesses

Goal:

From isolated data islands to

(partially) integrated data

landscapes

Page 11: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Requirements for Scientific Data Management

� Integration: Combined analysis of different data sources

� Traceability: Reproducability of research results

� Evidence in lawsuits: IP protection

� Reusability: Acccessibility for future usage

� Flexibility: Adapt to changes in the

research processes

� DocumentationSemantics

Models

Page 12: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Agenda

1. Motivation: Data Management in the Life Sciences

2. Requirements for Scientific Data Management

3. Data Lake Architecture in HUMIT

4. Summary

Page 13: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Data Lakes

� Maintain source data in its

original structure

� Postpone (semantic)integration tasks

� Manage metadata about

sources, mappings, anddata quality

� Provide interfaces for uniform

querying and interactiveexploration of the data lake

James Dixon (Pentaho) https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

If you think of a datamart as a store of bottled water – cleansed and packaged

and structured for easy consumption – the data lake is a large body of water in

a more natural state. The contents of the data lake stream in from a source to

fill the lake, and various users of the lake can come to examine, dive in, or

take samples.

Page 14: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

HUMIT: Data Integration forHigh-Content Analysis

Integration based on Pay-as-you-go Idea

Incremental extraction and integration of data

Interactive tools for exploration and querying

of data, definition of semantic relationships

and mappings, and data visualization

Separation of data storage and data processing/transformation;

raw data is stored with metadata in a Data Lake, thereby immediately

available for data analysis;

data integration and mapping done later (ELT instead of ETL)

Page 15: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Proposal for a Data Lake Architecture

Page 16: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Ingestion Layer

� Low Effort for loading data(ELT instead of ETL)

� Support for the extraction ofmetadata and data

� Degree of automatization (especially for metadata extraction)?

� Schema extraction for semi-structured data (JSON, XML)

� Schema-on-Read

� Lazy Loading

� Data quality control

� Specify minimal requirements for ingested data

� Complement and annotate extracted metadata

Page 17: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Storage Layer

� Choice of data storage

� HDFS? NoSQL? RDBMS?A hybride solution is required, but …

�A uniform interface for data access

�A uniform query language (� query rewriting and data transformation)

� Metadata Repository and Metadata Model

� Manage schemata, mappings, data quality information and data lineage

� Close integration of data and metadata

� Data quality management

� Monitor data quality of data stores

� Semantic enrichment of metadata

� Prepare data marts for specific data sets

Page 18: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Interaction Layer

� Explore & Search in data repository

� Less direct queries (SQL), more Google-like queries

� Query for metadata and data

� User interaction should be captured as metadata

� Definition of exact queries

� Identification of new data relationships

� Metadata & Data Quality Management

� Exploration of the data lake (what kind of information is available)

� Capture semantic annotations of users

� Provide data quality information to users & collect feedback

Page 19: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Data Quality

� Comprehensivedata quality mgmt for

a data lake is

necessary

� Data quality management is more than just data cleaning

� goals, metrics, measurements, analysis, improvements

� Data quality needs to be checked already for ingested data

�Minimal requirements for data sources (e.g., provide metadata or certain

data items such as identifiers)

� Manage data quality information in metadata repository and make it available

to data users

Page 20: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Agenda

1. Motivation: Data Management in the Life Sciences

2. Requirements for Scientific Data Management

3. Data Lake Architecture in HUMIT

4. Summary

Page 21: HUMIT – Interactive Data Integration in a Data Lake System for …dbis.rwth-aachen.de/~quix/papers/ahm2016.pdf · 2016-06-03 · HUMIT – Interactive Data Integration in a Data

© Fraunhofer-Institut für Angewandte Informationstechnik FIT

Summary

� Data management in life sciences is oftenfile-based which limits reuse and

reproducability of experiments

� Making the data available in a data lake system provides

query, search and exploration features to the users

� Data lake is in early concept and requires more research

� Within the HUMIT project, we are developing several components and the

framework for a data lake system

� Metadata extraction (� CAiSE Forum 2016)

� Constance – Data Lake Framework (� SIGMOD 2016)

� Data quality management (� QDB workshop at VLDB 2016)

� User interaction and data visualization