Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
HUMIT –Interactive Data Integration in a Data Lake System for the Life Sciences
PD Dr. Christoph QuixFraunhofer-Institut für Angewandte Informationstechnik FITLife Science InformaticsAbteilungsleiter High Content Analysis & Information-intensive [email protected]
Vertretungsprofessur „Data Science“Leiter der Forschungsgruppe Big Data & Model Management
RWTH Aachen University
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Funding period: 2015-2018, funded by BMBF
Use Case Partners
Regulation requirements & Quality assurance
Coordinator / Technology Partner
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
High Content Screening
Systematic variation in parameters,
e.g. by compound or sequence
Automatic analysis by substructure
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Big Data in Life Sciences
� High-Content-Analysis
� Systematic Analysis of
huge image sets
� Automated image analysis
� Meta data extraction from
multimedia data
� Data management not
only in life sciences �
Scientific Data Management
� Workflow integration
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Zeta: Application Specific Platform
Directory Tree
Image Galeries
Time Line Animation
View Component
Overlays
Plugins
Plugin Toolbar
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Example Configuration
Cell-Cycle Analysis
Registration FB Detection Segmentation Tracking Classification Evaluation
Cell-ID Position[x,y] Mother-ID Time-ID Size MeanIntensity TotalIntensity G phase Mitosis ImageName Well Site Wavelength
1 29,35 -1 t171 786 79 62753 1 0
SR100702Live_G12_w1_s1_t171.t
if G12 s1 w1
2 44,82 -1 t171 1107 40 44376 0 1
SR100702Live_G12_w1_s1_t171.t
if G12 s1 w1
3 63,465 -1 t171 1383 87 120778 1 0
SR100702Live_G12_w1_s1_t171.t
if G12 s1 w1
4 97,363 -1 t171 721 67 48396 1 0
SR100702Live_G12_w1_s1_t171.t
if G12 s1 w1
Result
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Metadata and data is managed files and filenames!
File name: TSA_HDAC1_2.png
trichostatin A Histone deacetylase 1
is an inhibitor of
Assay: cell cycle inhibition
Table name
7
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Agenda
1. Motivation: Data Management in the Life Sciences
2. Requirements for Scientific Data Management
3. Data Lake Architecture in HUMIT
4. Summary
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Scientific Data
� Data collected during the work of scientist
� Measuring results, test data, reports, analysis, …
� Various file formats
� Excel, CSV, images/audio/video, text, XML, proprietary formats, …
� Heterogeneous semantics
� Test vs. Result data, own vs. other data, timeframe, …
Idea Proposal Experiment Result Report
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Heterogeneity is unavoidable
Islands of data in separate projectsand applications
Integrated data analysis requires huge manual effort
Traceability and reproducability
is difficult because of manualprocesses
Goal:
From isolated data islands to
(partially) integrated data
landscapes
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Requirements for Scientific Data Management
� Integration: Combined analysis of different data sources
� Traceability: Reproducability of research results
� Evidence in lawsuits: IP protection
� Reusability: Acccessibility for future usage
� Flexibility: Adapt to changes in the
research processes
� DocumentationSemantics
Models
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Agenda
1. Motivation: Data Management in the Life Sciences
2. Requirements for Scientific Data Management
3. Data Lake Architecture in HUMIT
4. Summary
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Data Lakes
� Maintain source data in its
original structure
� Postpone (semantic)integration tasks
� Manage metadata about
sources, mappings, anddata quality
� Provide interfaces for uniform
querying and interactiveexploration of the data lake
James Dixon (Pentaho) https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
If you think of a datamart as a store of bottled water – cleansed and packaged
and structured for easy consumption – the data lake is a large body of water in
a more natural state. The contents of the data lake stream in from a source to
fill the lake, and various users of the lake can come to examine, dive in, or
take samples.
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
HUMIT: Data Integration forHigh-Content Analysis
Integration based on Pay-as-you-go Idea
Incremental extraction and integration of data
Interactive tools for exploration and querying
of data, definition of semantic relationships
and mappings, and data visualization
Separation of data storage and data processing/transformation;
raw data is stored with metadata in a Data Lake, thereby immediately
available for data analysis;
data integration and mapping done later (ELT instead of ETL)
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Proposal for a Data Lake Architecture
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Ingestion Layer
� Low Effort for loading data(ELT instead of ETL)
� Support for the extraction ofmetadata and data
� Degree of automatization (especially for metadata extraction)?
� Schema extraction for semi-structured data (JSON, XML)
� Schema-on-Read
� Lazy Loading
� Data quality control
� Specify minimal requirements for ingested data
� Complement and annotate extracted metadata
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Storage Layer
� Choice of data storage
� HDFS? NoSQL? RDBMS?A hybride solution is required, but …
�A uniform interface for data access
�A uniform query language (� query rewriting and data transformation)
� Metadata Repository and Metadata Model
� Manage schemata, mappings, data quality information and data lineage
� Close integration of data and metadata
� Data quality management
� Monitor data quality of data stores
� Semantic enrichment of metadata
� Prepare data marts for specific data sets
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Interaction Layer
� Explore & Search in data repository
� Less direct queries (SQL), more Google-like queries
� Query for metadata and data
� User interaction should be captured as metadata
� Definition of exact queries
� Identification of new data relationships
� Metadata & Data Quality Management
� Exploration of the data lake (what kind of information is available)
� Capture semantic annotations of users
� Provide data quality information to users & collect feedback
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Data Quality
� Comprehensivedata quality mgmt for
a data lake is
necessary
� Data quality management is more than just data cleaning
� goals, metrics, measurements, analysis, improvements
� Data quality needs to be checked already for ingested data
�Minimal requirements for data sources (e.g., provide metadata or certain
data items such as identifiers)
� Manage data quality information in metadata repository and make it available
to data users
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Agenda
1. Motivation: Data Management in the Life Sciences
2. Requirements for Scientific Data Management
3. Data Lake Architecture in HUMIT
4. Summary
© Fraunhofer-Institut für Angewandte Informationstechnik FIT
Summary
� Data management in life sciences is oftenfile-based which limits reuse and
reproducability of experiments
� Making the data available in a data lake system provides
query, search and exploration features to the users
� Data lake is in early concept and requires more research
� Within the HUMIT project, we are developing several components and the
framework for a data lake system
� Metadata extraction (� CAiSE Forum 2016)
� Constance – Data Lake Framework (� SIGMOD 2016)
� Data quality management (� QDB workshop at VLDB 2016)
� User interaction and data visualization