Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Master Thesis
Integration of biomedical and semantic knowledge for enabling systems biology approaches
Karl Kugler
UMIT - university for health sciences, medical informatics and technology Institute for biomedical engineering
Hall, August 2007
Eduard Wallnöfer Zentrum 1, A-6060 Hall, Österreich/Austria www.umit.at
Thesis adviser and examiner: Univ.-Prof. Dr. Bernhard Tilg
Univ.-Prof. Dr. Armin Graber
Co-Examiner: Dipl.-Ing. Dr. Bernhard Pfeifer Accepted by the examination committee on
Executive Summary
Introduction: By using High Throughput technologies the amount of available biomedical
data has increased over the last years, and this amount of data has become a challenge in
research fields like genomics, where the usage of cDNA microarrays lead to the need of
creating tools to manage the vast amount of information[1]. Gardner identifies the problem as
“Too often, the data generated by the automated technologies gather in vast silos that are
impressive in scale but limited in usefulness to the organization“[2]. Integrating information
from diverse sources holds several challenges that are presented in several sections. This work
introduces some of the basic principles and techniques used to bring biomedical data into an
integrated and manageable state and solve the mentioned challenges.
Objective: This work will present a framework, targeting at providing conversion and import
features, built for the IMGuS project, which focuses on prostate cancer systems biology, by
combining several –omic technologies.
Methods: The presented framework uses methods of data warehousing and semantic
integration, which are illustrated in an extra section of this work. A conceptual view of
integration techniques and current projects is featured as well.
Results: The imgus-etl-framework for LINDA provides easy to extend features that are
needed for the processes and tasks contained in data integration and especially in data
warehousing. The presented repository schema was designed in order to extend the data
warehouse for usage with additional -omics data and information from external data bases
like KEGG[3].
Conclusion: Having introduced the imgus-etl-framework should help to maximize the
interaction of system users producing the data, and thereby increase the rate of performed
updates. In the future a Graphical User Interface might even allow creating user specific
conversion and importing processes by using a simple drag-and-drop application.
Zusammenfassung
Problemstellung: Durch die Etablierung von Hochdurchsatzverfahren stieg die Anzahl der
verfügbaren biomedizinischen Daten in den letzten Jahren stark an. Dieser Anwuchs führte
für Forschungsbiete, wie zum Beispiel die Genomics, wo durch die Verwendung von cDNA
Microarrays die Erstellung neuer Datenmanagementtools für die enorme Datenmenge
notwendig wurde[1], zu einigen Herausforderungen. Gardner beschreibt das Problem mit den
Worten “Too often, the data generated by the automated technologies gather in vast silos that
are impressive in scale but limited in usefulness to the organization“[2]. Die
Problemstellungen, die sich für die Integration von biologischen Daten aus verschiedenen
Quellen ergeben, werden vorgestellt. Des Weiteren werden grundlegende Prinzipien und
Techniken präsentiert, die genützt werden, um diese Daten in einen integrierbaren und
verwaltbaren Zustand überführen sollen.
Zielsetzung: In dieser Arbeit wird ein Framework vorgestellt, das die Umwandlung und den
Import von Daten ermöglichen soll. Dieses Framework wurde im Rahmen des IMGuS
Projektes, das auf die Zusammenführung von –omics für die Untersuchung von Prostatakrebs
hinarbeitet, erstellt.
Methoden: Im vorgestellten Framework werden Methoden des Data Warehousing und der
semantischen Integration vorgestellt, die in einem Kapitel dieser Arbeit vorgestellt werden.
Eine Übersicht über die verschiedene Techniken der Datenintegration und aktuelle Projekte
ist ebenfalls enthalten.
Ergebnisse: Das imgus-etl-framework für LINDA bietet leicht zu erweiternde
Funktionalitäten, die für die Aufgaben der Datenintegration, vor allem in Data Warehouses,
von Nöten sind. Das vorgestellte Schema des Repository wurde mit dem Hintergrund einer
Erweiterbarkeit für zusätzliche –omics Daten erstellt, und soll in Zukunft auch Daten von
Datenbanken wie KEGG[3] enthalten.
Konklusion: Durch die Einführung des imgus-etl-framework soll die Benutzerinteraktion auf
Seiten der Datenproduzenten maximiert werden, und sich so auch die Anzahl der
Aktualisierungen dementsprechend erhöhen. Künftig soll es eine grafische
Benutzeroberfläche erlauben, benutzerspezifische Transformationen und Importe mittels einer
simplen Drag-and-Drop Anwendung zu gestalten.
I
Table of Content
1 INTRODUCTION 1
2 METHODS 4
2.1 DIAL (DATA INTEGRATION, ANALYSIS AND LOGISTICS) 4 2.1.1 DATA INTEGRATION 4 2.1.2 DATA ANALYSIS 5 2.1.2.1 Knowledge Discovery in Databases 5 2.1.2.2 Data Mining 6 2.1.2.3 KDD and Data Mining 7 2.1.3 DATA LOGISTIC 8 2.2 DATA INTEGRATION 9 2.2.1 OVERVIEW 9 2.2.2 REQUIREMENTS FOR DATA INTEGRATION 10 2.2.3 STEPS IN DATA INTEGRATION 11 2.2.4 INTEGRATION APPROACHES 13 2.2.4.1 Hernandez/ Kambhampati-Classification of Integration approaches 13 2.2.4.2 Leser/Naumann-Classification of Integration Approaches 16 2.2.4.3 Conclusion 17 2.2.5 CHALLENGES IN DATA INTEGRATION 18 2.2.5.1 Technical Challenges 19 2.2.5.2 Semantic Integration 19 2.2.5.3 Exponential growth rate of data amount 20 2.2.5.4 Human Resources 21 2.3 DATA WAREHOUSING 23 2.3.1 ARCHITECTURE 25 2.3.1.1 Back Room - Data Management 25 2.3.1.2 Front Room – Data Access 26 2.3.2 METADATA 27 2.3.3 KEEPING UP-TO-DATE 28 2.4 SEMANTIC INTEGRATION 30 2.4.1 ONTOLOGIES 31 2.4.1.1 Creating an ontolgy 32 2.4.1.2 Web Ontology Language 32 2.5 CURRENT PROJECTS 34 2.5.1 SRS 35 2.5.2 DISCOVERYLINK 35 2.5.3 BIOMEDIATOR 36 2.5.4 ALADIN 37 2.5.5 ATLAS 37 2.5.6 BIOWAREHOUSE 38 2.5.7 COMPARISON 40
3 RESULTS 41
3.1 PROJECT IMGUS SETTINGS 41 3.2 LINDA REPOSITORY 44 3.2.1 DESIGN OF THE IMGUS-ETL-FRAMEWORK FOR LINDA 45 3.2.2 INFRASTRUCTURE 45 3.2.3 ARCHITECTURE 48
II
3.2.4 INTERFACES 49 3.2.5 IMPORT AND MAPPING COMPONENTS 51 3.3 IMPLEMENTATION OF THE IMGUS-ETL-FRAMEWORK FOR LINDA 52 3.3.1 FILE CONVERSION 52 3.3.1.1 Horizontal into Vertical Representation 52 3.3.2 FILE IMPORT 53 3.3.2.1 Talend Open Studio Import Classes 53 3.3.3 ONTOLOGIES 54 3.3.4 CONVERSION AND IMPORT 55
4 DISCUSSION 56
4.1 DATA INTEGRATION 56 4.2 SWITCH TO JAVA 56 4.3 INTERFACE USAGE 56 4.4 REPOSITORY 57 4.5 IMGUS-ETL-FRAMEWORK 57 4.6 DEPLOYMENT STAGES 58 4.7 FUTURE WORK 58
LIST OF FIGURES A
LIST OF TABLES C
BIBLIOGRAPHY D
CURRICULUM VITAE K
EXPRESSION OF THANKS L
STATEMENT M
Introduction
1
1 INTRODUCTION
In the year 2001 the human genome was published by Lander[4], and a wave of enthusiasm
flooded the life science society. The research communities started to dream about
personalized medicine having a breakthrough in the next decades. Several years later some
scientists realized that the genome itself could not bring the breakthrough everyone was
expecting, as Quakenbush[5] describes in 2007 and van Beek compared the situation of the
repeatedly praised breakthrough with an airplane safety instructions one has heard to often,
already in the year 2003[6]. Here we find calls for bringing the amount of gathered data into a
manageable state, something Liu et al declared necessary for molecular biology laboratories
as well, caused by the introduction of gene expression studies using cDNA microarrays[1]. A
similar conclusion was found at a workshop dealing with genome informatics issues[7], held
by the Department of Energy already in 1993: “More support for complex, multi-database
queries will require major efforts toward improving the integration and interoperability of
community databases. … Without an API, researchers must spend excessive time manually
identifying, extracting, and formatting data from community databases before further
analyses can begin.”. Gardner addresses the problem as “Too often, the data generated by the
automated technologies gather in vast silos that are impressive in scale but limited in
usefulness to the organization“[2]. One of his conclusions is almost identical to the upper one,
as he states that rethinking the way this data is managed is required. Nagarajan et al even
consider the integration of the data sourcing from the various sources as the biggest challenge
in the analysis of the huge amount of available information[8].
Now taking this into account, what makes the information manageable? One solution
provided by van Beek focuses on data integration, analysis and logistics. The next sections
will provide concepts and basics on data integration [6].
But first it is necessary to understand why transforming biomedical information into a
manageable and integrated state is crucial to enable further steps in biomedical research. It is
necessary to distinguish between qualitative and quantitative bio research. A group
performing experiments on their own, providing qualitative data, has no need for automated
data integration. When the need of comparing results with other groups occur, a scientist may
integrate her data manually, which means she might transform information locally, perhaps
even with little hand made scripts on demand. But as we see a more and more fragmented
distribution of information, as well thematically and spatially[9], the need of having data
Introduction
2
integration automatically occurs. This fragmentation leads to the need of having a common
understanding and definition of concepts. Creating ontologies and common standards, which
is one paradigm in the fields of medical informatics, is a process that has started in the
biomedical informatics the last couple of years[[10],[11]]. Big projects like the Gene
Ontology[12] (GO) or the MicroArray Gene Expression Object Model[13] (MAGE-OM) help
to simplify the process of exchanging data and knowledge. But the usage of given standards
must ensure to not be limiting the representation of new research results, as Almeida et al
state[14].
When integrating biological data one must always keep in mind that it is not enough to only
store the data in a proper way, but rather to define the data’s meaning on a semantic level and
how to query such data afterwards has to be considered as well. Not thinking about how to
make data harvestable might make the whole project useless, as scientists should not use their
working days entering complicated queries, but rather be performing experiments at their
working benches and exploring the gained data. The variety for entering user queries ranges
from providing a simple user form entry on a web page, to sophisticated script languages like
Icarus, a script language for the Sequence Retrieval System SRS[15].
If working with such a load of data one needs to think about integrating and making it
accessible, as well as how to analyse the whole amount of data that might be in the focus of
interest. Van Beek[6] refers to this set of tasks as DIAL (Data Integration, Analysis and
Logistics), which might be considered a quite useful description. In a following section the
steps performed in DIAL will be described, in order to understand the process of creating new
knowledge out of several distributed sources.
As the integration of data contains a wide range of different research approaches, it will be
discussed regarding the several requirements, being of technical and semantic nature, and the
single steps needed for a systematic integration approach will be discussed. Three main
categories can be identified when speaking about Data Integration, as there Navigational or
Linkage based approaches, which connect information or data entities by simply providing a
traceable path between them. Mediator based integration approaches hide the underlying data
sources from a user’s point of view, as they present one interface to query multiple instances
of data storage beyond. The third approaches uses data warehouses, having repositories,
storing physical entities, which were copied from the data sources to be integrated.
Introduction
3
There are many projects aiming at providing a solution to integrate biomedical data
nowadays, some of this projects will be introduced in a later section, in order to provide a
comprehensive view of the current research field.
The last sections will describe parts of the IMGuS project, which was initiated in order to
enable a system biology view on prostate cancer. The projects aims at integrating data from
various –omics techniques, like metabolomics, genomics or phenomics. This work focuses on
the aspects of how to import the given data sets into the IMGuS data warehouse.
Methods
4
2 METHODS
Data Integration is a big field in current biomedical researches, as an enormous amount of
data is produced[[16],[17]]. Due to achievements in biological methods, such as High
Performance Liquid Chromatography, efficient data integration technologies might become a
key issue in disciplines like proteomics[18]. This section will provide an overview of current
technologies and projects aiming at integrating data sourcing from various biological data
sources.
2.1 DIAL (Data Integration, Analysis and Logistics)
The Centre for Medical Systems Biology started a project targeting the harvesting of medical
knowledge by data integration, analysis and logistics, a set of tasks called DIAL[6]. This
acronym very well describes the whole process of handling biological and medical
information in order to gain new knowledge.
This section will describe the three parts integration, analysis and logistics in order to present
an overview on how today’s modern biomedical science works. As data integration will be a
section of its own, it will only be described briefly, whereas data analysis and logistics, will
be presented more in depth.
2.1.1 Data Integration
Data Integration might be considered as combining several different data sources in a way, so
that someone accessing, gets only one representational view on the whole set[19]. As one
might easily imagine, this presents some quite strong constraint, since today’s biomedical
information residues get more and more distributed. To achieve these objectives systems and
algorithms needed have to be adopted in order of being able to deal with these conditions.
Before automatic data integration was an issue, scientists had to manually find the needed
data sources, analyse their representational format and then manually integrate the different
results[20], but as more and more data sources build up in the internet, today an automated
approach becomes essential.
As Schönbach et al state data integration is the prerequisite for the later step of data
analysis[11]. This is especially true for the systems biology field, where a
Methods
5
systematic/integrated view on the individual is the base for applying a methodological
approach towards gaining new information and knowledge as stated by Palsson[21]. The
integration of data in an informatics kinds of way isn’t mandatory, since until the recent years
a scientist could very well integrate her information with other sources manually[22], by
mining other data sources or integrating literature knowledge by doing paper readings. Today
as most information is widely distributed over the internet, non-automated data integration
becomes more and more impossible. Regarding the amount of data high through-put assets
create, an automated approach is mandatory[[1], [17],[23],[16]].
2.1.2 Data Analysis
Having a huge amount of integrated data stored in some place does not bring any output as
this data silo needs to be analyzed in order to get any information or knowledge.
Mathematical test, explorative or descriptive statistics or methods of data mining and machine
learning, like neuronal networks, may therefore be applied on this data. The steps and task of
doing this is called data analysis. Berthold and Hand[24] define data analysis as “…the
process of computing various summaries and derived values from the given set of data”. They
further point out that simply applying tools to a given data set may not be considered
“cookbook fallacy” rather than data analysis. In order to have real advantage out of integrated
data, one needs to know how to apply methods that really fit to the given problem, otherwise
meaningless or even false results have to be excepted[25].
For analysing data there are two opposite approaches. One is the classical approach of having
an idea, formulating a theory and thereafter testing a hypothesis. Whereas on the other hand
there is a branch of research that tries to discover new knowledge by exploring the existing
data without formulating a hypothesis in advance, and then trying to falsify it[6]. The idea of
generating new hypothesis through data analysis, and afterwards providing these theories in
old-fashioned workbench laboratory work, seems to become relevant[5], but as the number of
new hypothesis, created by modern high-throughput analysis approaches, increases, being
able to prove the whole number of fresh hypothesis might turn out to be difficult[6].
2.1.2.1 Knowledge Discovery in Databases
One of the basic concepts used when working on the creation of new knowledge by using
databases is called knowledge discovery in databases, which is often abbreviated as KDD.
KDD may be regarded on an abstract level as “making sense of data” or in a more formal
definition: “KDD is the nontrivial process of identifying valid, novel, potentially useful, and
Methods
6
ultimately understandable patterns in data” as described by Fayyad et al[25]. Fayyad et al
furthermore define KDD as getting understandable (summarized) reports from a too large to
be understandable amount of basic data. This definition perfectly matches with the above
introduced characterization for data analysis. KDD may therefore be defined as a division of
data analysis.
The goals of data mining may be classified into two groups: Verification and Discovery[26],
with Verification targeting on proving a user’s hypothesis, whereas Discovery describes the
attempt to create new patterns and findings on a given data set. Discovery might be further be
divided into two sub targets, as Prediction defines the goal of discovering parameters that
may be used in order to predict future behaviour and Description is used to find patterns that
describe the data sets in a understandable way.
2.1.2.2 Data Mining
Data mining can be considered as the application of algorithms while performing KDD. This
defines Data mining as part of the KDD-process[25]. Data Mining can be considered as the
selection and application of algorithm on data sets in order to find useful patterns. These
patterns might be classifications, clusters or a model that represents the underlying data sets.
Data Mining is one of the steps Fayyad et al defined in their KDD model, which will be
shown later. The data mining component includes iterative and repetitive appliance of data
mining methods and algorithms[26], combined with interaction by the user. Many of these
methods originate from the scientific fields of machine learning, pattern recognition and
statistics. Several goals can be defined for the approach of using data mining as a discovery
tool:
- Classification: Automatically mapping each data entry to exactly one predefined class
- Regression: Mapping an entry to a real-valued prediction variable and discovering
functional relationships between features
- Clustering: Identifying a finite set of categories or cluster, that describe the data set
- Summarization: Finding a compact description for a subset of data
- Dependency Modelling: Modelling a graph, that describes significant dependencies
between variables
- Change and Deviation Detection: Detecting the most significant changes in a data set
from former measurement entries or standard values
Methods
7
2.1.2.3 KDD and Data Mining
Fayyad et al created a model of KDD, including the role of KDD:
Figure 1 - The steps performed in KDD as defined by Fayyad et al[25]
The initial step, here labelled as “Data” is to understand the domain of the application, as well
as to gather the current knowledge, facts and information. In a second step the “Target Data”
is selected, by filtering those objects that are needed for a further proceeding in the creation of
knowledge. After that the target data is being pre-processed resulting in “Preprocessed Data”,
filtering out invalid and missing values or noise. In a next step the data is reduced to
representative features and less dimensions, e.g. using information gain methods, leaving the
“Transformed Data” as working set. By this step the preparation and selection of data is
finished.
Next, one has to decide what data-mining methods will be applied to the working set in order
to reach the goal of the KDD process. Having this matching, one may choose the correct data-
mining algorithms and select needed parameters for these. After that one may apply the data-
mining methods on the working set, thereby searching for patterns of interest. These patterns
may later by evaluated and interpreted, which may cause a return to one of the earlier steps.
And finally after having successfully evaluated and interpreted the patterns one may use the
newly created knowledge.
Methods
8
2.1.3 Data Logistic
By defining the term “logistics” as “the planning, implementation, and coordination of the
details of a business or other operation”[27] data logistics may be regarded as dealing with
data on a meta level, targeting mainly on delivering the right type of data in the right time.
The two main tasks of data logistics can be split into: data transportation and data
transformation as Jablonski et al do[28]. By introducing the Process Based Data Logistics
PBDL approach, they indirectly support the task of data transportation by applying workflow
management methods, whereas data transformation is being performed using XML based
ontology wrapping between the different formats.
Methods
9
2.2 Data Integration
As mentioned above data integration is one of the primary goals on enabling a structured
approach on the widely distributed amount of data, today’s biological and biomedical
disciplines produce. Data Integration was defined as “the problem of combining data residing
at different sources, and providing the user with a unified view of these data” by Lenzerini in
2002[19].
Hernandez and Kambhampati define three goals that should be enabled by using data
integration approaches as they define gathering knowledge from a huge amount of data,
formulating a hypothesis and finally verifying this hypothesis as primary use of bioinformatic
integration systems[22].
When working with biological data some special challenges like variety of data and
representational heterogeneity have to be taken into account[22]. An interesting distinction is
made by Leser et Naumann[29] by separating “data-focused” and “schema–focused”
integration approaches. Data-focused systems provide high standard of data by manually
maintaining the entries, whereas schema-focused seem to act more like a biological
“integration middleware”.
2.2.1 Overview
This section will show why it is necessary to use data integration when performing system
biological research. There are several different approaches to this, from providing linkage-
based data storage to holistic biomedical data warehouses. All of these techniques try to
enable the scientist to gain information and knowledge from experimental or in silico data,
existing knowledge or even measurements from animal models or patients. Another benefit of
combining information from different sources is that occurring redundant or overlapping data
sets could be used to verify or cross validate another entry[30].
During the last decades the number of biological data to be stored increased exponentially, as
described by the EMBL statistics[31]. This growth leads to the need of bringing this
information into a manageable state. So databases were developed, containing this
information in mostly proprietary format. So worldwide biological research and development
had to face the challenge of becoming thematically and spatially more and more fragmented,
Methods
10
as described in Leser and Rieger[9]. This fragmentation caused biological databases to spread
at high speed.
The current number of registered biological databases is 968, which is an increase of 13% to
2006, as is described in Galperin “The Molecular Biology Database Collection”[32]. Taking
this situation in account, one can easily notice the need to integrate this huge amount data in
order to enable a complete system biological approach.
Today several approaches on integrating the data on the data storage level can be defined.
Hernandez and Kambhampati define three types of integration approaches[22]. These
methods will be explained in order to illustrate there strengths and weaknesses.
Something that has to be kept in mind when integrating data from diverse sources is that when
merging information from objects that are considered equal enough to be joined, possible
differences might be blurred, thereby making it impossible for a biologist to know what exact
object he is looking at. This problem does not occur in an approach where information
coming from different sources is kept separated and only a linking structure is
established[33].
When talking about integration of data, it is fundamental to know about the challenges this
approach presents. Since the information may have several sources and may even represent
different (experimental) sights on a data set. It’s crucial to keep track of the technical and
semantic requirements an integrating service needs.
2.2.2 Requirements for Data Integration
When integrating data from biological sources several requirements have to be met in order to
enable a valuable use. Some of these requirements originate from the historical development
in business economics, where the idea of integrated data, or “federated databases” derives
from. Nevertheless do they appeal to data integration approaches in biomedical environments
as defined by Leser et Rieger[9], Haas et al[34] and Ibrahim and Schwinger[35]:
- Transparency: Masking the data sources from the user. The user does not need to
know what the underlying semantic and technical implementations are, accessing the
top level resource, delegates a query to the underlying systems. The sources of data
integration must as well not be forced to show a given behaviour, as they should act
absolutely independently.
Methods
11
- Completeness: The data represented by the integrated system should be the complete
data, held by the nested systems.
- Semantic Correctness and Non-existence of Redundancy: The schema represented by
the global system has to semantically correct and addressing its elements has to be
unambiguously defined. The several data sources may even contain conflicting
elements.
Having a biological environment some additional requirements have to be met:
- Actuality: For some biological issues it is necessary to inspect the most actual data.
When working with a data integration system, which copies entries into its own
physical memory it’s quite challenging to keep the stored information up to date.
- System Performance: With high throughput measurements used in the “-omics” field,
the size of the gathered information raises the need to have efficient algorithms to
manage and explore this huge amount of data[[11],[36]]. But in order to enable an
algorithm to access the data, it’s crucial to have a good system performance allowing
efficient and optimized access to the stored data.
- Data Integration: Integrating the data in a biological context means to recognise and
merge duplicates information from various sources. Linkages between objects have to
be detected, possible contradictions within data sets have to be cleared.
2.2.3 Steps in Data Integration
Having the requirements of successful data integration it’s necessary to think about, the data
integration should be performed in order to guarantee the fulfilment of these formal requests.
A study performed by Seligman et al in the year 2002, on behalf of the MITRE
Corporation[37] defines eight tasks that need to be performed when integrating data[38]. This
subsection will present these steps that are more granulated than most other definitions of data
integration tasks. In order to focus on biomedical data integration aspects for this field of
application will be added to the task description.
- Gathering knowledge about sources: Each data source hast to be understood in terms
of schematic, representational and semantic means. This might be difficult as not
every source in a biomedical environment is documented fully enough to satisfy the
requirements needed to be completely identified.
- Gathering knowledge about the application target: The interfaces and views a user or
a user side system has the specified application targets have to designed and
Methods
12
implemented. Therefore it is crucial to have end user on board, who can help in
understanding what these goals may look like.
- Identifying semantic correspondences: This might be considered as semantic
harmonizing, trying to merge logical entities from the different data sources that
correspond to the same real world objects[39]. In the biological sciences this might
happen as many data sources created their own ontological system, each fitting
perfectly their own need.
- Creation of attribute transformations: Having identified the need attributes and their
needed representation, in a next step the transformation of these attributes has to be
considered. Some attributes need syntactic transformation, other target attributes are
aggregations or calculations based on attributes provided by the data sources.
- Specifying data combination rules: When combining data vectors from different rows
it needs to be specified how these combination takes place. A much bigger challenge
than just merging these vector is the handling of duplicates. To determine which
vector contains the “true” information is a difficult task, and omitting the conflicting
entity might cause problems when interpreting the new data.
- Creating logical mapping: Having performed the upper tasks, the mapping from the
source data to the user end data might be performed. The way this is done depends on
the integration approach, e.g. when performing mediated integration the mapping
could be expressed in a SQL-view.
- Cleaning the data: Incorrect values in the data have to be detected and corrected. This
is an important step, since especially in biomedical applications, wrong data entries
could lead to wrong statistics and results[40]. Rahm and Do define several
requirements for data cleaning approaches[40], as the cleansing should be complete,
by meaning that all errors and inconsistencies in both the source data and the
integrated data. The cleaning approach should need minimal manual interaction by
the user, and should be extendable to further data sources. And last but not least,
should it be combined with the schema related data transformations. One thing that
has to be kept in mind that it is difficult to decide what data can be deleted since even
incomplete or potentially incorrect might be of interest for research purposes[22]. A
survey done by Schönbach et al showed that about 30% of 145 source data contained
an error, that could have caused further trouble[11].
- Implementation of a user friendly access environment: Making the integrated data
accessible for the user, in way efficient and working results can be produced. In most
Methods
13
cases this will mean providing some GUI or web based interface to the using
community.
2.2.4 Integration approaches
During the last couple of years, three main approaches on data integration could be noticed, as
described by Hernandez and Kambhampati[22]:
- Navigational Integration
- Mediator-based Integration
- Warehouse Integration
Leser et Naumann[29] classifies into three categories as well, but does not take the classic
historical technical view, instead focuses on the integration focus
- Data-focused
- Schema-focused
The first classification will further on be called “Hernandez/ Kambhampati-Classification” the
second one “Leser/Naumann-Classification”. In this section this two classifications and their
defined classes, will be presented by their basic concepts, as well as their pros and cons.
2.2.4.1 Hernandez/ Kambhampati-Classification of Integration approaches
As mentioned above the three types for classification of integration approaches by Herandez
and Kambhampati are: Navigational Integration, Mediator-based Integration and Warehouse
Integration. This classification represents a methodical distinction of the integration
techniques, inspecting the property of where the access/combining level of this data lies. As
the first two approaches “Navigational” and “Mediator-based” leave the data at their sources,
and do not store a physical entity of the information at a central repository, they’re called
“virtual” integration approaches. Data Warehousing on the other side stores transformed
copies of the data in a central repository, therefore called a “materialized” integration. It is
important to keep this distinction in mind when thinking about reading and writing access to
these data sets.
Methods
14
2.2.4.1.1 Navigational Integration
Also called “Linked Based” Integration, meaning that
the integrated data is still distributed over several
sources, still has several different forms of
representation but is connected via a linking model. This
idea may be compared to the linking system used by the
WWW. The information is fragmented over many
servers but by establishing links, a connection between
the desired information is established. One idea is to
store the linkage information as a pair of keys
containing: the ID of the target database and the accession-number of the dataset as described
in Leser et Naumann[29]. Those links between different data sources might be administrated
by hand or automatically established, and are often referred to as “cross-references”[41].
The weaknesses of this method are the combinatorial explosion of possible links between the
data sources, and the rather simplistic semantic model, which might lead to a high number of
false negative or false positive links between the data sources[2].
2.2.4.1.2 Mediator-based Integration
„Define an object that encapsulates how a set of objects interact. Mediator promotes loose
coupling by keeping objects from referring to each other explicitly, and it lets you vary their
interaction independently.”
The definition of the mediator design pattern defined by Gamma et al[42], states out that the
mediator provides access to an encapsulated, internal representation of objects by not showing
the outside what the internal linkage structure looks like. Expanding this definition by using
the term “database” instead of “object” explains how the mediated integration approach
works. As this approach merges the various data sources, it is often referred to as “federated
databases” approach[30].
Figure 2 - For this approach data
sources are integrated by having links
between the objects. In the upper figure
a tuple in database 1 contains linkage
information pointing to a database 2
primary key.
Methods
15
Figure 3 - The mediator (virtual database)
hides the data sources from user access. For
each data source a wrapper has to be
implemented, to delegate the queries and
access the data
The query sent to the system by a user is
transformed by a mediating level, so that several
databases that may be behind the mediator
interface can be queried. The mapping of the
gained information is provided the mediator as
well.
In order to access one of the underlying data
sources a wrapper has to be used. A wrapper is
composed of two components. One component
sends a query to the data source in order to
retrieve the information, while the second part is
transforming the gained information in the
expected output format[43]. This means that
when integrating of n data sources is planned n
wrappers have to be implemented in worst case.
Data sources might not just be databases, but can
as well be flat files, that can be accessed by a
specific wrapper as well.
There exist two different approaches in providing a view on the mediator database[44]. The
first is called “Global-as-View” GAV and the other one is referred to as “Local-as-View”
LAV. Both concepts are shortly described below.
2.2.4.1.2.1 Global-as-View GAV
In the GAV concept the global schema may be compared to an ordinary view in a database
system, and unfolding a query is quite trivial. However, whenever a change in the information
sources or adding another information source requires redesigning the global view. This
makes GAV almost not useful for systems where changes in the data sources occur
frequently.
2.2.4.1.2.2 Local-as-View LAV
Adoption to changing data sources is easier in local-as-view environments, because here a
global schema exists independently of the schemes the sources provide. For a changed or an
added source, a source description has to be modified or implemented. The drawback of LAV
is however, that reformulation and transformation of queries is a non-trivial task, thus
Methods
16
resulting in a low performance. This problem is being addressed as “answering queries using
views”[45], because the query needs to access the local sources with their own local views.
2.2.4.1.2.3 Combination of the approaches
As both approaches do have their weaknesses, several projects aim at combining GAV and
LAV in order to get the best results, by making use of the several strengths. These projects
may be referred to as Global-Local-as-View or GLAV, as introduced by Friedman et al in
1999[46]. Other research groups are as well trying to get the best out of these two approaches,
e.g. Lacroix presented a wrapper using the “search view” approach, in order to create an
intermediate level mechanism[43].
2.2.4.1.3 Data Warehouse Integration
This approach integrates the data from various sources by copying the information values into
a central repository, which makes the information integrated and queried part of the data
warehouse rather than being part of the sources. This means that the data uploaded to the
warehouse may and in most cases must be transformed to fit the warehousing schema. Or in
other words: “A data warehouse can therefore be seen as a set of materialized views defined
over the sources” as defined by Theodorates et Sellis[47]. The big advantage of using Data
Warehousing techniques is, that by only working with copies of the original data sets, the
information intentionally produced by the research does not need to be changed, instead when
uploading this information to the data warehouse, the transformation takes place on local
copies.
2.2.4.2 Leser/Naumann-Classification of Integration Approaches
Leser and Naumann separate between in their work “(Almost) Hands-Off Information
Integration for the Life Sciences” between two major types of data integration projects. They
do distinct on the grade of integration by defining the first group of projects as “data-
focused”, which describes projects that are maintained manually, and thereby provide a high
quality standard of information. The second group are the so called “schema-focused”
projects, which focus on providing a global schema for data storage.
2.2.4.2.1 Data-focused Integration
Data focused projects are to be considered the most successful projects in the biological scene
by now, as they provide a high-standard data quality by administrating the information
manually. As this maintenance is performed by experts, one can be sure the information being
provided meets the needs for good scientific practice. As these types of projects do mainly
focus on the data quality and the completeness of entered information entities database basic
Methods
17
demands may play a minor role. As these projects are to be administrated manually by experts
the costs for keeping such projects consistent are quite high. Examples for such projects are:
Swiss-Prot[[48],[49] ] or Omim[50].
2.2.4.2.2 Schema-focused Integration
Project that are “Schema-focused” make extensive use of database technology as they try to
fit data to a given schema, in order to achieve a high grade of automation. Such projects are
not yet very successful, at least in terms of getting attention, in the life sciences, which, as
Leser et Naumann believe, is caused by their schema-centricity[29]. Creating a global schema
for information sourcing from different groups might be a difficult task, especially mapping
the entries semantics. And as some kind of abstraction is necessary to create a global schema,
biologists might distrust these steps.
2.2.4.3 Conclusion
Having introduced the three integration approaches as defined by Hernandez and
Kambhampati, it is impossible to say what kind of approach is the only one working, but one
can instead see the strengths and weaknesses each of these approaches has. This subsection
will try to give a brief summary of the pros and cons, but the selection of an approach is to be
done separately in each project where a need for data integration comes up.
Following a navigational integration approach might help if working with loose webs of data,
where no relational schema is provided, where a data web is characterized as a set of pages
and the links between those pages[46]. Wrapping this definition to a more abstract level, data
webs could be sets of information entities providing information about links between those
entities. This can be a useful approach if the information provided by a data source is only
reachable by following one or more pathways or the data source providing the information
allows no possible automated parsing or information recognition. This does not free a user
from the need to manually combine the results found with this approach. It however has the
advantage that a user can undoubtedly identify the source certain information is coming from.
One advantage of using a data warehouse is the total control of the data that is used. Since
every piece of information has to be loaded into the data warehouse by a process, this process
can check if the information to be imported fits the needs that have to be met. One other
advantage is that every bit of time that could be lost if waiting for a slow data source on the
network can be saved if the complete data is stored in one central repository. As well are the
Methods
18
data sources that often are needed for the daily business of research work or even life critical
tasks in a clinical environment kept safe from any Denial of Service Attack, that might be
caused by a huge batch data being downloaded for a real time survey as the happen in
mediator based integration approaches[51].
One other advantage is the possibility to perform all query optimization that might be
necessarily on the local system, whereas mediator based projects might lack the required
information about current query execution environment from the data sources providing the
needed biological information. Another fact is that the more different sources are integrated
over the internet in real time, the more growths the probability of one of these sources not
being available at the time needed, and even if all the sources send the information to answer
a query, this might be an enormous amount of data being transferred over the network, which
may cause the network to be overloaded[30].
One of the major drawbacks in the usage of data warehouses is that potentially outdated data
might be used, if updates on the data set are nor performed regularly. This can not happen if
using a mediator-based system since all information is gathered in real time from the
underlying data sources.
2.2.5 Challenges in Data Integration
Integrating data from several sources always brings several challenges of technical nature, but
when integrating biomedical data there are even more challenges to be kept in mind and
mastered. These challenges may be technical as well as of a semantic nature[9], but even
educating staff used to integrate and research this data may be considered part of the
challenge[52].
To understand the way data integration is performed, one needs to understand what problems
and challenges this task contains. This section states out what the current demands in modern
biological data integration are.
Methods
19
2.2.5.1 Technical Challenges
Some of the challenges data integration projects have to face are of technical nature, as a
solution for these problems is mandatory for every integration project.
2.2.5.1.1 Various Data Formats
When exchanging or integrating data one needs to define how, in a syntactical way, this
exchange happens. Some projects define their own file formats, for exchanging ASCII-files,
containing nested command and information. With the spreading of XML as a quasi-standard
for exchanging text based files, this problem is getting less important. Achard et al suggested
the use of XML and XML Interchange Data Dumps in order to replace the current flat file
exchange[53] already in 2000. Regarding the usage and spread of SBML and other similar
XML-based exchange formats, it seems like this has yet come true.
2.2.5.1.2 Various Access Languages
After having the integrated data stored on a system, a user must be able to access this
information in order to query for entries he might be interested in, in order to do his research.
Enabling an easy access to fulfil the user demands is crucial to make such software usable.
Projects like BioKleisli/K2[54] or DiscoveryLink[55] created querying languages on their
own in order to enable user access. In the case of BioKleisli the product deriving from the
first approach of designing this language was so complicated, that in a commercial follow up
project, a whole new, more SQL like, querying language had to be constructed.
2.2.5.2 Semantic Integration
Leser and Rieger define semantic heterogeneity as a two level challenge[9]. On the semantic
level (“What is a gene?”) and on a data level (“Are two genes identical?”). As bioinformatics
is, compared to medical informatics, a quite new discipline definitions and ontologies still
undergo a constant process of change. Even a substantial concept like gene is ambiguously
defined. It may happen that two different sources have different wordings for the same
concept, or even worse the same wording for two different concepts. This may easily lead to
data inconsistencies[22]. Some data sources don’t even have well documented descriptions
for their content and schema.
With the evolvement of biomedical data integration issues the need for combining the existing
ontologies becomes more and more obvious, as Rosse et Mejino Jr. describe the process of
designing new ontologies in the areas of medical informatics and even more in biomedical
Methods
20
informatics[10]. By sticking to large ontologies like the Gene Ontology[12], that is
meanwhile even part of the Unified Medical Language System[56], the integration process
may be simplified.
As semantic integration is one of the most active research fields in the domain of data
integration[[57],[2]], a later section will more deeply introduce the concepts and findings.
One of the most important tasks of developing semantic integration methods, the design and
development of ontologies will be presented in this later section as well. It has to be kept in
mind that using established communal standards might in some cases lead to the problem of
not being able to represent new findings in this standardized way[14].
2.2.5.3 Exponential growth rate of data amount
As described above biological databases grow in two dimensions. First dimension is the
amount of databases itself and second the number of entries in the databases. The number of
database entries for Swiss-Prot increases at an exponential rate. Similar growth rates can be
observed in other biological databases.
As shown in the upper figure even the manually administrated Swiss-Prot database grows at
exponential rate. It’s easy to image, that projects that don’t need any manual editing may even
grow at a higher rate. See next figure for the growth rate of GenBank from the year 1985 to
the year 2006.
Figure 4 - Swiss-Prot release 53.0 contained 269293 on 29-May-2007, growing at exponential rate. Figure
taken from [58]
Methods
21
The storage itself is no big deal yet, since other data bases contain much larger amounts of
data. But if the growth keeps on being exponential it might be difficulty in the future. A thing
one has to deal with already nowadays, is the need for efficient algorithms that are able to
deal with such amounts of data, in order to analyse the information, as is shown in Ning et
al[60] or Enright et al[61].
2.2.5.4 Human Resources
Speaking of efficient algorithm leads to an aspect, which seems like it has faded from the
spotlight. Due to the fact that in order to keep the pace of method development and daily work
bench practice up to the velocity of the information growth rate, specialists in these areas are
needed. Heidorn et al suggests training bioinformatics in information management and data
integration skills, in order to keep the scientists focusing on research issues not on data
integration problems[52]. A “biological informatic” should be able to support local research
groups, as well as to develop tools and integration methods for a global science approach.
Figure 5 - The growth rate of GenBank from 1985 until 2006 shows exponential behaviour, figure taken
from [59].
Methods
22
But not only for those integrating the huge amount of information the need of being trained
and skilled in these new disciplines occurs. Physicians and other staff directly linked to
patient care need to be aware of the possibilities that open in research and diagnostics fields.
This causes the need of adopting the available tools in a way, non biomedical informatics staff
can understand and work with these applications as well[62]. The National Institute of Health
recently presented a Roadmap, targeting at bridging the clinical research process with
laboratory results, thereby taking into account the emerging speed of new scientific results
being found in life sciences and other biological fields[63].
Methods
23
2.3 Data Warehousing
As introduced before a data warehouse may be regarded as a materialized view und several
distributed sources of data. Data warehouses became more and more established as tools for
knowledge gain in the biomedical field during the last years. They have been used in order to
support assets in healthcare[64] and chemi-informatics[65]. As they originate from business
and financial sciences it may as well be defined as “a collection of technologies aimed at
enabling the knowledge worker (executive, manger and analyst) to make better and faster
decisions” as Jarke et al do[66]. This is a quite business orientated view, but can easily be
swapped to a biomedical domain. Most of the following definitions and facts in this section
are taken from Kimball’s and Caserta’s Book “The Data Warehouse ETL Toolkit”[67].
One of the currently most used definitions of data warehouses, based on property based view
is “A Data Warehouse is a subject-oriented integrated, time-varying, non-volatile collection
of data in support of the management's decision making process” by Inmon[68]. In the next
paragraphs the mentioned features will be examined and their usefulness in a biomedical
setting discussed:
- Subject-orientated: A data warehouse has to be focused on a specified target, in terms
of research area, in order to enable a result orientated approach. In a biological field
this means one has to decide what subject of research a data warehouse should support
(e.g. cancer of the bladder). Schönbach et al even divide two groups of collections of
biomedical data: subject-orientated data warehouses and general-purpose
databases[11].
- Integrated: As mentioned above the integration of data coming from various
fragmented sources enables a holistic view on a posed question. This integration of
information is crucial nowadays, since there are several hundred databases available
containing needed information on biological matters.
- Time-Varying: Object information stored in a data warehouse will not be deleted if
newer information is added. So, creating a timeline or history on this data may be
inspected, enabling a scientist to reproduce the evolution of information on this object.
- Non-volatile: The information kept in the data warehouse repository is stored
permanently and will not be deleted.
Methods
24
One important property that has to be added to the upper definition is:
- Read-only: The data stored in the data warehouse will only be read by user. No writing
access may be applied from the outside world, except updating the stored information
by adding new entries from other data sources[69].
The data stored in a data warehouse is often of a multidimensional kind, as the focus of
interest depends on the posed questions[70]. To create an analogon to the sample presented by
Chaudhuri et Dayal, if creating a data warehouse containing biological sample information,
dimensions of interest may be the time of sample acquisition, the type of biological material a
sample is (tissue, urine, blood) or even the group a specimen belongs to (disease, control,
medication A, medication B). In many cases this dimensions are hierarchical structured.
One of the key targets in implementing a data warehouse has to be taking a high quality of
data, since data quality is one of the major factors correlating significantly to the end user
satisfaction as a survey performed by Shin in 2003 shows[71]. It’s noteworthy that the ability
to locate data (grouping the ability to locate data and metadata and the detail level of defining
the data) scored the second place in this user satisfaction ranking. Regarding the vast amount
of data stored in a data warehouse, and taking the completeness of user documentation on
meta data into account, this implies that in order to have a high grade of user acceptance, a
data warehouse implementation has to provide a comprehensive documentation in order to be
successful.
Methods
25
2.3.1 Architecture
A data warehouse may be divided in two physical and as well logical entities. One entity, the
so called back room, is holding and managing the data, while the other entity, referred to as
front room, enables data access. This distinction is crucial in the understanding of how a data
warehouse works and how it is organized.
Figure 6 - A data warehouse may be considered as consisting of a back room component and a front room
component. While the back room is responsible for data integration and storage, the front room has to
enable the access on the data. Graphic taken from IMGuS presentation at DILS 2007[72].
2.3.1.1 Back Room - Data Management
The back room is often described as data management or data preparation component. It
contains the data, prepares and delivers data from queries, but it does not support any user
queries from the outside since this is a task of the front room. A back room is often referred to
as “staging area”, which is this context may be regarded as permanently storing the
information to a physical entity like a disc.
2.3.1.1.1 The ETL-System
The Extract-Transform-Load (ETL) system may be considered as the basic concept of a data
warehouse back room. The ETL process extracts the needed data from the source systems,
transforms it into the needed presentation, by performing aggregations and other mutations on
the extracted data, afterwards loads these results into the data warehouse repository, and
finally transforms the stored data in a user friendly representation format. Putting these steps
together it is possible to state, that ETL is responsible for data integration when using the data
warehouse approach. The sources for the data to be integrated may be flat files, or data
Methods
26
coming from a real data base system, which is important as many of the public databases
provide complete SQL dumps of their contents and some just a flat file representation.
A more formal definition of what the ETL-system is responsible for was described by Simitsis
et al[73]:
- Identifying relevant data within the data sources
- Extracting this information
- Customizing and Integrating this information into a common format
- Cleaning this data
- Storing this cleansed data in the data warehouse
2.3.1.2 Front Room – Data Access
The front room component enables a user or client application to access the data hold in the
warehouse. The main task of the front room is mapping the huge amount of low-level data,
usually stored in a data warehouse, to another more valuable form[25]. This more valuable
form may come by being more useful, abstract or more compact. For accessing the data
techniques like Data Marts and Online Analytical Processing (OLAP) may be established as
way a user or a reporting tool accesses the stored information. The front room is responsible
for managing queries as well.
The front room activities are often referred to as Business Intelligence (BI). The diction
implies that non trivial actions take place in here. If defining business intelligence as “…the
process of turning data into information and then into knowledge”, as Golfarelli et al do[74],
and, at the same time, regarding the repository containing the data of the data warehouse as a
database, we may set the wordings BI and KDD as equivalent in this context.
The front room may provide techniques of data mining, text mining or classical statistical
methods. These may be performed on Data Marts and OLAP cubes. These two technologies
shall be presented in order to understand how BI makes use of data provided by the back
room.
2.3.1.2.1 Data Marts
A data mart is commonly defined as a subset of a data warehouse. It contains the same data
but filtered and aggregated to contain only data basing on a certain business process or as
some consider it a department based view. Several papers may distinct on what data it really
Methods
27
may contain, but for there current work the upper definition may suit. Data marts often are
introduced for performance or security reasons, or when the need of restructuring some parts
of the existing data in the warehouse occurs, e.g. when applying BI methods.
2.3.1.2.2 Online Analytical Processing OLAP
While in classical relational database systems the concept of transaction plays an important
role, the concept of having a transaction proof infrastructure is at minor importance in data
warehouses[70] . In operational environments the term Online transaction processing OLTP
describes the need for having a focus on efficient transaction handling[75]. In contrast OLAP
systems need to provide a more analytic access and view on the data handling, focusing on
decision support.
Typical OLAP operations, as Chaudhuri et Dayal[70] describe them, are:
- rollup, which increases the level of aggregation
- drill-down, a decrease in the level of aggregation or a increase in the level of
information detail
- slice_and_dice, information selection and projection
- pivot, re-orientating the multidimensional view of data
2.3.2 Metadata
A data warehouse could not be manageable without taking into account the information about
the data itself, defining all elements and how they work together. There are several institutes
and consortiums that are working on defining a standard set of metadata. These sets are
already in use in several biomedical approaches[76]. BioRegistry is a project aiming a
creating a metadata repository for biological databases[36].
In a data warehouse metadata may be divided into two sets, one set containing information
about how to extract and load data from different sources, this set is referred to as back room
metadata, while on the other side descriptive information about the stored data is labelled as
front room metadata. Back room metadata can be split into three logical blocks:
- Business metadata: This contains information about the meaning of data in the domain
of the business or science field.
- Technical metadata: This kind of metadata represents mainly the physical aspects of
the handled data.
Methods
28
- Process metadata: Metadata about the ETL-process, like statistics on loading time,
failures and successes in row loading.
Front room metadata may consist of security and access information, labelling specifications
or user-specific settings.
2.3.3 Keeping up-to-date
Because data warehouse do not contain directly information that is produced by any
experiments or other methods that produce data, but instead needs to be fed by the ETL
process, that imports external data from the data sources, keeping a data warehouse up-to-date
is a non trivial task. Bouzeghoub and Peralta define data freshness as one of the key features
in a data warehousing system[41], citing a survey performed by Shin in 2003[71]. They
further split data freshness into currency[77] and timeliness[78], where currency expresses a
measure for the time needed for extracting information from a source and presenting it to a
user, while timeliness expresses the rate a data set changes, by either adding new data or
updating existing one.
The update policy of a data warehouse needs some proper design decisions, since decisions
on the data stored in the repository need to be correct. As mentioned above a data warehouse
contains non-volatile data, but what happens if erroneous data has been integrated to the data
warehouse? As Kimball argues, a data warehouse should represent a business not the system
the data originates from, so he states that this invalid data has to be corrected by either
negating or updating the wrong fact, or deleting it and reloading the correct information
instead[67].
In order to keep a database up-to-date it must be made sure, that the information contained in
the repository is renewed on a regular base. This can be either done by completely importing
the data source as one big entity, which may cause major problems since the amount of data
that has to be transferred may be gigantic, or, more clever, by updating only the portions that
have changed since the last update. This later approach is referred to as a delta update, but
many of the public biological data bases do not offer this feature yet[30]. The following table
lists some data sources and their update routines as used in the Atlas project[79], including
information about the frequency of the updates.
Methods
29
Data Source Update Frequency Update Type
GenBank Sequence Daily delta
GenBank Sequence Release full
GenBank Refseq Daily delta
GenBank Refseq Release full
NCBI Taxonomy Release full
HomoloGene Daily full
OMIM Daily full
Gene Daily full
LocusLink Daily full
UniProt Bi-weekly full
HPRD Release full
MINT Relase full
DIP Release full
BIND Release full
GO Release full
Table 1 - The data sources as used by the Atlas project and their update properties as shown in Shah et
al[79]. The column in the middle shows how often the data source is updated in the Atlas data warehouse,
while the column in the right shows whether the complete data source is being re-imported or just the
changes to the currently stored version
Methods
30
2.4 Semantic Integration
As stated above data integration contains the challenging task of integrating several data
sources that do not necessarily share the same semantic space, which means that they do no
have mutual definitions of wordings and concepts. This makes combing data coming from the
different sources quite tricky. This diversity in representation and meaning causes to be
semantic integration to be one of the most challenging task in integrating biomedical
information[80].
Sharing the same idea of wordings and concepts start by so simple wordings like “body
weight”. Imagine having two physicians classifying patients into three groups “normal body
weight”, “above normal body weight” and “below normal body weight”. In some cases
patients would end up in different classes, as the subjective interpretation of the wording
“body weight” might differ between the two physicians. A case like the upper one could be
easily solved by using absolute measurements of the bodyweight in kilograms1 or aggregated
data like using the body mass index. But what if even more complex concepts need to be
addressed? Having two different data sources containing the word “COLD” could easily lead
to major problems, as data source uses COLD as abbreviation for “chronic obstructive lung
disorder”, whereas data source one uses the term cold to express a temperature[2]. In other
cases two different words might be used to address the same concept (e.g. “high blood
pressure” and “hypertension”). In order to avoid these problems a domain needs to be
semantically defined. These problems are often referred to as semantic heterogeneity[81], or
as Rosenthal et al simply state: “For meaningful information exchange or integration,
providers and consumers need compatible semantics between source and target systems”[82].
The term meaningful might be underscored, since it describes why having a common
understanding is necessary in integrating various data.
One of the solutions for the problems mentioned above, is the usage of so called
ontologies[2], as an ontology contains a formal representation of all concepts used in domain,
and describes the relationships between these concepts.
1 Having the absolute body weight in kilograms would surely not be enough since the body height has to be taken into account in order to gain information about a persons obesity status
Methods
31
2.4.1 Ontologies
In order to cover a specific domain in terms of semantics, all concepts and their relationships
need to be covered. An ontology does exactly this be defining all concepts a domain contains,
and additionally describing the relationships these concepts may have. Bucella et al[81] use a
definition by Gruber, who they pronounce, introduced ontologies into computer sciences as an
“explicit specification of a conceptualization” with his approach “Ontolingua”[83]. A more
specific definition is found at Schulze-Kremer as he describes an ontology as “"Concise and
unambiguous description of principle relevant entities with their potential, valid relations to
each other”[84].
It’s crucial to keep in mind that an ontology is not a model of an application domain, as it
does not contain any hypothesis, neither can it be used as a database schema directly since it
does not contain any type information, but it can be used a starting point when defining a new
schema[85].
One advantage of using a ontology based integration approach is, that an ontology provides a
vocabulary, that is normally stable enough to be used a conceptual interface for a database
schema and on the other hand is not depending on the database schema itself[81]. The second
one is that by using an ontology based integration approach the target of by “meaningful”, as
stated above, is automatically reached, since an ontology explicitly targets at proving
meaningful concepts and relationships.
One way of describing the format of an ontology is referring to the concept and relationship
block as a triplet, of the type concept-relationship-concept, or subject-predicate-object, as
assertions[2]. Basic relationships are is-a, part-of, but ontologies aiming at being usable
should contain more sophisticated relations targeting temporal(transformation_of,
derives_from) or spatial(located_in, contained_in) connections[86].
The Open Biomedical Ontologies OBO consortium is one instance trying to achieve a
common standard in the design and the usage of ontologies in the biomedical informatics
field[87]. They provide a library containing ontologies for usage across the many different
field of life science, containing the Gene Ontology[12] and other ontologies for cell and
sequence information.
Methods
32
2.4.1.1 Creating an ontolgy
Sometimes existing ontologies can not be used in existing projects as they do not support the
specific needs a certain domain of a user group has, in that case an ontology has to be newly
built. Buccella and Cechich define three major stages when creating a new ontology in order
to integrate various data sets[81]:
- The first step is building a shared vocabulary, which starts by analysing the given
information sources in order to find the terms or so called primitives, which are used
to build the new ontology. The information sources are checked on global level, which
means taken into account a global view on all the data sources. Analysing the
information implies checking how which information is stored and what a stored data
entry means (defining its semantic). When analysing this information it’s crucial to
keep the above mentioned problems of semantic heterogeneity in mind, in order to
have an unambiguous set of entities.
- In the second stage a local approach is applied on the data sources. This stage is
similar to the first one, but focuses solely on the source in isolation, not taking into
account any linkage to another source. Thus having defined the local terms, a local
ontology may be created.
- Having created a global ontology and the various local ontologies, the mapping
between those two levels has to be established on the third stage. This mapping may
be simple mapping of wordings (“function of gene” to “gene function”), mapping of
types(dates to timestamps) or more sophisticated use of formulas (mapping from
degrees Fahrenheit to degrees Celsius).
When creating a new vocabulary Schulze-Kremer sees the basic challenge in having good
definitions of concepts, as having ambiguous or not detailed enough description of these
could easily lead to problems in the later use[85]. In his paper several common problems are
listed: Definitions that are only made up be telling what a concept is not, too broad or to too
narrow definitions, self-reflexive definitions and rather verifying a scope than defining a
concept. Schulze-Kremer further suggests documenting the design criteria and the formal
notation of the ontology itself.
2.4.1.2 Web Ontology Language
The Web Ontology Language[88], often read in the abbreviated form OWL, is an ontology
provided by the World Wide Web Consortium[89] as a recommendation. OWL provides the
technology to exchange information and it’s semantic via networks as it was intentionally
Methods
33
designed to be part of the semantic web[90], or to be more precise in order to support
intelligent agents[91], that need to exchange information automatically.
It’s is possible to define classes and subclasses and then apply set operators like union or
intersect, or to define properties and sub properties. Objects can than be defined and classified
link to properties containing the object’s individual values. OWL distinguishes between two
basic types of properties that are both instances of built-in OWL classes:
- Object properties, that link an object to another object
- Datatype properties, that link an object to a data type
Methods
34
2.5 Current Projects
In the last years several research communities implemented various techniques targeting on
integrating data and information from different sources. Some companies jumped this train as
well, and today the range of available solutions ranges from free open source applications to
commercial products. In the sector of biomedical research countless projects have been
established to proceed in integrating the huge amount of information gained in the daily
workbench research. High-throughput measurement assets helped the number of data to be
harvested to explode in the omics fields, thereby lacking the ability to transform the huge
amount of information into interpretable and valuable knowledge[36]. Or as Ideker et al state,
the vast amount of data gathered by new methods like the usage of microarrays are might be
not useful to do research on a singular cell, but by applying an integrated data approach on
these data sets it might me possible to perform in silico biology that later could be verified by
workbench research[92]. All of the projects presented in this section target at an overall
approach on not integrating sources for a certain domain, but instead provide a framework to
integrate various public or private data sources. There exists loads of other projects that have
been designed to integrate information in order to answer certain specific questions, by
integrating information coming from a regional base, like the biobank presented by Muilu et
al[93] or Columba[33].
The following section will present a very small overview about existing projects, not aiming
at being complete, but rather to give examples for the different integration approaches as
described above. This ranges from projects supporting a link based navigation through given
data sources to projects converting data from different sources to fit a new schematic
representation in order to be stored in a data warehouse repository. It is important to see that
most of the projects aiming at integrating data from biomedical sources can not be
undoubtedly classified to belong to exactly one integration approach, but rather mix the
techniques in order to get the best result.
The Sequence Retrieval System, as introduced by Etzold in 1996[15], will be presented as it
is one of the most popular query interfaces and data integration projects in the biological
research fields, followed by IBM’s DiscoveryLink[94]. A project by Leser and Naumann,
named “ALADIN”, will be described shortly, as this is one of their current projects, aiming at
an automated data integration architecture[29]. After ALADIN two more projects targeting at
Methods
35
an Data Warehousing integration approach will be shown with Atlas[79] and
BioWarehouse[30].
2.5.1 SRS
The Sequence Retrieval System SRS[15]is one the mostly distributed querying tool for
biological sources, as it provides an easy to use graphical user interface that enables a user to
access a wide range of biological database and flat file resources[51]. Almost every data
sources can be integrated into SRS, as it uses text file representations of data sources to access
the information. In order to make the data source accessible the meta information has to be
declared in the Icarus scripting language that is part of SRS. This declaration contains
information about the data object as well as how to parse it.
It is possible to establish bidirectional linkages between data sources, that can be weighted or
even be combined with logical operator (AND, OR and NOT). Having a high rate of cross
references in a set of data sources, this set might be considered some kind of domain
knowledgebase[95]. The two main strengths of SRS are surely the way new data sources can
be added, by generating a flat file and describing its content using Icarus, while on the other
hand simplifying the generation of queries.
2.5.2 DiscoveryLink
DiscoveryLink[94] is an IBM product basing on the fusion of DataJoiner[96] and Garlic[97],
which both are developed by IBM as well. The components of DataJoiner provide query
optimization, a complete query execution module and the technology for federating the
different data sources, whereas the Garlic component enables the integration of new data
sources.
As wrapping data from the data sources is one of the main concepts in DiscoveryLink,
implementing those wrapping mechanisms, the creators tried to implement a technology that
allows a maximum of integrable data sources, with a minimum of effort in implementing the
actual wrapper. Those wrappers are implemented in C++ and usually support more than one
data sources, if those sources share the same API.
When sending a query to the system, a query processor distributes the query to the several
wrappers according to the information provided by the source descriptions. A global sum of
Methods
36
execution costs is calculated afterwards, and according to this information an execution plan
is created. After each wrapper has executed his tasks, a global result is aggregated[22].
2.5.3 BioMediator
The BioMediator[98] project is an approach using federated databases in order to integrate
biomedical information, that supports features liking querying for specific data instances or
browsing through properties. It allows to define user-specific design of mediated schema in
order to not overload the user by given a to broad view on information he might not be in
need of in order to answer questions belonging to a certain scientific domain.
Figure 7 - The three main stages of the BioMediator project (as presented in Shaker et al
[99]) showing the syntactic and semantic wrapping components as well the query processing component
that enables accessing the system beyond
The linked information is presented as a graph containing edges and nodes, with the nodes
representing the data source instances from the mediated schema and the edges being the
relationship between those entities. This path can be queried using the Path Querying
Language PQL[100], a technology enabling the definition of queries and constraints between
the federated databases. One of the main components is the source knowledge base that stores
information about how to wrap the information coming from a certain data source. In order to
achieve the goal of providing a specific view for user specifics domain, for each domain a
SKB has to be defined separately.
The wrapping itself is split into wrapping on a syntactic level first, and then in a second step
converting the sources semantic information, where the wrappers in the second step are
referred to as metawrappers[101], where the syntactic transformation is called data
acquisition and the syntactic transformation data translation.
The authors themselves state that they tend to expand the integration platform they provide by
now to a complete distributed network of peers[102].
Methods
37
2.5.4 ALADIN
The target of the ALADIN integration approach, introduced by Leser et Naumann[29], is to
automate a big proportion of the integration process, which is already implied by the project’s
complete name, as ALADIN is a acronym for “Almost Automatic Data Integration”. This high
grade of automatic integration is reached by using methods to automatically detect links
between objects from various data sources. The information gained by the linkage detection is
stored global, materialized repository, a data warehouse, and can be accessed via the classic
methods of KDD.
In biological data sources there are two types of objects:
- Primary Objects: Objects that contain the most useful information, they mostly
represent the basic concepts of a scientific field (genes, DNA sequences, …)
- Secondary Objects: Nested information container linked to a primary object are
referred to as “annotations”. E.g. the sequence string of a Protein or the functional
annotations of a gene are considered secondary objects.
In most cases linkage is only being performed between primary and secondary objects within
biological data sources, whereas there’s often a heavy linkage between those sources using
the primary objects identifiers. But there might as well be duplicates be scattered among the
different sources, these duplicates have to be discovered and flagged.
The automated detection of relationships between primary objects originating from different
sources is established using techniques from data integration, text mining, information
retrieval and data mining. “Guessing” these relationships is considered as a main feature of
ALADIN as it might help to find unseen connections, but it might as well produce false
negative (no link is detected) or false positive (a wrong link is discovered) results.
2.5.5 Atlas
Atlas is a data warehousing project from the University of British Columbia, that aims at
integrating information from several biological sources[79]. The data sources that it can
integrate may be categorized into four groups, being sequences, molecular interactions, gene
related resources and ontologies that are stored in a MySQL database. The developers
Methods
38
designed their database schema according to the above mentioned categories for importable
data.
Like BioWarehouse and other data warehousing solutions loaders play a major role in
integrating data from the external sources. For Atlas loaders are implemented for sequences
and molecular interactions, whereas the other information is simply important using database
dumps provided by the originating sources.
A user might access the data stored in Atlas directly via SQL, or using the provided API from
the Atlas framework or with some end-user applications.
2.5.6 BioWarehouse
Lee et al introduced a bioinformatics data warehouse approach in 2006, called
BioWarehouse[30]. BioWarehouse is an open source toolkit for constructing a data
warehouse integrating several biological data sources, published under the Mozilla Public
License[103], and is currently available for Oracle and MySQL.
Figure 8 - The main data types used in BioWarehouse and their possible connections, as shown in Lee et
al[30] A complete ER-diagram may be accessed at
http://www.biomedcentral.com/content/supplementary/1471-2105-7-170-S1.jpeg.
One of the key features is the usage of a so called warehouse identifier WID, used to uniquely
identify a data object stored in the warehouse. As WIDs are given to any type of concepts,
e.g. genes, proteins or reactions, a linkage between a gene object and a reaction object is
possible.
Methods
39
The main data types used in the BioWarehouse toolkit are: Taxon, BioSource, NucleicAcid,
Subsequence, Gene, Protein, Feature, Reaction, Chemical and Pathway. Each of these basic
types contains information typical to this type, but as well metadata like change history and
source of origin.
In order to add information to the warehouse, tools named “loaders” are necessary, as they are
in charge if loading and transforming the data coming from a data source to fit the
BioWarehouse schema of the according main type. Duplicates, which may show up if
importing from several sources, are not merged by the ETL system, instead if two objects
referring to the same concepts are important from the sources, two data objects are stored in
the warehouse. BioWarehouse as well allows storing information about literature references
linked to a data set, representing a biological source, within the repository.
Methods
40
2.5.7 Comparison
All of the above mentioned projects have their strengths and weakness, and each one of them
has proved to be useful in certain biomedical enquiries. In the table shown below all of the
above mentioned projects are listed, providing information about the homepage and their
approach of integration according to Hernandez and Kambhampati. It is not always possible
to clearly define if a project is purely using a navigational approach or a mediator-based, so
the decision on how to classify was made upon, how the most characterising features can be
classified.
Name Homepage Integration
Approach
SRS[15] http://srs.ebi.ac.uk Navigational
DiscoveryLink[94] http://www-
304.ibm.com/jct09002c/us/en/university/scholars/pr
oducts/lifesciences/discoverylink/
Mediator
BioMediator[98] http://www.biomediator.org Mediator
ALADIN[29], http://www.informatik.hu-
berlin.de/forschung/gebiete/wbi/research/projects/al
adin/
Data
Warehouse
BioWarehouse[30] http://biowarehouse.ai.sri.com/ Data
Warehouse
Atlas[79]. http://bioinformatics.ubc.ca/atlas/ Data
Warehouse
Table 2 - An overview on some data integration projects, described above, showing the project name and
the reference it was published first, as well as the project's homepage and the type of integration approach
(Navigational, Mediator or Data Warehouse)
Results
41
3 RESULTS
This section provides an overview of the various findings and implementations that were
made for the IMGuS project. The project setting, the infrastructure as well as some details on
the newly created ETL-framework will be part of the following subsections. Source code will
only be provided if it is necessary to present insights on development details.
3.1 Project IMGuS settings
The IMGuS project was implemented in order to identify molecular signatures that might help
to stratify patients, who are susceptible to curative treatment of prostate cancer. Therefore
several –omics techniques are combined and their knowledge is being integrated. This data
integration is done, using a data warehouse approach, called life science integrative data
warehouse (LINDA)[104]. This integrated data is then later used for system biological
approaches and methods.
Several academic and commercial partners are working together in the IMGuS project,
everyone providing their special abilities:
1. Department of Urology – Innsbruck Medical University
(biobank, probes and phenomic data)
2. Biocrates life sciences GmbH
(metabolomics)
3. Institute of Analytical Chemistry and Radiochemistry, University of Innsbruck
(proteomics)
4. German Cancer Research Centre Heidelberg
(genomics)
5. Max Planck Institute for Molecular Genomics Berlin
(modelling)
6. University for Health Sciences Medical Informatics and Technology - UMIT
(IT infrastructure and data warehousing)
The following figure demonstrates how the project participants interact, and what data sets or
services are provided. The Department for Urology takes samples of the patients and provides
additional phenomic information, like anamnesis or medical therapies. These samples are
Results
42
later processed with metabolomic, proteomic and genomic analytical approaches. Using an
Electronic Data Capture System the results of the various techniques are being imported into
the data warehouse provided by UMIT. Using System Biology methods the existing data is
then being processed and queried in order to gain the ability to build models.
Figure 9 - The IMGuS project participants provide individual services and data sets that are captures and
integrated in order to enable a System Biology approach. Graphic taken from IMGuS presentation at
DILS 2007[72].
In order to enable communications and data access a web platform was implemented,
allowing all project partners to integrate and access the project data. It allows Electronic Data
Capture EDC confirming Good Clinical Practice GCP, to upload the data gained by IMGuS
related studies. The data is then in a later process integrated into LINDA by using ETL tools
like Talend Open Solutions[105] or native scripts.
Results
43
Figure 10 - The upper figure was taken from[106] and shows a screenshot of the ad-hoc query builder
tool, which is part of the IMGuS project.
To access the integrated data an ad-hoc query builder tool was implemented, allowing easy to
use query generation and execution[106]. This allows users to query the data base without
having profound knowledge on hot to compose SQL statements. The basic concept of the ad-
hoc query builder is to use metadata instead of technical details that might not be
understandable for a user. The result sets gained by using the ad-hoc query builder can then
be used to perform data mining, statistical analysis or other KDD methods.
Results
44
3.2 LINDA Repository
As mentioned above the repository is the entity storing the information put into the data
warehouse. For the IMGuS project a database using PostgreSQL as database management
system was used. In the first phase the existing repository for LINDA was redesigned in order
to better fit to the users needs. The existing tables were reconsidered and re-created to a more
linkage related schema.
Figure 11 - The LINDA repository, was redesigned in order to better fit the user needs, an provide more
flexibility in storing and importing data sets. Relations starting with a “g_” are used to store Genomic
data, whereas relations having the prefix “_m” contain data coming from Metabolomic approaches. By
restructuring the single data records into a more fragmented form, the creation of more specialized and
individual data marts was enabled.
The repository may be split into two main parts. One part contains mostly administrative
information like information about the biological source a certain sample comes from
(tissue,serum…), or information on the data source the information is stored in. The other part
of data stored is the pure measurement information.
This distinction and seperation allows to easily extend the repository in order to store data
from other biomedical sources in the future.
Results
45
3.2.1 Design of the imgus-etl-framework for LINDA
Using the existing infrastructure of the IMGuS project, lead to the decision to recreate the
ETL-system that was loading the data, originating from the various sources, into the Data
Warehouse. As users weren’t able to upload the data on their own the decision was made, to
create a framework, that could be adopted to be used via a simple graphical user interface, in
order to enable a more effective use of human resources. Up to this time users had to take the
data files to a member of the backend development team, who would pre-process these files
manually or by using an awk[107] script, if a pre-processing process step was necessary.
Later the provided data file would be imported by using a Talend[105] file, performing a
PERL[108] script. This would cause delays in getting the data into the warehouse, and as well
would bind the developer to a process step, that could be easily automated. Re-implementing
these components also had the advantage of being able to recreate the single scripts in Java, if
possible, which eased the creation of a framework.
This section will describe how the new ETL-framework looks like, what way the newly
constructed components look like and how they interact. A description of the implementation
process will be part of the next section.
3.2.2 Infrastructure
The infrastructure used for the IMGuS ETL process was divided into three sections. Each of
these sections may be used by different user groups. These groups may be divided into three
categories: Back room developers BRD, front room developers FRD and clinical scientist or
biologists CSB. BRDs are involved in providing the components needed for data import,
transformation and storage, whereas FRDs develop KDD algorithms used to extract
information from the repository. CSBs browse through the data sets and use the applications
provided by the FRDs in order to gain new knowledge from the existing data. They might be
regarded as users or consumers, as they do not directly interact with the data sets, but rather
get a view on a composition and aggregation provided by the front room tool set.
This distinction also had an influence on the infrastructure, as each of these stages has a
completely different state of stability and constancy. While on the BRD level a schematic
change and improvement may occur on a regular base in order to evolve, those changes need
to be tested and verified before being deployed to the development FRD productive CSB
stage.
Results
46
Performance is another issue, since data import processes might consume much of the
bandwidth a network connection provides, or cleaning and transformation steps need CPU
time if big data sets need to be processed. Regarding those circumstances the decision was
made to provide three data bases, one data base for each of the three usage levels. In order to
get a real benefit out of this, deployment criteria needed to be defined.
The basic stage deploys a stable schema for storing the data, as well as the tools and
applications needed to import this data from the various sources. The second stage allows
developing and deploying stable KDD algorithms and tools that have been created and tested
on the basic resources provided by the first layer. In the third stage clinical scientist and
biologists use the provided applications to query the integrated data in order to prove existing
theories or search for new findings. This new knowledge may then later be deployed to the
scientific community.
The following figure shows the three stages and the deployment steps between them.
Results
47
Figure 12 - Three stages of deployment could be identified. Back room development, front room
development and clinical science or biology. Each of these stages has its own development areas, but is
depending on the underlying stage to provide stable functions and correct process data.
Results
48
3.2.3 Architecture
The new ETL framework can be regarded as consisting of two main features. One of these
features is the conversion of given file formats or database formats in order to fit to the
structure needed by the following loading processes, which store the data in the repository.
Figure 13 - The upper figure shows the main components the new ETL-framework. The conversion and
loading components are the main features the framework provides. As can easily be seen, the loading
component contains Java source provided by the Talend Open Studio. The repository is not part of the
framework, and has just been added to the graphic for means of understanding.
The right part of the figure above, shows the ETL-component embedded in the system
environment, excluding any parts of the front room. As can be seen, data coming from
external sources, may be accessed in a file representation, including simple flat files, more
complex XML files or as a database, which in most cases mean, accessing and reading a data
base dump.
The loading component makes use of Java code and archives created by Talend Open Studio
TOS tool, which is used in order to create a simple mechanism of importing data. These files
can later be re-used in the ETL component in order to read, modify and store the data. As data
transformation consumes about 70-80% of the time used to build a data warehouse[11], the
conversion and transformation steps, including Java classes created by TOS, were the first
software components to be designed an planed, in order to enable an early user interaction
with establishing the warehouse.
Results
49
3.2.4 Interfaces
In order to make the framework easy to extend a set of interfaces was designed, allowing the
easy adding of customer specific components for import and conversion purposes. This
subsection shall shortly describe the basic concepts of these interfaces, their interaction and
usage.
Figure 14 - Several interface have been designed in order to enable an easy to use and extend framework
for ETL process usage. The upper figure shows the basic interface and their sample implementation for a
meatabolomic domain.
Three main interfaces can be identified, as there are IConversion, IImport and
IImgus. IConversion is the main interface for all classes implementing a data
transformation algorithm, whereas IImport is mother class to all classes providing a data
set to database import function. IImgus is used as basic interface for distinction of the
various biomedical domains(metabolomics, genomics, phenomics…) that are part of the
IMGuS data warehouse.
The interface IConversion provides the method:
public void doConversion(InputStream is, OutputStream os, IConversionType ict);
Results
50
This method allows the simple conversion of a java.io.InputStream into a
java.io.OutputStream, whereby the kind of conversion to be applied onto the input is
determined by an Implementation of the IConversionType interface.
In order to provide a standard implementation of this, some more classes where introduced
then:
Figure 15 - In order to enable a standardized simplified use of the interfaces used for conversion two more
classes are provided. ConversionImpl, which is a standard class for data conversion, and ConversionFile,
which allows simple file conversion by providing the convert(File in, File out, IConversionType ict)
method with two file parameters.
These two more classes allow a standardized usage of conversion
methods, reducing the need of implementing a single conversion
implementation for every data type. A user can simply make
usage of an input file by simply writing his individual conversion
type and providing this class with these two parameters, as well as
his desired output file.
A similar approach was taken for the import tasks, where starting
from the main interface IImport, several other interfaces and
classes where created in order to reach a high level of possible
reuse.
Figure 16 - For the import tasks several interfaces and classes where
designed and implemented. The upper figure shows one example for
the use of reusable classes importing from CSV-files
Results
51
3.2.5 Import and Mapping components
As the provided data from the various sources fulfilled every aspect of the data heterogeneity
mentioned above, the import and mapping components turned out to be a non trivial task. In
order to enable the import of these data, which vary in representational aspects and content, a
solution using Talend Open Studio jobs and Java classes was created.
Figure 17 - The import and mapping components can be divided in three main layers (reading, mapping,
writing), the layers of data providing and storing were added to the upper figure for means of readability.
As shown in the upper figure three main layers were identified:
- Reading: Here the data files are read in, and their underlying schema (normally
provided as heading line in a CSV file) is used to create an input stream. As files in
the same biomedical domain may even differ in their representation, it was necessary
to create a reading component for every single file type, that should be later passed
into the data warehouse
- Mapping: As the above mentioned file schemas would only fit to the corresponding
database schema in the minority of cases, the mapping layer was introduced in order
to make the incoming data file stream compatible to the database table schema it
would later be stored in.
- Writing: After the data stream, coming from the input file, is mapped to the
corresponding database table schema it can be directly written to the repository, using
the component fitting the target database schema.
Results
52
3.3 Implementation of the imgus-etl-framework for LINDA
The following section will describe some issues that came while the implementation phase of
the imgus-etl-framework. In this phase of the project not all information from the several
biomedical domains could be integrated completely. But to prove the correct functioning of
the several components, the decision was made, that it would be sufficient to fully implement
the complete functionality for a subset of these domains.
One of the first targets to be achieved was to enable a user to import files on his own, since up
to this time, every file import had to be done by one of the Data Warehouse-developers,
which would cost enormous resources speaking of time and money. So the decision was taken
to enable a user to store his data into the repository by various tools. One task that is here
separately described from File Import was to convert the given files into a format, which
could later on be read and parsed by the import components.
3.3.1 File Conversion
Some of the files provided from the project partners were given in a format that could not
directly be read by the importing components. In order to fix this issue several conversion
components had to be implemented.
3.3.1.1 Horizontal into Vertical Representation
One of the major disadvantages was the fact that most of the data being provided by external
sources was distributed in a horizontal way (figure below). This form of representation caused
several problems with the usage of the ETL tool, so a new vertical distribution had to be
introduced.
Figure 18 - Some data was presented in a horizontal way, meaning that after some columns defining the
data set, several features was represented. The upper sample shows a sample data set from a
metabolomic approach. “NA” stands for “Not Available”, which means that the value for this certain
measurement could not be retrieved or stored in the data file.
In order to get a vertical representation of the data (figure below), a small application was
written in Java, including an algorithm, which would wrap to the desired form of data set
representation.
Results
53
Figure 19 - After converting the horizontal data set, two new fields can be seen, as the represent the newly
established column headers
In order to enable an efficient pre-processing several data cleansing operations, like replacing
“NA” or “null” values, were included in this entity. So a user might in this early step already
start to clean the provided data.
3.3.2 File Import
In order to enable an easy usage of file import, the imgus-etl-framework provides import
components that can be used to import the data sets into the repository.
3.3.2.1 Talend Open Studio Import Classes
The import of the data provided in a VSC format, was done using Java classes, designed and
provided by Talend Open Studio TOS[105].
TOS allows easy to use graphical
arrangement of components typically
used for data import and integration
tasks, put together in entities called
jobs. In an earlier stage several jobs
already existed as PERL code, but as
TOS allowed the creation of Java
code, introduced in Talend Open
Studio v2.0, the decision was made to
redesign all existing PERL jobs,
using a Java code output.
Figure 20 - Talend Open Studio allows the simple creation of
data transformation jobs. The upper figure shows a sample job
for the import of metabolomic data, provided as CSV file into a
PostgreSQL database. This workflow includes reading the
CVS file, mapping to the SQL schema and a logging process
that allows to inspect the process progress.
Results
54
3.3.3 Ontologies
In order to enable a correct semantic connectivity between the concepts stored in the
repository, an ontology section was inserted. It was almost completely taken from the
BioSQL project[109], as this seemed to be the way, which would fit best to this project.
Figure 21 - The ontology section for the IMGuS project was taken mostly from the BioSQL[109] project.
It allows to simply add information on terms and relationships from other ontologies, as well
as to create your own, question specific entries.
The table term contains objects or concepts like “gene”, “exon”, “metabolite”,
“measurement”, as well as words being the wording for a relationship like “is part of”,
“originates from”, “is composed of”. As synonyms are common things in biomedical research
domains, an extra table containing these synonyms, called term_synonym. As terms in most
cases originate from existing ontologies the table ontology provides the possibility to store
information on those external sources.
The relationships themselves are composed by combining three of the terms in the form of:
Subject . Predicate . Object
, where ‘.’ represents the operator for concatenation.
Results
55
Each of these three components is part of the term relation, and stored in a table called
relationship. This table may as well reference to the ontology relation, as some of the used
relationships may be taken from external ontologies.
3.3.4 Conversion and Import
The new imgus-etl-framework was developed and tested with metabolomic data, provided by
Biocrates life science AG. This data contained mass spectrographic measurements for prostate
cancer including samples from tissue and serum, being part of the IMGuS project.
The functionality of the horizontal-vertical transformation classes could be proved on two
metabolomic data sets, “ProstateSerumData” and “ProstateTissueSamples”, by doing a diff to
an awk-script, used before the Java implementation. This diff showed no differences between
those two files.
ProstateSerumData contained 115 columns containing metabolomic information, with 3
columns used to identify the data entry, and 319 data entries, which means that 35840 entry
pairs (metabolite + measurement) had to be processed. ProstateTissueSamples contained 352
columns, 4 identifying, with 36 data entry lines, summing up to 12876. To process,
ProstateSerumData the imgus-etl-framework needed 156-297 milliseconds, to process
ProstateTissueSamples file it took 62-156 milliseconds
data file main
headers
data
headers
sum
headers
data
lines
ex time
(min)
ex time
(max)
ProstateSerumData 3 112 115 319 156 ms 297 ms
ProstateTissueSamples 4 348 352 36 62 ms 156 ms
Table 3 - In order to test the Horizontal/Vertical conversion component a Java program was written,
performing the conversion of the file 100 times, and then the maximum execution time and the minimum
execution time of this runs were taken (ex time (min/max)).
Discussion
56
4 DISCUSSION
This master thesis aimed at providing a back room environment for the IMGuS project. In
order to achieve this target, the existing infrastructure was evaluated, redesigned and re-
implemented where necessary. All required features and components were at least planned
and designed, and for all components a prove of concept was provided by implementing them
for the metabolomic domain.
4.1 Data integration
As the integration of data that is distributed in dimensions like time, space and representation
it could be shown, that using a data warehouse approach can help in bringing together the
information and knowledge provided by the diverse –omics techniques. For the IMGuS
project using LINDA in combination with the EDC, the imgus-etl-framework and the query
builder, an integrated approach was designed and implemented. By using the framework the
rate of performed updates should increase, thereby increasing the rate of new knowledge
being found, as the time between data production, data integration and data mining should
decrease.
4.2 Switch to Java
Before the imgus-etl-framework was implemented the existing ETL system was a mixture of
PERL and awk scripts, which forced a developer to understand both of these languages, but
had several pros, especially in terms of flexibility and speed. But by making the decision to
Java the creation of a “one application for all tasks”-architecture was eased. The painful task
of transforming the existing scripting tools into Java code, fitting the imgus-etl-framework
specification, should be well worth the effort speaking in terms of maintainability and
extensibility.
4.3 Interface usage
The interfaces IIConversion and IImport in combination with the IConversionType
interface should provide all the stability to enable a high degree of reusability within the yet
used -omic fields but should prove flexible enough to be used for dealing with additional data
Discussion
57
sets, like phenomic data, that should be added to the IMGuS project in one of the next project
steps.
4.4 Repository
By redesigning the LINDA repository the
extensibility was improved, now data coming from
new biomedical data sources can be easily added
into the existing schema. The schema itself might
have become a bit more complex to understand for
the front room developers, but this drawback can be
fixed by expanding the usage of data marts. Those
data marts can be easily created from the existing
repository. The usage of data marts can have
advantages both in the stages of KDD development,
as well as in the stage of knowledge gain itself.
KDD development could be made more efficient by
using subsets of the complete data warehouse stored
in a data warehouse, whereas a data mart could as
well be used to filter and aggregate certain domains
in the clinical and biological research stage.
4.5 imgus-etl-framework
It could be shown that be using a set of interfaces and combing them with intelligent
technologies, that might even be provided from external tools like Talend Open Studio an
efficient, easy-to-extend framework can be created. By distinguishing between conversion
and import tasks, the grade of reusability could even be extended, allowing adapting to new
data sources faster.
Migrating the complete framework to Java allows building a whole-in-one application that
can be easily put into a Graphical User Environment in one of the next project steps. It as well
allows to use existing biomedical applications, like BioJava[110], in a very early step.
Figure 22 - The use of data marts could help
in the front room development stage and the
clinical and biological research stage.
Discussion
58
4.6 Deployment Stages
By introducing the 3 staged deployment schema presented above, a stable environment should
be enabled, as every person in research and development can rely on the stability of the tools
and resources provided by the stage below. By separating back room development from front
room development the development of new methods and tools is completely independent
from activities that happen in other stages. These should make the complete system more
stable and allow a guided migration work flow if changes need to propagated. Future
developments will show if this separation will prove its usability.
4.7 Future Work
As the IMGuS project evolves steadily, it will be necessary to adopt the imgus-etl-framework
and the LINDA repository to these changes according to the new requirements. It will as well
be one of the next targets to add information from biological data bases like KEGG[3] in
order to enable the biomedical research staff to directly put the found information into the
existing state of the art context.
An integration of literature data bases should as well be part of further project proceedings. It
is planned to integrate information from bibliographic data sources, as PubMed itself offers
more than 16.5 million medicine and biology related citations, originating from more than
19.000 life science journals[111]. These entries feature abstract in most cases, and some even
include links to the full text articles. Linking scientific information from this source could
improve the process of knowledge discovery, since the researcher could get access to this
information on the fly, while performing queries and interpreting the results. A possible
solution for integrating this source could be to import PubMed entries into the data
warehouse, or by simply storing the linkage information and providing the user with this
linking information.
In order to enable a higher grade of user interaction, the ETL framework should later be put
into a graphical user interface, which should allow easily composing complete conversion and
importing processes by a simple drag-and-drop interface. This would allow a user to simply
create his own data conversion and import processes.
Discussion
59
As several IMGuS project partners provide phenomic data, the extension of the back room in
order to fit to this data will be one of the next project steps, as integrating information about
anamnesis, the medical history and certain medical procedures could add up to very
interesting results. The existing repository will allow adapting to these changes without
touching any of the existing relations. The import process can be extended and being realized
by using the interfaces IConversion and IImport presented above.
List of Figures
A
LIST OF FIGURES
Figure 1 - The steps performed in KDD as defined by Fayyad et al[25] ................................... 7
Figure 6 - A data warehouse may be considered as consisting of a back room component and
a front room component. While the back room is responsible for data integration and storage,
the front room has to enable the access on the data. Graphic taken from IMGuS presentation
at DILS 2007[72]. .................................................................................................................... 25
Figure 7 - The three main stages of the BioMediator project (as presented in Shaker et al .... 36
Figure 8 - The main data types used in BioWarehouse and their possible connections, as
shown in Lee et al[30] A complete ER-diagram may be accessed at ...................................... 38
Figure 9 - The IMGuS project participants provide individual services and data sets that are
captures and integrated in order to enable a System Biology approach. Graphic taken from
IMGuS presentation at DILS 2007[72]. ................................................................................... 42
Figure 10 - The upper figure was taken from[106] and shows a screenshot of the ad-hoc query
builder tool, which is part of the IMGuS project. .................................................................... 43
Figure 11 - The LINDA repository, was redesigned in order to better fit the user needs, an
provide more flexibility in storing and importing data sets. Relations starting with a “g_” are
used to store Genomic data, whereas relations having the prefix “_m” contain data coming
from Metabolomic approaches. By restructuring the single data records into a more
fragmented form, the creation of more specialized and individual data marts was enabled.... 44
Figure 12 - Three stages of deployment could be identified. Back room development, front
room development and clinical science or biology. Each of these stages has its own
development areas, but is depending on the underlying stage to provide stable functions and
correct process data. ................................................................................................................. 47
List of Figures
B
Figure 13 - The upper figure shows the main components the new ETL-framework. The
conversion and loading components are the main features the framework provides. As can
easily be seen, the loading component contains Java source provided by the Talend Open
Studio. The repository is not part of the framework, and has just been added to the graphic for
means of understanding............................................................................................................ 48
Figure 14 - Several interface have been designed in order to enable an easy to use and extend
framework for ETL process usage. The upper figure shows the basic interface and their
sample implementation for a meatabolomic domain. .............................................................. 49
Figure 15 - In order to enable a standardized simplified use of the interfaces used for
conversion two more classes are provided. ConversionImpl, which is a standard class for data
conversion, and ConversionFile, which allows simple file conversion by providing the
convert(File in, File out, IConversionType ict) method with two file parameters................... 50
Figure 17 - The import and mapping components can be divided in three main layers (reading,
mapping, writing), the layers of data providing and storing were added to the upper figure for
means of readability. ................................................................................................................ 51
Figure 18 - Some data was presented in a horizontal way, meaning that after some columns
defining the data set, several features was represented. The upper sample shows a sample
data set from a metabolomic approach. “NA” stands for “Not Available”, which means that
the value for this certain measurement could not be retrieved or stored in the data file.......... 52
Figure 19 - After converting the horizontal data set, two new fields can be seen, as the
represent the newly established column headers...................................................................... 53
Figure 21 - The ontology section for the IMGuS project was taken mostly from the
BioSQL[109] project................................................................................................................ 54
List of Tables
C
LIST OF TABLES
Table 1 - The data sources as used by the Atlas project and their update properties as shown
in Shah et al[79]. The column in the middle shows how often the data source is updated in the
Atlas data warehouse, while the column in the right shows whether the complete data source
is being re-imported or just the changes to the currently stored version.................................. 29
Table 2 - An overview on some data integration projects, described above, showing the
project name and the reference it was published first, as well as the project's homepage and
the type of integration approach (Navigational, Mediator or Data Warehouse) ...................... 40
Table 3 - In order to test the Horizontal/Vertical conversion component a Java program was
written, performing the conversion of the file 100 times, and then the maximum execution
time and the minimum execution time of this runs were taken (ex time (min/max)). ............. 55
Bibliography
D
BIBLIOGRAPHY
[1] Liu, C. L., Prapong, W., Natkunam, Y., Alizadeh, A., Montgomery, K., Gilks, C. B., de, v. & Rijn, M. (2002). Software Tools for High-Throughput Analysis and Archiving of Immunohistochemistry Staining Data Obtained with Tissue Microarrays. Am J Pathol,
161(5), 1557-1565.
[2] Gardner SP (2005). Ontologies and semantic date integration. Drug discovery today,
10(14-24), 1001-1007.
[3] Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl. Acids Res., 28(1), 27-30.
[4] Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitzhugh, W., Funke, R. & Morgan, M. J. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921.
[5] Quackenbush, J. (2007). Extracting biology from high-dimensional biological data. J Exp
Biol, 210(Pt 9), 1507-1517.
[6] Beek, J. H. G. M. v. (2004). Data integration and analysis for medical systems biology: Conference Reviews. Comp. Funct. Genomics, 5(2), 201-204.
[7] DOE Department of Energy (1993). Report of the Invitational DOE Workshop on Genome
Informatics.
[8] Nagarajan, R., Ahmed, M. & Phatak, A. (2004). Database Challenges in the Integration of
Biomedical Data Sets.
[9] Leser, U. & Rieger, P. (2003). Integration molekularbiologischer Daten. Datenbank-
Spektrum, 6, 56-66.
[10] Rosse, C. & José, L. V. M. J. (2003). A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of Biomedical Informatics, 36(6), 478-500ee = http://dx.doi.org/10.1016/j.jbi.2003.11.007.
[11] Schönbach, C., Kowalski-Saunders, P. & Brusic, V. (2000). Data Warehousing in Molecular Biology. Briefings in Bioinformatics, 1(1), 190-198.
[12] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1), 25-29.
Bibliography
E
[13] Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J. & Brazma, A. (2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome
Biology, 3, research00.
[14] Almeida, J. S., Chen, C., Gorlitsky, R., Stanislaus, R., Aires-De-Sousa, M., Eleutério, P., Carriço, J., Maretzek, A., Bohn, A., Chang, A., Zhang, F., Mitra, R., Mills, G. B., Wang, X. & Deus, H. F. (2006). Data integration gets 'Sloppy'. Nature Biotechnology, 24(9), 1070-1071.
[15] Etzold, T., Ulyanov, A. & Argos, P. (1996). SRS: information retrieval system for molecular biology data banks. Methods Enzymol, 266, 114-128.
[16] Kemmeren, P., Kockelkorn, T. T. J. P., Bijma, T., Donders, R. & Holstege, F. C. P. (2005). Predicting gene function through systematic analysis and quality assessment of high-throughput data. Bioinformatics, 21(8), 1644-1652.
[17] Muller Patrick Y.; Janovjak Harald; Miserez André R.; Dobbie Zuzana (2002). Processing of Gene Expression Data Generated by Quantitative Real-Time RT-PCR. BioTechniques, 32(6), 2-7.
[18] William S. Hancock, S. L. W. R. R. S. a. E. A. G. (2002). Publishing large proteome datasets: scientific policy meets emerging technologies. , 20(12), 39-44.
[19] Lenzerini, M. (2002). Data Integration: A Theoretical Perspective.
[20] Halevy, A., Rajaraman, A. & Ordille, J. (2006). Data integration: the teenage
yearsVLDB Endowment.
[21] Palsson, B. (2000). The challenges of in silico biology. Nat Biotechnol, 18(11), 1147-1150.
[22] Hernandez, T. & Kambhampati, S. (2004). Integration of biological sources: current systems and challenges ahead. SIGMOD Rec., 33(3), 51-60.
[23] Bork, P. (2000). Powers and Pitfalls in Sequence Analysis: The 70% Hurdle. Genome
Res., 10(4), 398-400.
[24] Berthold, M. R. & Hand, D. J. (2003). Intelligent Data AnalysisSpringer.
[25] Fayyad, U. M., Piatetsky-Shapiro, G. & Smyth, P. (1996). From data mining to knowledge discovery: an overview., 1-34.
[26] Fayyad, U. M., Piatetsky-Shapiro, G. & Smyth, P. (1996). Knowledge Discovery and
Data Mining: Towards a Unifying Framework.
[27] Dictionary.com (2007). Definition of “logistics”.
[28] Jablonski, S., Lay Rainer Meiler, C. & M. Sascha H. Wolfgang, (2005). Data logistics as
a means of integration in healthcare applications. New York, NY, USA: ACM Press.
Bibliography
F
[29] Leser, U. & Naumann, F. (2005). (Almost) Hands-Off Information Integration for the
Life Sciences.
[30] Lee, T., Pouliot, Y., Wagner, V., Gupta, P., Calvert, D. S., Tenenbaum, J. & Karp, P. (2006). BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics,
7(1).
[31] EBI European Bioinformatics Institute (2007). The EMBL Nucleotide Sequence
Database, statistics.
[32] Galperin, M. Y. (2007). The Molecular Biology Database Collection: 2007 update. Nucleic Acids Res, 35(Database issue).
[33] Trißl, S., Rother, K., Müller, H., Koch, I., Steinke, T., Preissner, R., Frömmel, C. & Leser, U. (2004). Columba: Multidimensional Data Integration of Protein Annotations (2994spage =156).
[34] Haas, L. M., Lin, E. T. & Roth, M. T. (2002). Data integration through database federation. IBM Systems Journal, 41(4), 578-596ee = http://dx.doi.org/10.1147/sj.414.0578.
[35] Ibrahim, I. K. & Schwinger, W. (2001). Data Integration in Digital Libraries:
Approaches and Challenges.
[36] Perco Paul ; Rapberger Ronald ; Siehs Christian ; Lukas Arno ; Oberbauer Rainer ; Mayer Gert ; Mayer Bernd ; (2006). Transforming omics data into context : Bioinformatics on genomics and proteomics raw data. Electrophoresis, 27(13), 2659-2675.
[37] MITRE Corporation (2007). MITRE Corporation.
[38] Seligman, L. J., Rosenthal, A., Lehner, P. E. & Smith, A. (2002). Data Integration: Where Does the Time Go? IEEE Data Eng. Bull., 25(3), 3-10.
[39] Rahm, E. & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334-350.
[40] Rahm, E. & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE
Data Eng. Bull., 23(4), 3-13.
[41] Boulakia, S. C., S. Lair Stransky, N., StGraziani FranRadvanyi Barillot, E. & Froidevaux, C. (2004). Selecting biomedical data sources according to user preferences. Bioinformatics, 20(1), 86-93.
[42] Gamma, E., Helm, R., Johnson, R. & Vlissides, J. (1995). Design patterns: elements of
reusable object-oriented software. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
[43] Lacroix, Z. (2002). Biological data integration: wrapping data and tools. IEEE Trans Inf
Technol Biomed, 6(2), 123-128.
[44] Li Xu and David W. Embley (2004). Combining the Best of Global-as-View and Local-
as-View for Data Integration.
Bibliography
G
[45] Levy, A. Y., Mendelzon, A. O. & Sagiv, Y. (1995). Answering queries using views
(extended abstract). New York, NY, USA: ACM Press.
[46] Friedman, M., Levy, A. Y. & Millstein, T. D. (1999). Navigational Plans For Data
Integration.
[47] Theodoratos, D. & Sellis, T. K. (1997). Data Warehouse Configuration.
[48] Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids Res., 28(1), 45-48.
[49] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S. & Schneider, M. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res.,
31(1), 365-370.
[50] Hamosh, A., Scott, A. F., Amberger, J., Bocchini, C., Valle, D. & McKusick, V. A. (2002). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl. Acids Res., 30(1), 52-55.
[51] Wong, L. (2002). Technologies for integrating biological data. Brief Bioinform, 3(4), 389-404.
[52] Heidorn, B. P., Palmer, C. L. & Wright, D. (2007). Biological information specialists for biological informatics. Journal of Biomedical Discovery and Collaboration, 2, 1+.
[53] Achard, F., Vaysseix, G. & Barillot, E. (2001). XML, bioinformatics and data integration. Bioinformatics, 17(2), 115-125.
[54] Davidson, S. B., Overton, G. C., Tannen, V. & Wong, L. (1997). BioKleisli: A Digital Library for Biomedical Researchers. Int. J. on Digital Libraries, 1(1), 36-53.
[55] Haas, L. M., Kossmann, D., Wimmers, E. L. & Yang, J. (1997). Optimizing Queries
Across Diverse Data Sources.
[56] Lindberg D. A. B. ; Humphreys B. L. ; McCray A. T. ; (1993). The unified medical language system. Methods of information in medicine, 32(4), 281-291.
[57] Noy, N. F. (2004). Semantic integration: a survey of ontology-based approaches. SIGMOD Rec., 33(4), 65-70.
[58] UniProtKB/Swiss-Prot (2007). UniProtKB/Swiss-Prot Release 53.0 statistics.
[59] Wikipedia (2007). Wikipedia Genbank.
[60] Ning, Z., Cox, A. J. & Mullikin, J. C. (2001). SSAHA: A Fast Search Method for Large DNA Databases. Genome Res., 11(10), 1725-1729.
[61] Enright, A. J., Van Dongen, S. & Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucl. Acids Res., 30(7), 1575-1584.
Bibliography
H
[62] Tarczy-Hornoch, P., Markey, M. K., Smith, J. A. & Hiruki, T. (2007). Bio*Medical informatics and genomic medicine: Research and training. Journal of Biomedical Informatics,
40(1), 1-4ee = http://dx.doi.org/10.1016/j.jbi.2006.10.002.
[63] NIH National Institutes of Health (2007). Re-engineering the Clinical Research
Enterprise.
[64] Gibson G (1999). What works. Data warehouse: decision support solution reduces patient admissions, saves payer millions. Health management technology, 20(4).
[65] Heyer Kimberley I. (1999). The development cycle of a pharmaceutical discovery chemi-informatics system. Medicinal Research Reviews, 19(3), 209-221.
[66] Jarke, M. (2003). Fundamentals of data warehouses. Berlin [u.a.]: Springer.
[67] Kimball, R. & Caserta, J. (2004). The data warehouse ETL toolkit. Indianapolis, Ind: Wiley.
[68] Inmon, W. H. (2002). Building the data warehouse. New York, N.Y. [u.a.]: Wiley.
[69] Moody, D. L. & Kortink, M. A. R. (2000). From enterprise models to dimensional
models: a methodology for data warehouse and data mart design.
[70] Chaudhuri, S. & Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1), 65-74.
[71] Shin, B. (2003). An Exploratory Investigation of System Success Factors in Data Warehousing. J. AIS, 4.
[72] Pfeifer Bernhard (2007). A Life Science Data Warehouse System to enable Systems
Biology in Prostate Cancer.
[73] Simitsis, A., Vassiliadis, P. & Sellis, T. (2005). Optimizing ETL Processes in Data
Warehouses. Washington, DC, USA: IEEE Computer Society.
[74] Golfarelli, M., Rizzi, S. & Cella, I. (2004). Beyond data warehousing: what's next in
business intelligence?. New York, NY, USA: ACM Press.
[75] Datta, A., Moon, B. & Thomas, H. (1998). A Case for Parallelism in Data Warehousing
and OLAP. Washington, DC, USA: IEEE Computer Society.
[76] Silva MR (2004). Bioinformatics, the Clearing-House Mechanism and the Convention on Biological Diversity. Biodiversity Informatics.
[77] Segev, A. & Fang, W. (1990). Currency-Based Updates to Distributed Materialized
Views. Washington, DC, USA: IEEE Computer Society.
[78] Wang, R. Y. & Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. J. Manage. Inf. Syst., 12(4), 5-33.
[79] Shah Sohrab ; Huang Yong ; Xu Tao ; Yuen Macaire ; Ling John ; Ouellette BF Francis (2005). Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics, 6(1), 34.
Bibliography
I
[80] Buttler, D., Coleman, M., Critchlow, T., Fileto, R., Han, W., Pu, C., Rocco, D. & Xiong, L. (2002). Querying multiple bioinformatics information sources: can semantic web research help? SIGMOD Rec., 31(4), 59-64.
[81] Buccella, A., Cechich, A. & Brisaboa, N. R. (2005). Ontology-Based Data Integration. In J. H. D. a. V. E. F. Laura C. Rivero(Ed.), Encyclopedia of Database Technologies and
Applications (pp. 450-456). Idea Group.
[82] Rosenthal, A., Seligman, L. J. & Renner, S. (2004). From semantic integration to semantics management: case studies and a way forward. SIGMOD Record, 33(4), 44-50ee = http://doi.acm.org/10.1145/1041410.1041418, http://www.sigmod.org/sigmod/record/issues/0412/10.arnie-7.pdf.
[83] Gruber, T. R. (1992). Ontolingua: A Mechanism to Support Portable Ontologies.
[84] Schulze-Kremer, S. (1998). Ontologies for Molecular Biology.
[85] Schulze-Kremer, S. (2002). Ontologies for molecular biology and bioinformatics. In
Silico Biology, 2, 17ee = http://www.bioinfo.de/isb/abstracts/02/0017.html.
[86] Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A. L. & Rosse, C. (2005). Relations in biomedical ontologies. Genome
Biol, 6(5).
[87] OBO Open Biomedical Ontologies (2007). Open Biomedical Ontologies.
[88] Dean, M., Schreiber;, (2004 date -modified). OWL Web Ontology Language Reference (10).
[89] World Wide Web Consortium (2007). World Wide Web Consortium.
[90] Lee, B. T., Hendler, J. & Lassila, O. (2001). The Semantic Web. Scientific American.
[91] Horrocks, I., Schneider, P. P. & van Harmelen, F. (2003). From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 1(1), 7-26.
[92] Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R. & Hood, L. (2001). Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science, 292(5518), 929-934.
[93] Muilu, J., Peltonen, L. & Litton, J. (2007). The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600?000 twin pairs in Europe. European Journal of Human Genetics, aop(current).
[94] Haas, L. M., Schwarz, P. M., Kodali, P., Kotlar, E., Rice, J. E. & Swope, W. C. (2001). DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst. J.,
40(2), 489-511.
[95] Zdobnov, E., Lopez, R., Apweiler, R. & Etzold, T. (2002). The EBI SRS server: recent developments. Bioinformatics, 18(2), 139-148.
[96] IBM (2007). Data Joiner.
Bibliography
J
[97] Carey, M. J., Haas, L. M., Schwarz, P. M., Arya, M., Cody, W. F., Fagin, R., Flickner, M., Luniewski, A. W., Niblack, W., Petkovic, D., Thomas, J., Williams, J. H. & Wimmers, E. L. (1995). Towards heterogeneous multimedia information systems: the Garlic approach. ride,
00, 124.
[98] Donelson, L., Tarczy-Hornoch, P., Mork, P., Dolan, C., Mitchell, J., Barrier, M. & Mei, H. (2003). The BioMediator System as a Data Integration Tool to Answer Diverse Biologic Queries. Medinfo.
[99] Shaker, R., Mork, P., Brockenbrough, J. S., Donelson, L. & Tarczy-Hornoch, P. (2004). The BioMediator System as a Tool for Integrating Biologic Databases on the Web.
[100] Mork, P., Shaker, R., Halevy, A. & Tarczy-Hornoch, P. (2002). PQL: A declarative
query language over dynamic biological schemata.
[101] Shaker R.; Mork, Mork, S., Barclay, P. & Tarczy-Hornoch, M. (2002). A rule driven
bidirectional translation system remapping queries and result sets between a mediated
schema and heterogeneous data sources'.
[102] Mork, P., Shaker, R. & Tarczy-Hornoch, P. (2005). The Multiple Roles of Ontologies in
the BioMediator Data Integration System.
[103] Mozilla Foundation (2007). Mozilla Public License Version 1.1.
[104] Pfeifer Bernhard; Baumgartner Christian; Aschaber Johannes; Hanser Friedrich; Dreiseitl Stefan; Modre Robert; Schreier Günter; Tilg Bernhard (2007). A Life Science Data
Warehouse System to enable Systems Biology in Prostate Cancer.
[105] Talend (2007). Talend open data solutions.
[106] Lorünser Gerd ; (2006). Konzeption eines auf Metadaten basierenden Ad-hoc-Query
Builders. Hall:.
[107] Aho, A. V., Kernighan, B. W. & Weinberger, P. J. (1987). The AWK programming
language. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
[108] Wall Larry (1987). Perl – Practical Extraction and Report Language.
[109] Open Bioinformatics Foundation (2007). BioSQL project.
[110] Pocock, M., Down, T. & Hubbard, T. (2000). BioJava: open source components for bioinformatics. SIGBIO Newsl., 20(2), 10-12.
[111] Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L. & Tatusova, T. A. (2007). Database resources of the National Center for Biotechnology Information. Nucl. Acids Res., 35(suppl_1), D5-12.
Curriculum Vitae
K
CURRICULUM VITAE
Personal data
Name: Karl Kugler Address: Metzentaler 21, 6094 Axams Date of birth: 18. December 1981, Innsbruck Nationality: Austrian
Education 10/2002 – 10/2005 University of Health Sciences, Medical Informatics and Technology; Innsbruck, Austria Bachelor of Science Biomedical Informatics (B.Sc.) 09/1993 - 06/2001 Bundesgymnasium Sillgasse; Innsbruck, Austria General University-Level Graduation 09/1992- 07/1993 Gymnasium der Abtei Schlierbach; Schlierbach, Austria 09/1988 - 07/1992 Volksschule; Kematen/Krems, Austria Primary School
Practical experience
1998 – 1999 Hotline and Costumer Services at Modern Business Systems; Innsbruck
2000 - 2003 Hotline and Costumer Services at k2-design edv-systeme; Axams
03/2003 -10/2003
Project participation: “Data Mining in Clinical, Genomic, Proteomic, Metabolic and Medical Image Databases“ at the Department of Database Systems; UMIT
08/2003 Practice at the Institut für Klinische Chemie und Pathobiochemie am Klinikum Rechts der Isar der TU München; Munich, Germany
02/2004 – 01/2005
Project participation: “Finding and Calling Webservices using Axis” at the Institute for Informationsystems; UMIT
09/2004 – 10/2004
Practice at Biocrates life sciences GmbH; Innsbruck
09/2004 – 02/2005
Trainer and first level support for SAP IS-H at the TILAK; Innsbruck
since 03/2005
Bioinformatics Department at Biocrates life sciences GmbH; Innsbruck
Others since 10/2001
Member of the Austrian Red Cross; Innsbruck Emergency Medical Technician, on-scene commander and ongoing servicing in administrative matters for voluntary members
Expression of thanks
L
EXPRESSION OF THANKS
Some lines to say “thank you” should be placed in here, in order to honour all the persons that
escorted me through the years of my bachelor and master courses. It wasn’t always an easy
time, but I cannot say that I regret any of the lessons learned. But most of it was a great time I
wouldn’t want to miss.
First of all I want to thank Rektor Univ.-Prof. Dr. Bernhard Tilg and all his staff at the
Institute for Biomedical Engineering for supporting me in writing this master thesis. Rektor
Tilg would lend me an ear every time something needed to be talked about. The same goes for
Univ.-Prof. Dr. Armin Graber, who proved that he was not just a great boss.
Special thanks go out to Dipl.-Ing. Dr. Bernhard Pfeifer, who was great to work with, took
every time needed to support me and showed that working in an academic environment could
contain a whole lot of fun as well. I hope that many students more will have the pleasure of
either being educated by him or even get the benefit to work with him. They may not just
learn from his skills concerning intelligence and problem solving, but even more from his
warm-hearted manners.
Further expressions of thanks go out to my friends and colleagues at the Austrian Red Cross,
who would cheer me up, whenever it seemed like there was an unsolvable problem ahead.
But there are two people who deserve the most to hear a “thank you” after this five years, as
they helped me out in times of financial shortcomings, pushed me ahead in times of lacking
motivation and where always there for me: Thank you mum, thank you dad!
Statement
M
STATEMENT
I hereby declare to have completed this work independently and to have used no aids other
than those mentioned.
Hall, …………………………………….
……………………………………
Karl Kugler