Integration of biomedical and semantic knowledge for ......Integration of biomedical and semantic knowledge for enabling systems biology approaches Karl Kugler UMIT - university for

Master Thesis

Integration of biomedical and semantic knowledge for enabling systems biology approaches

Karl Kugler

UMIT - university for health sciences, medical informatics and technology Institute for biomedical engineering

Hall, August 2007

Eduard Wallnöfer Zentrum 1, A-6060 Hall, Österreich/Austria www.umit.at

Thesis adviser and examiner: Univ.-Prof. Dr. Bernhard Tilg

Univ.-Prof. Dr. Armin Graber

Co-Examiner: Dipl.-Ing. Dr. Bernhard Pfeifer Accepted by the examination committee on

Executive Summary

Introduction: By using High Throughput technologies the amount of available biomedical

data has increased over the last years, and this amount of data has become a challenge in

research fields like genomics, where the usage of cDNA microarrays lead to the need of

creating tools to manage the vast amount of information[1]. Gardner identifies the problem as

“Too often, the data generated by the automated technologies gather in vast silos that are

impressive in scale but limited in usefulness to the organization“[2]. Integrating information

from diverse sources holds several challenges that are presented in several sections. This work

introduces some of the basic principles and techniques used to bring biomedical data into an

integrated and manageable state and solve the mentioned challenges.

Objective: This work will present a framework, targeting at providing conversion and import

features, built for the IMGuS project, which focuses on prostate cancer systems biology, by

combining several –omic technologies.

Methods: The presented framework uses methods of data warehousing and semantic

integration, which are illustrated in an extra section of this work. A conceptual view of

integration techniques and current projects is featured as well.

Results: The imgus-etl-framework for LINDA provides easy to extend features that are

needed for the processes and tasks contained in data integration and especially in data

warehousing. The presented repository schema was designed in order to extend the data

warehouse for usage with additional -omics data and information from external data bases

like KEGG[3].

Conclusion: Having introduced the imgus-etl-framework should help to maximize the

interaction of system users producing the data, and thereby increase the rate of performed

updates. In the future a Graphical User Interface might even allow creating user specific

conversion and importing processes by using a simple drag-and-drop application.

Zusammenfassung

Problemstellung: Durch die Etablierung von Hochdurchsatzverfahren stieg die Anzahl der

verfügbaren biomedizinischen Daten in den letzten Jahren stark an. Dieser Anwuchs führte

für Forschungsbiete, wie zum Beispiel die Genomics, wo durch die Verwendung von cDNA

Microarrays die Erstellung neuer Datenmanagementtools für die enorme Datenmenge

notwendig wurde[1], zu einigen Herausforderungen. Gardner beschreibt das Problem mit den

Worten “Too often, the data generated by the automated technologies gather in vast silos that

are impressive in scale but limited in usefulness to the organization“[2]. Die

Problemstellungen, die sich für die Integration von biologischen Daten aus verschiedenen

Quellen ergeben, werden vorgestellt. Des Weiteren werden grundlegende Prinzipien und

Techniken präsentiert, die genützt werden, um diese Daten in einen integrierbaren und

verwaltbaren Zustand überführen sollen.

Zielsetzung: In dieser Arbeit wird ein Framework vorgestellt, das die Umwandlung und den

Import von Daten ermöglichen soll. Dieses Framework wurde im Rahmen des IMGuS

Projektes, das auf die Zusammenführung von –omics für die Untersuchung von Prostatakrebs

hinarbeitet, erstellt.

Methoden: Im vorgestellten Framework werden Methoden des Data Warehousing und der

semantischen Integration vorgestellt, die in einem Kapitel dieser Arbeit vorgestellt werden.

Eine Übersicht über die verschiedene Techniken der Datenintegration und aktuelle Projekte

ist ebenfalls enthalten.

Ergebnisse: Das imgus-etl-framework für LINDA bietet leicht zu erweiternde

Funktionalitäten, die für die Aufgaben der Datenintegration, vor allem in Data Warehouses,

von Nöten sind. Das vorgestellte Schema des Repository wurde mit dem Hintergrund einer

Erweiterbarkeit für zusätzliche –omics Daten erstellt, und soll in Zukunft auch Daten von

Datenbanken wie KEGG[3] enthalten.

Konklusion: Durch die Einführung des imgus-etl-framework soll die Benutzerinteraktion auf

Seiten der Datenproduzenten maximiert werden, und sich so auch die Anzahl der

Aktualisierungen dementsprechend erhöhen. Künftig soll es eine grafische

Benutzeroberfläche erlauben, benutzerspezifische Transformationen und Importe mittels einer

simplen Drag-and-Drop Anwendung zu gestalten.

I

Table of Content

1 INTRODUCTION 1

2 METHODS 4

2.1 DIAL (DATA INTEGRATION, ANALYSIS AND LOGISTICS) 4 2.1.1 DATA INTEGRATION 4 2.1.2 DATA ANALYSIS 5 2.1.2.1 Knowledge Discovery in Databases 5 2.1.2.2 Data Mining 6 2.1.2.3 KDD and Data Mining 7 2.1.3 DATA LOGISTIC 8 2.2 DATA INTEGRATION 9 2.2.1 OVERVIEW 9 2.2.2 REQUIREMENTS FOR DATA INTEGRATION 10 2.2.3 STEPS IN DATA INTEGRATION 11 2.2.4 INTEGRATION APPROACHES 13 2.2.4.1 Hernandez/ Kambhampati-Classification of Integration approaches 13 2.2.4.2 Leser/Naumann-Classification of Integration Approaches 16 2.2.4.3 Conclusion 17 2.2.5 CHALLENGES IN DATA INTEGRATION 18 2.2.5.1 Technical Challenges 19 2.2.5.2 Semantic Integration 19 2.2.5.3 Exponential growth rate of data amount 20 2.2.5.4 Human Resources 21 2.3 DATA WAREHOUSING 23 2.3.1 ARCHITECTURE 25 2.3.1.1 Back Room - Data Management 25 2.3.1.2 Front Room – Data Access 26 2.3.2 METADATA 27 2.3.3 KEEPING UP-TO-DATE 28 2.4 SEMANTIC INTEGRATION 30 2.4.1 ONTOLOGIES 31 2.4.1.1 Creating an ontolgy 32 2.4.1.2 Web Ontology Language 32 2.5 CURRENT PROJECTS 34 2.5.1 SRS 35 2.5.2 DISCOVERYLINK 35 2.5.3 BIOMEDIATOR 36 2.5.4 ALADIN 37 2.5.5 ATLAS 37 2.5.6 BIOWAREHOUSE 38 2.5.7 COMPARISON 40

3 RESULTS 41

3.1 PROJECT IMGUS SETTINGS 41 3.2 LINDA REPOSITORY 44 3.2.1 DESIGN OF THE IMGUS-ETL-FRAMEWORK FOR LINDA 45 3.2.2 INFRASTRUCTURE 45 3.2.3 ARCHITECTURE 48

II

3.2.4 INTERFACES 49 3.2.5 IMPORT AND MAPPING COMPONENTS 51 3.3 IMPLEMENTATION OF THE IMGUS-ETL-FRAMEWORK FOR LINDA 52 3.3.1 FILE CONVERSION 52 3.3.1.1 Horizontal into Vertical Representation 52 3.3.2 FILE IMPORT 53 3.3.2.1 Talend Open Studio Import Classes 53 3.3.3 ONTOLOGIES 54 3.3.4 CONVERSION AND IMPORT 55

4 DISCUSSION 56

4.1 DATA INTEGRATION 56 4.2 SWITCH TO JAVA 56 4.3 INTERFACE USAGE 56 4.4 REPOSITORY 57 4.5 IMGUS-ETL-FRAMEWORK 57 4.6 DEPLOYMENT STAGES 58 4.7 FUTURE WORK 58

LIST OF FIGURES A

LIST OF TABLES C

BIBLIOGRAPHY D

CURRICULUM VITAE K

EXPRESSION OF THANKS L

STATEMENT M

Introduction

1

1 INTRODUCTION

In the year 2001 the human genome was published by Lander[4], and a wave of enthusiasm

flooded the life science society. The research communities started to dream about

personalized medicine having a breakthrough in the next decades. Several years later some

scientists realized that the genome itself could not bring the breakthrough everyone was

expecting, as Quakenbush[5] describes in 2007 and van Beek compared the situation of the

repeatedly praised breakthrough with an airplane safety instructions one has heard to often,

already in the year 2003[6]. Here we find calls for bringing the amount of gathered data into a

manageable state, something Liu et al declared necessary for molecular biology laboratories

as well, caused by the introduction of gene expression studies using cDNA microarrays[1]. A

similar conclusion was found at a workshop dealing with genome informatics issues[7], held

by the Department of Energy already in 1993: “More support for complex, multi-database

queries will require major efforts toward improving the integration and interoperability of

community databases. … Without an API, researchers must spend excessive time manually

identifying, extracting, and formatting data from community databases before further

analyses can begin.”. Gardner addresses the problem as “Too often, the data generated by the

automated technologies gather in vast silos that are impressive in scale but limited in

usefulness to the organization“[2]. One of his conclusions is almost identical to the upper one,

as he states that rethinking the way this data is managed is required. Nagarajan et al even

consider the integration of the data sourcing from the various sources as the biggest challenge

in the analysis of the huge amount of available information[8].

Now taking this into account, what makes the information manageable? One solution

provided by van Beek focuses on data integration, analysis and logistics. The next sections

will provide concepts and basics on data integration [6].

But first it is necessary to understand why transforming biomedical information into a

manageable and integrated state is crucial to enable further steps in biomedical research. It is

necessary to distinguish between qualitative and quantitative bio research. A group

performing experiments on their own, providing qualitative data, has no need for automated

data integration. When the need of comparing results with other groups occur, a scientist may

integrate her data manually, which means she might transform information locally, perhaps

even with little hand made scripts on demand. But as we see a more and more fragmented

distribution of information, as well thematically and spatially[9], the need of having data

Introduction

2

integration automatically occurs. This fragmentation leads to the need of having a common

understanding and definition of concepts. Creating ontologies and common standards, which

is one paradigm in the fields of medical informatics, is a process that has started in the

biomedical informatics the last couple of years[[10],[11]]. Big projects like the Gene

Ontology[12] (GO) or the MicroArray Gene Expression Object Model[13] (MAGE-OM) help

to simplify the process of exchanging data and knowledge. But the usage of given standards

must ensure to not be limiting the representation of new research results, as Almeida et al

state[14].

When integrating biological data one must always keep in mind that it is not enough to only

store the data in a proper way, but rather to define the data’s meaning on a semantic level and

how to query such data afterwards has to be considered as well. Not thinking about how to

make data harvestable might make the whole project useless, as scientists should not use their

working days entering complicated queries, but rather be performing experiments at their

working benches and exploring the gained data. The variety for entering user queries ranges

from providing a simple user form entry on a web page, to sophisticated script languages like

Icarus, a script language for the Sequence Retrieval System SRS[15].

If working with such a load of data one needs to think about integrating and making it

accessible, as well as how to analyse the whole amount of data that might be in the focus of

interest. Van Beek[6] refers to this set of tasks as DIAL (Data Integration, Analysis and

Logistics), which might be considered a quite useful description. In a following section the

steps performed in DIAL will be described, in order to understand the process of creating new

knowledge out of several distributed sources.

As the integration of data contains a wide range of different research approaches, it will be

discussed regarding the several requirements, being of technical and semantic nature, and the

single steps needed for a systematic integration approach will be discussed. Three main

categories can be identified when speaking about Data Integration, as there Navigational or

Linkage based approaches, which connect information or data entities by simply providing a

traceable path between them. Mediator based integration approaches hide the underlying data

sources from a user’s point of view, as they present one interface to query multiple instances

of data storage beyond. The third approaches uses data warehouses, having repositories,

storing physical entities, which were copied from the data sources to be integrated.

Introduction

3

There are many projects aiming at providing a solution to integrate biomedical data

nowadays, some of this projects will be introduced in a later section, in order to provide a

comprehensive view of the current research field.

The last sections will describe parts of the IMGuS project, which was initiated in order to

enable a system biology view on prostate cancer. The projects aims at integrating data from

various –omics techniques, like metabolomics, genomics or phenomics. This work focuses on

the aspects of how to import the given data sets into the IMGuS data warehouse.

Methods

4

2 METHODS

Data Integration is a big field in current biomedical researches, as an enormous amount of

data is produced[[16],[17]]. Due to achievements in biological methods, such as High

Performance Liquid Chromatography, efficient data integration technologies might become a

key issue in disciplines like proteomics[18]. This section will provide an overview of current

technologies and projects aiming at integrating data sourcing from various biological data

sources.

2.1 DIAL (Data Integration, Analysis and Logistics)

The Centre for Medical Systems Biology started a project targeting the harvesting of medical

knowledge by data integration, analysis and logistics, a set of tasks called DIAL[6]. This

acronym very well describes the whole process of handling biological and medical

information in order to gain new knowledge.

This section will describe the three parts integration, analysis and logistics in order to present

an overview on how today’s modern biomedical science works. As data integration will be a

section of its own, it will only be described briefly, whereas data analysis and logistics, will

be presented more in depth.

2.1.1 Data Integration

Data Integration might be considered as combining several different data sources in a way, so

that someone accessing, gets only one representational view on the whole set[19]. As one

might easily imagine, this presents some quite strong constraint, since today’s biomedical

information residues get more and more distributed. To achieve these objectives systems and

algorithms needed have to be adopted in order of being able to deal with these conditions.

Before automatic data integration was an issue, scientists had to manually find the needed

data sources, analyse their representational format and then manually integrate the different

results[20], but as more and more data sources build up in the internet, today an automated

approach becomes essential.

As Schönbach et al state data integration is the prerequisite for the later step of data

analysis[11]. This is especially true for the systems biology field, where a

Methods

5

systematic/integrated view on the individual is the base for applying a methodological

approach towards gaining new information and knowledge as stated by Palsson[21]. The

integration of data in an informatics kinds of way isn’t mandatory, since until the recent years

a scientist could very well integrate her information with other sources manually[22], by

mining other data sources or integrating literature knowledge by doing paper readings. Today

as most information is widely distributed over the internet, non-automated data integration

becomes more and more impossible. Regarding the amount of data high through-put assets

create, an automated approach is mandatory[[1], [17],[23],[16]].

2.1.2 Data Analysis

Having a huge amount of integrated data stored in some place does not bring any output as

this data silo needs to be analyzed in order to get any information or knowledge.

Mathematical test, explorative or descriptive statistics or methods of data mining and machine

learning, like neuronal networks, may therefore be applied on this data. The steps and task of

doing this is called data analysis. Berthold and Hand[24] define data analysis as “…the

process of computing various summaries and derived values from the given set of data”. They

further point out that simply applying tools to a given data set may not be considered

“cookbook fallacy” rather than data analysis. In order to have real advantage out of integrated

data, one needs to know how to apply methods that really fit to the given problem, otherwise

meaningless or even false results have to be excepted[25].

For analysing data there are two opposite approaches. One is the classical approach of having

an idea, formulating a theory and thereafter testing a hypothesis. Whereas on the other hand

there is a branch of research that tries to discover new knowledge by exploring the existing

data without formulating a hypothesis in advance, and then trying to falsify it[6]. The idea of

generating new hypothesis through data analysis, and afterwards providing these theories in

old-fashioned workbench laboratory work, seems to become relevant[5], but as the number of

new hypothesis, created by modern high-throughput analysis approaches, increases, being

able to prove the whole number of fresh hypothesis might turn out to be difficult[6].

2.1.2.1 Knowledge Discovery in Databases

One of the basic concepts used when working on the creation of new knowledge by using

databases is called knowledge discovery in databases, which is often abbreviated as KDD.

KDD may be regarded on an abstract level as “making sense of data” or in a more formal

definition: “KDD is the nontrivial process of identifying valid, novel, potentially useful, and

Methods

6

ultimately understandable patterns in data” as described by Fayyad et al[25]. Fayyad et al

furthermore define KDD as getting understandable (summarized) reports from a too large to

be understandable amount of basic data. This definition perfectly matches with the above

introduced characterization for data analysis. KDD may therefore be defined as a division of

data analysis.

The goals of data mining may be classified into two groups: Verification and Discovery[26],

with Verification targeting on proving a user’s hypothesis, whereas Discovery describes the

attempt to create new patterns and findings on a given data set. Discovery might be further be

divided into two sub targets, as Prediction defines the goal of discovering parameters that

may be used in order to predict future behaviour and Description is used to find patterns that

describe the data sets in a understandable way.

2.1.2.2 Data Mining

Data mining can be considered as the application of algorithms while performing KDD. This

defines Data mining as part of the KDD-process[25]. Data Mining can be considered as the

selection and application of algorithm on data sets in order to find useful patterns. These

patterns might be classifications, clusters or a model that represents the underlying data sets.

Data Mining is one of the steps Fayyad et al defined in their KDD model, which will be

shown later. The data mining component includes iterative and repetitive appliance of data

mining methods and algorithms[26], combined with interaction by the user. Many of these

methods originate from the scientific fields of machine learning, pattern recognition and

statistics. Several goals can be defined for the approach of using data mining as a discovery

tool:

- Classification: Automatically mapping each data entry to exactly one predefined class

- Regression: Mapping an entry to a real-valued prediction variable and discovering

functional relationships between features

- Clustering: Identifying a finite set of categories or cluster, that describe the data set

- Summarization: Finding a compact description for a subset of data

- Dependency Modelling: Modelling a graph, that describes significant dependencies

between variables

- Change and Deviation Detection: Detecting the most significant changes in a data set

from former measurement entries or standard values

Methods

7

2.1.2.3 KDD and Data Mining

Fayyad et al created a model of KDD, including the role of KDD:

Figure 1 - The steps performed in KDD as defined by Fayyad et al[25]

The initial step, here labelled as “Data” is to understand the domain of the application, as well

as to gather the current knowledge, facts and information. In a second step the “Target Data”

is selected, by filtering those objects that are needed for a further proceeding in the creation of

knowledge. After that the target data is being pre-processed resulting in “Preprocessed Data”,

filtering out invalid and missing values or noise. In a next step the data is reduced to

representative features and less dimensions, e.g. using information gain methods, leaving the

“Transformed Data” as working set. By this step the preparation and selection of data is

finished.

Next, one has to decide what data-mining methods will be applied to the working set in order

to reach the goal of the KDD process. Having this matching, one may choose the correct data-

mining algorithms and select needed parameters for these. After that one may apply the data-

mining methods on the working set, thereby searching for patterns of interest. These patterns

may later by evaluated and interpreted, which may cause a return to one of the earlier steps.

And finally after having successfully evaluated and interpreted the patterns one may use the

newly created knowledge.

Methods

8

2.1.3 Data Logistic

By defining the term “logistics” as “the planning, implementation, and coordination of the

details of a business or other operation”[27] data logistics may be regarded as dealing with

data on a meta level, targeting mainly on delivering the right type of data in the right time.

The two main tasks of data logistics can be split into: data transportation and data

transformation as Jablonski et al do[28]. By introducing the Process Based Data Logistics

PBDL approach, they indirectly support the task of data transportation by applying workflow

management methods, whereas data transformation is being performed using XML based

ontology wrapping between the different formats.

Methods

9

2.2 Data Integration

As mentioned above data integration is one of the primary goals on enabling a structured

approach on the widely distributed amount of data, today’s biological and biomedical

disciplines produce. Data Integration was defined as “the problem of combining data residing

at different sources, and providing the user with a unified view of these data” by Lenzerini in

2002[19].

Hernandez and Kambhampati define three goals that should be enabled by using data

integration approaches as they define gathering knowledge from a huge amount of data,

formulating a hypothesis and finally verifying this hypothesis as primary use of bioinformatic

integration systems[22].

When working with biological data some special challenges like variety of data and

representational heterogeneity have to be taken into account[22]. An interesting distinction is

made by Leser et Naumann[29] by separating “data-focused” and “schema–focused”

integration approaches. Data-focused systems provide high standard of data by manually

maintaining the entries, whereas schema-focused seem to act more like a biological

“integration middleware”.

2.2.1 Overview

This section will show why it is necessary to use data integration when performing system

biological research. There are several different approaches to this, from providing linkage-

based data storage to holistic biomedical data warehouses. All of these techniques try to

enable the scientist to gain information and knowledge from experimental or in silico data,

existing knowledge or even measurements from animal models or patients. Another benefit of

combining information from different sources is that occurring redundant or overlapping data

sets could be used to verify or cross validate another entry[30].

During the last decades the number of biological data to be stored increased exponentially, as

described by the EMBL statistics[31]. This growth leads to the need of bringing this

information into a manageable state. So databases were developed, containing this

information in mostly proprietary format. So worldwide biological research and development

had to face the challenge of becoming thematically and spatially more and more fragmented,

Methods

10

as described in Leser and Rieger[9]. This fragmentation caused biological databases to spread

at high speed.

The current number of registered biological databases is 968, which is an increase of 13% to

2006, as is described in Galperin “The Molecular Biology Database Collection”[32]. Taking

this situation in account, one can easily notice the need to integrate this huge amount data in

order to enable a complete system biological approach.

Today several approaches on integrating the data on the data storage level can be defined.

Hernandez and Kambhampati define three types of integration approaches[22]. These

methods will be explained in order to illustrate there strengths and weaknesses.

Something that has to be kept in mind when integrating data from diverse sources is that when

merging information from objects that are considered equal enough to be joined, possible

differences might be blurred, thereby making it impossible for a biologist to know what exact

object he is looking at. This problem does not occur in an approach where information

coming from different sources is kept separated and only a linking structure is

established[33].

When talking about integration of data, it is fundamental to know about the challenges this

approach presents. Since the information may have several sources and may even represent

different (experimental) sights on a data set. It’s crucial to keep track of the technical and

semantic requirements an integrating service needs.

2.2.2 Requirements for Data Integration

When integrating data from biological sources several requirements have to be met in order to

enable a valuable use. Some of these requirements originate from the historical development

in business economics, where the idea of integrated data, or “federated databases” derives

from. Nevertheless do they appeal to data integration approaches in biomedical environments

as defined by Leser et Rieger[9], Haas et al[34] and Ibrahim and Schwinger[35]:

- Transparency: Masking the data sources from the user. The user does not need to

know what the underlying semantic and technical implementations are, accessing the

top level resource, delegates a query to the underlying systems. The sources of data

integration must as well not be forced to show a given behaviour, as they should act

absolutely independently.

Methods

11

- Completeness: The data represented by the integrated system should be the complete

data, held by the nested systems.

- Semantic Correctness and Non-existence of Redundancy: The schema represented by

the global system has to semantically correct and addressing its elements has to be

unambiguously defined. The several data sources may even contain conflicting

elements.

Having a biological environment some additional requirements have to be met:

- Actuality: For some biological issues it is necessary to inspect the most actual data.

When working with a data integration system, which copies entries into its own

physical memory it’s quite challenging to keep the stored information up to date.

- System Performance: With high throughput measurements used in the “-omics” field,

the size of the gathered information raises the need to have efficient algorithms to

manage and explore this huge amount of data[[11],[36]]. But in order to enable an

algorithm to access the data, it’s crucial to have a good system performance allowing

efficient and optimized access to the stored data.

- Data Integration: Integrating the data in a biological context means to recognise and

merge duplicates information from various sources. Linkages between objects have to

be detected, possible contradictions within data sets have to be cleared.

2.2.3 Steps in Data Integration

Having the requirements of successful data integration it’s necessary to think about, the data

integration should be performed in order to guarantee the fulfilment of these formal requests.

A study performed by Seligman et al in the year 2002, on behalf of the MITRE

Corporation[37] defines eight tasks that need to be performed when integrating data[38]. This

subsection will present these steps that are more granulated than most other definitions of data

integration tasks. In order to focus on biomedical data integration aspects for this field of

application will be added to the task description.

- Gathering knowledge about sources: Each data source hast to be understood in terms

of schematic, representational and semantic means. This might be difficult as not

every source in a biomedical environment is documented fully enough to satisfy the

requirements needed to be completely identified.

- Gathering knowledge about the application target: The interfaces and views a user or

a user side system has the specified application targets have to designed and

Methods

12

implemented. Therefore it is crucial to have end user on board, who can help in

understanding what these goals may look like.

- Identifying semantic correspondences: This might be considered as semantic

harmonizing, trying to merge logical entities from the different data sources that

correspond to the same real world objects[39]. In the biological sciences this might

happen as many data sources created their own ontological system, each fitting

perfectly their own need.

- Creation of attribute transformations: Having identified the need attributes and their

needed representation, in a next step the transformation of these attributes has to be

considered. Some attributes need syntactic transformation, other target attributes are

aggregations or calculations based on attributes provided by the data sources.

- Specifying data combination rules: When combining data vectors from different rows

it needs to be specified how these combination takes place. A much bigger challenge

than just merging these vector is the handling of duplicates. To determine which

vector contains the “true” information is a difficult task, and omitting the conflicting

entity might cause problems when interpreting the new data.

- Creating logical mapping: Having performed the upper tasks, the mapping from the

source data to the user end data might be performed. The way this is done depends on

the integration approach, e.g. when performing mediated integration the mapping

could be expressed in a SQL-view.

- Cleaning the data: Incorrect values in the data have to be detected and corrected. This

is an important step, since especially in biomedical applications, wrong data entries

could lead to wrong statistics and results[40]. Rahm and Do define several

requirements for data cleaning approaches[40], as the cleansing should be complete,

by meaning that all errors and inconsistencies in both the source data and the

integrated data. The cleaning approach should need minimal manual interaction by

the user, and should be extendable to further data sources. And last but not least,

should it be combined with the schema related data transformations. One thing that

has to be kept in mind that it is difficult to decide what data can be deleted since even

incomplete or potentially incorrect might be of interest for research purposes[22]. A

survey done by Schönbach et al showed that about 30% of 145 source data contained

an error, that could have caused further trouble[11].

- Implementation of a user friendly access environment: Making the integrated data

accessible for the user, in way efficient and working results can be produced. In most

Methods

13

cases this will mean providing some GUI or web based interface to the using

community.

2.2.4 Integration approaches

During the last couple of years, three main approaches on data integration could be noticed, as

described by Hernandez and Kambhampati[22]:

- Navigational Integration

- Mediator-based Integration

- Warehouse Integration

Leser et Naumann[29] classifies into three categories as well, but does not take the classic

historical technical view, instead focuses on the integration focus

- Data-focused

- Schema-focused

The first classification will further on be called “Hernandez/ Kambhampati-Classification” the

second one “Leser/Naumann-Classification”. In this section this two classifications and their

defined classes, will be presented by their basic concepts, as well as their pros and cons.

2.2.4.1 Hernandez/ Kambhampati-Classification of Integration approaches

As mentioned above the three types for classification of integration approaches by Herandez

and Kambhampati are: Navigational Integration, Mediator-based Integration and Warehouse

Integration. This classification represents a methodical distinction of the integration

techniques, inspecting the property of where the access/combining level of this data lies. As

the first two approaches “Navigational” and “Mediator-based” leave the data at their sources,

and do not store a physical entity of the information at a central repository, they’re called

“virtual” integration approaches. Data Warehousing on the other side stores transformed

copies of the data in a central repository, therefore called a “materialized” integration. It is

important to keep this distinction in mind when thinking about reading and writing access to

these data sets.

Methods

14

2.2.4.1.1 Navigational Integration

Also called “Linked Based” Integration, meaning that

the integrated data is still distributed over several

sources, still has several different forms of

representation but is connected via a linking model. This

idea may be compared to the linking system used by the

WWW. The information is fragmented over many

servers but by establishing links, a connection between

the desired information is established. One idea is to

store the linkage information as a pair of keys

containing: the ID of the target database and the accession-number of the dataset as described

in Leser et Naumann[29]. Those links between different data sources might be administrated

by hand or automatically established, and are often referred to as “cross-references”[41].

The weaknesses of this method are the combinatorial explosion of possible links between the

data sources, and the rather simplistic semantic model, which might lead to a high number of

false negative or false positive links between the data sources[2].

2.2.4.1.2 Mediator-based Integration

„Define an object that encapsulates how a set of objects interact. Mediator promotes loose

coupling by keeping objects from referring to each other explicitly, and it lets you vary their

interaction independently.”

The definition of the mediator design pattern defined by Gamma et al[42], states out that the

mediator provides access to an encapsulated, internal representation of objects by not showing

the outside what the internal linkage structure looks like. Expanding this definition by using

the term “database” instead of “object” explains how the mediated integration approach

works. As this approach merges the various data sources, it is often referred to as “federated

databases” approach[30].

Figure 2 - For this approach data

sources are integrated by having links

between the objects. In the upper figure

a tuple in database 1 contains linkage

information pointing to a database 2

primary key.

Methods

15

Figure 3 - The mediator (virtual database)

hides the data sources from user access. For

each data source a wrapper has to be

implemented, to delegate the queries and

access the data

The query sent to the system by a user is

transformed by a mediating level, so that several

databases that may be behind the mediator

interface can be queried. The mapping of the

gained information is provided the mediator as

well.

In order to access one of the underlying data

sources a wrapper has to be used. A wrapper is

composed of two components. One component

sends a query to the data source in order to

retrieve the information, while the second part is

transforming the gained information in the

expected output format[43]. This means that

when integrating of n data sources is planned n

wrappers have to be implemented in worst case.

Data sources might not just be databases, but can

as well be flat files, that can be accessed by a

specific wrapper as well.

There exist two different approaches in providing a view on the mediator database[44]. The

first is called “Global-as-View” GAV and the other one is referred to as “Local-as-View”

LAV. Both concepts are shortly described below.

2.2.4.1.2.1 Global-as-View GAV

In the GAV concept the global schema may be compared to an ordinary view in a database

system, and unfolding a query is quite trivial. However, whenever a change in the information

sources or adding another information source requires redesigning the global view. This

makes GAV almost not useful for systems where changes in the data sources occur

frequently.

2.2.4.1.2.2 Local-as-View LAV

Adoption to changing data sources is easier in local-as-view environments, because here a

global schema exists independently of the schemes the sources provide. For a changed or an

added source, a source description has to be modified or implemented. The drawback of LAV

is however, that reformulation and transformation of queries is a non-trivial task, thus

Methods

16

resulting in a low performance. This problem is being addressed as “answering queries using

views”[45], because the query needs to access the local sources with their own local views.

2.2.4.1.2.3 Combination of the approaches

As both approaches do have their weaknesses, several projects aim at combining GAV and

LAV in order to get the best results, by making use of the several strengths. These projects

may be referred to as Global-Local-as-View or GLAV, as introduced by Friedman et al in

1999[46]. Other research groups are as well trying to get the best out of these two approaches,

e.g. Lacroix presented a wrapper using the “search view” approach, in order to create an

intermediate level mechanism[43].

2.2.4.1.3 Data Warehouse Integration

This approach integrates the data from various sources by copying the information values into

a central repository, which makes the information integrated and queried part of the data

warehouse rather than being part of the sources. This means that the data uploaded to the

warehouse may and in most cases must be transformed to fit the warehousing schema. Or in

other words: “A data warehouse can therefore be seen as a set of materialized views defined

over the sources” as defined by Theodorates et Sellis[47]. The big advantage of using Data

Warehousing techniques is, that by only working with copies of the original data sets, the

information intentionally produced by the research does not need to be changed, instead when

uploading this information to the data warehouse, the transformation takes place on local

copies.

2.2.4.2 Leser/Naumann-Classification of Integration Approaches

Leser and Naumann separate between in their work “(Almost) Hands-Off Information

Integration for the Life Sciences” between two major types of data integration projects. They

do distinct on the grade of integration by defining the first group of projects as “data-

focused”, which describes projects that are maintained manually, and thereby provide a high

quality standard of information. The second group are the so called “schema-focused”

projects, which focus on providing a global schema for data storage.

2.2.4.2.1 Data-focused Integration

Data focused projects are to be considered the most successful projects in the biological scene

by now, as they provide a high-standard data quality by administrating the information

manually. As this maintenance is performed by experts, one can be sure the information being

provided meets the needs for good scientific practice. As these types of projects do mainly

focus on the data quality and the completeness of entered information entities database basic

Methods

17

demands may play a minor role. As these projects are to be administrated manually by experts

the costs for keeping such projects consistent are quite high. Examples for such projects are:

Swiss-Prot[[48],[49] ] or Omim[50].

2.2.4.2.2 Schema-focused Integration

Project that are “Schema-focused” make extensive use of database technology as they try to

fit data to a given schema, in order to achieve a high grade of automation. Such projects are

not yet very successful, at least in terms of getting attention, in the life sciences, which, as

Leser et Naumann believe, is caused by their schema-centricity[29]. Creating a global schema

for information sourcing from different groups might be a difficult task, especially mapping

the entries semantics. And as some kind of abstraction is necessary to create a global schema,

biologists might distrust these steps.

2.2.4.3 Conclusion

Having introduced the three integration approaches as defined by Hernandez and

Kambhampati, it is impossible to say what kind of approach is the only one working, but one

can instead see the strengths and weaknesses each of these approaches has. This subsection

will try to give a brief summary of the pros and cons, but the selection of an approach is to be

done separately in each project where a need for data integration comes up.

Following a navigational integration approach might help if working with loose webs of data,

where no relational schema is provided, where a data web is characterized as a set of pages

and the links between those pages[46]. Wrapping this definition to a more abstract level, data

webs could be sets of information entities providing information about links between those

entities. This can be a useful approach if the information provided by a data source is only

reachable by following one or more pathways or the data source providing the information

allows no possible automated parsing or information recognition. This does not free a user

from the need to manually combine the results found with this approach. It however has the

advantage that a user can undoubtedly identify the source certain information is coming from.

One advantage of using a data warehouse is the total control of the data that is used. Since

every piece of information has to be loaded into the data warehouse by a process, this process

can check if the information to be imported fits the needs that have to be met. One other

advantage is that every bit of time that could be lost if waiting for a slow data source on the

network can be saved if the complete data is stored in one central repository. As well are the

Methods

18

data sources that often are needed for the daily business of research work or even life critical

tasks in a clinical environment kept safe from any Denial of Service Attack, that might be

caused by a huge batch data being downloaded for a real time survey as the happen in

mediator based integration approaches[51].

One other advantage is the possibility to perform all query optimization that might be

necessarily on the local system, whereas mediator based projects might lack the required

information about current query execution environment from the data sources providing the

needed biological information. Another fact is that the more different sources are integrated

over the internet in real time, the more growths the probability of one of these sources not

being available at the time needed, and even if all the sources send the information to answer

a query, this might be an enormous amount of data being transferred over the network, which

may cause the network to be overloaded[30].

One of the major drawbacks in the usage of data warehouses is that potentially outdated data

might be used, if updates on the data set are nor performed regularly. This can not happen if

using a mediator-based system since all information is gathered in real time from the

underlying data sources.

2.2.5 Challenges in Data Integration

Integrating data from several sources always brings several challenges of technical nature, but

when integrating biomedical data there are even more challenges to be kept in mind and

mastered. These challenges may be technical as well as of a semantic nature[9], but even

educating staff used to integrate and research this data may be considered part of the

challenge[52].

To understand the way data integration is performed, one needs to understand what problems

and challenges this task contains. This section states out what the current demands in modern

biological data integration are.

Methods

19

2.2.5.1 Technical Challenges

Some of the challenges data integration projects have to face are of technical nature, as a

solution for these problems is mandatory for every integration project.

2.2.5.1.1 Various Data Formats

When exchanging or integrating data one needs to define how, in a syntactical way, this

exchange happens. Some projects define their own file formats, for exchanging ASCII-files,

containing nested command and information. With the spreading of XML as a quasi-standard

for exchanging text based files, this problem is getting less important. Achard et al suggested

the use of XML and XML Interchange Data Dumps in order to replace the current flat file

exchange[53] already in 2000. Regarding the usage and spread of SBML and other similar

XML-based exchange formats, it seems like this has yet come true.

2.2.5.1.2 Various Access Languages

After having the integrated data stored on a system, a user must be able to access this

information in order to query for entries he might be interested in, in order to do his research.

Enabling an easy access to fulfil the user demands is crucial to make such software usable.

Projects like BioKleisli/K2[54] or DiscoveryLink[55] created querying languages on their

own in order to enable user access. In the case of BioKleisli the product deriving from the

first approach of designing this language was so complicated, that in a commercial follow up

project, a whole new, more SQL like, querying language had to be constructed.

2.2.5.2 Semantic Integration

Leser and Rieger define semantic heterogeneity as a two level challenge[9]. On the semantic

level (“What is a gene?”) and on a data level (“Are two genes identical?”). As bioinformatics

is, compared to medical informatics, a quite new discipline definitions and ontologies still

undergo a constant process of change. Even a substantial concept like gene is ambiguously

defined. It may happen that two different sources have different wordings for the same

concept, or even worse the same wording for two different concepts. This may easily lead to

data inconsistencies[22]. Some data sources don’t even have well documented descriptions

for their content and schema.

With the evolvement of biomedical data integration issues the need for combining the existing

ontologies becomes more and more obvious, as Rosse et Mejino Jr. describe the process of

designing new ontologies in the areas of medical informatics and even more in biomedical

Methods

20

informatics[10]. By sticking to large ontologies like the Gene Ontology[12], that is

meanwhile even part of the Unified Medical Language System[56], the integration process

may be simplified.

As semantic integration is one of the most active research fields in the domain of data

integration[[57],[2]], a later section will more deeply introduce the concepts and findings.

One of the most important tasks of developing semantic integration methods, the design and

development of ontologies will be presented in this later section as well. It has to be kept in

mind that using established communal standards might in some cases lead to the problem of

not being able to represent new findings in this standardized way[14].

2.2.5.3 Exponential growth rate of data amount

As described above biological databases grow in two dimensions. First dimension is the

amount of databases itself and second the number of entries in the databases. The number of

database entries for Swiss-Prot increases at an exponential rate. Similar growth rates can be

observed in other biological databases.

As shown in the upper figure even the manually administrated Swiss-Prot database grows at

exponential rate. It’s easy to image, that projects that don’t need any manual editing may even

grow at a higher rate. See next figure for the growth rate of GenBank from the year 1985 to

the year 2006.

Figure 4 - Swiss-Prot release 53.0 contained 269293 on 29-May-2007, growing at exponential rate. Figure

taken from [58]

Methods

21

The storage itself is no big deal yet, since other data bases contain much larger amounts of

data. But if the growth keeps on being exponential it might be difficulty in the future. A thing

one has to deal with already nowadays, is the need for efficient algorithms that are able to

deal with such amounts of data, in order to analyse the information, as is shown in Ning et

al[60] or Enright et al[61].

2.2.5.4 Human Resources

Speaking of efficient algorithm leads to an aspect, which seems like it has faded from the

spotlight. Due to the fact that in order to keep the pace of method development and daily work

bench practice up to the velocity of the information growth rate, specialists in these areas are

needed. Heidorn et al suggests training bioinformatics in information management and data

integration skills, in order to keep the scientists focusing on research issues not on data

integration problems[52]. A “biological informatic” should be able to support local research

groups, as well as to develop tools and integration methods for a global science approach.

Figure 5 - The growth rate of GenBank from 1985 until 2006 shows exponential behaviour, figure taken

from [59].

Methods

22

But not only for those integrating the huge amount of information the need of being trained

and skilled in these new disciplines occurs. Physicians and other staff directly linked to

patient care need to be aware of the possibilities that open in research and diagnostics fields.

This causes the need of adopting the available tools in a way, non biomedical informatics staff

can understand and work with these applications as well[62]. The National Institute of Health

recently presented a Roadmap, targeting at bridging the clinical research process with

laboratory results, thereby taking into account the emerging speed of new scientific results

being found in life sciences and other biological fields[63].

Methods

23

2.3 Data Warehousing

As introduced before a data warehouse may be regarded as a materialized view und several

distributed sources of data. Data warehouses became more and more established as tools for

knowledge gain in the biomedical field during the last years. They have been used in order to

support assets in healthcare[64] and chemi-informatics[65]. As they originate from business

and financial sciences it may as well be defined as “a collection of technologies aimed at

enabling the knowledge worker (executive, manger and analyst) to make better and faster

decisions” as Jarke et al do[66]. This is a quite business orientated view, but can easily be

swapped to a biomedical domain. Most of the following definitions and facts in this section

are taken from Kimball’s and Caserta’s Book “The Data Warehouse ETL Toolkit”[67].

One of the currently most used definitions of data warehouses, based on property based view

is “A Data Warehouse is a subject-oriented integrated, time-varying, non-volatile collection

of data in support of the management's decision making process” by Inmon[68]. In the next

paragraphs the mentioned features will be examined and their usefulness in a biomedical

setting discussed:

- Subject-orientated: A data warehouse has to be focused on a specified target, in terms

of research area, in order to enable a result orientated approach. In a biological field

this means one has to decide what subject of research a data warehouse should support

(e.g. cancer of the bladder). Schönbach et al even divide two groups of collections of

biomedical data: subject-orientated data warehouses and general-purpose

databases[11].

- Integrated: As mentioned above the integration of data coming from various

fragmented sources enables a holistic view on a posed question. This integration of

information is crucial nowadays, since there are several hundred databases available

containing needed information on biological matters.

- Time-Varying: Object information stored in a data warehouse will not be deleted if

newer information is added. So, creating a timeline or history on this data may be

inspected, enabling a scientist to reproduce the evolution of information on this object.

- Non-volatile: The information kept in the data warehouse repository is stored

permanently and will not be deleted.

Methods

24

One important property that has to be added to the upper definition is:

- Read-only: The data stored in the data warehouse will only be read by user. No writing

access may be applied from the outside world, except updating the stored information

by adding new entries from other data sources[69].

The data stored in a data warehouse is often of a multidimensional kind, as the focus of

interest depends on the posed questions[70]. To create an analogon to the sample presented by

Chaudhuri et Dayal, if creating a data warehouse containing biological sample information,

dimensions of interest may be the time of sample acquisition, the type of biological material a

sample is (tissue, urine, blood) or even the group a specimen belongs to (disease, control,

medication A, medication B). In many cases this dimensions are hierarchical structured.

One of the key targets in implementing a data warehouse has to be taking a high quality of

data, since data quality is one of the major factors correlating significantly to the end user

satisfaction as a survey performed by Shin in 2003 shows[71]. It’s noteworthy that the ability

to locate data (grouping the ability to locate data and metadata and the detail level of defining

the data) scored the second place in this user satisfaction ranking. Regarding the vast amount

of data stored in a data warehouse, and taking the completeness of user documentation on

meta data into account, this implies that in order to have a high grade of user acceptance, a

data warehouse implementation has to provide a comprehensive documentation in order to be

successful.

Methods

25

2.3.1 Architecture

A data warehouse may be divided in two physical and as well logical entities. One entity, the

so called back room, is holding and managing the data, while the other entity, referred to as

front room, enables data access. This distinction is crucial in the understanding of how a data

warehouse works and how it is organized.

Figure 6 - A data warehouse may be considered as consisting of a back room component and a front room

component. While the back room is responsible for data integration and storage, the front room has to

enable the access on the data. Graphic taken from IMGuS presentation at DILS 2007[72].

2.3.1.1 Back Room - Data Management

The back room is often described as data management or data preparation component. It

contains the data, prepares and delivers data from queries, but it does not support any user

queries from the outside since this is a task of the front room. A back room is often referred to

as “staging area”, which is this context may be regarded as permanently storing the

information to a physical entity like a disc.

2.3.1.1.1 The ETL-System

The Extract-Transform-Load (ETL) system may be considered as the basic concept of a data

warehouse back room. The ETL process extracts the needed data from the source systems,

transforms it into the needed presentation, by performing aggregations and other mutations on

the extracted data, afterwards loads these results into the data warehouse repository, and

finally transforms the stored data in a user friendly representation format. Putting these steps

together it is possible to state, that ETL is responsible for data integration when using the data

warehouse approach. The sources for the data to be integrated may be flat files, or data

Methods

26

coming from a real data base system, which is important as many of the public databases

provide complete SQL dumps of their contents and some just a flat file representation.

A more formal definition of what the ETL-system is responsible for was described by Simitsis

et al[73]:

- Identifying relevant data within the data sources

- Extracting this information

- Customizing and Integrating this information into a common format

- Cleaning this data

- Storing this cleansed data in the data warehouse

2.3.1.2 Front Room – Data Access

The front room component enables a user or client application to access the data hold in the

warehouse. The main task of the front room is mapping the huge amount of low-level data,

usually stored in a data warehouse, to another more valuable form[25]. This more valuable

form may come by being more useful, abstract or more compact. For accessing the data

techniques like Data Marts and Online Analytical Processing (OLAP) may be established as

way a user or a reporting tool accesses the stored information. The front room is responsible

for managing queries as well.

The front room activities are often referred to as Business Intelligence (BI). The diction

implies that non trivial actions take place in here. If defining business intelligence as “…the

process of turning data into information and then into knowledge”, as Golfarelli et al do[74],

and, at the same time, regarding the repository containing the data of the data warehouse as a

database, we may set the wordings BI and KDD as equivalent in this context.

The front room may provide techniques of data mining, text mining or classical statistical

methods. These may be performed on Data Marts and OLAP cubes. These two technologies

shall be presented in order to understand how BI makes use of data provided by the back

room.

2.3.1.2.1 Data Marts

A data mart is commonly defined as a subset of a data warehouse. It contains the same data

but filtered and aggregated to contain only data basing on a certain business process or as

some consider it a department based view. Several papers may distinct on what data it really

Methods

27

may contain, but for there current work the upper definition may suit. Data marts often are

introduced for performance or security reasons, or when the need of restructuring some parts

of the existing data in the warehouse occurs, e.g. when applying BI methods.

2.3.1.2.2 Online Analytical Processing OLAP

While in classical relational database systems the concept of transaction plays an important

role, the concept of having a transaction proof infrastructure is at minor importance in data

warehouses[70] . In operational environments the term Online transaction processing OLTP

describes the need for having a focus on efficient transaction handling[75]. In contrast OLAP

systems need to provide a more analytic access and view on the data handling, focusing on

decision support.

Typical OLAP operations, as Chaudhuri et Dayal[70] describe them, are:

- rollup, which increases the level of aggregation

- drill-down, a decrease in the level of aggregation or a increase in the level of

information detail

- slice_and_dice, information selection and projection

- pivot, re-orientating the multidimensional view of data

2.3.2 Metadata

A data warehouse could not be manageable without taking into account the information about

the data itself, defining all elements and how they work together. There are several institutes

and consortiums that are working on defining a standard set of metadata. These sets are

already in use in several biomedical approaches[76]. BioRegistry is a project aiming a

creating a metadata repository for biological databases[36].

In a data warehouse metadata may be divided into two sets, one set containing information

about how to extract and load data from different sources, this set is referred to as back room

metadata, while on the other side descriptive information about the stored data is labelled as

front room metadata. Back room metadata can be split into three logical blocks:

- Business metadata: This contains information about the meaning of data in the domain

of the business or science field.

- Technical metadata: This kind of metadata represents mainly the physical aspects of

the handled data.

Methods

28

- Process metadata: Metadata about the ETL-process, like statistics on loading time,

failures and successes in row loading.

Front room metadata may consist of security and access information, labelling specifications

or user-specific settings.

2.3.3 Keeping up-to-date

Because data warehouse do not contain directly information that is produced by any

experiments or other methods that produce data, but instead needs to be fed by the ETL

process, that imports external data from the data sources, keeping a data warehouse up-to-date

is a non trivial task. Bouzeghoub and Peralta define data freshness as one of the key features

in a data warehousing system[41], citing a survey performed by Shin in 2003[71]. They

further split data freshness into currency[77] and timeliness[78], where currency expresses a

measure for the time needed for extracting information from a source and presenting it to a

user, while timeliness expresses the rate a data set changes, by either adding new data or

updating existing one.

The update policy of a data warehouse needs some proper design decisions, since decisions

on the data stored in the repository need to be correct. As mentioned above a data warehouse

contains non-volatile data, but what happens if erroneous data has been integrated to the data

warehouse? As Kimball argues, a data warehouse should represent a business not the system

the data originates from, so he states that this invalid data has to be corrected by either

negating or updating the wrong fact, or deleting it and reloading the correct information

instead[67].

In order to keep a database up-to-date it must be made sure, that the information contained in

the repository is renewed on a regular base. This can be either done by completely importing

the data source as one big entity, which may cause major problems since the amount of data

that has to be transferred may be gigantic, or, more clever, by updating only the portions that

have changed since the last update. This later approach is referred to as a delta update, but

many of the public biological data bases do not offer this feature yet[30]. The following table

lists some data sources and their update routines as used in the Atlas project[79], including

information about the frequency of the updates.

Methods

29

Data Source Update Frequency Update Type

GenBank Sequence Daily delta

GenBank Sequence Release full

GenBank Refseq Daily delta

GenBank Refseq Release full

NCBI Taxonomy Release full

HomoloGene Daily full

OMIM Daily full

Gene Daily full

LocusLink Daily full

UniProt Bi-weekly full

HPRD Release full

MINT Relase full

DIP Release full

BIND Release full

GO Release full

Table 1 - The data sources as used by the Atlas project and their update properties as shown in Shah et

al[79]. The column in the middle shows how often the data source is updated in the Atlas data warehouse,

while the column in the right shows whether the complete data source is being re-imported or just the

changes to the currently stored version

Methods

30

2.4 Semantic Integration

As stated above data integration contains the challenging task of integrating several data

sources that do not necessarily share the same semantic space, which means that they do no

have mutual definitions of wordings and concepts. This makes combing data coming from the

different sources quite tricky. This diversity in representation and meaning causes to be

semantic integration to be one of the most challenging task in integrating biomedical

information[80].

Sharing the same idea of wordings and concepts start by so simple wordings like “body

weight”. Imagine having two physicians classifying patients into three groups “normal body

weight”, “above normal body weight” and “below normal body weight”. In some cases

patients would end up in different classes, as the subjective interpretation of the wording

“body weight” might differ between the two physicians. A case like the upper one could be

easily solved by using absolute measurements of the bodyweight in kilograms1 or aggregated

data like using the body mass index. But what if even more complex concepts need to be

addressed? Having two different data sources containing the word “COLD” could easily lead

to major problems, as data source uses COLD as abbreviation for “chronic obstructive lung

disorder”, whereas data source one uses the term cold to express a temperature[2]. In other

cases two different words might be used to address the same concept (e.g. “high blood

pressure” and “hypertension”). In order to avoid these problems a domain needs to be

semantically defined. These problems are often referred to as semantic heterogeneity[81], or

as Rosenthal et al simply state: “For meaningful information exchange or integration,

providers and consumers need compatible semantics between source and target systems”[82].

The term meaningful might be underscored, since it describes why having a common

understanding is necessary in integrating various data.

One of the solutions for the problems mentioned above, is the usage of so called

ontologies[2], as an ontology contains a formal representation of all concepts used in domain,

and describes the relationships between these concepts.

1 Having the absolute body weight in kilograms would surely not be enough since the body height has to be taken into account in order to gain information about a persons obesity status

Methods

31

2.4.1 Ontologies

In order to cover a specific domain in terms of semantics, all concepts and their relationships

need to be covered. An ontology does exactly this be defining all concepts a domain contains,

and additionally describing the relationships these concepts may have. Bucella et al[81] use a

definition by Gruber, who they pronounce, introduced ontologies into computer sciences as an

“explicit specification of a conceptualization” with his approach “Ontolingua”[83]. A more

specific definition is found at Schulze-Kremer as he describes an ontology as “"Concise and

unambiguous description of principle relevant entities with their potential, valid relations to

each other”[84].

It’s crucial to keep in mind that an ontology is not a model of an application domain, as it

does not contain any hypothesis, neither can it be used as a database schema directly since it

does not contain any type information, but it can be used a starting point when defining a new

schema[85].

One advantage of using a ontology based integration approach is, that an ontology provides a

vocabulary, that is normally stable enough to be used a conceptual interface for a database

schema and on the other hand is not depending on the database schema itself[81]. The second

one is that by using an ontology based integration approach the target of by “meaningful”, as

stated above, is automatically reached, since an ontology explicitly targets at proving

meaningful concepts and relationships.

One way of describing the format of an ontology is referring to the concept and relationship

block as a triplet, of the type concept-relationship-concept, or subject-predicate-object, as

assertions[2]. Basic relationships are is-a, part-of, but ontologies aiming at being usable

should contain more sophisticated relations targeting temporal(transformation_of,

derives_from) or spatial(located_in, contained_in) connections[86].

The Open Biomedical Ontologies OBO consortium is one instance trying to achieve a

common standard in the design and the usage of ontologies in the biomedical informatics

field[87]. They provide a library containing ontologies for usage across the many different

field of life science, containing the Gene Ontology[12] and other ontologies for cell and

sequence information.

Methods

32

2.4.1.1 Creating an ontolgy

Sometimes existing ontologies can not be used in existing projects as they do not support the

specific needs a certain domain of a user group has, in that case an ontology has to be newly

built. Buccella and Cechich define three major stages when creating a new ontology in order

to integrate various data sets[81]:

- The first step is building a shared vocabulary, which starts by analysing the given

information sources in order to find the terms or so called primitives, which are used

to build the new ontology. The information sources are checked on global level, which

means taken into account a global view on all the data sources. Analysing the

information implies checking how which information is stored and what a stored data

entry means (defining its semantic). When analysing this information it’s crucial to

keep the above mentioned problems of semantic heterogeneity in mind, in order to

have an unambiguous set of entities.

- In the second stage a local approach is applied on the data sources. This stage is

similar to the first one, but focuses solely on the source in isolation, not taking into

account any linkage to another source. Thus having defined the local terms, a local

ontology may be created.

- Having created a global ontology and the various local ontologies, the mapping

between those two levels has to be established on the third stage. This mapping may

be simple mapping of wordings (“function of gene” to “gene function”), mapping of

types(dates to timestamps) or more sophisticated use of formulas (mapping from

degrees Fahrenheit to degrees Celsius).

When creating a new vocabulary Schulze-Kremer sees the basic challenge in having good

definitions of concepts, as having ambiguous or not detailed enough description of these

could easily lead to problems in the later use[85]. In his paper several common problems are

listed: Definitions that are only made up be telling what a concept is not, too broad or to too

narrow definitions, self-reflexive definitions and rather verifying a scope than defining a

concept. Schulze-Kremer further suggests documenting the design criteria and the formal

notation of the ontology itself.

2.4.1.2 Web Ontology Language

The Web Ontology Language[88], often read in the abbreviated form OWL, is an ontology

provided by the World Wide Web Consortium[89] as a recommendation. OWL provides the

technology to exchange information and it’s semantic via networks as it was intentionally

Methods

33

designed to be part of the semantic web[90], or to be more precise in order to support

intelligent agents[91], that need to exchange information automatically.

It’s is possible to define classes and subclasses and then apply set operators like union or

intersect, or to define properties and sub properties. Objects can than be defined and classified

link to properties containing the object’s individual values. OWL distinguishes between two

basic types of properties that are both instances of built-in OWL classes:

- Object properties, that link an object to another object

- Datatype properties, that link an object to a data type

Methods

34

2.5 Current Projects

In the last years several research communities implemented various techniques targeting on

integrating data and information from different sources. Some companies jumped this train as

well, and today the range of available solutions ranges from free open source applications to

commercial products. In the sector of biomedical research countless projects have been

established to proceed in integrating the huge amount of information gained in the daily

workbench research. High-throughput measurement assets helped the number of data to be

harvested to explode in the omics fields, thereby lacking the ability to transform the huge

amount of information into interpretable and valuable knowledge[36]. Or as Ideker et al state,

the vast amount of data gathered by new methods like the usage of microarrays are might be

not useful to do research on a singular cell, but by applying an integrated data approach on

these data sets it might me possible to perform in silico biology that later could be verified by

workbench research[92]. All of the projects presented in this section target at an overall

approach on not integrating sources for a certain domain, but instead provide a framework to

integrate various public or private data sources. There exists loads of other projects that have

been designed to integrate information in order to answer certain specific questions, by

integrating information coming from a regional base, like the biobank presented by Muilu et

al[93] or Columba[33].

The following section will present a very small overview about existing projects, not aiming

at being complete, but rather to give examples for the different integration approaches as

described above. This ranges from projects supporting a link based navigation through given

data sources to projects converting data from different sources to fit a new schematic

representation in order to be stored in a data warehouse repository. It is important to see that

most of the projects aiming at integrating data from biomedical sources can not be

undoubtedly classified to belong to exactly one integration approach, but rather mix the

techniques in order to get the best result.

The Sequence Retrieval System, as introduced by Etzold in 1996[15], will be presented as it

is one of the most popular query interfaces and data integration projects in the biological

research fields, followed by IBM’s DiscoveryLink[94]. A project by Leser and Naumann,

named “ALADIN”, will be described shortly, as this is one of their current projects, aiming at

an automated data integration architecture[29]. After ALADIN two more projects targeting at

Methods

35

an Data Warehousing integration approach will be shown with Atlas[79] and

BioWarehouse[30].

2.5.1 SRS

The Sequence Retrieval System SRS[15]is one the mostly distributed querying tool for

biological sources, as it provides an easy to use graphical user interface that enables a user to

access a wide range of biological database and flat file resources[51]. Almost every data

sources can be integrated into SRS, as it uses text file representations of data sources to access

the information. In order to make the data source accessible the meta information has to be

declared in the Icarus scripting language that is part of SRS. This declaration contains

information about the data object as well as how to parse it.

It is possible to establish bidirectional linkages between data sources, that can be weighted or

even be combined with logical operator (AND, OR and NOT). Having a high rate of cross

references in a set of data sources, this set might be considered some kind of domain

knowledgebase[95]. The two main strengths of SRS are surely the way new data sources can

be added, by generating a flat file and describing its content using Icarus, while on the other

hand simplifying the generation of queries.

2.5.2 DiscoveryLink

DiscoveryLink[94] is an IBM product basing on the fusion of DataJoiner[96] and Garlic[97],

which both are developed by IBM as well. The components of DataJoiner provide query

optimization, a complete query execution module and the technology for federating the

different data sources, whereas the Garlic component enables the integration of new data

sources.

As wrapping data from the data sources is one of the main concepts in DiscoveryLink,

implementing those wrapping mechanisms, the creators tried to implement a technology that

allows a maximum of integrable data sources, with a minimum of effort in implementing the

actual wrapper. Those wrappers are implemented in C++ and usually support more than one

data sources, if those sources share the same API.

When sending a query to the system, a query processor distributes the query to the several

wrappers according to the information provided by the source descriptions. A global sum of

Methods

36

execution costs is calculated afterwards, and according to this information an execution plan

is created. After each wrapper has executed his tasks, a global result is aggregated[22].

2.5.3 BioMediator

The BioMediator[98] project is an approach using federated databases in order to integrate

biomedical information, that supports features liking querying for specific data instances or

browsing through properties. It allows to define user-specific design of mediated schema in

order to not overload the user by given a to broad view on information he might not be in

need of in order to answer questions belonging to a certain scientific domain.

Figure 7 - The three main stages of the BioMediator project (as presented in Shaker et al

[99]) showing the syntactic and semantic wrapping components as well the query processing component

that enables accessing the system beyond

The linked information is presented as a graph containing edges and nodes, with the nodes

representing the data source instances from the mediated schema and the edges being the

relationship between those entities. This path can be queried using the Path Querying

Language PQL[100], a technology enabling the definition of queries and constraints between

the federated databases. One of the main components is the source knowledge base that stores

information about how to wrap the information coming from a certain data source. In order to

achieve the goal of providing a specific view for user specifics domain, for each domain a

SKB has to be defined separately.

The wrapping itself is split into wrapping on a syntactic level first, and then in a second step

converting the sources semantic information, where the wrappers in the second step are

referred to as metawrappers[101], where the syntactic transformation is called data

acquisition and the syntactic transformation data translation.

The authors themselves state that they tend to expand the integration platform they provide by

now to a complete distributed network of peers[102].

Methods

37

2.5.4 ALADIN

The target of the ALADIN integration approach, introduced by Leser et Naumann[29], is to

automate a big proportion of the integration process, which is already implied by the project’s

complete name, as ALADIN is a acronym for “Almost Automatic Data Integration”. This high

grade of automatic integration is reached by using methods to automatically detect links

between objects from various data sources. The information gained by the linkage detection is

stored global, materialized repository, a data warehouse, and can be accessed via the classic

methods of KDD.

In biological data sources there are two types of objects:

- Primary Objects: Objects that contain the most useful information, they mostly

represent the basic concepts of a scientific field (genes, DNA sequences, …)

- Secondary Objects: Nested information container linked to a primary object are

referred to as “annotations”. E.g. the sequence string of a Protein or the functional

annotations of a gene are considered secondary objects.

In most cases linkage is only being performed between primary and secondary objects within

biological data sources, whereas there’s often a heavy linkage between those sources using

the primary objects identifiers. But there might as well be duplicates be scattered among the

different sources, these duplicates have to be discovered and flagged.

The automated detection of relationships between primary objects originating from different

sources is established using techniques from data integration, text mining, information

retrieval and data mining. “Guessing” these relationships is considered as a main feature of

ALADIN as it might help to find unseen connections, but it might as well produce false

negative (no link is detected) or false positive (a wrong link is discovered) results.

2.5.5 Atlas

Atlas is a data warehousing project from the University of British Columbia, that aims at

integrating information from several biological sources[79]. The data sources that it can

integrate may be categorized into four groups, being sequences, molecular interactions, gene

related resources and ontologies that are stored in a MySQL database. The developers

Methods

38

designed their database schema according to the above mentioned categories for importable

data.

Like BioWarehouse and other data warehousing solutions loaders play a major role in

integrating data from the external sources. For Atlas loaders are implemented for sequences

and molecular interactions, whereas the other information is simply important using database

dumps provided by the originating sources.

A user might access the data stored in Atlas directly via SQL, or using the provided API from

the Atlas framework or with some end-user applications.

2.5.6 BioWarehouse

Lee et al introduced a bioinformatics data warehouse approach in 2006, called

BioWarehouse[30]. BioWarehouse is an open source toolkit for constructing a data

warehouse integrating several biological data sources, published under the Mozilla Public

License[103], and is currently available for Oracle and MySQL.

Figure 8 - The main data types used in BioWarehouse and their possible connections, as shown in Lee et

al[30] A complete ER-diagram may be accessed at

http://www.biomedcentral.com/content/supplementary/1471-2105-7-170-S1.jpeg.

One of the key features is the usage of a so called warehouse identifier WID, used to uniquely

identify a data object stored in the warehouse. As WIDs are given to any type of concepts,

e.g. genes, proteins or reactions, a linkage between a gene object and a reaction object is

possible.

Methods

39

The main data types used in the BioWarehouse toolkit are: Taxon, BioSource, NucleicAcid,

Subsequence, Gene, Protein, Feature, Reaction, Chemical and Pathway. Each of these basic

types contains information typical to this type, but as well metadata like change history and

source of origin.

In order to add information to the warehouse, tools named “loaders” are necessary, as they are

in charge if loading and transforming the data coming from a data source to fit the

BioWarehouse schema of the according main type. Duplicates, which may show up if

importing from several sources, are not merged by the ETL system, instead if two objects

referring to the same concepts are important from the sources, two data objects are stored in

the warehouse. BioWarehouse as well allows storing information about literature references

linked to a data set, representing a biological source, within the repository.

Methods

40

2.5.7 Comparison

All of the above mentioned projects have their strengths and weakness, and each one of them

has proved to be useful in certain biomedical enquiries. In the table shown below all of the

above mentioned projects are listed, providing information about the homepage and their

approach of integration according to Hernandez and Kambhampati. It is not always possible

to clearly define if a project is purely using a navigational approach or a mediator-based, so

the decision on how to classify was made upon, how the most characterising features can be

classified.

Name Homepage Integration

Approach

SRS[15] http://srs.ebi.ac.uk Navigational

DiscoveryLink[94] http://www-

304.ibm.com/jct09002c/us/en/university/scholars/pr

oducts/lifesciences/discoverylink/

Mediator

BioMediator[98] http://www.biomediator.org Mediator

ALADIN[29], http://www.informatik.hu-

berlin.de/forschung/gebiete/wbi/research/projects/al

adin/

Data

Warehouse

BioWarehouse[30] http://biowarehouse.ai.sri.com/ Data

Warehouse

Atlas[79]. http://bioinformatics.ubc.ca/atlas/ Data

Warehouse

Table 2 - An overview on some data integration projects, described above, showing the project name and

the reference it was published first, as well as the project's homepage and the type of integration approach

(Navigational, Mediator or Data Warehouse)

Results

41

3 RESULTS

This section provides an overview of the various findings and implementations that were

made for the IMGuS project. The project setting, the infrastructure as well as some details on

the newly created ETL-framework will be part of the following subsections. Source code will

only be provided if it is necessary to present insights on development details.

3.1 Project IMGuS settings

The IMGuS project was implemented in order to identify molecular signatures that might help

to stratify patients, who are susceptible to curative treatment of prostate cancer. Therefore

several –omics techniques are combined and their knowledge is being integrated. This data

integration is done, using a data warehouse approach, called life science integrative data

warehouse (LINDA)[104]. This integrated data is then later used for system biological

approaches and methods.

Several academic and commercial partners are working together in the IMGuS project,

everyone providing their special abilities:

1. Department of Urology – Innsbruck Medical University

(biobank, probes and phenomic data)

2. Biocrates life sciences GmbH

(metabolomics)

3. Institute of Analytical Chemistry and Radiochemistry, University of Innsbruck

(proteomics)

4. German Cancer Research Centre Heidelberg

(genomics)

5. Max Planck Institute for Molecular Genomics Berlin

(modelling)

6. University for Health Sciences Medical Informatics and Technology - UMIT

(IT infrastructure and data warehousing)

The following figure demonstrates how the project participants interact, and what data sets or

services are provided. The Department for Urology takes samples of the patients and provides

additional phenomic information, like anamnesis or medical therapies. These samples are

Results

42

later processed with metabolomic, proteomic and genomic analytical approaches. Using an

Electronic Data Capture System the results of the various techniques are being imported into

the data warehouse provided by UMIT. Using System Biology methods the existing data is

then being processed and queried in order to gain the ability to build models.

Figure 9 - The IMGuS project participants provide individual services and data sets that are captures and

integrated in order to enable a System Biology approach. Graphic taken from IMGuS presentation at

DILS 2007[72].

In order to enable communications and data access a web platform was implemented,

allowing all project partners to integrate and access the project data. It allows Electronic Data

Capture EDC confirming Good Clinical Practice GCP, to upload the data gained by IMGuS

related studies. The data is then in a later process integrated into LINDA by using ETL tools

like Talend Open Solutions[105] or native scripts.

Results

43

Figure 10 - The upper figure was taken from[106] and shows a screenshot of the ad-hoc query builder

tool, which is part of the IMGuS project.

To access the integrated data an ad-hoc query builder tool was implemented, allowing easy to

use query generation and execution[106]. This allows users to query the data base without

having profound knowledge on hot to compose SQL statements. The basic concept of the ad-

hoc query builder is to use metadata instead of technical details that might not be

understandable for a user. The result sets gained by using the ad-hoc query builder can then

be used to perform data mining, statistical analysis or other KDD methods.

Results

44

3.2 LINDA Repository

As mentioned above the repository is the entity storing the information put into the data

warehouse. For the IMGuS project a database using PostgreSQL as database management

system was used. In the first phase the existing repository for LINDA was redesigned in order

to better fit to the users needs. The existing tables were reconsidered and re-created to a more

linkage related schema.

Figure 11 - The LINDA repository, was redesigned in order to better fit the user needs, an provide more

flexibility in storing and importing data sets. Relations starting with a “g_” are used to store Genomic

data, whereas relations having the prefix “_m” contain data coming from Metabolomic approaches. By

restructuring the single data records into a more fragmented form, the creation of more specialized and

individual data marts was enabled.

The repository may be split into two main parts. One part contains mostly administrative

information like information about the biological source a certain sample comes from

(tissue,serum…), or information on the data source the information is stored in. The other part

of data stored is the pure measurement information.

This distinction and seperation allows to easily extend the repository in order to store data

from other biomedical sources in the future.

Results

45

3.2.1 Design of the imgus-etl-framework for LINDA

Using the existing infrastructure of the IMGuS project, lead to the decision to recreate the

ETL-system that was loading the data, originating from the various sources, into the Data

Warehouse. As users weren’t able to upload the data on their own the decision was made, to

create a framework, that could be adopted to be used via a simple graphical user interface, in

order to enable a more effective use of human resources. Up to this time users had to take the

data files to a member of the backend development team, who would pre-process these files

manually or by using an awk[107] script, if a pre-processing process step was necessary.

Later the provided data file would be imported by using a Talend[105] file, performing a

PERL[108] script. This would cause delays in getting the data into the warehouse, and as well

would bind the developer to a process step, that could be easily automated. Re-implementing

these components also had the advantage of being able to recreate the single scripts in Java, if

possible, which eased the creation of a framework.

This section will describe how the new ETL-framework looks like, what way the newly

constructed components look like and how they interact. A description of the implementation

process will be part of the next section.

3.2.2 Infrastructure

The infrastructure used for the IMGuS ETL process was divided into three sections. Each of

these sections may be used by different user groups. These groups may be divided into three

categories: Back room developers BRD, front room developers FRD and clinical scientist or

biologists CSB. BRDs are involved in providing the components needed for data import,

transformation and storage, whereas FRDs develop KDD algorithms used to extract

information from the repository. CSBs browse through the data sets and use the applications

provided by the FRDs in order to gain new knowledge from the existing data. They might be

regarded as users or consumers, as they do not directly interact with the data sets, but rather

get a view on a composition and aggregation provided by the front room tool set.

This distinction also had an influence on the infrastructure, as each of these stages has a

completely different state of stability and constancy. While on the BRD level a schematic

change and improvement may occur on a regular base in order to evolve, those changes need

to be tested and verified before being deployed to the development FRD productive CSB

stage.

Results

46

Performance is another issue, since data import processes might consume much of the

bandwidth a network connection provides, or cleaning and transformation steps need CPU

time if big data sets need to be processed. Regarding those circumstances the decision was

made to provide three data bases, one data base for each of the three usage levels. In order to

get a real benefit out of this, deployment criteria needed to be defined.

The basic stage deploys a stable schema for storing the data, as well as the tools and

applications needed to import this data from the various sources. The second stage allows

developing and deploying stable KDD algorithms and tools that have been created and tested

on the basic resources provided by the first layer. In the third stage clinical scientist and

biologists use the provided applications to query the integrated data in order to prove existing

theories or search for new findings. This new knowledge may then later be deployed to the

scientific community.

The following figure shows the three stages and the deployment steps between them.

Results

47

Figure 12 - Three stages of deployment could be identified. Back room development, front room

development and clinical science or biology. Each of these stages has its own development areas, but is

depending on the underlying stage to provide stable functions and correct process data.

Results

48

3.2.3 Architecture

The new ETL framework can be regarded as consisting of two main features. One of these

features is the conversion of given file formats or database formats in order to fit to the

structure needed by the following loading processes, which store the data in the repository.

Figure 13 - The upper figure shows the main components the new ETL-framework. The conversion and

loading components are the main features the framework provides. As can easily be seen, the loading

component contains Java source provided by the Talend Open Studio. The repository is not part of the

framework, and has just been added to the graphic for means of understanding.

The right part of the figure above, shows the ETL-component embedded in the system

environment, excluding any parts of the front room. As can be seen, data coming from

external sources, may be accessed in a file representation, including simple flat files, more

complex XML files or as a database, which in most cases mean, accessing and reading a data

base dump.

The loading component makes use of Java code and archives created by Talend Open Studio

TOS tool, which is used in order to create a simple mechanism of importing data. These files

can later be re-used in the ETL component in order to read, modify and store the data. As data

transformation consumes about 70-80% of the time used to build a data warehouse[11], the

conversion and transformation steps, including Java classes created by TOS, were the first

software components to be designed an planed, in order to enable an early user interaction

with establishing the warehouse.

Results

49

3.2.4 Interfaces

In order to make the framework easy to extend a set of interfaces was designed, allowing the

easy adding of customer specific components for import and conversion purposes. This

subsection shall shortly describe the basic concepts of these interfaces, their interaction and

usage.

Figure 14 - Several interface have been designed in order to enable an easy to use and extend framework

for ETL process usage. The upper figure shows the basic interface and their sample implementation for a

meatabolomic domain.

Three main interfaces can be identified, as there are IConversion, IImport and

IImgus. IConversion is the main interface for all classes implementing a data

transformation algorithm, whereas IImport is mother class to all classes providing a data

set to database import function. IImgus is used as basic interface for distinction of the

various biomedical domains(metabolomics, genomics, phenomics…) that are part of the

IMGuS data warehouse.

The interface IConversion provides the method:

public void doConversion(InputStream is, OutputStream os, IConversionType ict);

Results

50

This method allows the simple conversion of a java.io.InputStream into a

java.io.OutputStream, whereby the kind of conversion to be applied onto the input is

determined by an Implementation of the IConversionType interface.

In order to provide a standard implementation of this, some more classes where introduced

then:

Figure 15 - In order to enable a standardized simplified use of the interfaces used for conversion two more

classes are provided. ConversionImpl, which is a standard class for data conversion, and ConversionFile,

which allows simple file conversion by providing the convert(File in, File out, IConversionType ict)

method with two file parameters.

These two more classes allow a standardized usage of conversion

methods, reducing the need of implementing a single conversion

implementation for every data type. A user can simply make

usage of an input file by simply writing his individual conversion

type and providing this class with these two parameters, as well as

his desired output file.

A similar approach was taken for the import tasks, where starting

from the main interface IImport, several other interfaces and

classes where created in order to reach a high level of possible

reuse.

Figure 16 - For the import tasks several interfaces and classes where

designed and implemented. The upper figure shows one example for

the use of reusable classes importing from CSV-files

Results

51

3.2.5 Import and Mapping components

As the provided data from the various sources fulfilled every aspect of the data heterogeneity

mentioned above, the import and mapping components turned out to be a non trivial task. In

order to enable the import of these data, which vary in representational aspects and content, a

solution using Talend Open Studio jobs and Java classes was created.

Figure 17 - The import and mapping components can be divided in three main layers (reading, mapping,

writing), the layers of data providing and storing were added to the upper figure for means of readability.

As shown in the upper figure three main layers were identified:

- Reading: Here the data files are read in, and their underlying schema (normally

provided as heading line in a CSV file) is used to create an input stream. As files in

the same biomedical domain may even differ in their representation, it was necessary

to create a reading component for every single file type, that should be later passed

into the data warehouse

- Mapping: As the above mentioned file schemas would only fit to the corresponding

database schema in the minority of cases, the mapping layer was introduced in order

to make the incoming data file stream compatible to the database table schema it

would later be stored in.

- Writing: After the data stream, coming from the input file, is mapped to the

corresponding database table schema it can be directly written to the repository, using

the component fitting the target database schema.

Results

52

3.3 Implementation of the imgus-etl-framework for LINDA

The following section will describe some issues that came while the implementation phase of

the imgus-etl-framework. In this phase of the project not all information from the several

biomedical domains could be integrated completely. But to prove the correct functioning of

the several components, the decision was made, that it would be sufficient to fully implement

the complete functionality for a subset of these domains.

One of the first targets to be achieved was to enable a user to import files on his own, since up

to this time, every file import had to be done by one of the Data Warehouse-developers,

which would cost enormous resources speaking of time and money. So the decision was taken

to enable a user to store his data into the repository by various tools. One task that is here

separately described from File Import was to convert the given files into a format, which

could later on be read and parsed by the import components.

3.3.1 File Conversion

Some of the files provided from the project partners were given in a format that could not

directly be read by the importing components. In order to fix this issue several conversion

components had to be implemented.

3.3.1.1 Horizontal into Vertical Representation

One of the major disadvantages was the fact that most of the data being provided by external

sources was distributed in a horizontal way (figure below). This form of representation caused

several problems with the usage of the ETL tool, so a new vertical distribution had to be

introduced.

Figure 18 - Some data was presented in a horizontal way, meaning that after some columns defining the

data set, several features was represented. The upper sample shows a sample data set from a

metabolomic approach. “NA” stands for “Not Available”, which means that the value for this certain

measurement could not be retrieved or stored in the data file.

In order to get a vertical representation of the data (figure below), a small application was

written in Java, including an algorithm, which would wrap to the desired form of data set

representation.

Results

53

Figure 19 - After converting the horizontal data set, two new fields can be seen, as the represent the newly

established column headers

In order to enable an efficient pre-processing several data cleansing operations, like replacing

“NA” or “null” values, were included in this entity. So a user might in this early step already

start to clean the provided data.

3.3.2 File Import

In order to enable an easy usage of file import, the imgus-etl-framework provides import

components that can be used to import the data sets into the repository.

3.3.2.1 Talend Open Studio Import Classes

The import of the data provided in a VSC format, was done using Java classes, designed and

provided by Talend Open Studio TOS[105].

TOS allows easy to use graphical

arrangement of components typically

used for data import and integration

tasks, put together in entities called

jobs. In an earlier stage several jobs

already existed as PERL code, but as

TOS allowed the creation of Java

code, introduced in Talend Open

Studio v2.0, the decision was made to

redesign all existing PERL jobs,

using a Java code output.

Figure 20 - Talend Open Studio allows the simple creation of

data transformation jobs. The upper figure shows a sample job

for the import of metabolomic data, provided as CSV file into a

PostgreSQL database. This workflow includes reading the

CVS file, mapping to the SQL schema and a logging process

that allows to inspect the process progress.

Results

54

3.3.3 Ontologies

In order to enable a correct semantic connectivity between the concepts stored in the

repository, an ontology section was inserted. It was almost completely taken from the

BioSQL project[109], as this seemed to be the way, which would fit best to this project.

Figure 21 - The ontology section for the IMGuS project was taken mostly from the BioSQL[109] project.

It allows to simply add information on terms and relationships from other ontologies, as well

as to create your own, question specific entries.

The table term contains objects or concepts like “gene”, “exon”, “metabolite”,

“measurement”, as well as words being the wording for a relationship like “is part of”,

“originates from”, “is composed of”. As synonyms are common things in biomedical research

domains, an extra table containing these synonyms, called term_synonym. As terms in most

cases originate from existing ontologies the table ontology provides the possibility to store

information on those external sources.

The relationships themselves are composed by combining three of the terms in the form of:

Subject . Predicate . Object

, where ‘.’ represents the operator for concatenation.

Results

55

Each of these three components is part of the term relation, and stored in a table called

relationship. This table may as well reference to the ontology relation, as some of the used

relationships may be taken from external ontologies.

3.3.4 Conversion and Import

The new imgus-etl-framework was developed and tested with metabolomic data, provided by

Biocrates life science AG. This data contained mass spectrographic measurements for prostate

cancer including samples from tissue and serum, being part of the IMGuS project.

The functionality of the horizontal-vertical transformation classes could be proved on two

metabolomic data sets, “ProstateSerumData” and “ProstateTissueSamples”, by doing a diff to

an awk-script, used before the Java implementation. This diff showed no differences between

those two files.

ProstateSerumData contained 115 columns containing metabolomic information, with 3

columns used to identify the data entry, and 319 data entries, which means that 35840 entry

pairs (metabolite + measurement) had to be processed. ProstateTissueSamples contained 352

columns, 4 identifying, with 36 data entry lines, summing up to 12876. To process,

ProstateSerumData the imgus-etl-framework needed 156-297 milliseconds, to process

ProstateTissueSamples file it took 62-156 milliseconds

data file main

headers

data

headers

sum

headers

data

lines

ex time

(min)

ex time

(max)

ProstateSerumData 3 112 115 319 156 ms 297 ms

ProstateTissueSamples 4 348 352 36 62 ms 156 ms

Table 3 - In order to test the Horizontal/Vertical conversion component a Java program was written,

performing the conversion of the file 100 times, and then the maximum execution time and the minimum

execution time of this runs were taken (ex time (min/max)).

Discussion

56

4 DISCUSSION

This master thesis aimed at providing a back room environment for the IMGuS project. In

order to achieve this target, the existing infrastructure was evaluated, redesigned and re-

implemented where necessary. All required features and components were at least planned

and designed, and for all components a prove of concept was provided by implementing them

for the metabolomic domain.

4.1 Data integration

As the integration of data that is distributed in dimensions like time, space and representation

it could be shown, that using a data warehouse approach can help in bringing together the

information and knowledge provided by the diverse –omics techniques. For the IMGuS

project using LINDA in combination with the EDC, the imgus-etl-framework and the query

builder, an integrated approach was designed and implemented. By using the framework the

rate of performed updates should increase, thereby increasing the rate of new knowledge

being found, as the time between data production, data integration and data mining should

decrease.

4.2 Switch to Java

Before the imgus-etl-framework was implemented the existing ETL system was a mixture of

PERL and awk scripts, which forced a developer to understand both of these languages, but

had several pros, especially in terms of flexibility and speed. But by making the decision to

Java the creation of a “one application for all tasks”-architecture was eased. The painful task

of transforming the existing scripting tools into Java code, fitting the imgus-etl-framework

specification, should be well worth the effort speaking in terms of maintainability and

extensibility.

4.3 Interface usage

The interfaces IIConversion and IImport in combination with the IConversionType

interface should provide all the stability to enable a high degree of reusability within the yet

used -omic fields but should prove flexible enough to be used for dealing with additional data

Discussion

57

sets, like phenomic data, that should be added to the IMGuS project in one of the next project

steps.

4.4 Repository

By redesigning the LINDA repository the

extensibility was improved, now data coming from

new biomedical data sources can be easily added

into the existing schema. The schema itself might

have become a bit more complex to understand for

the front room developers, but this drawback can be

fixed by expanding the usage of data marts. Those

data marts can be easily created from the existing

repository. The usage of data marts can have

advantages both in the stages of KDD development,

as well as in the stage of knowledge gain itself.

KDD development could be made more efficient by

using subsets of the complete data warehouse stored

in a data warehouse, whereas a data mart could as

well be used to filter and aggregate certain domains

in the clinical and biological research stage.

4.5 imgus-etl-framework

It could be shown that be using a set of interfaces and combing them with intelligent

technologies, that might even be provided from external tools like Talend Open Studio an

efficient, easy-to-extend framework can be created. By distinguishing between conversion

and import tasks, the grade of reusability could even be extended, allowing adapting to new

data sources faster.

Migrating the complete framework to Java allows building a whole-in-one application that

can be easily put into a Graphical User Environment in one of the next project steps. It as well

allows to use existing biomedical applications, like BioJava[110], in a very early step.

Figure 22 - The use of data marts could help

in the front room development stage and the

clinical and biological research stage.

Discussion

58

4.6 Deployment Stages

By introducing the 3 staged deployment schema presented above, a stable environment should

be enabled, as every person in research and development can rely on the stability of the tools

and resources provided by the stage below. By separating back room development from front

room development the development of new methods and tools is completely independent

from activities that happen in other stages. These should make the complete system more

stable and allow a guided migration work flow if changes need to propagated. Future

developments will show if this separation will prove its usability.

4.7 Future Work

As the IMGuS project evolves steadily, it will be necessary to adopt the imgus-etl-framework

and the LINDA repository to these changes according to the new requirements. It will as well

be one of the next targets to add information from biological data bases like KEGG[3] in

order to enable the biomedical research staff to directly put the found information into the

existing state of the art context.

An integration of literature data bases should as well be part of further project proceedings. It

is planned to integrate information from bibliographic data sources, as PubMed itself offers

more than 16.5 million medicine and biology related citations, originating from more than

19.000 life science journals[111]. These entries feature abstract in most cases, and some even

include links to the full text articles. Linking scientific information from this source could

improve the process of knowledge discovery, since the researcher could get access to this

information on the fly, while performing queries and interpreting the results. A possible

solution for integrating this source could be to import PubMed entries into the data

warehouse, or by simply storing the linkage information and providing the user with this

linking information.

In order to enable a higher grade of user interaction, the ETL framework should later be put

into a graphical user interface, which should allow easily composing complete conversion and

importing processes by a simple drag-and-drop interface. This would allow a user to simply

create his own data conversion and import processes.

Discussion

59

As several IMGuS project partners provide phenomic data, the extension of the back room in

order to fit to this data will be one of the next project steps, as integrating information about

anamnesis, the medical history and certain medical procedures could add up to very

interesting results. The existing repository will allow adapting to these changes without

touching any of the existing relations. The import process can be extended and being realized

by using the interfaces IConversion and IImport presented above.

List of Figures

A

LIST OF FIGURES

Figure 1 - The steps performed in KDD as defined by Fayyad et al[25] ................................... 7

Figure 6 - A data warehouse may be considered as consisting of a back room component and

a front room component. While the back room is responsible for data integration and storage,

the front room has to enable the access on the data. Graphic taken from IMGuS presentation

at DILS 2007[72]. .................................................................................................................... 25

Figure 7 - The three main stages of the BioMediator project (as presented in Shaker et al .... 36

Figure 8 - The main data types used in BioWarehouse and their possible connections, as

shown in Lee et al[30] A complete ER-diagram may be accessed at ...................................... 38

Figure 9 - The IMGuS project participants provide individual services and data sets that are

captures and integrated in order to enable a System Biology approach. Graphic taken from

IMGuS presentation at DILS 2007[72]. ................................................................................... 42

Figure 10 - The upper figure was taken from[106] and shows a screenshot of the ad-hoc query

builder tool, which is part of the IMGuS project. .................................................................... 43

Figure 11 - The LINDA repository, was redesigned in order to better fit the user needs, an

provide more flexibility in storing and importing data sets. Relations starting with a “g_” are

used to store Genomic data, whereas relations having the prefix “_m” contain data coming

from Metabolomic approaches. By restructuring the single data records into a more

fragmented form, the creation of more specialized and individual data marts was enabled.... 44

Figure 12 - Three stages of deployment could be identified. Back room development, front

room development and clinical science or biology. Each of these stages has its own

development areas, but is depending on the underlying stage to provide stable functions and

correct process data. ................................................................................................................. 47

List of Figures

B

Figure 13 - The upper figure shows the main components the new ETL-framework. The

conversion and loading components are the main features the framework provides. As can

easily be seen, the loading component contains Java source provided by the Talend Open

Studio. The repository is not part of the framework, and has just been added to the graphic for

means of understanding............................................................................................................ 48

Figure 14 - Several interface have been designed in order to enable an easy to use and extend

framework for ETL process usage. The upper figure shows the basic interface and their

sample implementation for a meatabolomic domain. .............................................................. 49

Figure 15 - In order to enable a standardized simplified use of the interfaces used for

conversion two more classes are provided. ConversionImpl, which is a standard class for data

conversion, and ConversionFile, which allows simple file conversion by providing the

convert(File in, File out, IConversionType ict) method with two file parameters................... 50

Figure 17 - The import and mapping components can be divided in three main layers (reading,

mapping, writing), the layers of data providing and storing were added to the upper figure for

means of readability. ................................................................................................................ 51

Figure 18 - Some data was presented in a horizontal way, meaning that after some columns

defining the data set, several features was represented. The upper sample shows a sample

data set from a metabolomic approach. “NA” stands for “Not Available”, which means that

the value for this certain measurement could not be retrieved or stored in the data file.......... 52

Figure 19 - After converting the horizontal data set, two new fields can be seen, as the

represent the newly established column headers...................................................................... 53

Figure 21 - The ontology section for the IMGuS project was taken mostly from the

BioSQL[109] project................................................................................................................ 54

List of Tables

C

LIST OF TABLES

Table 1 - The data sources as used by the Atlas project and their update properties as shown

in Shah et al[79]. The column in the middle shows how often the data source is updated in the

Atlas data warehouse, while the column in the right shows whether the complete data source

is being re-imported or just the changes to the currently stored version.................................. 29

Table 2 - An overview on some data integration projects, described above, showing the

project name and the reference it was published first, as well as the project's homepage and

the type of integration approach (Navigational, Mediator or Data Warehouse) ...................... 40

Table 3 - In order to test the Horizontal/Vertical conversion component a Java program was

written, performing the conversion of the file 100 times, and then the maximum execution

time and the minimum execution time of this runs were taken (ex time (min/max)). ............. 55

Bibliography

D

BIBLIOGRAPHY

[1] Liu, C. L., Prapong, W., Natkunam, Y., Alizadeh, A., Montgomery, K., Gilks, C. B., de, v. & Rijn, M. (2002). Software Tools for High-Throughput Analysis and Archiving of Immunohistochemistry Staining Data Obtained with Tissue Microarrays. Am J Pathol,

161(5), 1557-1565.

[2] Gardner SP (2005). Ontologies and semantic date integration. Drug discovery today,

10(14-24), 1001-1007.

[3] Kanehisa, M. & Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl. Acids Res., 28(1), 27-30.

[4] Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., Fitzhugh, W., Funke, R. & Morgan, M. J. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921.

[5] Quackenbush, J. (2007). Extracting biology from high-dimensional biological data. J Exp

Biol, 210(Pt 9), 1507-1517.

[6] Beek, J. H. G. M. v. (2004). Data integration and analysis for medical systems biology: Conference Reviews. Comp. Funct. Genomics, 5(2), 201-204.

[7] DOE Department of Energy (1993). Report of the Invitational DOE Workshop on Genome

Informatics.

[8] Nagarajan, R., Ahmed, M. & Phatak, A. (2004). Database Challenges in the Integration of

Biomedical Data Sets.

[9] Leser, U. & Rieger, P. (2003). Integration molekularbiologischer Daten. Datenbank-

Spektrum, 6, 56-66.

[10] Rosse, C. & José, L. V. M. J. (2003). A reference ontology for biomedical informatics: the Foundational Model of Anatomy. Journal of Biomedical Informatics, 36(6), 478-500ee = http://dx.doi.org/10.1016/j.jbi.2003.11.007.

[11] Schönbach, C., Kowalski-Saunders, P. & Brusic, V. (2000). Data Warehousing in Molecular Biology. Briefings in Bioinformatics, 1(1), 190-198.

[12] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M. & Sherlock, G. (2000). Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 25(1), 25-29.

Bibliography

E

[13] Spellman, P. T., Miller, M., Stewart, J., Troup, C., Sarkans, U., Chervitz, S., Bernhart, D., Sherlock, G., Ball, C., Lepage, M., Swiatek, M., Marks, W. L., Goncalves, J., Markel, S., Iordan, D., Shojatalab, M., Pizarro, A., White, J., Hubley, R., Deutsch, E., Senger, M., Aronow, B. J., Robinson, A., Bassett, D., Stoeckert, C. J. & Brazma, A. (2002). Design and implementation of microarray gene expression markup language (MAGE-ML). Genome

Biology, 3, research00.

[14] Almeida, J. S., Chen, C., Gorlitsky, R., Stanislaus, R., Aires-De-Sousa, M., Eleutério, P., Carriço, J., Maretzek, A., Bohn, A., Chang, A., Zhang, F., Mitra, R., Mills, G. B., Wang, X. & Deus, H. F. (2006). Data integration gets 'Sloppy'. Nature Biotechnology, 24(9), 1070-1071.

[15] Etzold, T., Ulyanov, A. & Argos, P. (1996). SRS: information retrieval system for molecular biology data banks. Methods Enzymol, 266, 114-128.

[16] Kemmeren, P., Kockelkorn, T. T. J. P., Bijma, T., Donders, R. & Holstege, F. C. P. (2005). Predicting gene function through systematic analysis and quality assessment of high-throughput data. Bioinformatics, 21(8), 1644-1652.

[17] Muller Patrick Y.; Janovjak Harald; Miserez André R.; Dobbie Zuzana (2002). Processing of Gene Expression Data Generated by Quantitative Real-Time RT-PCR. BioTechniques, 32(6), 2-7.

[18] William S. Hancock, S. L. W. R. R. S. a. E. A. G. (2002). Publishing large proteome datasets: scientific policy meets emerging technologies. , 20(12), 39-44.

[19] Lenzerini, M. (2002). Data Integration: A Theoretical Perspective.

[20] Halevy, A., Rajaraman, A. & Ordille, J. (2006). Data integration: the teenage

yearsVLDB Endowment.

[21] Palsson, B. (2000). The challenges of in silico biology. Nat Biotechnol, 18(11), 1147-1150.

[22] Hernandez, T. & Kambhampati, S. (2004). Integration of biological sources: current systems and challenges ahead. SIGMOD Rec., 33(3), 51-60.

[23] Bork, P. (2000). Powers and Pitfalls in Sequence Analysis: The 70% Hurdle. Genome

Res., 10(4), 398-400.

[24] Berthold, M. R. & Hand, D. J. (2003). Intelligent Data AnalysisSpringer.

[25] Fayyad, U. M., Piatetsky-Shapiro, G. & Smyth, P. (1996). From data mining to knowledge discovery: an overview., 1-34.

[26] Fayyad, U. M., Piatetsky-Shapiro, G. & Smyth, P. (1996). Knowledge Discovery and

Data Mining: Towards a Unifying Framework.

[27] Dictionary.com (2007). Definition of “logistics”.

[28] Jablonski, S., Lay Rainer Meiler, C. & M. Sascha H. Wolfgang, (2005). Data logistics as

a means of integration in healthcare applications. New York, NY, USA: ACM Press.

Bibliography

F

[29] Leser, U. & Naumann, F. (2005). (Almost) Hands-Off Information Integration for the

Life Sciences.

[30] Lee, T., Pouliot, Y., Wagner, V., Gupta, P., Calvert, D. S., Tenenbaum, J. & Karp, P. (2006). BioWarehouse: a bioinformatics database warehouse toolkit. BMC Bioinformatics,

7(1).

[31] EBI European Bioinformatics Institute (2007). The EMBL Nucleotide Sequence

Database, statistics.

[32] Galperin, M. Y. (2007). The Molecular Biology Database Collection: 2007 update. Nucleic Acids Res, 35(Database issue).

[33] Trißl, S., Rother, K., Müller, H., Koch, I., Steinke, T., Preissner, R., Frömmel, C. & Leser, U. (2004). Columba: Multidimensional Data Integration of Protein Annotations (2994spage =156).

[34] Haas, L. M., Lin, E. T. & Roth, M. T. (2002). Data integration through database federation. IBM Systems Journal, 41(4), 578-596ee = http://dx.doi.org/10.1147/sj.414.0578.

[35] Ibrahim, I. K. & Schwinger, W. (2001). Data Integration in Digital Libraries:

Approaches and Challenges.

[36] Perco Paul ; Rapberger Ronald ; Siehs Christian ; Lukas Arno ; Oberbauer Rainer ; Mayer Gert ; Mayer Bernd ; (2006). Transforming omics data into context : Bioinformatics on genomics and proteomics raw data. Electrophoresis, 27(13), 2659-2675.

[37] MITRE Corporation (2007). MITRE Corporation.

[38] Seligman, L. J., Rosenthal, A., Lehner, P. E. & Smith, A. (2002). Data Integration: Where Does the Time Go? IEEE Data Eng. Bull., 25(3), 3-10.

[39] Rahm, E. & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334-350.

[40] Rahm, E. & Do, H. H. (2000). Data Cleaning: Problems and Current Approaches. IEEE

Data Eng. Bull., 23(4), 3-13.

[41] Boulakia, S. C., S. Lair Stransky, N., StGraziani FranRadvanyi Barillot, E. & Froidevaux, C. (2004). Selecting biomedical data sources according to user preferences. Bioinformatics, 20(1), 86-93.

[42] Gamma, E., Helm, R., Johnson, R. & Vlissides, J. (1995). Design patterns: elements of

reusable object-oriented software. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

[43] Lacroix, Z. (2002). Biological data integration: wrapping data and tools. IEEE Trans Inf

Technol Biomed, 6(2), 123-128.

[44] Li Xu and David W. Embley (2004). Combining the Best of Global-as-View and Local-

as-View for Data Integration.

Bibliography

G

[45] Levy, A. Y., Mendelzon, A. O. & Sagiv, Y. (1995). Answering queries using views

(extended abstract). New York, NY, USA: ACM Press.

[46] Friedman, M., Levy, A. Y. & Millstein, T. D. (1999). Navigational Plans For Data

Integration.

[47] Theodoratos, D. & Sellis, T. K. (1997). Data Warehouse Configuration.

[48] Bairoch, A. & Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucl. Acids Res., 28(1), 45-48.

[49] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O'Donovan, C., Phan, I., Pilbout, S. & Schneider, M. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucl. Acids Res.,

31(1), 365-370.

[50] Hamosh, A., Scott, A. F., Amberger, J., Bocchini, C., Valle, D. & McKusick, V. A. (2002). Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucl. Acids Res., 30(1), 52-55.

[51] Wong, L. (2002). Technologies for integrating biological data. Brief Bioinform, 3(4), 389-404.

[52] Heidorn, B. P., Palmer, C. L. & Wright, D. (2007). Biological information specialists for biological informatics. Journal of Biomedical Discovery and Collaboration, 2, 1+.

[53] Achard, F., Vaysseix, G. & Barillot, E. (2001). XML, bioinformatics and data integration. Bioinformatics, 17(2), 115-125.

[54] Davidson, S. B., Overton, G. C., Tannen, V. & Wong, L. (1997). BioKleisli: A Digital Library for Biomedical Researchers. Int. J. on Digital Libraries, 1(1), 36-53.

[55] Haas, L. M., Kossmann, D., Wimmers, E. L. & Yang, J. (1997). Optimizing Queries

Across Diverse Data Sources.

[56] Lindberg D. A. B. ; Humphreys B. L. ; McCray A. T. ; (1993). The unified medical language system. Methods of information in medicine, 32(4), 281-291.

[57] Noy, N. F. (2004). Semantic integration: a survey of ontology-based approaches. SIGMOD Rec., 33(4), 65-70.

[58] UniProtKB/Swiss-Prot (2007). UniProtKB/Swiss-Prot Release 53.0 statistics.

[59] Wikipedia (2007). Wikipedia Genbank.

[60] Ning, Z., Cox, A. J. & Mullikin, J. C. (2001). SSAHA: A Fast Search Method for Large DNA Databases. Genome Res., 11(10), 1725-1729.

[61] Enright, A. J., Van Dongen, S. & Ouzounis, C. A. (2002). An efficient algorithm for large-scale detection of protein families. Nucl. Acids Res., 30(7), 1575-1584.

Bibliography

H

[62] Tarczy-Hornoch, P., Markey, M. K., Smith, J. A. & Hiruki, T. (2007). Bio*Medical informatics and genomic medicine: Research and training. Journal of Biomedical Informatics,

40(1), 1-4ee = http://dx.doi.org/10.1016/j.jbi.2006.10.002.

[63] NIH National Institutes of Health (2007). Re-engineering the Clinical Research

Enterprise.

[64] Gibson G (1999). What works. Data warehouse: decision support solution reduces patient admissions, saves payer millions. Health management technology, 20(4).

[65] Heyer Kimberley I. (1999). The development cycle of a pharmaceutical discovery chemi-informatics system. Medicinal Research Reviews, 19(3), 209-221.

[66] Jarke, M. (2003). Fundamentals of data warehouses. Berlin [u.a.]: Springer.

[67] Kimball, R. & Caserta, J. (2004). The data warehouse ETL toolkit. Indianapolis, Ind: Wiley.

[68] Inmon, W. H. (2002). Building the data warehouse. New York, N.Y. [u.a.]: Wiley.

[69] Moody, D. L. & Kortink, M. A. R. (2000). From enterprise models to dimensional

models: a methodology for data warehouse and data mart design.

[70] Chaudhuri, S. & Dayal, U. (1997). An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1), 65-74.

[71] Shin, B. (2003). An Exploratory Investigation of System Success Factors in Data Warehousing. J. AIS, 4.

[72] Pfeifer Bernhard (2007). A Life Science Data Warehouse System to enable Systems

Biology in Prostate Cancer.

[73] Simitsis, A., Vassiliadis, P. & Sellis, T. (2005). Optimizing ETL Processes in Data

Warehouses. Washington, DC, USA: IEEE Computer Society.

[74] Golfarelli, M., Rizzi, S. & Cella, I. (2004). Beyond data warehousing: what's next in

business intelligence?. New York, NY, USA: ACM Press.

[75] Datta, A., Moon, B. & Thomas, H. (1998). A Case for Parallelism in Data Warehousing

and OLAP. Washington, DC, USA: IEEE Computer Society.

[76] Silva MR (2004). Bioinformatics, the Clearing-House Mechanism and the Convention on Biological Diversity. Biodiversity Informatics.

[77] Segev, A. & Fang, W. (1990). Currency-Based Updates to Distributed Materialized

Views. Washington, DC, USA: IEEE Computer Society.

[78] Wang, R. Y. & Strong, D. M. (1996). Beyond accuracy: what data quality means to data consumers. J. Manage. Inf. Syst., 12(4), 5-33.

[79] Shah Sohrab ; Huang Yong ; Xu Tao ; Yuen Macaire ; Ling John ; Ouellette BF Francis (2005). Atlas – a data warehouse for integrative bioinformatics. BMC Bioinformatics, 6(1), 34.

Bibliography

I

[80] Buttler, D., Coleman, M., Critchlow, T., Fileto, R., Han, W., Pu, C., Rocco, D. & Xiong, L. (2002). Querying multiple bioinformatics information sources: can semantic web research help? SIGMOD Rec., 31(4), 59-64.

[81] Buccella, A., Cechich, A. & Brisaboa, N. R. (2005). Ontology-Based Data Integration. In J. H. D. a. V. E. F. Laura C. Rivero(Ed.), Encyclopedia of Database Technologies and

Applications (pp. 450-456). Idea Group.

[82] Rosenthal, A., Seligman, L. J. & Renner, S. (2004). From semantic integration to semantics management: case studies and a way forward. SIGMOD Record, 33(4), 44-50ee = http://doi.acm.org/10.1145/1041410.1041418, http://www.sigmod.org/sigmod/record/issues/0412/10.arnie-7.pdf.

[83] Gruber, T. R. (1992). Ontolingua: A Mechanism to Support Portable Ontologies.

[84] Schulze-Kremer, S. (1998). Ontologies for Molecular Biology.

[85] Schulze-Kremer, S. (2002). Ontologies for molecular biology and bioinformatics. In

Silico Biology, 2, 17ee = http://www.bioinfo.de/isb/abstracts/02/0017.html.

[86] Smith, B., Ceusters, W., Klagges, B., Köhler, J., Kumar, A., Lomax, J., Mungall, C., Neuhaus, F., Rector, A. L. & Rosse, C. (2005). Relations in biomedical ontologies. Genome

Biol, 6(5).

[87] OBO Open Biomedical Ontologies (2007). Open Biomedical Ontologies.

[88] Dean, M., Schreiber;, (2004 date -modified). OWL Web Ontology Language Reference (10).

[89] World Wide Web Consortium (2007). World Wide Web Consortium.

[90] Lee, B. T., Hendler, J. & Lassila, O. (2001). The Semantic Web. Scientific American.

[91] Horrocks, I., Schneider, P. P. & van Harmelen, F. (2003). From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 1(1), 7-26.

[92] Ideker, T., Thorsson, V., Ranish, J. A., Christmas, R., Buhler, J., Eng, J. K., Bumgarner, R., Goodlett, D. R., Aebersold, R. & Hood, L. (2001). Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science, 292(5518), 929-934.

[93] Muilu, J., Peltonen, L. & Litton, J. (2007). The federated database – a basis for biobank-based post-genome studies, integrating phenome and genome data from 600?000 twin pairs in Europe. European Journal of Human Genetics, aop(current).

[94] Haas, L. M., Schwarz, P. M., Kodali, P., Kotlar, E., Rice, J. E. & Swope, W. C. (2001). DiscoveryLink: a system for integrated access to life sciences data sources. IBM Syst. J.,

40(2), 489-511.

[95] Zdobnov, E., Lopez, R., Apweiler, R. & Etzold, T. (2002). The EBI SRS server: recent developments. Bioinformatics, 18(2), 139-148.

[96] IBM (2007). Data Joiner.

Bibliography

J

[97] Carey, M. J., Haas, L. M., Schwarz, P. M., Arya, M., Cody, W. F., Fagin, R., Flickner, M., Luniewski, A. W., Niblack, W., Petkovic, D., Thomas, J., Williams, J. H. & Wimmers, E. L. (1995). Towards heterogeneous multimedia information systems: the Garlic approach. ride,

00, 124.

[98] Donelson, L., Tarczy-Hornoch, P., Mork, P., Dolan, C., Mitchell, J., Barrier, M. & Mei, H. (2003). The BioMediator System as a Data Integration Tool to Answer Diverse Biologic Queries. Medinfo.

[99] Shaker, R., Mork, P., Brockenbrough, J. S., Donelson, L. & Tarczy-Hornoch, P. (2004). The BioMediator System as a Tool for Integrating Biologic Databases on the Web.

[100] Mork, P., Shaker, R., Halevy, A. & Tarczy-Hornoch, P. (2002). PQL: A declarative

query language over dynamic biological schemata.

[101] Shaker R.; Mork, Mork, S., Barclay, P. & Tarczy-Hornoch, M. (2002). A rule driven

bidirectional translation system remapping queries and result sets between a mediated

schema and heterogeneous data sources'.

[102] Mork, P., Shaker, R. & Tarczy-Hornoch, P. (2005). The Multiple Roles of Ontologies in

the BioMediator Data Integration System.

[103] Mozilla Foundation (2007). Mozilla Public License Version 1.1.

[104] Pfeifer Bernhard; Baumgartner Christian; Aschaber Johannes; Hanser Friedrich; Dreiseitl Stefan; Modre Robert; Schreier Günter; Tilg Bernhard (2007). A Life Science Data

Warehouse System to enable Systems Biology in Prostate Cancer.

[105] Talend (2007). Talend open data solutions.

[106] Lorünser Gerd ; (2006). Konzeption eines auf Metadaten basierenden Ad-hoc-Query

Builders. Hall:.

[107] Aho, A. V., Kernighan, B. W. & Weinberger, P. J. (1987). The AWK programming

language. Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

[108] Wall Larry (1987). Perl – Practical Extraction and Report Language.

[109] Open Bioinformatics Foundation (2007). BioSQL project.

[110] Pocock, M., Down, T. & Hubbard, T. (2000). BioJava: open source components for bioinformatics. SIGBIO Newsl., 20(2), 10-12.

[111] Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L. & Tatusova, T. A. (2007). Database resources of the National Center for Biotechnology Information. Nucl. Acids Res., 35(suppl_1), D5-12.

Curriculum Vitae

K

CURRICULUM VITAE

Personal data

Name: Karl Kugler Address: Metzentaler 21, 6094 Axams Date of birth: 18. December 1981, Innsbruck Nationality: Austrian

Education 10/2002 – 10/2005 University of Health Sciences, Medical Informatics and Technology; Innsbruck, Austria Bachelor of Science Biomedical Informatics (B.Sc.) 09/1993 - 06/2001 Bundesgymnasium Sillgasse; Innsbruck, Austria General University-Level Graduation 09/1992- 07/1993 Gymnasium der Abtei Schlierbach; Schlierbach, Austria 09/1988 - 07/1992 Volksschule; Kematen/Krems, Austria Primary School

Practical experience

1998 – 1999 Hotline and Costumer Services at Modern Business Systems; Innsbruck

2000 - 2003 Hotline and Costumer Services at k2-design edv-systeme; Axams

03/2003 -10/2003

Project participation: “Data Mining in Clinical, Genomic, Proteomic, Metabolic and Medical Image Databases“ at the Department of Database Systems; UMIT

08/2003 Practice at the Institut für Klinische Chemie und Pathobiochemie am Klinikum Rechts der Isar der TU München; Munich, Germany

02/2004 – 01/2005

Project participation: “Finding and Calling Webservices using Axis” at the Institute for Informationsystems; UMIT

09/2004 – 10/2004

Practice at Biocrates life sciences GmbH; Innsbruck

09/2004 – 02/2005

Trainer and first level support for SAP IS-H at the TILAK; Innsbruck

since 03/2005

Bioinformatics Department at Biocrates life sciences GmbH; Innsbruck

Others since 10/2001

Member of the Austrian Red Cross; Innsbruck Emergency Medical Technician, on-scene commander and ongoing servicing in administrative matters for voluntary members

Expression of thanks

L

EXPRESSION OF THANKS

Some lines to say “thank you” should be placed in here, in order to honour all the persons that

escorted me through the years of my bachelor and master courses. It wasn’t always an easy

time, but I cannot say that I regret any of the lessons learned. But most of it was a great time I

wouldn’t want to miss.

First of all I want to thank Rektor Univ.-Prof. Dr. Bernhard Tilg and all his staff at the

Institute for Biomedical Engineering for supporting me in writing this master thesis. Rektor

Tilg would lend me an ear every time something needed to be talked about. The same goes for

Univ.-Prof. Dr. Armin Graber, who proved that he was not just a great boss.

Special thanks go out to Dipl.-Ing. Dr. Bernhard Pfeifer, who was great to work with, took

every time needed to support me and showed that working in an academic environment could

contain a whole lot of fun as well. I hope that many students more will have the pleasure of

either being educated by him or even get the benefit to work with him. They may not just

learn from his skills concerning intelligence and problem solving, but even more from his

warm-hearted manners.

Further expressions of thanks go out to my friends and colleagues at the Austrian Red Cross,

who would cheer me up, whenever it seemed like there was an unsolvable problem ahead.

But there are two people who deserve the most to hear a “thank you” after this five years, as

they helped me out in times of financial shortcomings, pushed me ahead in times of lacking

motivation and where always there for me: Thank you mum, thank you dad!

Statement

M

STATEMENT

I hereby declare to have completed this work independently and to have used no aids other

than those mentioned.

Hall, …………………………………….

……………………………………

Karl Kugler

Documents

Integration of biomedical and semantic knowledge for ......Integration of biomedical and semantic knowledge for enabling systems biology approaches Karl Kugler UMIT - university for