State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya

State of the Nation in Data Integration for Bioinformatics

Carole Goble and Robert Stevens

Presented by: Daya Wimalasuriya

2

Outline

Introduction Why integration of bioinformatics sources is

especially hard Use of traditional data integration techniques

in bioinformatics Use of new integration techniques Where to go from here?

3

Introduction

As a discipline, bioinformatics is based on a range of diverse, complex and distributed data resources (900 or more).

Because of the existence of such a large number of sources, data integration is critically important in this field.

Integration of these sources is a challenging task. It is said that bioinformaticians should be a little

ashamed by this situation. It has been stated that a “Bioinformatics Nation” should be

developed from the current set of competing “Princely States”.

4

Introduction (contd.)

Why so many sources? The Web makes it (too) easy to publish data. Being a resource provider is one way to make a

reputation. Each new sub-discipline develops its own data

representations skewed to its biases. Each type of data has many resources which

have many overlaps. In comparison, areas such as particle physics

have few centralized data resources.

5

Introduction (contd.)

Additional problems of these sources Each group is highly autonomous and routinely

create different data resources and designs. Different interfaces are provided by different

groups (e.g., “flat files”, XML, APIs) Users are often independent and decoupled from

data providers. Many groups don’t have the expertise or the

resources needed to survive (only about 18% have a sustained future)

6

Special Challenges of DI in BI When compared with fields such as

astronomy and particle physics, bioinformatics data are not very large.

The problem is the complexity of data, arising from several factors describing a sample and its originating context diversity of sources of a sample large number of inter-links, etc.

7

How to Handle the Complexity of Data? Common, shared identities and names:

“A biologist would rather share their toothbrush than their gene name”

Shared semantics Ontologies can help but political and theoretical

wrangling hinder their development Shared and stable access mechanisms

In 2007, BioMART altered its interfaces four time breaking any client software that used them

8

How to Handle the Complexity of Data? (contd.) Adhering to standards

a blue collar science Explicitly stating collection policies and

governance Balancing “curation” with ease of use

These issues have to be handled while keeping the freedom of rapid innovation.

9

Different Data Integration Techniques used with Bioinformatics Sources

10

Traditional Data Integration Techniques Link Integration (Search):

Directly cross-references a data entry in a data source with another entry in another data source.

Implemented using hyperlinks. Widely used by bioinformatics systems such as

SRS, Entrez and Integr8. This technique actually represents interlinks

created in a haphazard manner. Vulnerable to name changes, updates, etc.

11

Traditional Data Integration Techniques (contd.) Data Warehousing (Materialization):

Data are extracted, cleaned and stored in a separate, integrated database.

Some people believe that this is the only data integration technique that actually works.

Requires a pre-determined encompassing model. Involves a high initial cost as well as high

maintenance costs; hard to change; commonly decoupled from data providers;

Often result in “data mortuaries”.

12

Traditional Data Integration Techniques (contd.) View Integration (Mediation):

Data is still in the source databases but a virtual warehouse is constructed using mappings.

Uses models such as Global-As-View (GAV) and Local-As-View (LAV).

Popular among database theorists and vendors. Developing a global model is costly, mappings are

often brittle, results in a complex environment. Automated processes are necessary to make this

method practically useful.

13

Traditional Data Integration Techniques (contd.) Integration Application (Ad hoc methods)

Applications specifically designed for a particular integration task.

Generally uses a combination of integration techniques and provides more options for the user.

Avoids the “Big I Challenge” Workflows coordinate a transient workflow

between data services and analytical tools and expose the integration methods.

14

New Data Integration Techniques Service Oriented Architectures:

Include technologies such as CORBA and Web Services.

Data Integration has to be achieved by “plumbing” these services.

CORBA is generally considered too heavy despite its technical sophistication.

SOAP based web services has shown promise but have problems such as poor documentation.

These services are necessary to do away with the widespread practice of “screen scrapping”

15

New Data Integration Techniques (contd.) Mashups:

A Web 2.0 idea based on taking data from more than one web-based resource to build a new web-based application (e.g., combing a feed of earthquake measurements with Google maps)

Delivered through the Web, open and light. Platforms such as Microsoft Popfly and Yahoo!

Pipes are already available. Has been used in applications such as tracking

the spread of aviation flu.

16

Mashups (contd.)

Emphasizes the role of the user in creating a specific, light-touch, on-demand integration.

Relies on the existence of APIs and light-weight tools for development.

Preferred by bioinformaticians over heavier general engineering solutions.

Just as vulnerable as other integration techniques to identity clashes and concept ambiguities.

17

New Data Integration Techniques (contd.) Semantic Web applications are also expected

to help the integration of bioinformatics sources.

They are called “smashups”. Smashups should use ontologies and support

reasoning. The communication mechanisms may be the

same as those used by mashups. (e.g., AJAX in the client side)

18

Architecture of Smashups

19

Requirements of Smashups

Simple and stable APIs that can be used by third parties

Publishing data as RDF and supporting SPARQL endpoints (or supporting conversion to these formats from databases)

Clarifying the semantics Tackling the problem of object reconciliation.

20

Where to go from here?

Better naming standards and handling object reconciliation is essential for the success of any data integration technique in bioinformatics.

Most promising techniques: Web Services Mashups (Web 2.0 Applications) Smashups (Semantic Web Applications)

21

Thank You!

Questions?

Documents

State of the Nation in Data Integration for Bioinformatics Carole Goble and Robert Stevens Presented by: Daya Wimalasuriya