The Global Bio Grid

Preview:

DESCRIPTION

The Global Bio Grid. Virginia Center for Grid Research. Andrew Grimshaw University of Virginia January, 2006. Why Bio Grids? Grid Basics The Global Bio Grid. In ten years the world will be very different. Think back ten years. No web Wide-spread internet was new - PowerPoint PPT Presentation

Citation preview

The Global Bio GridAndrew Grimshaw

University of VirginiaJanuary, 2006

Virginia Center for Grid Research

• Why Bio Grids?

• Grid Basics

• The Global Bio Grid

In ten years the world will be very different.

Think back ten years.

• No web

• Wide-spread internet was new

• Human Genome Project still far from completion

• Science (biology) done primarily in individual labs

Today

• Billions a year in e-commerce• Internet everywhere

• Broadband to your home• Wireless becoming pervasive

• Pervasive device are proliferating – motes

• Sequencing of organisms a daily event. Bioinformatics hitting the main stream

Tomorrow

• $1000/sequnce for humans – becomes standard clinical practice

• “Biology is becoming an information science”(Large Scale Biomedical Science: Exploring Strategies for future research, Institute of Medicine, National Research Council, 2003)

• Global interconnected networks – grids• Provide transparent, secure, access to data, applications,

and on-demand compute.

• Research using not just your data, but all trusted data, not just your applications, but any trusted application.

• Implications for progress are significant.

There are a number of “catches”

• So much data!

• So many organizations with so little trust!

• So much complexity!

An IT guys view

• Data is all over, of all different forms, with lots of different policies• Need to get the right data in the right place at the

right time

• Ontology problem – how do we compare, integrate, the databases• Need to understand semantics, automatically

transform

• Semantics• Knowledge Discovery – “mining”

This is where grids enter the picture

(we do the plumbing)

Some lessons learned

• 10+ years in academic and commercial grids• All/most problems are not technical• Users don’t want change!

• Too many grids are technology centric• Must keep “activation energy low”• Need a user-centric approach• There are at least four classes of users• Wide variance in computational savvy

A grid enables users to collaborate securely by sharing processing, applications, work flows and processes, and data across heterogeneous systems and administrative domains for collaboration, faster application execution, and easier access to data.

What is a Grid? A grid is all about gathering together resources and making them accessible to users and applications.

The emphasis is on secure access to a widevariety of resources

Characteristics of Grid systems

Numerous ResourcesOwnership by

MutuallyDistrustful

Organizations & Individuals

Potentially FaultyResources

Different Security

Requirements & Policies Required

Resources areHeterogeneous

GeographicallySeparated

Different Resource

ManagementPolicies

Connected byHeterogeneous, Multi-Level

NetworksGrid System

Characteristics of a Grid system

Numerous ResourcesOwnership by

MutuallyDistrustful

Organizations & Individuals

Potentially FaultyResources

Different Security

Requirements & Policies Required

Resources areHeterogeneous

GeographicallySeparated

Different Resource

ManagementPolicies

Connected byHeterogeneous, Multi-Level

Networks

What grids are not

• The solution to all problems

• Clusters of machines

• SETI@home

• Any one particular technology

Users view

Site 0 Site 1 Site 2 Site 3

Cluster

Cluster

HPSS

UsersUsers

Grid

Runprograms

AccessData Collaborate

Provideshared

services

Grid Computing Scenarios

Desktop Cycle Aggregation• Limited acceptance in commercial enterprises

Cluster Grids• Single owner, department, project • Single domain, file system• LAN connection

Campus/Enterprise Grids• Multiple owners, domains• Multiple file systems• WAN connection

Partner Grids• Multiple owners, sites, domains• Multiple file systems• Internet connectivity

Legion Grid

Software – C

ompute

and Data G

rid

Standards

• Global Grid Forum – ggf.org• OGSA – Open Grid Services Architecture

• Web-Services based IPC• WSRF and possibly other• OGSA-BES – Basic Execution Service• OGSA-ByteIO – file IO• WS-Naming – abstract name to EPR• RNS-lite – Resource Name Space

The Global Bio Grid

• Federated access to multiple • Data sources

• Public databases• Commercial databases• In-house databases, annotations, etc.

• Application suites (including processes and workflows)

• Compute resources

• Shared among collaborative research teams• Multiple research locations• Virtual organizations

• Built on evolving computing standards (GGF, I3C, WS-*)

GBG concept

Global Bio Grid• Datagrid using Avaki DG technology

• Working on ADG available free for “.edu”• UVA, NCBIO, U-Texas, Texas Tech• Already operational• Flat file and relational• Working on an OGSA-compliant implementation

• Compute grid at UVA on-line• 64 dual processor Opteron’s available• Sunfires• Hundreds of Windows machines• Legion 1.8 based – moving towards OGSA-compliant services

• Applications• Biomarker• Searching pub med• Hospital info integration

Three resource classes illustrate the Grid-effect

• Data

• Processing

• Applications

Data• Suppose you have collaborators with critical

databases (clinical, protein, other) that you need to use.

• You use a number of databases that change on a regular basis.

• You want to “mine” heterogeneous data sets (relational, flat-file, XML, …) in different locations – say in a hospital

• Want to produce, consume, or share derivative data products, e.g., the result of a set of joins and data transformation steps.

• This applies to business data (BI/EII) as well as life science data

SEQ_3

BiochemistryBiology

Partner Institution

SEQ_2SEQ_1

Partner Institution

Public DB Public DB

Research Institution

APP 2APP 1

Public DBDataGrid: Unifying fabric for data access • Transparent access to multiple DBs• Multiple domains• Highly-secure, flexible access control• Automatic cache management and

coherence

PDB

NCBI

EMBL

SEQ_1

Data

Three Concrete Examples

• KDS – “data mining” on widely separated data sets such as PubMed.

• “Map” UniProt datasets into data grid• Researchers no longer need to spend time

downloading latest

• Extended Hospital

Extended Hospital

Insurance companies

Emergency vehicles

Research

DataWarehouse

Department Domain

Data

Department Domain

Data

Department Domain

Data

HOSPITAL

Clinics / Large Practices

Non-relatedHospitals

AuthorizedFamily

Processing• Classic high-throughput computing

• Suppose you have thousands of computationally intensive jobs to run• SW, CHARMm, Sequest, a.out

• Your usage is bursty – need a lot over short period of time, but often have idle resources

• You wish you had more!

SEQ_3

BiochemistryBiology

Partner Institution

SEQ_2SEQ_1

Partner Institution

Public DB Public DB

Research Institution

APP 2APP 1

Cluster 1

Cluster 2

Cluster N

Processing

Public DBCompute Grid: Shared access to processing

• Flexible, location-independent access to virtually unlimited processing, on-demand

• Scheduling, usage, management policies• System detects, recovers from job failures• Heterogeneous platform support• Usage accounting, as required

PDB

NCBI

EMBL

SEQ_1

Data

Concrete Examples

• Biomarkers project wants to run Sequest-2 using public databases

• Charmm/Amber

• Gnomad (Altman et al)

• BLAST, FASTA, ….

• Autodock

Applications

• Suppose you want to use applications or workflows developed, maintained, and supported by others – without the hassle of installing all of them on your gear.

• Suppose you want to couple multiple applications developed at different institutions together.

SEQ_3

BiochemistryBiology

Partner Institution

SEQ_2SEQ_1

Partner Institution

Public DB Public DB

Research Institution

APP 2APP 1

PDBNCBIEMBLSEQ_NData

Cluster 1

Cluster 2

Cluster N

Processing

APP 1

APP 2

APP N

Applications

Public DB

• Flexible binary management• No need to recompile applications• Securely share applications

• Restrict who gains access• Restrict where apps run

Grid users share applications, employing multiple data & processing resources

PDB

NCBI

EMBL

SEQ_1

Data

SEQ_3

BiochemistryBiology

Partner Institution

SEQ_2SEQ_1

Partner Institution

Public DB Public DB

Research Institution

APP 2APP 1

Cluster 1

Cluster 2

Cluster N

Processing

APP 1

APP 2

APP N

Applications

Public DBBetter Research, Faster

• Secure, wide-area access to global breadth of consistent, current data

• Access to vast processing power• Ability to securely share proprietary

data and applications, as needed

PDB

NCBI

EMBL

SEQ_1

Data

Evolution in action

Bare Metal Programming

50’s

Batch OS

Multi-UserTimeshare

60’s to 80’s

Low Level Network

Programming

Today

Grid & WS

Now & Future!

Summary

Summary

• Grids will have a huge impact on the life sciences

• Prototype GBG operational

• Applications are underway

• We’re always looking for new applications

Recommended