Discovery Net : A UK e-Science Pilot Project for Grid ...Discovery Process Management Workflow based service composition Data-flow approach fits Knowledge Discovery process Allows

Discovery Net : A UK e-Science Pilot Project

for Grid-based Knowledge Discovery Services

Patrick WendelImperial College, London

Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003

Why Discovery Net?

Data Challenge: Distributed, heterogeneous & large scale data setsNovel and real-time data sources

Resource ChallengeNovel specialised data analysis components/services continually being published/made availableComputational resources provided

Information Challenge: Data cleaning, normalisation & calibrationNew data needs to be related to existing data

Knowledge Challenge:Collaborative, interactive & people-intensiveResult interpretation & validation in relation to existing knowledgeKnowledge sharing is key

What is Discovery Net

Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services

Key Technologies:Grid and Distributed ComputingWorkflow and service compositionData Mining & Visualisation.Data Access & Information Structuring.High Throughput Screening Devices: real-time.

Discovery Net: Unifying the World’s Knowledge

Data Integration: Dynamic Real Time Construction of “Data Grids”

Application Integration: Component and Service-based Integration

People Integration:Global-wide Discovery Groupware

Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work

What is Discovery Net

Using Distributed Resources

ScientificInformation Scientific Discovery

Literature

Databases

OperationalData

Images

InstrumentData

Real Time Integration

Dynamic Application

Integration

Workflow Construction

Interactive Visual

Analysis

Discovery Net Layer Model(Life Science Application)

High Performance

and Grid-enabled Transfer

Protocol

(GSI-FTP, DSTP..)

Grid-enabled Infrastructure

(GSI)

Deployment

Web/Grid Services

OGSA

D-Net Middleware:

Provides execution logic for distributed knowledge discovery and access to distributed resources

Computation & Data Resources:

Distributed databases, compute servers and scientific devices.

D-Net Clients:End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities

A Knowledge Grid based on D-Net Servers

DNet Server

Data access &

Storage

InfoGrid

Com

ponents

Com

putation

Deploym

ent

DN

et AP

I

DNet Server

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPML

Knowledgediscoveryservices

XML

DNet Server

Data access &

Storage

InfoGrid

Com

ponents

Com

putation

Deploym

ent

DN

et AP

I

DNet Server

Data access &

Storage

InfoGrid

Com

ponents

Com

putation

Deploym

ent

DN

et AP

I

DNet ServerDNet Server

DNet Server

DNet Client

DNet Server

DNet Client

DNet Client

Computational services

Data sources

WWW

RDBMS

DNet server

DNet server

DNet participating client

DNet clientInternet

Web client

DPMLDPML

Knowledgediscoveryservices

XML

Several types of clients for different usage (from thin web client to

participating client)

Current implmentation based on Java distributed objects (EJB), moving

towards Web/Grid service

But deployment and API access through standard Web/Grid service

Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes

Discovery Process Management

Workflow based service compositionData-flow approach fits Knowledge Discovery

processAllows scientists to develop processes.Towards a Standard Workflow

Representation for Discovery Informatics: Discovery Process Markup Language (DPML):

Contains component data-flow graphs, but alsoRecords collaboration information (user, changes)Records execution constraints (location, parameterisation)Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms

D-Net Workflow for Genome Annotation :

16 services executing across Internet

InfoGrid: Dynamic Data Integration

Integrative Analysis

Chemistry

Gene

Protein /

Targets

Biological

Screening

Clinical

Journals

Sequence

Structure

Location

Function…

Activity

Protocols

Toxicology

Metabolic

Pathways…

Sequence

Expression

Function…

Structures

Libraries

Catalogues

Synthetic

pathways…

Journals

Project

Reports

Patents…

Trails

Patients…

Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring

Towards a Dynamic Information Integration Methodology:

Specialised Information Source Access:

InfoGrid allows users to register, locate

and connect to various specialised

information sources.

On the-fly Integration: InfoGrid allows

users to build their own integration

structure on the fly (Worst case:

proprietary protocol/format, best case

JDBC/HTTP-XML-XPath/Web Service).

Easy Maintenance: Wrappers/Drivers to

new data sources can be added through a

clean API

Dynamic Application Integration Services

Dynamic Application Integration = On-demand access and composition of remote analysis components

Towards a Dynamic Component Integration:

Component service: allow users to

register, locate and remotely execute

components (Java component interface or

Web Service port type).

Execution service: allow users to control

the execution of components distributed

environments

Easy Maintenance: New components can

be added through a clean API

Regression

Clustering

Classification

Gene function

perdition

Homology SearchPromoter

Prediction

D-NET API

Discovery Deployment

Batch processingDiscovery Service

ReportDiscovery Component

Discovery Process in DPML

Discovery Deployment = On-demand rapid application construction and publishing

Towards a Dynamic Deployment of Knowledge Discovery Procedures:

Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool.

Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures

Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.

Knowledge Integration & Interpretation

Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge

Towards a Knowledge Integration Framework: Multi-subject data analysis

Specialised Client Interfaces: Interactive

Analysis and dynamic component

interaction

Result Annotation, Structuring and

Storage: Information source query, result

browsing, sharing and markup

Sequence

Analysis

Text MiningGenetic

Analysis

Pathway

Analysis

Life science example application

Workflow execution

Component execution location resolutionUser list of known resourcesA component can require explicitly to be executed on a particular resourceA component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go)

For unconstrained components, simple “near the data” execution policy:

If single input data location then execute thereOtherwise fallback to original execution location

Allows usual DPKD workflows to be designedHandles data management and transfer (serialisation, Java based, FTP based)

Discovery Net and Grid technologies

Cluster/Campus Grid level:Partial or complete workflow execution on Condor / SGETask farming on subset of the workflow

Global Grid:GSI integration (Java Cog Kit)GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3)Potential use of GRIS or NWS in component implementation

Globus scheduler ? Unicore ? SRB ?

Discovery Net Application Testbeds

GUSTO UNITS with wireless connectivity

Life Science Testbed:Gene sequencing, Protein Chips

High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases

Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, ..

High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets

Geo-hazard PredictionMulti-spectral, multi-temporal, Satellite imagery

Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge

Case Study:SC2002 HPC Challenge

blastgenscan

Repeat

Masker

grail

genscanE-PCR

Identify

Genes

Gene markers

tRNAs, rRNAsNon-translated

RNAsRegulatory

Regions

Repetitive

ElementsSegmental

Duplication

SNP

VariationsLiterature

References

…..

3D-PSSMblast

Motif

Search

PFAM

DSCpredator

Inter

Pro

Inter

Pro

SMART

SWISS

PROT

Identify

Functional

Characteisation

Homologues

Domain 3-D Structure

Fold Prediction

Secondary

structureLiterature

References

…..

Proteins

Classify into

Protein Families

IdentifyOrganism

Chromosomes

Organism’s

DNA

Relate Cell

Cycle

Metabolism

Drugs

Biological

Process…..Cell death

EmbryogenesisLiterature

References

…..

Ontologies

Pathway

Maps

GeneMapsAmiGO

GenNav

virtual

chip

High Throughput

Sequencers

Nucleotide-level

Annotation

Protein-level

Annotation

Process-level

Annotation

NCBIEMBL

TIGR SNP

GO CSNDB

GKKEGG

D-Net based Global Collaborative Real- Time Genome Annotation

Genome

Annotation

15 DBs 21 Applications

How It Works

Nucleotide Annotation Workflows

Download sequence

from Reference

Server

Save to Distributed Annotation

Server

InteractiveEditor &

Visualisation

Execute distributed annotation workflow

NCBIEMBL

TIGR SNP

Inter

ProSMART

SWISS

PROT

GO

KEGG

1800 clicks500 Web access200 copy/paste3 weeks work

in 1 workflow and few second execution

Conclusion and Future works

Towards an open integration platform that enables scientists to conduct their KD activitiesSeveral levels of integration requiredEnable use of available resources

Evolution towards cost model integration (performance, value, QoS)Semantic based service retrieval and compositionOther useful standards ? (OGSA-DAI ?)

Documents

Discovery Net : A UK e-Science Pilot Project for Grid ...Discovery Process Management Workflow based service composition Data-flow approach fits Knowledge Discovery process Allows