Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Discovery Net : A UK e-Science Pilot Project
for Grid-based Knowledge Discovery Services
Patrick WendelImperial College, London
Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003
Why Discovery Net?
Data Challenge: Distributed, heterogeneous & large scale data setsNovel and real-time data sources
Resource ChallengeNovel specialised data analysis components/services continually being published/made availableComputational resources provided
Information Challenge: Data cleaning, normalisation & calibrationNew data needs to be related to existing data
Knowledge Challenge:Collaborative, interactive & people-intensiveResult interpretation & validation in relation to existing knowledgeKnowledge sharing is key
What is Discovery Net
Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services
Key Technologies:Grid and Distributed ComputingWorkflow and service compositionData Mining & Visualisation.Data Access & Information Structuring.High Throughput Screening Devices: real-time.
Discovery Net: Unifying the World’s Knowledge
Data Integration: Dynamic Real Time Construction of “Data Grids”
Application Integration: Component and Service-based Integration
People Integration:Global-wide Discovery Groupware
Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work
What is Discovery Net
Using Distributed Resources
ScientificInformation Scientific Discovery
Literature
Databases
OperationalData
Images
InstrumentData
Real Time Integration
Dynamic Application
Integration
Workflow Construction
Interactive Visual
Analysis
Discovery Net Layer Model(Life Science Application)
High Performance
and Grid-enabled Transfer
Protocol
(GSI-FTP, DSTP..)
Grid-enabled Infrastructure
(GSI)
Deployment
Web/Grid Services
OGSA
D-Net Middleware:
Provides execution logic for distributed knowledge discovery and access to distributed resources
Computation & Data Resources:
Distributed databases, compute servers and scientific devices.
D-Net Clients:End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities
A Knowledge Grid based on D-Net Servers
DNet Server
Data access &
Storage
InfoGrid
Com
ponents
Com
putation
Deploym
ent
DN
et AP
I
DNet Server
DNet Server
DNet Client
DNet Client
Computational services
Data sources
WWW
RDBMS
DNet server
DNet server
DNet participating client
DNet clientInternet
Web client
DPML
Knowledgediscoveryservices
XML
DNet Server
Data access &
Storage
InfoGrid
Com
ponents
Com
putation
Deploym
ent
DN
et AP
I
DNet Server
Data access &
Storage
InfoGrid
Com
ponents
Com
putation
Deploym
ent
DN
et AP
I
DNet ServerDNet Server
DNet Server
DNet Client
DNet Server
DNet Client
DNet Client
Computational services
Data sources
WWW
RDBMS
DNet server
DNet server
DNet participating client
DNet clientInternet
Web client
DPMLDPML
Knowledgediscoveryservices
XML
Several types of clients for different usage (from thin web client to
participating client)
Current implmentation based on Java distributed objects (EJB), moving
towards Web/Grid service
But deployment and API access through standard Web/Grid service
Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes
Discovery Process Management
Workflow based service compositionData-flow approach fits Knowledge Discovery
processAllows scientists to develop processes.Towards a Standard Workflow
Representation for Discovery Informatics: Discovery Process Markup Language (DPML):
Contains component data-flow graphs, but alsoRecords collaboration information (user, changes)Records execution constraints (location, parameterisation)Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms
D-Net Workflow for Genome Annotation :
16 services executing across Internet
InfoGrid: Dynamic Data Integration
Integrative Analysis
Chemistry
Gene
Protein /
Targets
Biological
Screening
Clinical
Journals
Sequence
Structure
Location
Function…
Activity
Protocols
Toxicology
Metabolic
Pathways…
Sequence
Expression
Function…
Structures
Libraries
Catalogues
Synthetic
pathways…
Journals
Project
Reports
Patents…
Trails
Patients…
Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring
Towards a Dynamic Information Integration Methodology:
Specialised Information Source Access:
InfoGrid allows users to register, locate
and connect to various specialised
information sources.
On the-fly Integration: InfoGrid allows
users to build their own integration
structure on the fly (Worst case:
proprietary protocol/format, best case
JDBC/HTTP-XML-XPath/Web Service).
Easy Maintenance: Wrappers/Drivers to
new data sources can be added through a
clean API
Dynamic Application Integration Services
Dynamic Application Integration = On-demand access and composition of remote analysis components
Towards a Dynamic Component Integration:
Component service: allow users to
register, locate and remotely execute
components (Java component interface or
Web Service port type).
Execution service: allow users to control
the execution of components distributed
environments
Easy Maintenance: New components can
be added through a clean API
Regression
Clustering
Classification
Gene function
perdition
Homology SearchPromoter
Prediction
D-NET API
Discovery Deployment
Batch processingDiscovery Service
ReportDiscovery Component
Discovery Process in DPML
Discovery Deployment = On-demand rapid application construction and publishing
Towards a Dynamic Deployment of Knowledge Discovery Procedures:
Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool.
Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures
Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports.
Knowledge Integration & Interpretation
Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge
Towards a Knowledge Integration Framework: Multi-subject data analysis
Specialised Client Interfaces: Interactive
Analysis and dynamic component
interaction
Result Annotation, Structuring and
Storage: Information source query, result
browsing, sharing and markup
Sequence
Analysis
Text MiningGenetic
Analysis
Pathway
Analysis
Life science example application
Workflow execution
Component execution location resolutionUser list of known resourcesA component can require explicitly to be executed on a particular resourceA component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go)
For unconstrained components, simple “near the data” execution policy:
If single input data location then execute thereOtherwise fallback to original execution location
Allows usual DPKD workflows to be designedHandles data management and transfer (serialisation, Java based, FTP based)
Discovery Net and Grid technologies
Cluster/Campus Grid level:Partial or complete workflow execution on Condor / SGETask farming on subset of the workflow
Global Grid:GSI integration (Java Cog Kit)GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3)Potential use of GRIS or NWS in component implementation
Globus scheduler ? Unicore ? SRB ?
Discovery Net Application Testbeds
GUSTO UNITS with wireless connectivity
Life Science Testbed:Gene sequencing, Protein Chips
High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases
Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, ..
High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets
Geo-hazard PredictionMulti-spectral, multi-temporal, Satellite imagery
Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge
Case Study:SC2002 HPC Challenge
blastgenscan
Repeat
Masker
grail
genscanE-PCR
Identify
Genes
Gene markers
tRNAs, rRNAsNon-translated
RNAsRegulatory
Regions
Repetitive
ElementsSegmental
Duplication
SNP
VariationsLiterature
References
…..
3D-PSSMblast
Motif
Search
PFAM
DSCpredator
Inter
Pro
Inter
Pro
SMART
SWISS
PROT
Identify
Functional
Characteisation
Homologues
Domain 3-D Structure
Fold Prediction
Secondary
structureLiterature
References
…..
Proteins
Classify into
Protein Families
IdentifyOrganism
Chromosomes
Organism’s
DNA
Relate Cell
Cycle
Metabolism
Drugs
Biological
Process…..Cell death
EmbryogenesisLiterature
References
…..
Ontologies
Pathway
Maps
GeneMapsAmiGO
GenNav
virtual
chip
High Throughput
Sequencers
Nucleotide-level
Annotation
Protein-level
Annotation
Process-level
Annotation
NCBIEMBL
TIGR SNP
GO CSNDB
GKKEGG
D-Net based Global Collaborative Real- Time Genome Annotation
Genome
Annotation
15 DBs 21 Applications
How It Works
Nucleotide Annotation Workflows
Download sequence
from Reference
Server
Save to Distributed Annotation
Server
InteractiveEditor &
Visualisation
Execute distributed annotation workflow
NCBIEMBL
TIGR SNP
Inter
ProSMART
SWISS
PROT
GO
KEGG
1800 clicks500 Web access200 copy/paste3 weeks work
in 1 workflow and few second execution
Conclusion and Future works
Towards an open integration platform that enables scientists to conduct their KD activitiesSeveral levels of integration requiredEnable use of available resources
Evolution towards cost model integration (performance, value, QoS)Semantic based service retrieval and compositionOther useful standards ? (OGSA-DAI ?)