27
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State University www.multiscalecomputing.org Cluster 2004 Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz A Distributed Data Management Middleware for Data-Driven Application Systems The Mobius Project Shannon Hastings The Ohio State University

download Power Point

Embed Size (px)

Citation preview

Page 1: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

A Distributed Data Management Middleware for Data-Driven Application Systems

The Mobius Project

Shannon HastingsThe Ohio State University

Page 2: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Overview– Motivation– Related Work– Mobius Middleware

• Services• Protocol• Architecture• Results

– Community Presence

Page 3: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Motivation– Increasing numbers of applications that wish to store and query distributed and

disparate data• Large-scale data discovery, versioning, storage, retrieval, and linking in a

distributed multi-institutional/global environment.• Management of data model evolution and data translation.• Support for on-the-fly creation of model driven data worlds. • Support indexing and storage of large-scale data sets in distributed

environments.– Approach: create a data storage and management middleware that is:

• Distributed and Scaleable – service based grid architecture with ability to run on networked collections of PCs and PC clusters

• Searchable – data instances conform to published data models and can be queried by applications.

• Shareable – generic protocols enabling multiple applications to concurrently store, retrieve, query, and share data in the system.

Page 4: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Related Work/Projects

– OGSA-DAI – DQP– Chimera– Storage Resource Broker (SRB)– Distributed Parallel Storage Services (DPSS)– IBP– ActiveProxy-G

Page 5: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mobius Overview– The Mobius project identifies, defines, and builds a set of

services and protocols enabling the management and integration of both data and data definitions.

– We break everything into two components:• The protocol definitions.• The service implementation of those definitions.

– Mobius intends to conform to all (maybe) standards set forth by the GGF and attempts to leverage relevant existing grid technologies.

• Currently support the GGF/DAIS XML Realization Proposed Specification

Page 6: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mobius Core Services– Mako (Federated Adhoc Storage Services)– Global Model Exchange (GME)– Data Translation Service Framework (DTS)

• Mobius Extension Services– VMako (Single virtual service view of a federation of Makos)– Other Higher level query services (semantic query, inference services etc.)– Data Replication, Transportation, and Distributed Query Services.

• Other Needed Grid Services – Namespace Management– Service Naming– Data Replication– Security

Page 7: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mako Service (Data Resource)

– Exposes existing data services as XML data services through a set of well defined interfaces based on the Mako protocol. (GGF/DAIS XML Realization Specification).

– Configurable Protocol Handling• Admin specifies handlers to

process incoming packets based on type.

• Specific implementations of handler can easily be written and integrated.

Mako Architecture

TCP ListenerSSL Listener

GSI Listener

Internet /Network

Supported Interfaces

Create C

ollectionR

emoveC

ollectionS

ubmit S

chema

Subm

it XM

LR

etrieve XM

LR

emove X

ML

Subm

it XP

athS

ubmit X

Query

Get C

ollectionsG

et Service D

ataP

erfrom X

Update

Packet Router

XML XML

DataResource

DataResource

DataResource

XML Packet

The Packet Router parses the XMLpacket and determines if the Makohas an implementation of the interfacefor supporting that packet type, if it doesthe packets is passed to that Handler forprocessing.

Example Data Services:OracleExistMako DBmysqlOracle XML DB

Page 8: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Data Resource Support– “Mako DB”

• In house XML database.• Optimized for federated ad hoc usage of XML.• Plugs into Mako framework.

– XML DB Support• Built in support for XML databases that support the XML

DB API.– Other Data Resources

• Easily integrated, by implementing a small set of handlers for them

• In the future these handlers will be publicly available.

Page 9: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Other Features

– Data Validation– Element Referencing– Lazy Retrieval– Distributed Document Object Model (DOM)– Large binary node attachments

Page 10: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mako Data Referencing Use Case

– Shows retrieving data that is federated across multiple Mako services.

Mako Data Referencing

Patient 1Patient 2Patient 3 (Expanded) <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient>Patient 4

Mako1Mako2

Study 1Study 2 (Expanded) <study name=”myStudy”/> <patients> <mako:DocumentReference serviceName=”Mako1” document=”Patient 2"/> <mako:DocumentReference serviceName=”Mako1” document=”Patient 3"/> </patients> </study>Study 3Study 4

Request Study 2

<study name=”myStudy”/> <patients> <patient> <lastName>Doe</lastName> <firstName>John</firstName> ….. </patient> <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient </patients> </study>

The Mako Protocol allows pieces ofdata being referenced to be resolved atrequest time by the Mako retrieving therequest, or it can be done lazily by the

client.

Client

Response

Page 11: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Mako

• Virtual Mako– Simplifies client-side complexity of

interfacing with multiple Makos by presenting a single virtualized interface to a collection of federated Makos.

• Acts as a data integration point for distributed queries

• Pluggable algorithms for XML instance ingestion/distribution

• Protocol request broadcast and response aggregation

• Supports all services a standard Mako supports

• Maps a Virtual Collection to a number of remote standard Collections

VirtualMako

Request on Collection 1

Response

Virtual Mako

Mako

Remote Requeston Collection A

Responses

Collection A Collection B

Remote Requeston Collection B

Remote Requeston Collection C

Page 12: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• GME: Need for global data definition management!

– What is “global data definition” (Global Model)?• my “LabResult” != your “LabResult”

– Promote creation and evolution of standard definitions of data types. – For data communication between multiple institutions they must agree

on a common structure or a mapping between structures. – Allow for sharing and discovery of data definitions in a grid

environment.

• GME Enables– publish, retrieve, query, reference, and version data models in a

distributed domain hierarchical environment

Page 13: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Global Model Exchange– Manages the Global Model

• handles presented issues– Provides submission and

discovery protocol– DNS like architecture

• Hierarchical authority subordinate domain division

• Model resolution query messaging

• Local cache with TTL.

Page 14: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Users/Services publish schemas to the authoritative GME of that namespace. Then any other services/users from similar or different organizations who have the authority will be able to reference, use, alter, etc the data definitions of that schema.

Page 15: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Data Translation Service– Issues

• How do I translate one data type to another?

• How do I convert an old version of a data type to a newer one?

– Protocol and service framework for handling the mapping of one data instance or data definition to another should exist.

– Allows two protocol disjoint services to communicate

– Enables translating between changing data types.

A TO BMapping Service

C1 TO C2

Mapping Service

Data Translation

ServiceRegistry

Registration Registration

User(Wants to convert data

from type A to B)

1) Discovery

2) A

Dat

a

3) B

Dat

a

Page 16: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mobius Application Examples (www.projectmobius.org

)

– GridPACS• Distributed image

management and analysis application.

– BlockMan• Large-scale data de-

clustering and querying application.

GME

GridPACS Architecture

MakoMakoMakoMako

VMako

MakoMakoMakoMako

MakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMako

DC Cluster

Rogue Cluster

Mob Cluster

5 nodes, 2 TB40 nodes, 16 TB

8 nodes, 14 TB

MakoMakoMakoMako

OSUmed Cluster

24 nodes, 7.2 TB

Page 17: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Experimental Results– System Configuration:

• a 24 node cluster, 933 Mhz CPU, 512MB memory, 300GB disk, Fast Ethernet. • A 5-node cluster, dual Xeon 2.4Ghz, 2GB memory, 360GB disk. Gigabit Switch. • An 8-node cluster, dual 64-bit AMD, 8GB memory, 1.5TB disk. Gigabit Switch.

– One Mako server (with MakoDB) on each node of a cluster. One GME server in the environment.

– Four aspects of middleware evaluated:• Model driven database creation using MakoDB• Data submission, query, and retrieval using Mako protocol with 3 different

backends.• Multi-server response time trials with varying number of readers and writers.• Multi-cluster trials with varying numbers of reader and writer clients.

– Results for our in house data management “MakoDB” are very promising.– We stressed tested the middleware and found very good scale as more

clients try to interact.

Page 18: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Model driven, on-demand database creation using MakoDB backend.

• highly nested models and well as highly flat models scale relative to each other in the system as we vary there respective heights and widths.

Page 19: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Mako Single Server Submission and Retrieval Experimental Results– Performed against three Mako exposed data services

• MakoDB• Exist (Mako XMLDB service)• Xindice (Mako XMLDB service)

– Two Experimental Models• Model A (flat)

– Contains a root element with ten children,– Each with a child containing text.

• Model B (tall and wide)– Designed to show the cost of both depth and width. It– Contains 10 elements, which each containing 10 child elements.

Page 20: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Submission Results

Model A XML Submission Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds makodb

existxindice

Model B XML Submission Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Sizem

illis

econ

ds

makodbexistxindice

Page 21: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Retrieval Results

Model A XML Retrieval Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds

makodbexistxindice

Model B XML Retrieval Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds

makodbexistsxindice

Page 22: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• XPath Query Single Server Results– Medical Image Model

• demographical information and other metadata.

• base64 encoded image• base64 encoded thumbnail

– Experiment queried for a set of thumbnails in the system based on varied metadata values.

XPath Query Experiment

0

1000

2000

3000

4000

5000

6000

0 200 400 600 800 1000

Number of Documentsm

illis

econ

ds

makodbexistxindice

Page 23: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Client scaling tests.

Multi-server scaling results are better in than single server scaling results. Showing that the client response times will gradually get better as more servers are added until HW saturation has taken over.

Page 24: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Multi-client Multi-cluster response time trials.

Page 25: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

• Conclusions– A searchable, scaleable , distributed, shareable

middleware to support data management needs of data intensive applications.

– Preliminary results show it provides an efficient and flexible system for caching, sharing, and asynchronously exchanging data in a distributed environment and is flexible to several backend storage choices.

Page 26: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

Mobius in the Community

• NIH-CaBIG– Mobius has been adopted as a core middleware component of the NIH Cancer Grid

effort.• OGSA-DAI

– Mobius has been selected as a principle sister project to the E-Sciences project OGSA-DAI effort.

• GGF– Active members of Data Access and Integration Services Working Group (DAIS-

WG)• Biological Research Technology Transfer (BRTT)

– Mobius is currently funded under a State of Ohio technology transfer grant.• Ohio State University Comprehensive Cancer Center

– Mobius is being used as the data management backbone across the clinical labs and patient care centers of the cancer center.

Page 27: download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

End of TalkEnd of Talk