download Power Point

Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org

Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz

A Distributed Data Management Middleware for Data-Driven Application Systems

The Mobius Project

Shannon HastingsThe Ohio State University



• Overview– Motivation– Related Work– Mobius Middleware

• Services• Protocol• Architecture• Results

– Community Presence



• Motivation– Increasing numbers of applications that wish to store and query distributed and

disparate data• Large-scale data discovery, versioning, storage, retrieval, and linking in a

distributed multi-institutional/global environment.• Management of data model evolution and data translation.• Support for on-the-fly creation of model driven data worlds. • Support indexing and storage of large-scale data sets in distributed

environments.– Approach: create a data storage and management middleware that is:

• Distributed and Scaleable – service based grid architecture with ability to run on networked collections of PCs and PC clusters

• Searchable – data instances conform to published data models and can be queried by applications.

• Shareable – generic protocols enabling multiple applications to concurrently store, retrieve, query, and share data in the system.



• Related Work/Projects

– OGSA-DAI – DQP– Chimera– Storage Resource Broker (SRB)– Distributed Parallel Storage Services (DPSS)– IBP– ActiveProxy-G



• Mobius Overview– The Mobius project identifies, defines, and builds a set of

services and protocols enabling the management and integration of both data and data definitions.

– We break everything into two components:• The protocol definitions.• The service implementation of those definitions.

– Mobius intends to conform to all (maybe) standards set forth by the GGF and attempts to leverage relevant existing grid technologies.

• Currently support the GGF/DAIS XML Realization Proposed Specification



• Mobius Core Services– Mako (Federated Adhoc Storage Services)– Global Model Exchange (GME)– Data Translation Service Framework (DTS)

• Mobius Extension Services– VMako (Single virtual service view of a federation of Makos)– Other Higher level query services (semantic query, inference services etc.)– Data Replication, Transportation, and Distributed Query Services.

• Other Needed Grid Services – Namespace Management– Service Naming– Data Replication– Security



• Mako Service (Data Resource)

– Exposes existing data services as XML data services through a set of well defined interfaces based on the Mako protocol. (GGF/DAIS XML Realization Specification).

– Configurable Protocol Handling• Admin specifies handlers to

process incoming packets based on type.

• Specific implementations of handler can easily be written and integrated.

Mako Architecture

TCP ListenerSSL Listener

GSI Listener

Internet /Network

Supported Interfaces

Create C

ollectionR

emoveC

ollectionS

ubmit S

chema

Subm

it XM

LR

etrieve XM

LR

emove X

ML

Subm

it XP

athS

ubmit X

Query

Get C

ollectionsG

et Service D

ataP

erfrom X

Update

Packet Router

XML XML

DataResource

DataResource

DataResource

XML Packet

The Packet Router parses the XMLpacket and determines if the Makohas an implementation of the interfacefor supporting that packet type, if it doesthe packets is passed to that Handler forprocessing.

Example Data Services:OracleExistMako DBmysqlOracle XML DB



• Data Resource Support– “Mako DB”

• In house XML database.• Optimized for federated ad hoc usage of XML.• Plugs into Mako framework.

– XML DB Support• Built in support for XML databases that support the XML

DB API.– Other Data Resources

• Easily integrated, by implementing a small set of handlers for them

• In the future these handlers will be publicly available.



• Other Features

– Data Validation– Element Referencing– Lazy Retrieval– Distributed Document Object Model (DOM)– Large binary node attachments



• Mako Data Referencing Use Case

– Shows retrieving data that is federated across multiple Mako services.

Mako Data Referencing

Patient 1Patient 2Patient 3 (Expanded) <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient>Patient 4

Mako1Mako2

Study 1Study 2 (Expanded) <study name=”myStudy”/> <patients> <mako:DocumentReference serviceName=”Mako1” document=”Patient 2"/> <mako:DocumentReference serviceName=”Mako1” document=”Patient 3"/> </patients> </study>Study 3Study 4

Request Study 2

<study name=”myStudy”/> <patients> <patient> <lastName>Doe</lastName> <firstName>John</firstName> ….. </patient> <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient </patients> </study>

The Mako Protocol allows pieces ofdata being referenced to be resolved atrequest time by the Mako retrieving therequest, or it can be done lazily by the

client.

Client

Response



Mako

• Virtual Mako– Simplifies client-side complexity of

interfacing with multiple Makos by presenting a single virtualized interface to a collection of federated Makos.

• Acts as a data integration point for distributed queries

• Pluggable algorithms for XML instance ingestion/distribution

• Protocol request broadcast and response aggregation

• Supports all services a standard Mako supports

• Maps a Virtual Collection to a number of remote standard Collections

VirtualMako

Request on Collection 1

Response

Virtual Mako

Mako

Remote Requeston Collection A

Responses

Collection A Collection B

Remote Requeston Collection B

Remote Requeston Collection C

…



• GME: Need for global data definition management!

– What is “global data definition” (Global Model)?• my “LabResult” != your “LabResult”

– Promote creation and evolution of standard definitions of data types. – For data communication between multiple institutions they must agree

on a common structure or a mapping between structures. – Allow for sharing and discovery of data definitions in a grid

environment.

• GME Enables– publish, retrieve, query, reference, and version data models in a

distributed domain hierarchical environment



• Global Model Exchange– Manages the Global Model

• handles presented issues– Provides submission and

discovery protocol– DNS like architecture

• Hierarchical authority subordinate domain division

• Model resolution query messaging

• Local cache with TTL.



Users/Services publish schemas to the authoritative GME of that namespace. Then any other services/users from similar or different organizations who have the authority will be able to reference, use, alter, etc the data definitions of that schema.



• Data Translation Service– Issues

• How do I translate one data type to another?

• How do I convert an old version of a data type to a newer one?

– Protocol and service framework for handling the mapping of one data instance or data definition to another should exist.

– Allows two protocol disjoint services to communicate

– Enables translating between changing data types.

A TO BMapping Service

C1 TO C2

Mapping Service

Data Translation

ServiceRegistry

Registration Registration

User(Wants to convert data

from type A to B)

1) Discovery

2) A

Dat

a

3) B

Dat

a



• Mobius Application Examples (www.projectmobius.org

)

– GridPACS• Distributed image

management and analysis application.

– BlockMan• Large-scale data de-

clustering and querying application.

GME

GridPACS Architecture

MakoMakoMakoMako

VMako

MakoMakoMakoMako

MakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMako

DC Cluster

Rogue Cluster

Mob Cluster

5 nodes, 2 TB40 nodes, 16 TB

8 nodes, 14 TB

MakoMakoMakoMako

OSUmed Cluster

24 nodes, 7.2 TB

http://www.projectmobius.org/



• Experimental Results– System Configuration:

• a 24 node cluster, 933 Mhz CPU, 512MB memory, 300GB disk, Fast Ethernet. • A 5-node cluster, dual Xeon 2.4Ghz, 2GB memory, 360GB disk. Gigabit Switch. • An 8-node cluster, dual 64-bit AMD, 8GB memory, 1.5TB disk. Gigabit Switch.

– One Mako server (with MakoDB) on each node of a cluster. One GME server in the environment.

– Four aspects of middleware evaluated:• Model driven database creation using MakoDB• Data submission, query, and retrieval using Mako protocol with 3 different

backends.• Multi-server response time trials with varying number of readers and writers.• Multi-cluster trials with varying numbers of reader and writer clients.

– Results for our in house data management “MakoDB” are very promising.– We stressed tested the middleware and found very good scale as more

clients try to interact.



Model driven, on-demand database creation using MakoDB backend.

• highly nested models and well as highly flat models scale relative to each other in the system as we vary there respective heights and widths.



• Mako Single Server Submission and Retrieval Experimental Results– Performed against three Mako exposed data services

• MakoDB• Exist (Mako XMLDB service)• Xindice (Mako XMLDB service)

– Two Experimental Models• Model A (flat)

– Contains a root element with ten children,– Each with a child containing text.

• Model B (tall and wide)– Designed to show the cost of both depth and width. It– Contains 10 elements, which each containing 10 child elements.



Submission Results

Model A XML Submission Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds makodb

existxindice

Model B XML Submission Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Sizem

illis

econ

ds

makodbexistxindice



Retrieval Results

Model A XML Retrieval Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds

makodbexistxindice

Model B XML Retrieval Experiment

10

100

1000

10000

100000

10K 100K 1MB 10MB

Document Size

mill

isec

onds

makodbexistsxindice



• XPath Query Single Server Results– Medical Image Model

• demographical information and other metadata.

• base64 encoded image• base64 encoded thumbnail

– Experiment queried for a set of thumbnails in the system based on varied metadata values.

XPath Query Experiment

0

1000

2000

3000

4000

5000

6000

0 200 400 600 800 1000

Number of Documentsm

illis

econ

ds

makodbexistxindice



Client scaling tests.

Multi-server scaling results are better in than single server scaling results. Showing that the client response times will gradually get better as more servers are added until HW saturation has taken over.



• Multi-client Multi-cluster response time trials.



• Conclusions– A searchable, scaleable , distributed, shareable

middleware to support data management needs of data intensive applications.

– Preliminary results show it provides an efficient and flexible system for caching, sharing, and asynchronously exchanging data in a distributed environment and is flexible to several backend storage choices.



Mobius in the Community

• NIH-CaBIG– Mobius has been adopted as a core middleware component of the NIH Cancer Grid

effort.• OGSA-DAI

– Mobius has been selected as a principle sister project to the E-Sciences project OGSA-DAI effort.

• GGF– Active members of Data Access and Integration Services Working Group (DAIS-

WG)• Biological Research Technology Transfer (BRTT)

– Mobius is currently funded under a State of Ohio technology transfer grant.• Ohio State University Comprehensive Cancer Center

– Mobius is being used as the data management backbone across the clinical labs and patient care centers of the cancer center.



End of TalkEnd of Talk