Upload
datacenters
View
214
Download
0
Embed Size (px)
Citation preview
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
A Distributed Data Management Middleware for Data-Driven Application Systems
The Mobius Project
Shannon HastingsThe Ohio State University
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Overview– Motivation– Related Work– Mobius Middleware
• Services• Protocol• Architecture• Results
– Community Presence
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Motivation– Increasing numbers of applications that wish to store and query distributed and
disparate data• Large-scale data discovery, versioning, storage, retrieval, and linking in a
distributed multi-institutional/global environment.• Management of data model evolution and data translation.• Support for on-the-fly creation of model driven data worlds. • Support indexing and storage of large-scale data sets in distributed
environments.– Approach: create a data storage and management middleware that is:
• Distributed and Scaleable – service based grid architecture with ability to run on networked collections of PCs and PC clusters
• Searchable – data instances conform to published data models and can be queried by applications.
• Shareable – generic protocols enabling multiple applications to concurrently store, retrieve, query, and share data in the system.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Related Work/Projects
– OGSA-DAI – DQP– Chimera– Storage Resource Broker (SRB)– Distributed Parallel Storage Services (DPSS)– IBP– ActiveProxy-G
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mobius Overview– The Mobius project identifies, defines, and builds a set of
services and protocols enabling the management and integration of both data and data definitions.
– We break everything into two components:• The protocol definitions.• The service implementation of those definitions.
– Mobius intends to conform to all (maybe) standards set forth by the GGF and attempts to leverage relevant existing grid technologies.
• Currently support the GGF/DAIS XML Realization Proposed Specification
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mobius Core Services– Mako (Federated Adhoc Storage Services)– Global Model Exchange (GME)– Data Translation Service Framework (DTS)
• Mobius Extension Services– VMako (Single virtual service view of a federation of Makos)– Other Higher level query services (semantic query, inference services etc.)– Data Replication, Transportation, and Distributed Query Services.
• Other Needed Grid Services – Namespace Management– Service Naming– Data Replication– Security
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mako Service (Data Resource)
– Exposes existing data services as XML data services through a set of well defined interfaces based on the Mako protocol. (GGF/DAIS XML Realization Specification).
– Configurable Protocol Handling• Admin specifies handlers to
process incoming packets based on type.
• Specific implementations of handler can easily be written and integrated.
Mako Architecture
TCP ListenerSSL Listener
GSI Listener
Internet /Network
Supported Interfaces
Create C
ollectionR
emoveC
ollectionS
ubmit S
chema
Subm
it XM
LR
etrieve XM
LR
emove X
ML
Subm
it XP
athS
ubmit X
Query
Get C
ollectionsG
et Service D
ataP
erfrom X
Update
Packet Router
XML XML
DataResource
DataResource
DataResource
XML Packet
The Packet Router parses the XMLpacket and determines if the Makohas an implementation of the interfacefor supporting that packet type, if it doesthe packets is passed to that Handler forprocessing.
Example Data Services:OracleExistMako DBmysqlOracle XML DB
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Data Resource Support– “Mako DB”
• In house XML database.• Optimized for federated ad hoc usage of XML.• Plugs into Mako framework.
– XML DB Support• Built in support for XML databases that support the XML
DB API.– Other Data Resources
• Easily integrated, by implementing a small set of handlers for them
• In the future these handlers will be publicly available.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Other Features
– Data Validation– Element Referencing– Lazy Retrieval– Distributed Document Object Model (DOM)– Large binary node attachments
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mako Data Referencing Use Case
– Shows retrieving data that is federated across multiple Mako services.
Mako Data Referencing
Patient 1Patient 2Patient 3 (Expanded) <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient>Patient 4
Mako1Mako2
Study 1Study 2 (Expanded) <study name=”myStudy”/> <patients> <mako:DocumentReference serviceName=”Mako1” document=”Patient 2"/> <mako:DocumentReference serviceName=”Mako1” document=”Patient 3"/> </patients> </study>Study 3Study 4
Request Study 2
<study name=”myStudy”/> <patients> <patient> <lastName>Doe</lastName> <firstName>John</firstName> ….. </patient> <patient> <lastName>Smith</lastName> <firstName>Mike</firstName> ….. </patient </patients> </study>
The Mako Protocol allows pieces ofdata being referenced to be resolved atrequest time by the Mako retrieving therequest, or it can be done lazily by the
client.
Client
Response
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Mako
• Virtual Mako– Simplifies client-side complexity of
interfacing with multiple Makos by presenting a single virtualized interface to a collection of federated Makos.
• Acts as a data integration point for distributed queries
• Pluggable algorithms for XML instance ingestion/distribution
• Protocol request broadcast and response aggregation
• Supports all services a standard Mako supports
• Maps a Virtual Collection to a number of remote standard Collections
VirtualMako
Request on Collection 1
Response
Virtual Mako
Mako
Remote Requeston Collection A
Responses
Collection A Collection B
Remote Requeston Collection B
Remote Requeston Collection C
…
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• GME: Need for global data definition management!
– What is “global data definition” (Global Model)?• my “LabResult” != your “LabResult”
– Promote creation and evolution of standard definitions of data types. – For data communication between multiple institutions they must agree
on a common structure or a mapping between structures. – Allow for sharing and discovery of data definitions in a grid
environment.
• GME Enables– publish, retrieve, query, reference, and version data models in a
distributed domain hierarchical environment
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Global Model Exchange– Manages the Global Model
• handles presented issues– Provides submission and
discovery protocol– DNS like architecture
• Hierarchical authority subordinate domain division
• Model resolution query messaging
• Local cache with TTL.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Users/Services publish schemas to the authoritative GME of that namespace. Then any other services/users from similar or different organizations who have the authority will be able to reference, use, alter, etc the data definitions of that schema.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Data Translation Service– Issues
• How do I translate one data type to another?
• How do I convert an old version of a data type to a newer one?
– Protocol and service framework for handling the mapping of one data instance or data definition to another should exist.
– Allows two protocol disjoint services to communicate
– Enables translating between changing data types.
A TO BMapping Service
C1 TO C2
Mapping Service
Data Translation
ServiceRegistry
Registration Registration
User(Wants to convert data
from type A to B)
1) Discovery
2) A
Dat
a
3) B
Dat
a
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mobius Application Examples (www.projectmobius.org
)
– GridPACS• Distributed image
management and analysis application.
– BlockMan• Large-scale data de-
clustering and querying application.
GME
GridPACS Architecture
MakoMakoMakoMako
VMako
MakoMakoMakoMako
MakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMakoMako
DC Cluster
Rogue Cluster
Mob Cluster
5 nodes, 2 TB40 nodes, 16 TB
8 nodes, 14 TB
MakoMakoMakoMako
OSUmed Cluster
24 nodes, 7.2 TB
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Experimental Results– System Configuration:
• a 24 node cluster, 933 Mhz CPU, 512MB memory, 300GB disk, Fast Ethernet. • A 5-node cluster, dual Xeon 2.4Ghz, 2GB memory, 360GB disk. Gigabit Switch. • An 8-node cluster, dual 64-bit AMD, 8GB memory, 1.5TB disk. Gigabit Switch.
– One Mako server (with MakoDB) on each node of a cluster. One GME server in the environment.
– Four aspects of middleware evaluated:• Model driven database creation using MakoDB• Data submission, query, and retrieval using Mako protocol with 3 different
backends.• Multi-server response time trials with varying number of readers and writers.• Multi-cluster trials with varying numbers of reader and writer clients.
– Results for our in house data management “MakoDB” are very promising.– We stressed tested the middleware and found very good scale as more
clients try to interact.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Model driven, on-demand database creation using MakoDB backend.
• highly nested models and well as highly flat models scale relative to each other in the system as we vary there respective heights and widths.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Mako Single Server Submission and Retrieval Experimental Results– Performed against three Mako exposed data services
• MakoDB• Exist (Mako XMLDB service)• Xindice (Mako XMLDB service)
– Two Experimental Models• Model A (flat)
– Contains a root element with ten children,– Each with a child containing text.
• Model B (tall and wide)– Designed to show the cost of both depth and width. It– Contains 10 elements, which each containing 10 child elements.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Submission Results
Model A XML Submission Experiment
10
100
1000
10000
100000
10K 100K 1MB 10MB
Document Size
mill
isec
onds makodb
existxindice
Model B XML Submission Experiment
10
100
1000
10000
100000
10K 100K 1MB 10MB
Document Sizem
illis
econ
ds
makodbexistxindice
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Retrieval Results
Model A XML Retrieval Experiment
10
100
1000
10000
100000
10K 100K 1MB 10MB
Document Size
mill
isec
onds
makodbexistxindice
Model B XML Retrieval Experiment
10
100
1000
10000
100000
10K 100K 1MB 10MB
Document Size
mill
isec
onds
makodbexistsxindice
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• XPath Query Single Server Results– Medical Image Model
• demographical information and other metadata.
• base64 encoded image• base64 encoded thumbnail
– Experiment queried for a set of thumbnails in the system based on varied metadata values.
XPath Query Experiment
0
1000
2000
3000
4000
5000
6000
0 200 400 600 800 1000
Number of Documentsm
illis
econ
ds
makodbexistxindice
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Client scaling tests.
Multi-server scaling results are better in than single server scaling results. Showing that the client response times will gradually get better as more servers are added until HW saturation has taken over.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Multi-client Multi-cluster response time trials.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
• Conclusions– A searchable, scaleable , distributed, shareable
middleware to support data management needs of data intensive applications.
– Preliminary results show it provides an efficient and flexible system for caching, sharing, and asynchronously exchanging data in a distributed environment and is flexible to several backend storage choices.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
Mobius in the Community
• NIH-CaBIG– Mobius has been adopted as a core middleware component of the NIH Cancer Grid
effort.• OGSA-DAI
– Mobius has been selected as a principle sister project to the E-Sciences project OGSA-DAI effort.
• GGF– Active members of Data Access and Integration Services Working Group (DAIS-
WG)• Biological Research Technology Transfer (BRTT)
– Mobius is currently funded under a State of Ohio technology transfer grant.• Ohio State University Comprehensive Cancer Center
– Mobius is being used as the data management backbone across the clinical labs and patient care centers of the cancer center.
Multiscale Computing Laboratory, Department of Biomedical Informatics, Ohio State Universitywww.multiscalecomputing.org
Cluster 2004Stephen Langella, Shannon Hastings, Scott Oster, Tahsin Kurc, Joel Saltz
End of TalkEnd of Talk