Upload
leslie-parsons
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Federating Archives in the Federating Archives in the DELAMAN NetworkDELAMAN Network
Reagan W. MooreReagan W. Moore
San Diego Supercomputer CenterSan Diego Supercomputer Center
http://www.npaci.edu/DICE/SRBhttp://www.npaci.edu/DICE/SRB
Storage Resource BrokerStorage Resource Broker
• Build a shared collection• Authenticate users independently of the storage
systems• Control access independently of the storage
systems• Organize the file name space independently of
the storage systems• Manage context (metadata) independently of
content (files)• Maintain consistency between context and
operations on content
Distributed Data ManagementDistributed Data Management
Using Data GridsUsing Data Grids
Storage Resource Broker
• Generic distributed data management technology• Data grids - sharing• Digital libraries - publication• Persistent archives - preservation
• Federated server architecture / thin client• 250,000 lines of “C” code• Supports all major compute and storage platforms
• All requirements listed on following Scenario slides are supported
Scenario 1- Data MigrationScenario 1- Data Migration
• Provide URIDs (logical file names) that are independent of storage system
• Provide metadata for each file• Support browse and discovery on collection
hierarchy• Support access interfaces to the data• Support registration of existing files into a
shared collection• Single sign-on environment
• GSI / challenge response / tickets
Managing Distributed DataManaging Distributed Data
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Naming conventions provided by storage systems
Data Grids Provide a Level of Indirection Data Grids Provide a Level of Indirection for Each Naming Conventionfor Each Naming Convention
Storage Repository
• Storage location
• User name
• File name
• File context (creation date,…)
• Access constraints
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space (URID)
• Logical context (metadata)
• Control/consistency constraints
Data Collection
Data Access Methods (C library, Unix, Web Browser)
Data is organized as a shared collection
Provide Context for DataProvide Context for Data
• Properties of files• Provenance - source• Descriptive attributes• State information resulting from operations on files
• Organize properties as metadata in a collection hierarchy• Define operations on file properties• Manage state information - location, replicas, containers, checksums
• Separate context management from content management• Maintain consistency of context as operations are done on content
• Support context management• Schema extension, automated SQL generation, bulk metadata load• Metadata extraction through a remote procedure parsing the file
SRBserver
SRB agent
SRBserver
Federated Server ArchitectureFederated Server Architecture
MCAT
Read Application
SRB agent
1
2
34
6
5
Logical NameOr
Attribute Condition
1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control
Peer-to-peer
Brokering
Server(s) SpawningData
Access
Parallel Data Access
R1R2
5/6
Unix Shell
Java, NTBrowser
Kepler Actors
OAI,WSDL,WSRF
HTTPDSpace
OpenDAP
Archives - Tape,Sam-QFS, DMF,
HPSS, ADSM,UniTree, ADS
DatabasesDB2, Oracle, Sybase,SQLserver, Postgres,
mySQL, Informix
File SystemsUnix, NT,Mac OSX
Application
ORB
Storage Repository VirtualizationCatalog Abstraction
DatabasesDB2, Oracle, Sybase,
Postgres, mySQL,Informix
C, C++, Java Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency & Metadata Management / Authorization,Authentication,Audit
Linux I/O
DLL /Python,
Perl
Federation Management
Storage Resource Broker - Data Grid
Scenario 2 - Data ExchangeScenario 2 - Data Exchange
• Support access controls on the URIDs• Java administration GUI to support owner control of
access controls• Can delegate permission to set access controls• Access controls apply on all replicas independent of
storage system
• Support latency management for moving files across wide area networks• Parallel I/O, replication, staging, aggregation of data /
metadata / I/O commands
• Support integrity validation• Manage checksums for each file
Latency Management -Bulk OperationsLatency Management -Bulk Operations
• Bulk register• Create a logical name for a file
• Bulk load• Create a copy of the file on a data grid storage repository
• Bulk unload• Provide containers to hold small files and pointers to each file location
• Bulk delete• Mark as deleted in metadata catalog• After specified interval, delete file
• Bulk metadata load• Support parsing of metadata from a remote file at remote storage
• Requests for bulk operations for access control setting, …
Scenario 3 - Community AccessScenario 3 - Community Access
• Within the shared collection, the digital entities are owned and managed by the data grid• Files, URLs, SQL commands, database binary large objects can
be registered into the shared collection
• Access controls for• Files / metadata / storage systems
• Access controls are defined for multiple roles• Schema extension, create new metadata• Modify metadata• Add annotations• Turn on audit trails• Write data• Read data
Scenario 4 - Explorative StudiesScenario 4 - Explorative Studies
• Uniform access mechanisms to data across all storage systems• Support for queries on databases• Support for formatting results (XML, HTML)• Support audit trails, encryption
• Support user-defined collection hierarchy• Soft links (build a logical collection of pointers to data
within the data grid)
• Support for multiple types of discovery• By URID (Logical File Name)• By query on metadata (may be unique to a single file)• By GUID (handle system)
Scenario 5 - EducationScenario 5 - Education
• SRB is used to build digital libraries• Assemble class material• Manage student reports• Display material through web browsers
• Federation of digital libraries• Controlled sharing across independent data grids or
digital libraries• Support for cross-registration of logical name spaces• Authentication done by “home” data grid• Access controls managed by both data grids
FederationFederation
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection B
Data Access Methods (Web Browser, DSpace, OAI-PMH)
Data Grid
• Logical resource name space
• Logical user name space
• Logical file name space
• Logical context (metadata)
• Control/consistency constraints
Data Collection A
Access controls and consistency constraints on cross registration of digital entities
Scenario 6 - Updating ResourcesScenario 6 - Updating Resources
• Maintain system level metadata• Owner of registered file• Creation time, modification time, size, audit trails• Replica locations
• Support for synchronization of replicas• Can modify a replica, subsequent reads are to the
modified copy• Can synchronize copies to the modified version
• Support for physical file containers• Aggregate small files before storage
Scenario 7 - Web-based EditionsScenario 7 - Web-based Editions
• Support for digital library interfaces on top of the data grid• Transana - technology to manipulate, edit, and
manage classroom video (University of Wisconsin)• DSpace - digital library system to manage ingestion of
material into a collection• OAI-PMH - Open Archives Initiative protocol for
metadata harvesting• OpenDAP - Data Access Protocol that supports both
semantic and structural manipulation of registered files• Windows browser, Web browser, Java, WSDL
interfaces• Collaborating on development of portlet interface
Unix Shell
Java, NTBrowser
Kepler Actors
OAI,WSDL,WSRF
HTTPDSpace
OpenDAP
Archives - Tape,Sam-QFS, DMF,
HPSS, ADSM,UniTree, ADS
DatabasesDB2, Oracle, Sybase,SQLserver,Postgres,
mySQL, Informix
File SystemsUnix, NT,Mac OSX
Application
ORB
Storage Repository VirtualizationCatalog Abstraction
DatabasesDB2, Oracle, Sybase,
Postgres, mySQL,Informix
C, C++, Java Libraries
Logical Name Space
LatencyManagement
DataTransport
MetadataTransport
Consistency & Metadata Management / Authorization,Authentication,Audit
Linux I/O
DLL /Python,
Perl
Federation Management
Storage Resource Broker - Data Grid
Scenario 8 - Unconnected EditionsScenario 8 - Unconnected Editions
• Ability to download data from shared collection to local resource• Support for PCs, workstations,
supercomputers
• Generalization of anonymous FTP• Can issue a ticket permitting
• Limited number of read accesses valid for specified time interval
• Can set public access to a sub-collection• Can restrict access by user
name/domain/zone
Local ArchivesLocal Archives
• Maintain files in local file system• Register existence of the files into the data grid• Issue synchronization command to replicate
into the archive
• Maintain a data grid on the local system• Entire environment can be installed on a Mac in
15 minutes (Perl install script)• Use data grid federation to synchronize name
spaces, files, metadata from local data grid to archives data grid
Scenario 9 - Collaborative CommmentaryScenario 9 - Collaborative Commmentary
• Comments can be added by owner• Annotations can be added by authorized
persons• Annotations marked by person name, date• Can restrict annotation right by group
• Can choose to create explicit metadata attributes to manage comments• Can store multiple comments per object• Can search across metadata
• Or can use digital library interfaces to manage comments
Sites Using the SRBSites Using the SRBCiteSeer, Penn StateCity Univ. of New YorkGeospatial Environment, UCSDDrexel UniversityEOSDIS Distributed Active, NASA GoddardGeorgia TechKentucky State Libraries & ArchivesLibrary of CongressLos Alamos National LabNASA AmesNASA Goddard Space Flight CenterNCSA Grid Computing NIH (NCI Center for Bioinformatics)Penn State UniversityPittsburgh Supercomputing CenterPurdue University. IndianaStanford UniversityTACC, University of TexasTexas A & MUC Santa CruzUCLAUCSD NeuroscienceUniversity of MarylandUniversity of Michigan, CAC department University of New MexicoUniversity of WashingtonUniversity of WisconsinUSCYale University
Academia Sinica, TaiwanASCC, Computing Centre, TaiwanAustralian National UniversityBedford Oceanography,CanadaBioinformatics Institute, SingaporeCSIRO, AustraliaData Storage Institute, SingaporeEGEE, French National CenterGeoForschungsZentrum, GermanyJames Cook University, AustraliaKEK High Energy Physics, JapanMax Planck Institute, NetherlandsParallab, NorwaySouth Australian Advanced ComputingUIB (Parallab) , NorwayUniversity of AmsterdamUniversity of Cambridge, AstronomyUniversity of Cambridge, e-ScienceUniversity of EdinburghUniversity of Genoa, ItalyUniversity of Hong KongUnivrsity of ManchesterUniversity of OsloUniversity of SouthamptonYork Univ (UK)
Storage Resource Broker Collections at SDSC(11/2/2004)
GBs ofdata
stored
Numberof files
Numberof Users
Data Grid
NSF/ITR - National Virtual Observatory 53,858 9,536,698 80NSF - National Partnership for Advanced Computational Infrastructure 24,738 5,754,890 380
Hayden Planetarium - Evolution of the Solar System visualizations 7,201 113,600 178
NSF/NPACI - Joint Center for Structural Genomics 5,228 652,031 50
NSF/NPACI - Biology and Environmental collections 8,851 33,340 67
NSF - TeraGrid, ENZO Cosmology simulations 121,550 1,096,947 3,247
NIH - Biomedical Informatics Research Network 6,002 4,107,508 214
Digital Library
NLM - Digital Embryo image collection 720 45,365 23
NSF/NPACI - Long Term Ecological Reserve 253 8,436 36
NSF/NPACI - Grid Portal 2,211 51,227 407
NIH - Alliance for Cell Signaling microarray data 856 62,291 21
NSF - National Science Digital Library SIO Explorer collection 2,080 808,901 27
NSF/NPACI -Transana education research video collection 92 2,387 26
NSF/ITR - Southern California Earthquake Center 91,040 1,791,494 62
Persistent Archive
UCSD Libraries archive 128 204,828 29
NARA- Research Prototype Persistent Archive 166 316,813 58
NSF - National Science Digital Library persistent archive 3,571 26,908,350 122
TOTAL 328 TB 51 million 4,900
Generic InfrastructureGeneric Infrastructure
• SDSC developed the Storage Resource Broker (SRB) to support access to distributed data• Effort started in 1996 as a DARPA funded project• Now support over 30 national/international projects
• Development team of 12 staff is led by• Michael Wan, data management systems• Arcot Rajasekar , information management systems
• Arun Jagatheesan• George Kremenek• Sheau-Yen Chen• Arcot Rajasekar (SRB development
lead)• Reagan Moore (SRB PI)• Michael Wan (SRB architect)• Roman Olschanowsky (BIRN)• Bing Zhu• Charlie Cowart• Lucas Gilbert • Tim Warnock• Wayne Schroeder (SRB product)• Adam Birnbaum (SRB production)• Antoine De Torcy• Vicky Rowley (BIRN)• Marcio Faerman (SCEC)• Students & emeritus
• Erik Vandekieft• Reena Mathew• Xi (Cynthia) Sheng• Allen Ding• Grace Lin• Qiao Xin• Daniel Moore• Ethan Chen• Jon Weinburg
• Supported by overt 20 projects (NSF, DOE, NASA, NARA, NIH, LOC, NHPRC)
SDSC SRB Team SDSC SRB Team (left to right)(left to right)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.QuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture. QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Data Grid CapabilitiesData Grid Capabilities
• Data manipulation• Containers• Parallel I/O• Firewall interactions
• Resource interactions• Fault tolerance• Load leveling• Replication
• HIPAA security requirements• Authentication of all users• Access controls on data and metadata• Audit trails• Data encryption• Centralized control
• Application interfaces• C library, Shell commands, Java, Perl, Python, WSDL, workflow
Data Management System Data Management System FeaturesFeatures
• Data grid for managing distributed data• Latency management for bulk analyses of collections• Infrastructure independent name spaces for describing
data, resources, users, and state information
• Digital library for managing data context• Curation services for managing collections• Descriptive metadata for discovery
• Persistent archive to manage technology evolution• Interoperability mechanisms between heterogeneous
storage systems and user access mechanisms
BIRN - Biomedical Informatics BIRN - Biomedical Informatics Research Network Data GridResearch Network Data Grid
DukeUCLA
Cal Tech
Wash U.Duke
Harvard
NIH/NCRR Centers for Imaging and Computing
Cal-(IT)2NPACI/SDSC
“Deep Web”
“Surface Web”
Integrating Cyber Infrastructure to Link: •Advanced Imaging Instruments •Data Intensive Computing •Multi-Scale Brain Databases
Wireless “Pad” Web Interface
Digital LibraryDigital Library
• Collection hierarchy for organizing data• User-defined metadata• Collection level metadata
• Metadata manipulation• Schema extension• Bulk metadata processing• Queries on metadata• Access controls on metadata• Views on collections
• Digital library APIs• DSpace, Fedora, OAI-PMH, web browsers• METS metadata XML schema
Southern California Earthquake Southern California Earthquake CenterCenter
Store seismic data • Managing over 90 TBs, over 1.7
million files• Store community models for
seismic velocity• Data distributed between USC,
SDSC
SCEC community digital library • Storage Resource Broker data
grid technology• NMI portal interface• Digital library services to
display seismograms• Visualizations of seismic waves
at the surface• Visualization of seismic wave
propagation through the volume
SCEC Community
Library
Select Receiver (Lat/Lon)
OutputTime HistorySeismograms
Select ScenarioFault Model
Source Model
Registry Layer
Existing Data Centers
Data Services
Semantics (UCD)
SIA
P, S
SA
P
VO
Tab le
FIT
S, G
I F,…
OpenS
kyQuery
SkyQueryVOPlot OASISconVOT
TopcatMirage
AladinDIS
Disks, Tapes, CPUs, Fiber
Grid MiddlewareSRB, Globus, OGSA
SOAP, GridFTP
data mining
visualization
image
sourcedetection
Virtual Observatory Architecture
Digital LibraryOther registriesXML, DC, METS
OAI ADS
My Space storage services
Databases, Persistency, Replication
Virtual Data
Workflow (pipelines)
Discover Compute Publish Collaborate
Authentication & Authorization
crossmatch
HTTP Services SOAP Services Grid Services stateless, registered self-describing persistent, authenticated
Portals, User Interfaces, Tools
Compute Services
Bulk A
ccess
interfaces to data
National Virtual ObservatoryNational Virtual Observatory
Provide access to large star catalogs and large image sky surveys
• 2MASS • SDSS• DPOSS• USNO-B• Macho
National Science Digital LibraryNational Science Digital Library
Web Interface to Persistent Archive
Preserve educational material that has been registered into a central repository at Cornell through URLs• Crawl web and retrieve material, 10 levels of indirection• Convert internal URLs into data grid handles• Aggregate files into containers for storage• Preserve using SRB data grid technology• Currently housing over 26 million files
National Archives and Records National Archives and Records Administration - Research Prototype Administration - Research Prototype
Persistent ArchivePersistent Archive
NARA U Md SDSC
MCAT MCAT MCAT
Principle copystored at NARAwith completemetadata catalog
Replicated copyat U Md for improvedaccess, load balancingand disaster recovery
Deep Archive atSDSC, no useraccess, but complete copy
Demonstrate preservation environment • Authenticity• Integrity• Management of technology evolution• Mitigation of risk of data loss
• Replication of data• Federation of catalogs
• Management of preservation metadata• Scalability
• EAP collection• 350,000 files• 1.2 TBs in size
Federation of Three Independent Data Grids
For More InformationFor More Information
Reagan W. MooreSan Diego Supercomputer Center
http://www.npaci.edu/DICE
http://www.npaci.edu/DICE/SRB
http://www.npaci.edu/dice/srb/mySRB/mySRB.html