41
Security Requirements Security Requirements for Shared Collections for Shared Collections Storage Resource Broker Storage Resource Broker Reagan W. Moore Reagan W. Moore [email protected] http://www.sdsc.edu/srb http://www.sdsc.edu/srb

Security Requirements for Shared Collections Storage Resource Broker Reagan W. Moore [email protected]

Embed Size (px)

Citation preview

Security Requirements for Security Requirements for Shared CollectionsShared Collections

Storage Resource BrokerStorage Resource Broker

Reagan W. MooreReagan W. Moore

[email protected]

http://www.sdsc.edu/srbhttp://www.sdsc.edu/srb

AbstractAbstract

• National and international collaborations create shared collections that span multiple administrative domains.

• Shared collection are built using:• Data virtualization, the management of collection

properties independently of the storage system. • Trust virtualization, the ability to assert authenticity and

authorization independently of the underlying storage resources.

• What security requirements are implied in trust virtualization?

• How are data integrity challenges met?• How do you implement a “Deep Archive”?

State of the Art TechnologyState of the Art Technology

• Grid - workflow virtualization• Manage execution of jobs (processes)

independently of compute servers

• Data grid - data virtualization• Manage properties of a shared collection

independently of storage systems

• Semantic grid - information virtualization• Reason across inferred attributes derived from

multiple collections.

Shared CollectionsShared Collections

• Purpose of SRB data grid is to enable the creation of a shared collection between two or more institutions• Register digital entity into the shared collection• Assign owner, access controls• Assign descriptive, provenance metadata• Manage audit trails, versions, replicas, backups• Manage location mapping, containers, checksums• Manage verification, synchronization• Manage federation with other shared collections• Manage interactions with storage systems• Manage interactions with APIs

Trust VirtualizationTrust Virtualization

• Collection owns the data that is registered into the data grid• Data grid is a set of servers, installed at each

storage repository• Servers are installed under a SRB ID created

for the shared collection• All accesses to the data stored under the SRB

ID are through the data grid• Authentication and authorization are controlled

by the data grid, independently of the remote storage system

SRBserver

SRB agent

SRBserver

Federated Server ArchitectureFederated Server Architecture

MCAT

Sget

SRB agent

1

2

34

6

5

Logical NameOr

Attribute Condition

1.Logical-to-Physical mapping2.Identification of Replicas3.Access & Audit Control

Peer-to-peer

Brokering

Server(s) SpawningData

Access

Parallel Data Access

R1R2

5/6

Authentication / AuthorizationAuthentication / Authorization

• Collection owned data• At each remote storage system, an account ID is

created under which the data grid stores files

• User authenticates to the data grid• Data grid checks access controls• Data grid server authenticates to a remote

data grid server• Remote data grid server authenticates to the

remote storage repository• SRB servers return the data to the client

Authentication MechanismsAuthentication Mechanisms

• Grid Security Infrastructure• PKI certificates

• Challenge-response mechanism• No passwords sent over network

• Ticket• Valid for specified time period or number of

accesses

• Generic Security Service API• Authentication of server to remote storage

Trust ImplementationTrust Implementation

• For authentication to work across multiple administrative domains• Require collection-managed names for users

• For authorization to work across multiple administrative domains• Require collection-managed names for files

• Result is that access controls remain invariant. They do not change as data is moved to different storage systems under shared collection control

CommunicationCommunication

• Single port number for control messages• Specify desired control port

• Parallel I/O streams are received using multiple port numbers• Specify range of port numbers

• To deal with firewalls• Parallel I/O streams initiated from within firewall

SRBserver1

SRB agent

SRBserver2

Parallel mode Data Transfer – Server InitiatedParallel mode Data Transfer – Server Initiated

MCAT

Sput -m

SRB agent

1

2

3

4

5

6

srbObjPut+ socket addr , port and cookie

1.Logical-to-Physical mapping2. Identification of Replicas3.Access & Audit Control

Peer-to-peer

Request

Connect to client

Data transfer

R

Server behind firewallServer behind firewall

SRBserver1

SRB agent

SRBserver2

Parallel mode Data Transfer – Client InitiatedParallel mode Data Transfer – Client Initiated

MCAT

Sput -M

SRB agent

1

4

2

3

6

7srbObjPut

1.Logical-to-Physical mapping2. Identification of Replicas3.Access & Audit Control

Return socket addr.,

port and cookie

Connect to server

Data transfer

R

5

5

Client behind firewallClient behind firewall

SRBserver1

SRB agent

SRBserver2

Bulk Load OperationBulk Load Operation

MCAT

Sput -b

SRB agent

1

2

3

4

6

Return Resource Location

1.Logical-to-Physical mapping2. Identification of Replicas3.Access & Audit Control

Query Resource

Bulk Register

Bulk Data transfer thread

R

8 Mb buffer

Bulk Registration

threads

5

Store Data in a temp file

Unfold temp file

Client behind firewallClient behind firewall

SRBserver

SRB agent

SRBserver2

Third Party Data TransferThird Party Data Transfer

MCAT

Scp

SRB agent

1

2

3

4

5

srbObjCopy

dataPut- socket addr.,

port and cookie

Connect to server2 Data

transfer

R

6

SRBserver1

SRBserver

SRB agent

R

Server1 behind firewallServer1 behind firewall

oror

Server1 and Server2 Server1 and Server2

behind same firewallbehind same firewall

Example International Example International Institutions (2005)Institutions (2005)

Project InstitutionData mangement project British Antarctic Survey, UKeMinerals Cambridge e-Science Center, UKSickkids Hospital in Toronto CanadaWelsh e-Science Centre Cardiff University, UKAustralian Partnership for Advanced Computing Data Grid Victoria, Australia

Australian Partnership for Advanced Computing Data GridCommonwealth Scientific and Industrial Research Organization, Australia

Australian Partnership for Advanced Computing Data Grid University of Technology, AustraliaCenter for Advanced Studies, Research, and Development ItalyLIACS(Leiden Inst. Of Comp. Sci) Leiden University,The NetherlandsAustralian Partnership for Advanced Computing Data Grid Melbourne, AustraliaMonash E-Research Grid Monash University, AustraliaComputational Modelling University of Queensland, AustraliaVirtual Tissue Bank Osaka University, JapanCybermedia Center Osaka University, JapanBelfast e-Science Centre Queen's University, UKInformation Technology Department Sejong University, South KoreaNanyang Centre for Supercomputing SingaporeNational University (Biology data grid) SingaporeProtein structure prediction Taiwan University, Taiwan

KEK Data GridKEK Data Grid

• Japan• Taiwan• South Korea• Australia• Poland• US

• A functioning, general purpose international Data Grid for high-energy physics

Manchester-SDSC mirror

BaBar High-energy PhysicsBaBar High-energy Physics

• Stanford Linear Accelerator

• Lyon, France• Rome, Italy• San Diego• RAL, UK

• A functioning international Data Grid for high-energy physics

Manchester-SDSC mirror

Moved over 100 TBs of dataMoved over 100 TBs of data

BIRN - Biomedical Informatics BIRN - Biomedical Informatics Research NetworkResearch Network

• Link 17 research hospitals and universities to promote sharing of data

• Install 1-2 Terabyte disk caches at each site• Register all material into a central catalog

• Manage access controls on data, metadata, and resources

• Manage information about each image

• Audit sharing of data by institution• Extend system as needed

• Add disk space dynamically• Add new metadata attributes• Add new sites

NIH BIRN SRB Data GridNIH BIRN SRB Data Grid

• Biomedical Informatics Research Network• Access and analyze biomedical image data• Data resources distributed throughout the country• Medical schools and research centers across the US

• Stable high performance grid based environment• Coordinate data sharing• Federate collections • Support data mining and analysis

HIPAA Patient ConfidentialityHIPAA Patient Confidentiality

• Access controls on data• Access controls on metadata• Audit trails• End-to-end encryption (manage keys)• Localization of data to specific storage

• Access controls do not change when data is moved to another storage system under data grid control

Storage Resource Broker Collections at SDSC(12/6/2005)

GBs ofdata

stored

Numberof files

Userswith

ACLsData Grid      

NSF/ITR - National Virtual Observatory 93,252 11,189,215 100NSF - National Partnership for Advanced Computational Infrastructure 34,319 7,235,387 380

Static collections – Hayden planetarium 8,013 161,352 227

Pzone – public collections 16,815 9,628,452 68

NSF/NPACI - Biology and Environmental collections 103,849 124,169 67

NSF/NPACI – Joint Center for Structural Genomics 15,731 1,577,260 55

NSF - TeraGrid, ENZO Cosmology simulations 187,684 2,415,397 3,267

NIH - Biomedical Informatics Research Network 12,812 9,078,456 337

Digital Library      

NSF/NPACI - Long Term Ecological Reserve 236 34,594 36

NSF/NPACI - Grid Portal 2,620 53,048 460

NIH - Alliance for Cell Signaling microarray data 733 94,686 21

NSF - National Science Digital Library SIO Explorer collection 2,452 1,068,842 27

NSF/ITR - Southern California Earthquake Center 134,779 2,958,799 73

Persistent Archive      

NHPRC Persistent Archive Testbed (Kentucky, Ohio, Michigan, Minnesota) 96 387,825 28

UCSD Libraries archive 4,147 408,050 29

NARA- Research Prototype Persistent Archive 2,029 1,393,722 58

NSF - National Science Digital Library persistent archive 4,651 45,724,945 136

TOTAL 624 TB 93 million 5,369

Infrastructure IndependenceInfrastructure Independence

• Data virtualization• Management of name spaces independently

of the storage repositories• Global name spaces• Persistent identifiers• Collection-based ownership of files

• Support for access operations independently of the storage repositories

• Separation of access methods from storage protocols

• Support for integrity and authenticity operations

Separation of Access Method Separation of Access Method from Storage Protocolsfrom Storage Protocols

Storage SystemStorage System

Storage ProtocolStorage Protocol

Access MethodAccess Method

Access OperationsAccess Operations

Data GridData Grid

Map from the Map from the

operations used byoperations used by

the access methodthe access method

to a standard set ofto a standard set of

operations used to operations used to

interact with theinteract with the

storage systemstorage system

Storage OperationsStorage Operations

Data Grid OperationsData Grid Operations• File access

• Open, close, read, write, seek, stat, synch, …• Audit, versions, pinning, checksums, synchronize, …• Parallel I/O and firewall interactions• Versions, backups, replicas

• Latency management• Bulk operations

• Register, load, unload, delete, …

• Remote procedures• HDFv5, data filtering, file parsing, replicate, aggregate

• Metadata management• SQL generation, schema extension, XML import and export,

browsing, queries, • GGF, “Operations for Access, Management, and Transport at Remote

Sites”

Examples of ExtensibilityExamples of Extensibility• The 3 fundamental APIs are C library, shell commands,

Java• Other access mechanisms are ported on top of these interfaces

• API evolution• Initial access through C library, Unix shell command• Added iNQ Windows browser (C++ library)• Added mySRB Web browser (C library and shell commands)• Added Java (Jargon)• Added Perl/Python load libraries (shell command)• Added WSDL (Java)• Added OAI-PMH, OpenDAP, DSpace digital library (Java)• Added Kepler actors for dataflow access (Java)• Added GridFTP version 3.3 (C library)

Examples of ExtensibilityExamples of Extensibility• Storage Repository Driver evolution

• Initially supported Unix file system• Added archival access - UniTree, HPSS• Added FTP/HTTP• Added database blob access• Added database table interface• Added Windows file system• Added project archives - Dcache, Castor, ADS• Added Object Ring Buffer, Datascope• Added GridFTP version 3.3

• Database management evolution• Postgres• DB2• Oracle• Informix• Sybase• mySQL (most difficult port - no locks, no views, limited SQL)

Unix Shell

NT Browser,Kepler Actors

OAI,WSDL,(WSRF)

HTTP,DSpace,

OpenDAP,GridFTP

Archives - Tape,Sam-QFS, DMF,

HPSS, ADSM,UniTree, ADS

Databases -DB2, Oracle,

Sybase, Postgres, mySQL, Informix

File SystemsUnix, NT,Mac OSX

Application

ORB

Storage Repository AbstractionDatabase Abstraction

Databases -DB2, Oracle, Sybase,

Postgres, mySQL,Informix

CLibrary,

Java

Logical Name Space

LatencyManagement

DataTransport

MetadataTransport

Consistency & Metadata Management / Authorization, Authentication, Audit

Linux I/OC++

DLL /Python,

Perl, Windows

Federation Management

Storage Resource Broker 3.3.1Storage Resource Broker 3.3.1

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Access Methods (C library, Unix, Web Browser)

Data access directly between

application and storage

repository using names

required by the local

repository

Logical Name SpacesLogical Name Spaces

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection

Data Access Methods (C library, Unix, Web Browser)

Data is organized as a shared collection

Logical Resource NamesLogical Resource Names

• Logical resource name represents multiple physical resources

• Writing to a logical resource name can result in:• Replication - write completes when each physical

resource has a copy• Load leveling - write completes when the next physical

resource in the list has a copy• Fault tolerance - write completes when “k” of “n”

resources have a copy• Single copy - write is done to first disk at same IP

address, then disk anywhere, then tape

Federation Between Data GridsFederation Between Data Grids

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection B

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints

Data Collection A

Access controls and consistency constraints on cross registration of digital entities

Authentication across ZonesAuthentication across Zones

• Follow Shibboleth model

• A user is assigned a “home” zone• User identity is

Home-zone:Domain:User-ID

• All authentications are done by the “home” zone

User AuthenticationUser Authentication

Z1:D1:U1Z1:D1:U1

Z2Z2 Z1Z1

ConnectConnect

Challenge/ResponseChallenge/Response

Validation of Validation of

Challenge ResponseChallenge Response

User AuthenticationUser Authentication

Z1:D1:U1Z1:D1:U1

Z2Z2

Z1Z1

ConnectConnect

Challenge/ResponseChallenge/Response

Validation of Validation of

Challenge ResponseChallenge Response

Z3Z3ForwardForward

Deep ArchiveDeep Archive

Z2Z2 Z1Z1Z3Z3

Z2:D2:U2Z2:D2:U2

RegisterRegister

Z3:D3:U3Z3:D3:U3

RegisterRegister

PullPull PullPull

FirewallFirewall

Server initiated I/OServer initiated I/O

DeepDeep

ArchiveArchive

StagingStaging

ZoneZone

Remote ZoneRemote Zone

No access byNo access by

Remote zonesRemote zones

PVNPVN

Types of Risk Types of Risk

• Media failure• Replicate data onto multiple media

• Vendor specific systemic errors• Replicate data onto multiple vendor products

• Operational error• Replicate data onto a second administrative domain

• Natural disaster• Replicate data to a geographically remote site

• Malicious user• Replicate data to a deep archive

• Bit flips because of finite bit-error rate

Commands Using ChecksumCommands Using Checksum

• Registering checksum values into MCAT• at the time of upload

• Sput -k - compute checksum of local source file and register with MCAT

• Sput -K – checkum verification mode– After upload, compute checksum by reading back

uploaded file– Compare with the checksum generated with locally

• Existing SRB files• Schksum

– compute and register checksum if not already exist

• Srsync - if the checksum does not exist

How Many ReplicasHow Many Replicas

• Three sites minimize risk• Primary site

• Supports interactive user access to data

• Secondary site• Supports interactive user access when first site is

down• Provides 2nd media copy, located at a remote site,

uses different vendor product, independent administrative procedures

• Deep archive• Provides 3rd media copy, staging environment for

data ingestion, no user access

NARA Persistent ArchiveNARA Persistent Archive

NARA U Md SDSC

MCAT MCAT MCAT

Original data at NARA, data replicated to U Md

Replicated copyat U Md for improvedaccess, load balancingand disaster recovery

Deep archive atSDSC, no user access, datapulled from NARA

Demonstrate preservation environment • Authenticity• Integrity• Management of technology evolution• Mitigation of risk of data loss

• Replication of data• Federation of catalogs

• Management of preservation metadata• Scalability

• Types of data collections• Size of data collections

Federation of Three Independent Data Grids

Free Floating

Occasional Interchange

Replicated Data

User and Data Replica Resource Interaction Nomadic

Replicated CatalogSnow Flake

Master Slave

Archival

Partial User-ID Sharing

Partial Resource Sharing

No Metadata SynchHierarchical Zone Organization One Shared User-ID

System Managed ReplicationConnection From Any ZoneComplete Resource Sharing

System Set Access ControlsSystem Controlled Complete SynchComplete User-ID Sharing

System Managed ReplicationSystem Set Access ControlsSystem Controlled Partial Metadata SynchNo Resource Sharing

Super Administrator Zone Control

System Controlled Complete Metadata SynchComplete User-ID Sharing

Replication ZonesReplication Zones

Peer-to-Peer Peer-to-Peer

ZonesZones

Hierarchical ZonesHierarchical Zones

For more Information on For more Information on

Storage Resource Broker Data GridStorage Resource Broker Data Grid

Reagan W. MooreReagan W. Moore

[email protected]

http://www.sdsc.edu/srb/http://www.sdsc.edu/srb/