Upload
matilda-ford
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Introduction to Data Management
Dr Jens Jensen
Head of Data Services Group,
Leader of Storage and Data Management
and
Scientific Computing Dept GridPP more
STFC ...
Scientific data management:
– Large data volumes (10s of PB)
– Distributed user base
– Need for high performance transfers
– Need for data security (or not)
– Scalability
Transfer Protocols
– GridFTP(http://www.ogf.org/documents/GFD.20.pdf)
Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820)
– HTTP(S)
– WebDAV (RFC 4918)
GridFTP – based on FTP
Ancient protocol...RFCs 114 (1971), 141 (1971), 172 (1971), 265
(1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985)
Splitting control and data connection
ExtensionsRFC 2228, 2773 (security), 2640
(internationalisation), 3659 (misc.), 2389, 5797 (FEAT)
Control connection: port 21 (FTP), 2811 (GridFTP)
Client Server
Data connections and firewalls (active vs passive mode (PASV))
GridFTP – extensions to FTP
GSI security (later RFC 3820)
Striping (and EBLOCK mode)
TCP buffer size control/negot.?
Data channel authentication (DCAU)
The Grid....
Ad-hoc transfers between GridFTPendpoints
Initial user ingest? scp?
Hands on with GridFTP:uberftp (cf ftp)
Moving data in (and to, and from) the Grid
“Manually,” with GridFTP
Portals – e.g. NGS portal
GlobusOnline
FTS (as of 3.0, tbc)
The gLite grid – daily TLA dose
EMI – European Middleware Initiative
UMD – Unified Middleware Distribution
EGI – European Grid Infrastructure
IGE – Infrastructure for Globus in Europe
NGI – National Grid Initiative
The gLite grid – component TLAs
SE – Storage Element
SRM – Storage Resource Manager
LFC – LHC file catalogue
FTS – File Transfer Service
BDII – Berkeley Database Information Index (LDAP)
LFC
SRM GridFTP BDII
Storage Element
FTS
SRM (OGF GFD.129)– control interface– support for “spaces” (reserved areas)– retention policies (replica, output, custodial)– access latencies (offline, nearline, online)– storage “type” - permanent, volatile
LFN – Logical File Name (optional)Resolved by LFC into
GUID – Globally Unique IdentifierResolved by LFC into
SURL – Storage URL (or Site URL)Resolved by SE into
TURL – Transfer URL (eg gsiftp)
gLite - Summary of basic data commands
lcg-cp <srcfile> <dstfile>
Copy to/from SE, or between SEs (no LFC)
lcg-cr <srcfile> <dst>
Copy file into SE, and register in LFC (guid)
lcg-del <guid>
lcg-rep <src> <dst>
Replicate
Exercises
Lots of small files (105, 106)
Large files (108-1012)
Migration
Format migration, checksumming
Who can copy data? Write/Modify?
Exercises
How is scientific data mgmt different?– How do research disciplines differ?
– What are the interdisciplinary benefits?
How grids and clouds differ...?
Can we trust the grids/clouds?
Who leads the way? HEP? Industry?
Storage Accounting - static
Ongoing work...– Distributed storage systems
– Temporary file copies created
– Scheduled deletions
– Inaccessible free spaces, reserved space
– Filesystem/tape overheads
– Timeliness and accuracy
– Impact of compression
GridFTP today
GridFTP – workhorse of WAN grid data (OGF standard)
The need for GSI (non-TLS)
Numerous LAN protocols...
… moving towards more common standards? (eg HTTP)
lcg-cr--vo dteam
-l lfn:my_stuff
-d srm-dteam.gridpp.rl.ac.uk
file://`pwd`/foo.tmp
guid:921ac0b8-82aa-61dc-0192-6effece
Subsequent access and replication is by GUID
Data Security
Data security is like data security everywhere...
Except that the devil is in the detail
And the details are always different...
Data Security – Confidentiality
Data
Data
Data
Data
• In flight, or at rest• The performance issue• And the time issue• Who can “activate” it?
Data
Data Security – Availability
LOCKSS again... clouds are good at this.
Data
Somebody already thought about the difficult stuff...?
Liability, SLAs,...
Referencing Data
DOIs for data–DONA – Digital Objects
Numbering Authority Granularity? Licences, permissions Implementing data policies
Cloud Data – Cost
Clouds are elastic Elasticity is good for (rapid) growth Or shrinkth
Elasticity can be expensive, though Compared to “traditional” data centre Or in-house (but don’t underestimate
this!) Different cost models (Hybrids!)
Infrastructure Security
End-to-end security Authentication and authorisation Developing a threat model Protecting credentials Usability of security Anonymised??
Infrastructure Federated identity and single sign-
on Integration with existing
infrastructures Accounting
Securely... Anonymously?
And billing
The Role of Standards
Standards promote interoperation And maturity (sometimes)
Interoperation solves problems Sometimes E.g. eggs and baskets
Standards peer reviewed
Other Data Services
IRODS – “data grid”Successor to SRB
Server side workflows: rules, microservices
Safety Deposit BoxCommercial product from Tessella
Data preservation
NGS data services
NGS portal – https://portal.ngs.ac.uk/
http://www.ngs.ac.uk/tools/vbrowser
Databases: Oracle, MySQL
EU Funded Data Projects
EUDAT (www.eudat.eu)Collaborative iRODS based infrastructure
Multidisciplinary, scalable, long tail
SCIDIP-ES (earth science)www.scidip-es.eu
SCAPE (www.scape-project.eu)
PANDATA (neutron/synchrotron)pan-data.eu
New Stuff?
More mature approach to clouds?
CCN – Content Centric Networking
RAID --> ECC, “object” storage
Exercises
Lots of small files (105, 106)
Large files (108-1012)
Migration
Format migration, checksumming
Who can copy data? Write/Modify?
Exercises
How is scientific data mgmt different?– How do research disciplines differ? How
much can be shared?
– What are the interdisciplinary benefits?
How grids and clouds differ...?
Can we trust the grids/clouds?
Who leads the way? HEP? Industry?