38
Introduction to Data Management Dr Jens Jensen Head of Data Services Group, Leader of Storage and Data Management and Scientific Computing Dept GridPP more STFC ...

Introduction to Data Management Dr Jens Jensen Head of Data Services Group,Leader of Storage and Data Management and Scientific Computing DeptGridPPmore

Embed Size (px)

Citation preview

Introduction to Data Management

Dr Jens Jensen

Head of Data Services Group,

Leader of Storage and Data Management

and

Scientific Computing Dept GridPP more

STFC ...

Scientific data management:

– Large data volumes (10s of PB)

– Distributed user base

– Need for high performance transfers

– Need for data security (or not)

– Scalability

Data in “the Grid”?

“The Grid”

Data Data

Data in “the Cloud”?

“The Cloud”

Data Data

Transfer Protocols

– GridFTP(http://www.ogf.org/documents/GFD.20.pdf)

Aka “gsiftp” (GSI = Globus (Grid) Security Infrastructure, cf RFC3820)

– HTTP(S)

– WebDAV (RFC 4918)

GridFTP – based on FTP

Ancient protocol...RFCs 114 (1971), 141 (1971), 172 (1971), 265

(1971), 354 (1972), 542 (1973), 765 (1980), 959 (1985)

Splitting control and data connection

ExtensionsRFC 2228, 2773 (security), 2640

(internationalisation), 3659 (misc.), 2389, 5797 (FEAT)

Control connection: port 21 (FTP), 2811 (GridFTP)

Client Server

Data connections and firewalls (active vs passive mode (PASV))

(Grid)FTP - “3rd party copying”

GridFTP – extensions to FTP

GSI security (later RFC 3820)

Striping (and EBLOCK mode)

TCP buffer size control/negot.?

Data channel authentication (DCAU)

The Grid....

Ad-hoc transfers between GridFTPendpoints

Initial user ingest? scp?

Hands on with GridFTP:uberftp (cf ftp)

Moving data in (and to, and from) the Grid

“Manually,” with GridFTP

Portals – e.g. NGS portal

GlobusOnline

FTS (as of 3.0, tbc)

The gLite grid – daily TLA dose

EMI – European Middleware Initiative

UMD – Unified Middleware Distribution

EGI – European Grid Infrastructure

IGE – Infrastructure for Globus in Europe

NGI – National Grid Initiative

The gLite grid – component TLAs

SE – Storage Element

SRM – Storage Resource Manager

LFC – LHC file catalogue

FTS – File Transfer Service

BDII – Berkeley Database Information Index (LDAP)

LFC

SRM GridFTP BDII

Storage Element

FTS

SRM (OGF GFD.129)– control interface– support for “spaces” (reserved areas)– retention policies (replica, output, custodial)– access latencies (offline, nearline, online)– storage “type” - permanent, volatile

LFN – Logical File Name (optional)Resolved by LFC into

GUID – Globally Unique IdentifierResolved by LFC into

SURL – Storage URL (or Site URL)Resolved by SE into

TURL – Transfer URL (eg gsiftp)

gLite - Summary of basic data commands

lcg-cp <srcfile> <dstfile>

Copy to/from SE, or between SEs (no LFC)

lcg-cr <srcfile> <dst>

Copy file into SE, and register in LFC (guid)

lcg-del <guid>

lcg-rep <src> <dst>

Replicate

Exercises

Lots of small files (105, 106)

Large files (108-1012)

Migration

Format migration, checksumming

Who can copy data? Write/Modify?

Exercises

How is scientific data mgmt different?– How do research disciplines differ?

– What are the interdisciplinary benefits?

How grids and clouds differ...?

Can we trust the grids/clouds?

Who leads the way? HEP? Industry?

Storage Accounting - static

Ongoing work...– Distributed storage systems

– Temporary file copies created

– Scheduled deletions

– Inaccessible free spaces, reserved space

– Filesystem/tape overheads

– Timeliness and accuracy

– Impact of compression

GridFTP today

GridFTP – workhorse of WAN grid data (OGF standard)

The need for GSI (non-TLS)

Numerous LAN protocols...

… moving towards more common standards? (eg HTTP)

lcg-cr--vo dteam

-l lfn:my_stuff

-d srm-dteam.gridpp.rl.ac.uk

file://`pwd`/foo.tmp

guid:921ac0b8-82aa-61dc-0192-6effece

Subsequent access and replication is by GUID

Data Security

Data security is like data security everywhere...

Except that the devil is in the detail

And the details are always different...

Data Security – Confidentiality

Data

Data

Data

Data

• In flight, or at rest• The performance issue• And the time issue• Who can “activate” it?

Data

Data Security – Availability

LOCKSS again... clouds are good at this.

Data

Somebody already thought about the difficult stuff...?

Liability, SLAs,...

Data Security – Availability

DDoS Intentional

Botnets Unintentional

Referencing Data

DOIs for data–DONA – Digital Objects

Numbering Authority Granularity? Licences, permissions Implementing data policies

Cloud Data – Cost

Clouds are elastic Elasticity is good for (rapid) growth Or shrinkth

Elasticity can be expensive, though Compared to “traditional” data centre Or in-house (but don’t underestimate

this!) Different cost models (Hybrids!)

Infrastructure Security

End-to-end security Authentication and authorisation Developing a threat model Protecting credentials Usability of security Anonymised??

Infrastructure Federated identity and single sign-

on Integration with existing

infrastructures Accounting

Securely... Anonymously?

And billing

The Role of Standards

Standards promote interoperation And maturity (sometimes)

Interoperation solves problems Sometimes E.g. eggs and baskets

Standards peer reviewed

Other Data Services

IRODS – “data grid”Successor to SRB

Server side workflows: rules, microservices

Safety Deposit BoxCommercial product from Tessella

Data preservation

NGS data services

NGS portal – https://portal.ngs.ac.uk/

http://www.ngs.ac.uk/tools/vbrowser

Databases: Oracle, MySQL

EU Funded Data Projects

EUDAT (www.eudat.eu)Collaborative iRODS based infrastructure

Multidisciplinary, scalable, long tail

SCIDIP-ES (earth science)www.scidip-es.eu

SCAPE (www.scape-project.eu)

PANDATA (neutron/synchrotron)pan-data.eu

New Stuff?

More mature approach to clouds?

CCN – Content Centric Networking

RAID --> ECC, “object” storage

Exercises

Lots of small files (105, 106)

Large files (108-1012)

Migration

Format migration, checksumming

Who can copy data? Write/Modify?

Exercises

How is scientific data mgmt different?– How do research disciplines differ? How

much can be shared?

– What are the interdisciplinary benefits?

How grids and clouds differ...?

Can we trust the grids/clouds?

Who leads the way? HEP? Industry?

References

www.ngs.ac.uk

www.ogf.org

UMD user guide https://edms.cern.ch/document/722398/

GridPP storage and data management group

– http://www.gridpp.ac.uk/wiki/Grid_Storage