33
National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Embed Size (px)

Citation preview

Page 1: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Reagan Moore, PIMary Whitton, Project Manager

Page 2: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Policy Topics

• Policy-based Data Management• Practical Policy Working Group outcomes

– Data Center policies• Applications

– DataNet Federation Consortium analyzed 175 policies for• Data sharing (research collaborations)• SILS Digital library (personal collections)• RDA Practical Policy (data centers)• UNC-CH Protected data (secure medical workspace)• Odum/Dataverse (archive)• NSF data management plans (publication)

– Science Observatory Network (real-time sensor data) – PECE/RPI (anthropology)– NOAA NCDC (archive)

Page 3: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Policy-based Data Management

Page 4: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Summary of the Problem

Practical Policy

Assertion or assurance that is enforced about a (data) collection (data set, digital object, file) by the creators of the collection

Computer actionable policies are used to enforce data management automate administrative tasks validate compliance with assessment criteria automate scientific data processing and analyses

Users motivated by issues related to scale, distribution

Page 5: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Practical Policy Working Group

Page 6: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

• Practical Policy members represented– 11 types of data management systems– 30 institutions– 2 testbeds

• iRODSRenaissance Computing Institute,DataNet Federation Consortium – DFC

• GPFSInstitute of Physics of the Academy of Sciences, CESNETGarching Computing Centre – RZG

• Published two documents– Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Templates” February, 2015,

http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.– Moore, R., R. Stotzka, C. Cacciari, P. Benedikt, “Practical Policy Implementations”, February,

2015, http://dx.doi.org/10.15497/83E1B3F9-7E17-484A-A466-B3E5775121CC.

Policy Templates

Page 7: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Data Center Policies

• Contextual metadata extraction – Automate extraction of metadata from files

• Data access control– Automate application of appropriate access contrls

• Data backup– Automate creation of replicas

• Data format control– Automate identification of data format

• Data retention– Apply a retention period

• Disposition– Apply a disposition policy at end of retention period

7

Page 8: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Data Center Policies

• Integrity (including replication)– Verify integrity and replace bad copies

• Notification– Manage events about changes to the collection

• Restricted searching– Manage searches on collection

• Storage cost reports– Generate cost report

• Use agreements– Manage use agreements before data are retrieved

8

Page 9: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Digital Library Management

Page 10: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

LifeTime Library Policies

• Requirements– Enable students to create a personal digital collection– Provide pedagogy mechanisms for experimenting with:

• Naming - File names• Arrangement - Organization in collections• Description - Tags and metadata• Access controls - Sharing and publication• Ingestion - Controlled loading of data• Distribution - Storage locations

10

Page 11: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Student Experiences

• Students invariably:– Changed their minds about the purpose of the collection– Changed their minds about the description

• Term definitions tended to drift over the semester

– Changed their minds about the arrangement• Added new collections for additional types of data

• Resulting collections had:– 1,000 – 10,000 files– 2 Gigabytes to 150 Gigabytes in size– 4-10 metadata attributes per file

11

Page 12: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Protected Data

Page 13: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Protected Data Management

• UNC-CH has published an administrator’s guide for the management of protected data. This includes:– PII Personally Identifiable Information– PHI Protected Health Information– PCI Payment Card Industry information

• The question is whether each of the tasks specified in the guide can be automated as policies enforced by the data grid.

• See Chapter 6 of the Policy Examples Workbook– This specifies 51 tasks that should be managed by the

administrator

Page 14: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Protected Data Tasks1 Check for presence of PII on ingestion2 Check for viruses on ingestion3 Check passwords for required attributes4 Encrypt data on ingestion5 Encrypt data transfers6 Federation - control data copies (access control)7 Federation - manage remote data grid interactions (update rule base)8 Federation - periodically copy data9 Federation- manage data retrieval (update access controls)10 Generate checksum on ingestion11 Generate report of corrections to data sets or access controls12 Generate report for cost (time) required to audit events13 Generate report of types of protected assets present within a collection14 Generate report of all security and corruption events15 Generate report of the policies that are applied to the collections16 List all storage systems being used17 List persons who can access a collection

14

Page 15: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Protected Data Tasks18 List staff by position and required training courses19 List versions of technology that are being used20 Maintain document on independent assessment of software21 Maintain log of all software changes, OS upgrades22 Maintain log of disclosures23 Maintain password history on user name24 Parse event trail for all accessed systems25 Parse event trail for all persons accessing collection26 Parse event trail for all unsuccessful attempts to access data27 Parse event trail for changes to policies28 Parse event trail for inactivity29 Parse event trail for updates to rule bases30 Parse event trail to correlate data accesses with client actions31 Provide test environment to verify policies on new systems32 Provide test system for evaluating a recovery procedure33 Provide training courses for users34 Replicate data sets on ingestion 15

Page 16: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Protected Data Tasks35 Replicate iCAT periodically36 Set access approval flag37 Set access controls38 Set access restriction until approval flag is set39 Set approval flag per collection for enabling bulk download40 Set asset protection classifier for data sets based on type of PII41 Set flag for whether tickets can be used on files in a collection42 Set lockout flag and period on user name - counting number of tries43 Set password update flag on user name44 Set retention period for data reviews45 Set retention period on ingestion46 Track systems by type (server, laptop, router,….)47 Verify approval flags within a collection48 Verify files have not been corrupted49 Verify presence of required replicas50 Verify that no controlled data collections have public or anonymous access51 Verify that protected assets have been encrypted

16

Page 17: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Task Automation

• There are some unifying requirements across tasks:– Checking material for PII, viruses– Management of passwords– Generation of log files for all actions done– Creation of state information to track processes– Management of encryption– Management of access controls– Generation of audit trails– Parsing of events to demonstrate compliance over time– Verification that processes were correctly applied

• Many of these requirements can also be applied to digital libraries and research collaborations

17

Page 18: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Preservation

Page 19: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Cross-Disciplinary Data Discovery and Geographically Distributed Preservation

DFC April 2013 NSF Review Slide 19

Page 20: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Archive Policies

• The Dataverse network has about 800 GigaBytes of data that may contain protected information.

• An archive is needed with independent management of the material to ensure recovery in the case of a disaster.– Digital objects and provenance metadata must be re-

loadable into Dataverse.– Assessment criteria need to be evaluated to verify integrity.– Access controls must be enforced on restricted data.– Dataverse naming convention must be retained.

• Approach is to replicate the data holdings into an iRODS data grid. 20

Page 21: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Policies

• See chapter 5 of the Policy Examples Workbook – Odum preservation policies

• Preservation tasks include:– Staging files between Dataverse and iRODS– Checking data for presence of protected

information– Periodic verification of integrity and replicas– Verification of access controls– Reports on usage statistics

21

Page 22: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

NSF Data Management Plans

Page 23: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

NSF Data Management Plans

• The National Science Foundation has mandated that every project provide a 2-page description of how data will be managed.

• Each NSF directorate published guidelines on what the data management should include.

• An analysis of 12 sets of requirements identified 38 data management tasks that could be automated

• See Chapter 7 of Policy Template Workbook

23

Page 24: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

NSF DMP Requirements

24

Page 25: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

NSF DMP Requirements

25

Page 26: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

Science Observatory Network

Page 27: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

Real-Time Sensor Data

• Harvest sensor data from the Antelope Real Time Sensor orb.– Manages environmental, oceanic, seismic data– More that 3,000 sensors across the US

• Register each sensor as an independent collection– Retrieve the most recent sensor data– Harvest sensor data periodically– Transform to JSON, netCDF– Provide access to archived data

Page 28: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

PECE / RPI

Page 29: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

INLS 624

Collection Management Policies

• Contextual metadata extraction • Data access control• Data backup• Data format control• Data retention• Disposition• Integrity (including replication)• Notification• Restricted searching• Storage cost reports• Use agreements

29

Page 30: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

NOAA NCDC

Page 31: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

NOAA Climatic Data Center

• Manages an archive of climate data records received from multiple sources– Uses a staging area to

• Check input data for viruses • Manage ingestion into a tape archive

• Challenges– Needed a way to improve security

• Eliminate direct access to storage within the NOAA firewall

– Needed a way to automate management of each file• Verify archival storage before file is deleted

Page 32: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

ftp1

ftp4

ftp2

ftp5

ingest1

ingest2

Tape

Disk Cache

HDSS

DMZ Landing Zone: Open for data delivery

DM

Z Fi

rew

all

NCDC External Firewall

FTP Load Balance

ftp3

External Providers

FTP/FTPS

NCDC Internal Network

FTP PUSH/PULL

ftp

iRODS Secure Ingest

iRODS DMZ Grid

/DMZ/Archive

/NR2/NR3

iRODS NCDC Grid

/NCDC

/NR2/Ingest

/NR3/NR2

/Archive

/NR3

iRODS is:• Secure authentication• Security via Obscurity (one to bind them)• Uses a pull mechanism to move data into NCDC grid• A virtual management tool (clean-up) • Scope is entire grid

iRODS

Page 33: National Science Foundation Cooperative Agreement: OCI-0940841 Reagan Moore, PI Mary Whitton, Project Manager

National Science Foundation Cooperative Agreement: OCI-0940841

www.datafed.orgwww.irods.org

Policy Examples WorkbookPolicy Templates Workbook