1
Rolling Deck to Repository II: Getting Control of Provenance and Quality AGU Poster IN43A-1169 AGU Fall Meeting December 17, 2008 http://rvdata.us/ Stephen P. Miller 1 , Dru Clark 1 , Caryn Neiswender 1 , Robert A. Arko 2 , Cynthia L. Chandler 3 R R 2 II. Provenance and Quality Control II. Provenance and Quality Control Current Epoch 1.Gather all existing data 2.Go to sea 3.Merge new data with old 4.Try to figure out why things don’t agree (iterative process) 5.Publish in online journal 6.Exchange data Prior Epoch 1.Go to sea 2.Work up your own data 3.Publish in a journal 4.Exchange reprints V. Current and Pending Development V. Current and Pending Development Provenance: Track events throughout data life cycle: acquisition, QC, editing, merging, calibration, archiving Quality Control: Check data (and metadata) values according to established criteria, flag or repair, and report findings Make provenance and quality control information readily available for wide range of users, over decades, with “Institutional Quality Control Certificate” I. A Paradigm Shift I. A Paradigm Shift Case Study 2: Multibeam Seafloor Mapping Data Multibeam systems depend on accurate navigation, vertical reference and sound velocity data. NSF may support mandatory roll and pitch bias tests, at least on an annual basis. A major barrier to interdisciplinary re-use of data comes from a lack of understanding of the quality of a data file as it is exchanged among users, institutions and repositories. Current practice can lead to the propagation of artifacts, or at best wasteful duplication of QC effort. The R2R project is working toward standard MB-System based QC tools and reporting methods, to be recorded in an Institutional Quality Certificate (xml) that may travel with the swath file throughout its life cycle. One tool under active development by Scripps Institution of Oceanography Shipboard Technical Services is navd, which utilizes complex algorithms to select the “best” navigation data from multiple GPS data streams. Tools To evaluate data, based on established criterion, the R2R project will include the development, testing and deployment of standardized tools. Quality Control Certificate The Quality Control Certificate will include information about both quality and provenance. This certificate will utilize XML formatting, along with controlled vocabularies for quality tests, and data processing activities. Once certified by the appropriate authority, the Quality Control Certificate will inform data consumers about the quality and history of a data object. Case Study 1: Navigation Data Almost every data stream and sampling event depends on accurate navigation. Even in the era of modern GPS systems, artifacts arise from instrument and data transfer errors, signal blockage, combining data from diverse receivers, from semi-automatic or manual errors in intermediate file management, inappropriate resampling, or conversions to other formats. Improvements are needed to automatically detect and flag unrealistic values and outliers. Standard graphical tools would help to avoid embarrassing track lines over land. filename SB.19990202.edp.mb32 filesize 4032914 checksum 63758953c5a96e3f91d8f2049b9ed149 date of current file 2007-07-26 Record Type Record Name Record Date Authority Name Authority Institution Description Discussion provenance acquisition 1999-02- 02 Charters, James SIO SOMTS Original acquisition of data that led to version in this file Original SB 2000 data provided depths assuming 1500 m/sec provenance processed 2001-10- 12 Peckman, Uta SIO GDC This file contains values that have been transformed by some sort of algorithm or filtering Converted to true depths with correct svp, and reprocessed with correct pitch-, roll- and yaw-biases removed provenance access release 2002-03- 09 Peckman, Uta SIO GDC This file may be released to the public, according to the right-to-use statement in the accompanying metadata or supporting archive web site Proprietary hold has now been released by original data owner, chief scientist Hubert Staudigel quality version 2002-05- 16 Clark, Dru SIO GDC Certifies that this file is the best available version at current time Part of standard multibeam QC review by GDC quality metadata 2002-05- 16 Clark, Dru SIO GDC Certifies that metadata associated with this file are free of errors Part of standard multibeam QC review by GDC provenance published in SIOExplorer 2002-07- 06 Clark, Dru SIO GDC This file has been archived in the SIOExplorer Digital Library, along with supporting metadata, http://SIOExplorer.ucsd.edu Published in original version of SIOExplorer digital library provenance submitted to NGDC 2002-08- 15 Smith, Stuart SIO GDC This file submitted to National Geophysical Data Center (NGDC) repository, www.ngdc.noaa.gov Part of bulk transfer of multibeam cruises to NGDC provenance published in SIOExplorer 2007-07- 26 Clark, Dru SIO GDC This file has been archived in the SIOExplorer Digital Library, along with supporting metadata, http://SIOExplorer.ucsd.edu Re-published in revised version of SIOExplorer collection Sample Quality Control Certificate IV. Case Studies IV. Case Studies III. Background III. Background It Takes a Team Researchers Students Technicians Data Managers Preparation Preparation Cruise Level Metadata The Who, When and Where of a research cruise Data Gathering Scientists gather data in preparation for this cruise Data Data Acquisiti Acquisiti on on Start with correct metadata Submission Submission to to Repository Repository Retrieve Retrieve for for Research Research or Cruise or Cruise Planning Planning Reprocesse Reprocesse d Data d Data Submitted Submitted to to Repository Repository Rolling Deck to Repository Rolling Deck to Repository (R2R) Project Overview (R2R) Project Overview NSF-supported research vessels collectively produce an enormous volume and diversity of scientific data. With today’s rapidly rising ship costs, and the current trend toward greater re-use of shipboard data, it is imperative that the community takes positive, cost-effective, systematic steps to ensure greater data access. The NSF Division of Ocean Sciences Data and Sample Policy (pub. NSF 04-004) states, Principal Investigators are required to submit all environmental data collected to the designated National Data Centers as soon as possible, but no later than two (2) years after the data are collected. Inventories (metadata) of all marine environmental data collected should be submitted to the designated National Data Centers within sixty (60) days after the observational period/cruise. However, procedures for such submissions are poorly established, require lengthy follow-up with investigators, and yield documentation of variable quality. As the volume and diversity of data collected by the fleet increases, this problem will only grow worse. This new approach provides a “direct pipeline” from operating institutions to a central shoreside facility. Working directly with ship operators, we will ensure more complete and consistent data collection, quality control, and reporting. This modernized system will transition the U.S. academic research fleet from a collection of independent expeditionary platforms into an integrated ocean observing system – a network of ships and submersibles around the world that routinely report a standard suite of underway data and documentation to a central repository. The streamlined R2R system will facilitate data discovery and integration, quality assessment, cruise planning, compliance with funding agency data policies, and long-term data preservation. R2R Poster Series R2R Poster Series Rolling Deck to Repository I: Designing a Database Infrastructure AGU Poster # IN43A-1168 Rolling Deck to Repository II: Getting Control of Provenance and Quality AUG Poster # IN43A-1169 Rolling Deck to Repository III: Shipboard Event Logging AGU Poster # IN43A-1170 R2R Project Leads R2R Project Leads Scripps Institution of Oceanography 1 Stephen P. Miller [email protected] Lamont-Doherty Earth Observatory 2 Robert A. Arko [email protected] Woods Hole Oceanographic Institution 3 Cynthia L. Chandler [email protected] The Rolling Deck to Repository Project acknowledges support from the National Science Foundation (NSF), Oceanographic Instrumentation and Technical Services (OITS) Program

Rolling Deck to Repository II: Getting Control of Provenance and Quality AGU Poster IN43A-1169 AGU Fall Meeting December 17, 2008 Stephen

Embed Size (px)

Citation preview

Page 1: Rolling Deck to Repository II: Getting Control of Provenance and Quality AGU Poster IN43A-1169 AGU Fall Meeting December 17, 2008  Stephen

Rolling Deck to Repository II: Getting Control of Provenance and QualityRolling Deck to Repository II: Getting Control of Provenance and QualityAGU Poster IN43A-1169

AGU Fall MeetingDecember 17, 2008

http://rvdata.us/ Stephen P. Miller 1, Dru Clark 1, Caryn Neiswender 1, Robert A. Arko 2, Cynthia L. Chandler 3

RR RR22

II. Provenance and Quality ControlII. Provenance and Quality ControlII. Provenance and Quality ControlII. Provenance and Quality Control

Current Epoch1.Gather all existing data

2.Go to sea

3.Merge new data with old

4.Try to figure out why things don’t agree

(iterative process)

5.Publish in online journal

6.Exchange data

Prior Epoch1.Go to sea

2.Work up your own data

3.Publish in a journal

4.Exchange reprints

V. Current and Pending DevelopmentV. Current and Pending DevelopmentV. Current and Pending DevelopmentV. Current and Pending Development

Provenance: Track events throughout data life cycle: acquisition, QC, editing, merging, calibration, archiving

Quality Control: Check data (and metadata) values according to established criteria, flag or repair, and report findings

Make provenance and quality control information readily available for wide range of users, over decades, with “Institutional Quality Control Certificate”

I. A Paradigm ShiftI. A Paradigm ShiftI. A Paradigm ShiftI. A Paradigm Shift

Case Study 2: Multibeam Seafloor Mapping DataMultibeam systems depend on accurate navigation, vertical reference and sound velocity data. NSF may support mandatory roll and pitch bias tests, at least on an annual basis. A major barrier to interdisciplinary re-use of data comes from a lack of understanding of the quality of a data file as it is exchanged among users,

institutions and repositories. Current practice can lead to the propagation of artifacts, or at best wasteful duplication of QC effort.

The R2R project is working toward standard MB-System based QC tools and reporting methods, to be recorded in an Institutional Quality Certificate (xml) that may travel with the swath file throughout its life cycle.

One tool under active development by Scripps Institution of Oceanography Shipboard Technical Services is navd, which utilizes complex algorithms to select the “best” navigation data from multiple GPS data streams.

ToolsTo evaluate data, based on established criterion, the R2R project will include the development, testing and deployment of standardized tools.

Quality Control CertificateThe Quality Control Certificate will include information about both quality

and provenance. This certificate will utilize XML formatting, along with controlled vocabularies for quality tests, and data processing activities. Once certified by the appropriate authority, the Quality Control Certificate will inform data consumers about the quality and history of a data object.

Case Study 1: Navigation DataAlmost every data stream and sampling event depends on accurate navigation. Even in the era of modern GPS systems, artifacts arise from instrument and data transfer errors, signal

blockage, combining data from diverse receivers, from semi-automatic or manual errors in intermediate file management, inappropriate resampling, or conversions to other formats.

Improvements are needed to automatically detect and flag unrealistic values and outliers. Standard graphical tools would help to avoid embarrassing track lines over land.

filename SB.19990202.edp.mb32filesize 4032914checksum 63758953c5a96e3f91d8f2049b9ed149date of current file

2007-07-26

Record Type Record Name Record Date Authority NameAuthority

Institution Description Discussion

provenance acquisition 1999-02-02 Charters, James SIO SOMTS Original acquisition of data that led to version in this file Original SB 2000 data provided depths assuming 1500 m/secprovenance processed 2001-10-12 Peckman, Uta SIO GDC This file contains values that have been transformed by

some sort of algorithm or filteringConverted to true depths with correct svp, and reprocessed with correct pitch-, roll- and yaw-biases removed

provenance access release 2002-03-09 Peckman, Uta SIO GDC This file may be released to the public, according to the right-to-use statement in the accompanying metadata or supporting archive web site

Proprietary hold has now been released by original data owner, chief scientist Hubert Staudigel

quality version 2002-05-16 Clark, Dru SIO GDC Certifies that this file is the best available version at current time

Part of standard multibeam QC review by GDC

quality metadata 2002-05-16 Clark, Dru SIO GDC Certifies that metadata associated with this file are free of errors

Part of standard multibeam QC review by GDC

provenance published in SIOExplorer

2002-07-06 Clark, Dru SIO GDC This file has been archived in the SIOExplorer Digital Library, along with supporting metadata, http://SIOExplorer.ucsd.edu

Published in original version of SIOExplorer digital library

provenance submitted to NGDC 2002-08-15 Smith, Stuart SIO GDC This file submitted to National Geophysical Data Center (NGDC) repository, www.ngdc.noaa.gov

Part of bulk transfer of multibeam cruises to NGDC

provenance published in SIOExplorer

2007-07-26 Clark, Dru SIO GDC This file has been archived in the SIOExplorer Digital Library, along with supporting metadata, http://SIOExplorer.ucsd.edu

Re-published in revised version of SIOExplorer collection

Sample Quality Control Certificate

IV. Case StudiesIV. Case StudiesIV. Case StudiesIV. Case StudiesIII. BackgroundIII. BackgroundIII. BackgroundIII. Background

It Takes a Team Researchers Students Technicians Data Managers

It Takes a Team Researchers Students Technicians Data Managers

PreparationPreparation

Cruise Level MetadataThe Who, When and Where

of a research cruise

Data GatheringScientists gather data in

preparation for this cruise

PreparationPreparation

Cruise Level MetadataThe Who, When and Where

of a research cruise

Data GatheringScientists gather data in

preparation for this cruise

Data Data AcquisitionAcquisitionStart with correct

metadata

Data Data AcquisitionAcquisitionStart with correct

metadata

Submission Submission to to

RepositoryRepository

Submission Submission to to

RepositoryRepository

Retrieve for Retrieve for Research or Research or

Cruise Cruise PlanningPlanning

Retrieve for Retrieve for Research or Research or

Cruise Cruise PlanningPlanning

Reprocessed Reprocessed Data Data

Submitted to Submitted to RepositoryRepository

Reprocessed Reprocessed Data Data

Submitted to Submitted to RepositoryRepository

Rolling Deck to Repository (R2R) Rolling Deck to Repository (R2R) Project OverviewProject Overview

NSF-supported research vessels collectively produce an enormous volume and diversity of scientific data. With today’s rapidly rising ship costs, and the current trend toward greater re-use of shipboard data, it is imperative that the community takes positive, cost-effective, systematic steps to ensure greater data access.

The NSF Division of Ocean Sciences Data and Sample Policy (pub. NSF 04-004) states, “Principal Investigators are required to submit all environmental data collected to the designated National Data Centers as soon as possible, but no later than two (2) years after the data are collected. Inventories (metadata) of all marine environmental data collected should be submitted to the designated National Data Centers within sixty (60) days after the observational period/cruise.” However, procedures for such submissions are poorly established, require lengthy follow-up with investigators, and yield documentation of variable quality. As the volume and diversity of data collected by the fleet increases, this problem will only grow worse.

This new approach provides a “direct pipeline” from operating institutions to a central shoreside facility. Working directly with ship operators, we will ensure more complete and consistent data collection, quality control, and reporting.

This modernized system will transition the U.S. academic research fleet from a collection of independent expeditionary platforms into an integrated ocean observing system – a network of ships and submersibles around the world that routinely report a standard suite of underway data and documentation to a central repository. The streamlined R2R system will facilitate data discovery and integration, quality assessment, cruise planning, compliance with funding agency data policies, and long-term data preservation.

R2R Poster SeriesR2R Poster SeriesRolling Deck to Repository I: Designing a Database InfrastructureAGU Poster # IN43A-1168

Rolling Deck to Repository II: Getting Control of Provenance and QualityAUG Poster # IN43A-1169

Rolling Deck to Repository III: Shipboard Event LoggingAGU Poster # IN43A-1170

R2R Project LeadsR2R Project LeadsScripps Institution of Oceanography 1

Stephen P. [email protected] Earth Observatory 2

Robert A. [email protected] Hole Oceanographic Institution 3

Cynthia L. [email protected]

The Rolling Deck to Repository Project acknowledges support from the National Science Foundation (NSF), Oceanographic Instrumentation and Technical Services (OITS) Program