17
11/9/2016 1 Implementing a Data Quality Strategy to simplify access to data Kelsey Druken Implementing a Data Quality Strategy to simplify access to data Kelsey Druken, Claire Trenham, Lesley Wyborn, Ben Evans National Computational Infrastructure, Canberra eResearch 2016

Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

1

Implementing a Data Quality Strategy to simplify access to data

Kelsey Druken

Implementing a Data Quality Strategy to simplify access to data

Kelsey Druken, Claire Trenham, Lesley Wyborn, Ben Evans

National Computational Infrastructure, CanberraeResearch 2016

Page 2: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

2

nci.org.au

• The diverse data collections areco-located with a Petascale HPC and Cloud facility with a: • Top 50 Supercomputer (1.2Pflops)

• HPC Cloud (3000 node)

• Digital Laboratories

• Dynamic subsets are actively encouraged, and can be accessed via data services

• Processing times have decreased dramatically: new large data sets can be generated or analysed in minutes or hours instead of months

National Computational Infrastructure

nci.org.au

National Computational Infrastructure

• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections

• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences

Page 3: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

3

nci.org.au

National Computational Infrastructure

• NCI hosts one of Australia’s largest repositories (10+ PBytes) of research data collections

• Spanning datasets from climate, coasts, oceans and geophysics through to astronomy, bioinformatics and the social sciences

nci.org.au

National Computational Infrastructure

1. Climate/ESS Model Assets and Data Products

2. Earth and Marine Observations and Data Products

3. Geoscience Collections

4. Terrestrial Ecosystems Collections

5. Water Management and Hydrology Collections

Data Collections Approx. Capacity

CMIP5, CORDEX, ACCESS Models 5 Pbytes

Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR 2 Pbytes

Digital Elevation, BathymetryOnshore/Offshore Geophysics

1 Pbytes

Seasonal Climate 700 Tbytes

Bureau of Meteorology Observations 350 Tbytes

Bureau of Meteorology Ocean-Marine 350 Tbytes

Terrestrial Ecosystem 290 Tbytes

Reanalysis products 100 Tbytes

Page 4: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

4

nci.org.au

• Application of community-agreed data standards to the broad set of Earth systems and environmental data that are being used

• Within these disciplines, data span a wide range of:- Gridded- Non-gridded (i.e., trajectories/profiles,

point data)- Coordinate reference projections- Resolutions

Key Challenges

nci.org.au

How data collections are accessed

Collections are being accessed and utilised from a broad range of options• Direct access on filesystem• Web and data services • Data portals• Virtual labs (e.g., virtual desktops)

eReefs online analysis portal

Page 5: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

5

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

Page 6: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

6

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

Page 7: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

7

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

nci.org.au

National Environmental Research Data Interoperability Platform (NERDIP)

HDF5

NetCDF-4

Climate

GDAL

API Layers

HP Data Library Layer

[SEG-Y][Airborne

Geophysics] [FITS] [LAS

LiDAR]

Data Conventions netCDF-CF

[HDF4-

EOS]

ISO 19115, ACDD, RIF-CS, DCAT, etc.

VGLAGDC

VL

Services Layer

Fast “whole-of-library”

catalogue

Lustre Other Storage (e.g., HDFS)

National Environmental Research Data Interoperability Platform (NERDIP)

Climate & Weather Science Lab

Biodiversity & Climate Change VL

OG

C

WFS

OG

C

SWE

OG

C

W*P

S

OG

CW

CS

OG

C

WM

S

OG

CW

*TS

RD

F, LD

VHIRLGlobe Claritas

Workflow Engines, Virtual Laboratories (VL’s), Science Gateways

AuScopePortal

TERNPortal

AODN/IMOSPortal

eMASTSpeddexes

All Sky Virtual Observatory

ANDS/RDAPortal

eReefs

ModelsFortran, C, C++, MPI, OpenMP

Python, R, MatLab, IDL

VisualisationDrishti

Ferret, NCO, GDL, GDAL, GRASS, QGIS

Digital Bathymetry & Elevation Portal

Data.gov.au

Open NavSurface

Tools Data Portals

Direct

Access

CS-W

NetCDF-4

Weather

NetCDF-4

Oceans

NetCDF-4

EONetCDF-4

Bathy

HDF5 ??

Vocab Service

PROV Service

Op

enD

AP

Page 8: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

8

nci.org.au

Data Quality Strategy (DQS): The Goal?

Provide seamless programmatic access through standardisation of both data and

services

Data Quality Strategy (DQS)

nci.org.au

• Combining data• Visualising• How can we make enable this type of

easy access and use?

The Goal

Page 9: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

9

nci.org.au

Motivation: Data Management Maturity Program

DMM Capability – 25 Processes to Perform, Manage, Define

4. Data Operations Process Area13. Data Requirements Definition14. Data Lifecycle Management15. Contribution / Provider Management

5. Platform and Architecture Process Area16. Architectural Standards17. Architectural Approach18. Data Management Platform19. Data Integration / Data Linking20. Data Archiving and Preservation

6. Infrastructure Support Practices21. Measurement and Analysis22. Process Management23. Process Quality Assurance24. Risk Management25. Configuration Management

1. Data Management Strategy Process Area1. Data Management Strategy2. Communications3. Data Management Function4. Grant Strategy/Business Case5. Funding

2. Data Governance Process Area6. Governance Management7. Vocabulary/Glossary8. Metadata Management

3. Data Quality Process Area9. Data Quality Strategy10. Data Profiling11. Data Quality Assessment12. Data Cleansing and Curation

Please see the eResearchpresentation on this work:

Lesley Wyborn (NCI)Tuesday (today) @ 5pGrand Ballroom 1/2

nci.org.au

Data Quality Strategy (DQS)

Data Quality Strategy (DQS): What does it

involve?

1. Underlying HPD file format

2. Close collaboration with data custodians

and managers

• Planning, designing, or reassessing

the data collections

3. Quality control through compliance with

recognised community standards

4. Data assurance through demonstrated

functionality across common platforms,

tools, and services

Page 10: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

10

nci.org.au

National Computational Infrastructure

1. Climate/ESS Model Assets and Data Products

2. Earth and Marine Observations and Data Products

3. Geoscience Collections

4. Terrestrial Ecosystems Collections

5. Water Management and Hydrology Collections

Data Collections Approx. Capacity

CMIP5, CORDEX, ACCESS Models 5 Pbytes

Satellite Earth Obs: LANDSAT, Himawari-8, Sentinel, MODIS, INSAR 2 Pbytes

Digital Elevation, BathymetryOnshore/Offshore Geophysics

1 Pbytes

Seasonal Climate 700 Tbytes

Bureau of Meteorology Observations 350 Tbytes

Bureau of Meteorology Ocean-Marine 350 Tbytes

Terrestrial Ecosystem 290 Tbytes

Reanalysis products 100 Tbytes

NetCDFcommon data format

nci.org.au

NetCDF collection overview

FORMATBy collection

Page 11: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

11

nci.org.au

NetCDF collection overview

FORMATBy collection

The motivation: reduce ‘none’

nci.org.au

CF Conventions• Climate and Forecast Conventions and Metadata: http://cfconventions.org/

ACDD• Attribute Convention for Data Discovery:

http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery

Existing Community Standards

Together, these two standards define several categories of metadata ensuring:

Usage, discoverability, and understanding of the data contents

Page 12: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

12

nci.org.au

Collection

DatasetsDatasetsDatasets

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Data file

Collection & dataset-levels

(e.g., parent-child metadata)

ISO-19115, ANZLIC, etc.

File (granule)-level

Contains 2 types of metadata:

(1) Variable-level (CF-Convention)

(2) Global file-level (ACDD**)

**Can link to collection/dataset metadata

File-levelVariable-level(s)

Many levels of metadata

nci.org.au

Self-contained file format

Traditional metadata information

(i.e., global attributes relevant to all variables/file contents):

• What is the data

• How was it produced

• Who/where/when

• Contact information

• Instruments, sources

• Version history

Data file

Global-level Variable(s)-level

Data content information:

(i.e., specific to each variable and the physical structure of the file)

• Coordinate variable(s) definition (value, name, size)

• Variable attributes (units, description, etc.)

• Type of data (gridded, discrete, ragged, etc.)

STANDARD:

Attribute Convention for Dataset Discovery (ACDD)

STANDARD:

Climate and Forecasts (CF) Convention

Page 13: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

13

nci.org.au

• Want to adopt or utilise existing community checkers if possible

• Two main options:• UK Reading (CF-Convention document links to this one)

• IOOS (growing fast, designed to be modified and extended)

Our own modifications• Needed our own wrapper to enable collection-level scans

• Tailor our output and reporting

Compliance checker

nci.org.au

Compliance checker

Summarised version on the compliance status.

The break down… compliance scores and also measure of consistency across the collection

Providing attack plan for improvements:Make it easy for data managers to efficiently address and meet baseline compliance

Page 14: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

14

nci.org.au

The result: “win-win” for all

• Working with data custodians and managers

• Common goal• Enabling access to organised, performant,

and interoperable data

• Progressive improvement in the quality of the datasets across the different subject domains

• “win-win”: • the ease by which the users can access,

utilise, and combine datasets

nci.org.au

Data Quality Strategy (DQS)

Data Quality Strategy (DQS): What does it

involve?

1. Underlying HPD file format

2. Close collaboration with data custodians

and managers

• Planning, designing, or reassessing

the data collections

3. Quality control through compliance with

recognised community standards

4. Data assurance through demonstrated

functionality across common platforms,

tools, and services

Page 15: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

15

nci.org.au

•Has now evolved to not only be a QC step in our data publishing process but also a Quality Assurance one

• Extend to test “usability” across wide spectrum of scientific tools and data services

• Commonly used libraries (e.g., netCDF, HDF, GDAL, etc.)

• Accessibility by data servers (e.g., THREDDS, Hyrax, GeoServer)

• Validation against scientific analysis and programming platforms (e.g., Python, Matlab, R, QGIS)

• Visualization tools (e.g., ParaView, IDV, WMS-viewers)

Functionality tests

nci.org.au

Functionality tests

Primary motivation:

Positive experience for our users.

Expectation that advertised collections and services are usable.

Page 16: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

16

nci.org.au

Bonus results

Bonus results:• Feedback to the local and international communities

• The more we test and test, the more we learn

• Functionality tests lead to reference and training material for our user community

Please see the eResearch presentation:

“A learner-centred approach to specialised user training”Claire Trenham (NCI)Wednesday (tomorrow) @ 2:25pGrand Ballroom 1/2

nci.org.au

Bonus: User reference material

Page 17: Implementing a Data Quality Strategy to simplify access to ...€¦ · 11/9/2016 5 nci.org.au National Environmental Research Data Interoperability Platform (NERDIP) HDF5 NetCDF -4

11/9/2016

17

nci.org.au

Summary/Future Work

What’s next?

• Automating and extending these measures and tests across our full collection

• What about the broader file formats?

• Staying connected and working with international communities

• E.g., NSF Funded “Advancing netCDF-CF for the Geoscience Community” (EarthCube)

https://www.earthcube.org/content/advancing-netcdf-cf-geoscience-community

nci.org.au

Questions?

Questions?Thanks for listening.

Contact information:

Kelsey [email protected]

Claire [email protected]

Ben [email protected]

Lesley [email protected]