17
SPEDDEXES: An open-source, community developed approach to enhancing the way ‘Big Data’ is managed, discovered and shared by ecosystem scientists Evans, Bradley John*; Guru, Siddeswara; Allen, Stuart; Beckett, Duan; de Wit, Roald; Duursma, Daisy; Erwin, Tim; Evans, Ben; Fuchs, David; Hodge, Jonathan; Ip, Alex; King, Edward; Lewis, Adam; Paget, Matthew; Porter, David; Prentice, Iain Colin; Pugh, Tim; Scarth, Peter; Sixsmith, Joshua; Sun. Yi; Trevithick, Rebecca; Whitley, Rhys SPatially Explicit Data Discovery, EXtraction and Evaluation Service

SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Embed Size (px)

DESCRIPTION

SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES), Tim Pugh Bureau of Meteorology for ACEAS Grand 2014

Citation preview

Page 1: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

SPEDDEXES: An open-source, community developed approach to enhancing the way ‘Big Data’ is managed, discovered and shared by ecosystem scientists

Evans, Bradley John*; Guru, Siddeswara; Allen, Stuart; Beckett, Duan; de Wit, Roald; Duursma, Daisy; Erwin, Tim; Evans, Ben; Fuchs, David; Hodge, Jonathan; Ip, Alex; King, Edward; Lewis, Adam; Paget, Matthew; Porter, David; Prentice, Iain Colin; Pugh, Tim; Scarth, Peter; Sixsmith, Joshua; Sun. Yi; Trevithick, Rebecca; Whitley, Rhys

SPatially Explicit Data Discovery, EXtraction and Evaluation Service

Page 2: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

SPatially Explicit Data Discovery, EXtraction and Evaluation Service

SPATIALLY and temporally EXPLICIT research data infrastructure to interrogate data streams on the National Computing Infrastructure

DATA DISCOVERY for TERN or any datasets which can be read by the platform

EXTRACTION AND EVALUATION drives advances in ecosystem science, impact assessment and land managementCommunity success story: convolution of ideas from ...

Government (Fed and State), CSIRO, NCI, INTERSECT and Universities

Page 3: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

: Ever growing need

The Spatially Explicit Data Discovery, Extraction and Evaluation Service (SPEDDEXES) was developed to address the ever growing need to better manage and access the Big Data available to Australian ecosystem sciences today.

Coupled Model Intercomparison Project for Climate Experiments• CMIP-5 consists of ~23 international models• CMIP-5 international data repository is >2PB• CMIP-6 contributions expected to 10x CMIP-5

Page 4: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Climate/W

eather

Earth &

Marin

e Obse

rvations

Geoscience

Collecti

ons

Terrestr

ial Eco

syste

m

Water M

gmt, H

ydro

logy10

100

1000

10000 5419

1928

326176 140

Scientific Data for Research (NCI RDSI node)by 2015

Dat

a Vo

lum

es (T

B)

Page 5: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

: New approach for Big Data

It is no longer practical, let alone affordable, to continue to do data-intensive ecosystem science in the copy-and-work paradigm, a new approach to working with Big Data is required.

Think about network data access, not file downloads…

Cross-disciplinary use of file formats and services…

Open-source server technology and file formats…

Work with big data in a high performance facility

Page 6: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

: Two key issues

The SPEDDEXES concept and tools addresses two key issues. • Firstly, create a self-describing data archive,

which adheres to international standards and community conventions.

• Secondly, data providers to adopt community standards to enable data catalogue and data access services for easier utility, management, and sustainability.

Page 7: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

: SPEDDEXES architectureConnecting data to applications through the use of open-source middleware services and web technologies1. an Open-source Project for a Network Data Access Protocol

(OPeNDAP)2. the Open Geospatial Consortium (OGC) web services and the Web

Map Service (WMS) and protocol3. the Thematic Real-time Environmental Distributed Data Services

(TDS) service, an implementation of OPeNDAP and WMS4. an Environmental Research Divisions Data Access Program (ERDDAP)

service to aggregate data sources and provide search and data download services

5. ZOO Web Processing Service (ZOO WPS) for server-side processing6. A javascript web interface with search and visualization and subset

download functionalities (a.k.a. SPEDDEXES-UI).

Page 8: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

: Seeking climatic data

ERDDAP Service- Catalogue- File (csv,…)- RSS notify- Rich user interface

NCAR Data Service- Catalogue- OPeNDAP- WMS

NCI Data Service- Catalogue- OPeNDAP- WMS

TERN Data Service- Catalogue- OPeNDAP- WMS

Page 9: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

THREDDS and Discovery Systems

Data server

Communicate with Discovery Systems

MetadataRepository

MetadataHarvester

Reads

References

DiscoverySystem

THREDDS Serviceswith data server

WritesCatalog

Searches

MetadataGenerator

Netcdf, hdf, grib …

Page 10: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014
Page 11: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014
Page 12: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Trans-disciplinary science • To publish, catalogue and access self-documented data

for enhancing trans-disciplinary, big ecosystem-data science within interoperable data services and protocols.

Integrity of Science• Ease of access to data to enhance the scientist’s

workflow, ensures more accurate and repeatable science which can be conducted with less effort.

Integrity of Data• The data repository services ensure data integrity, digital

object identifiers, data discovery and catalogue searches.

Page 13: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

For further information:

Brad EvansDirector ~ TERN [email protected]

Tim PughAustralian Bureau of MeteorologyCentre for Australian Weather and Climate [email protected]

Page 14: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Self-describing dataAn open-source GeoSciences file format is the network Common Data Format (netCDF) from Unidata (http://www.unidata.ucar.edu).

NetCDF goals support for data archives:• Portable: byte order neutral.• Efficient: random access• Appendable data arrays• Metadata within the file for global and variable attributes

Metadata conventions provide community standards for …• self-describing (CF) metadata conventions• data discovery (Unidata ACDD) conventions• community specific metadata (i.e. IMOS, TERN AusCover)

• http://www.auscover.org.au/userdocs/metadata

Page 15: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Fundamental Objective of OPENDAP

The fundamental objective of OPeNDAP and OPeNDAP Inc. is to facilitate internet access to scientific data

This is done by:• Providing a protocol (DAP) to access data over the internet,• Hiding the format (and organization) in which the data are stored from

the user, and• Providing subsetting (and other) capabilities for the data at the server

OPeNDAP is based on a multi-tier architecture

OPeNDAP software is open source

Page 16: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

THREDDS Data Server (TDS)

TDS is THREDDS Data Server• THREDDS is Thematic Real-time Environmental Distributed Data Services• Middleware to bridge the gap between data providers and data users• THREDDS Data Server (TDS), a web server that provides catalog, metadata, and data

access services for scientific datasets. • The TDS is open source, 100% Java, and runs inside the open source Tomcat Servlet

container.

Unidata’s Common Data Model• merges the OPeNDAP, netCDF, and HDF5 data models to create a common API for

scientific data• implemented by the NetCDF Java library• read netCDF, OPeNDAP, HDF5, HDF4, GRIB 1 & 2, BUFR, NEXRAD 2 & 3, GEMPAK,

MCIDAS, GINI, among others• A pluggable framework allows other developers to add readers for their own specialized

formats.• provides standard APIs for geo-referencing coordinate systems, and specialized queries

for scientific feature types like Grid, Point, and Radial datasets

Page 17: SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES). Tim Pugh, ACEAS Grand 2014

Spectrum of Use Cases

Application Data Representation

OGC data modeldomain specificgeospatial, 1-D, 2-D

DAP2 data modeldomain neutraln-D, time series

**DAP4 data modeldomain neutralnew data types and data structuresstreaming, compressed, chunked

Common Data Model (CDM)domain specific

Future data modeldomain neutral??

Application Types

Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java NetCDF

Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler, ocean_prep

Interactive Data ViewerIDV, Panolopy, IDL, MATLAB, iPython (matplotlib), NCL, web browser (metadata)

Interactive AnalysisMATLAB, IDL, iPython, NCLCustom Application: Inudation Modeller

Web ApplicationLive Access ServerIMOS Data Portal (WMS)Custom Java Servlet

ProgrammingDAP2 Legacy Codeexisting tools

DAP2 New CodeNew tools

**DAP4 programminglegacy code support

**DAP4 programmingnew data model and protocolsstreaming support

**DAP4 programmingAsynchronous access modes, server-side processing

Data Access Protocol

Metadata Requestdas, dds, ddx

ASCII/Binary Data RequestSimple data representation

DAP Binary Object Request NcML Data Requestaggregation, virtual data sets

**DAP4server-side operations, async access mode, new data model, posting

Syntax

Return data set infofile.nc.dds - readablefile.nc.ddx - XMLfile.nc.asc - ASCII data return

Select variablesfile.nc.dods?var1,var2,var3

subset arraysfile.dods?var1(0:1:10)

Return file translationsfile.nc.netcdf - NetCDF file

Server-side operationsfile.nc?GEOLOC()Async access mode??

Clients

Programmatic AccessTsunami inudation modeller, NetCDF,NCO, PyDAP, PyNetCDF, MATLAB, IDL, …

Interactive AccessWeb browser - CatalogMATLAB, IDL, Python, Panolopy,…

Data Library & Catalog Servicemetadata harvestingdirectory listingsremote THREDDS services

Web ServiceJava servlet, Java appletGeospatial Information ServiceOPeNDAP data service

Analysis ServiceLive Access Server

Service CapabilitiesDAP2 response metadata, dods, ASCII / Binary

**DAP4 Responseasync access mode, server-side, streaming,

NcMLAggregation serviceVirtual Data Set ServiceRemote Data Access

Metadata Conversion and RDFmetadata definitions, translations (-> ISO) sematics, ontalogyCF->ISO, CF->WMS, CF->WCS

Layered ServicesCatalogue serviceWMS, WCS servicesAuthenticationConformance checksCF metadata checkISO metadata check

**DAP4 features listed is my estimation and not the official specification