Upload
aceas13tern
View
227
Download
0
Tags:
Embed Size (px)
DESCRIPTION
SPatially Explicit Data Discovery, Extraction and Evaluation Services (SPEDDEXES), Tim Pugh Bureau of Meteorology for ACEAS Grand 2014
Citation preview
SPEDDEXES: An open-source, community developed approach to enhancing the way ‘Big Data’ is managed, discovered and shared by ecosystem scientists
Evans, Bradley John*; Guru, Siddeswara; Allen, Stuart; Beckett, Duan; de Wit, Roald; Duursma, Daisy; Erwin, Tim; Evans, Ben; Fuchs, David; Hodge, Jonathan; Ip, Alex; King, Edward; Lewis, Adam; Paget, Matthew; Porter, David; Prentice, Iain Colin; Pugh, Tim; Scarth, Peter; Sixsmith, Joshua; Sun. Yi; Trevithick, Rebecca; Whitley, Rhys
SPatially Explicit Data Discovery, EXtraction and Evaluation Service
SPatially Explicit Data Discovery, EXtraction and Evaluation Service
SPATIALLY and temporally EXPLICIT research data infrastructure to interrogate data streams on the National Computing Infrastructure
DATA DISCOVERY for TERN or any datasets which can be read by the platform
EXTRACTION AND EVALUATION drives advances in ecosystem science, impact assessment and land managementCommunity success story: convolution of ideas from ...
Government (Fed and State), CSIRO, NCI, INTERSECT and Universities
: Ever growing need
The Spatially Explicit Data Discovery, Extraction and Evaluation Service (SPEDDEXES) was developed to address the ever growing need to better manage and access the Big Data available to Australian ecosystem sciences today.
Coupled Model Intercomparison Project for Climate Experiments• CMIP-5 consists of ~23 international models• CMIP-5 international data repository is >2PB• CMIP-6 contributions expected to 10x CMIP-5
Climate/W
eather
Earth &
Marin
e Obse
rvations
Geoscience
Collecti
ons
Terrestr
ial Eco
syste
m
Water M
gmt, H
ydro
logy10
100
1000
10000 5419
1928
326176 140
Scientific Data for Research (NCI RDSI node)by 2015
Dat
a Vo
lum
es (T
B)
: New approach for Big Data
It is no longer practical, let alone affordable, to continue to do data-intensive ecosystem science in the copy-and-work paradigm, a new approach to working with Big Data is required.
Think about network data access, not file downloads…
Cross-disciplinary use of file formats and services…
Open-source server technology and file formats…
Work with big data in a high performance facility
: Two key issues
The SPEDDEXES concept and tools addresses two key issues. • Firstly, create a self-describing data archive,
which adheres to international standards and community conventions.
• Secondly, data providers to adopt community standards to enable data catalogue and data access services for easier utility, management, and sustainability.
: SPEDDEXES architectureConnecting data to applications through the use of open-source middleware services and web technologies1. an Open-source Project for a Network Data Access Protocol
(OPeNDAP)2. the Open Geospatial Consortium (OGC) web services and the Web
Map Service (WMS) and protocol3. the Thematic Real-time Environmental Distributed Data Services
(TDS) service, an implementation of OPeNDAP and WMS4. an Environmental Research Divisions Data Access Program (ERDDAP)
service to aggregate data sources and provide search and data download services
5. ZOO Web Processing Service (ZOO WPS) for server-side processing6. A javascript web interface with search and visualization and subset
download functionalities (a.k.a. SPEDDEXES-UI).
: Seeking climatic data
ERDDAP Service- Catalogue- File (csv,…)- RSS notify- Rich user interface
NCAR Data Service- Catalogue- OPeNDAP- WMS
NCI Data Service- Catalogue- OPeNDAP- WMS
TERN Data Service- Catalogue- OPeNDAP- WMS
THREDDS and Discovery Systems
Data server
Communicate with Discovery Systems
MetadataRepository
MetadataHarvester
Reads
References
DiscoverySystem
THREDDS Serviceswith data server
WritesCatalog
Searches
MetadataGenerator
Netcdf, hdf, grib …
Trans-disciplinary science • To publish, catalogue and access self-documented data
for enhancing trans-disciplinary, big ecosystem-data science within interoperable data services and protocols.
Integrity of Science• Ease of access to data to enhance the scientist’s
workflow, ensures more accurate and repeatable science which can be conducted with less effort.
Integrity of Data• The data repository services ensure data integrity, digital
object identifiers, data discovery and catalogue searches.
For further information:
Brad EvansDirector ~ TERN [email protected]
Tim PughAustralian Bureau of MeteorologyCentre for Australian Weather and Climate [email protected]
Self-describing dataAn open-source GeoSciences file format is the network Common Data Format (netCDF) from Unidata (http://www.unidata.ucar.edu).
NetCDF goals support for data archives:• Portable: byte order neutral.• Efficient: random access• Appendable data arrays• Metadata within the file for global and variable attributes
Metadata conventions provide community standards for …• self-describing (CF) metadata conventions• data discovery (Unidata ACDD) conventions• community specific metadata (i.e. IMOS, TERN AusCover)
• http://www.auscover.org.au/userdocs/metadata
Fundamental Objective of OPENDAP
The fundamental objective of OPeNDAP and OPeNDAP Inc. is to facilitate internet access to scientific data
This is done by:• Providing a protocol (DAP) to access data over the internet,• Hiding the format (and organization) in which the data are stored from
the user, and• Providing subsetting (and other) capabilities for the data at the server
OPeNDAP is based on a multi-tier architecture
OPeNDAP software is open source
THREDDS Data Server (TDS)
TDS is THREDDS Data Server• THREDDS is Thematic Real-time Environmental Distributed Data Services• Middleware to bridge the gap between data providers and data users• THREDDS Data Server (TDS), a web server that provides catalog, metadata, and data
access services for scientific datasets. • The TDS is open source, 100% Java, and runs inside the open source Tomcat Servlet
container.
Unidata’s Common Data Model• merges the OPeNDAP, netCDF, and HDF5 data models to create a common API for
scientific data• implemented by the NetCDF Java library• read netCDF, OPeNDAP, HDF5, HDF4, GRIB 1 & 2, BUFR, NEXRAD 2 & 3, GEMPAK,
MCIDAS, GINI, among others• A pluggable framework allows other developers to add readers for their own specialized
formats.• provides standard APIs for geo-referencing coordinate systems, and specialized queries
for scientific feature types like Grid, Point, and Radial datasets
Spectrum of Use Cases
Application Data Representation
OGC data modeldomain specificgeospatial, 1-D, 2-D
DAP2 data modeldomain neutraln-D, time series
**DAP4 data modeldomain neutralnew data types and data structuresstreaming, compressed, chunked
Common Data Model (CDM)domain specific
Future data modeldomain neutral??
Application Types
Programmatic / Langauge APIFORTRAN, C/C++, JAVA, Python, NetCDF, Java NetCDF
Programmatic / ToolsNetCDF, NCO, PyDAPCustom Tools: OPeNDAP crawler, ocean_prep
Interactive Data ViewerIDV, Panolopy, IDL, MATLAB, iPython (matplotlib), NCL, web browser (metadata)
Interactive AnalysisMATLAB, IDL, iPython, NCLCustom Application: Inudation Modeller
Web ApplicationLive Access ServerIMOS Data Portal (WMS)Custom Java Servlet
ProgrammingDAP2 Legacy Codeexisting tools
DAP2 New CodeNew tools
**DAP4 programminglegacy code support
**DAP4 programmingnew data model and protocolsstreaming support
**DAP4 programmingAsynchronous access modes, server-side processing
Data Access Protocol
Metadata Requestdas, dds, ddx
ASCII/Binary Data RequestSimple data representation
DAP Binary Object Request NcML Data Requestaggregation, virtual data sets
**DAP4server-side operations, async access mode, new data model, posting
Syntax
Return data set infofile.nc.dds - readablefile.nc.ddx - XMLfile.nc.asc - ASCII data return
Select variablesfile.nc.dods?var1,var2,var3
subset arraysfile.dods?var1(0:1:10)
Return file translationsfile.nc.netcdf - NetCDF file
Server-side operationsfile.nc?GEOLOC()Async access mode??
Clients
Programmatic AccessTsunami inudation modeller, NetCDF,NCO, PyDAP, PyNetCDF, MATLAB, IDL, …
Interactive AccessWeb browser - CatalogMATLAB, IDL, Python, Panolopy,…
Data Library & Catalog Servicemetadata harvestingdirectory listingsremote THREDDS services
Web ServiceJava servlet, Java appletGeospatial Information ServiceOPeNDAP data service
Analysis ServiceLive Access Server
Service CapabilitiesDAP2 response metadata, dods, ASCII / Binary
**DAP4 Responseasync access mode, server-side, streaming,
NcMLAggregation serviceVirtual Data Set ServiceRemote Data Access
Metadata Conversion and RDFmetadata definitions, translations (-> ISO) sematics, ontalogyCF->ISO, CF->WMS, CF->WCS
Layered ServicesCatalogue serviceWMS, WCS servicesAuthenticationConformance checksCF metadata checkISO metadata check
**DAP4 features listed is my estimation and not the official specification