Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark...
If you can't read please download the document
Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop,
Informatics and Computational Challenges for Satellite
Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick
NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop, Dec 10, 2014
Slide 2
Analytical challenges Ecology and Biodiversity Sciences:
inherently multi-disciplinary: bio + earth critical,
societally-relevant environmental questions typically not local at
regional if not global scale analyses become far more robust and
efficient with faster access to wider range and larger volumes of
DATA 2 From: Halpern et al. A Global Map of Human Impact on Marine
Science Ecosystems, Science 15 February 2008: DOI:
10.1126/science.1149345
Slide 3
Good news more and more data There is a growing deluge of
environmental data to assist in these investigations
Slide 4
4 Fundamental Problem: Big Data Ecological/biodiversity data
are Big Data: globally distributed, voluminous highly heterogeneous
in structure and content rapidly growing! i.e., the 3 Vs: Volume
Variety Velocity
Slide 5
Informatics challenges Discovering and integrating data across
scales micro (and nano) to global aligning heterogeneous schema and
themes: land-use/land-cover, geology, soils, atmosphere, hydrology,
oceanography genes to ecosystems human sciences: culture
&traditions, demographics, economics, governance dealing with
volume: TB PB ++ satellite images sensors (aerial and ground-based)
observational data access and storage even GBs are problems at
Desktop!!! Documenting effects of climate change on forest
composition Large amounts of relevant data E.g., over 25,000 data
sets are available in the Knowledge Network for Biocomplexity
repository (KNB
http://knb.ecoinormatic.org)http://knb.ecoinormatic.org 5
Slide 6
Environmental Data the status quo Distributed: stewarded by
many groups, individuals Under-documented: sparsely and
inconsistently documented; jargon and acronyms; critical details
about data natural language (journals, white papers Inaccessible:
varying degrees and mechanisms of presentation via FTP, Web, etc.
Heterogeneous: broad range of relevant topics (semantics), lots of
different data formats (structure), data access protocols (syntax),
data models, etc.
Slide 7
Data collected by thousands of trained field scientists
providing invaluable on-the-ground, in situ information: fine-grain
detail on biodiversity but highly idiosyncratic approaches with
methods, naming of measurements Also, there is the long tail of
dark data* in ecology/biodiversity sciences * Heidorn, P Bryan.
2008 DOI: 10.1353/lib.0.0036
Slide 8
Ground-truthing & Observational Data AGGREGATORS are KEY:
Plant Occurrences and Vegetation Plots BIEN, Turboveg, sPlot, CTFS,
GBIF, Natural History Museum Collections, Map of Life Plant
Functional Traits (PFT) TRY, BIEN 8
Slide 9
Ground-truthing & Observational Data AGGREGATORS are KEY:
Sensor data NEON Genomic data iPlant Remote sensing and global
climate data NASA DAACs, IPCC... others... and MANY independent,
dark-tail, in situ data sets 9
Slide 10
Several Existing Resources eScience 201010
Slide 11
Geospatial Data Need better discovery, access to, and
integration of remote-sensing data with ground- truthing and
observational (in situ) data!! 11
for Preservation: ARCHIVES Archives should be permanent,
reliable, powerful, comprehensive, AND useful (usable) NSF DataNet
program: data stewardship and interoperability; exploring models
for sustainability federating major earth science data archives
distributed framework (shared responsibility) API (new groups can
participate, and are welcome!) Data, metadata, ontologies 13
Slide 14
for Discovery, Integration: SHARED KNOWLEDGE MODELS Consistency
and rigor in terminology Standardized protocols, methods when
possible Semantics approaches Ontologies for terminologies
Ontologies to describe data schemas Machine-assisted discovery,
reasoning, integration 14
Slide 15
Environmental Data the status quo Under-documented: sparsely
and inconsistently documented; jargon and acronyms; critical
details about data natural language (journals, white papers)
measurements: MAT, MAR, LL, LMA, LNA, PET, PLNTHT, VPD, VSWIR
techniques: SMLR, PSLR projects and models: LOPEX, ACCP, PROSPECT
instruments: AVIRIS, CASI, HyspIRI
Slide 16
Advance consistency and rigor in terminology and data
descriptions Standardize protocols, methods when possible
Development Tasks: Ontologies for domains Ontologies to describe
data schemas Mechanisms to bind data with Knowledge Models
Machine-assisted discovery, reasoning, integration Observational
data model as foundational template 16 Semantics approaches to
support machine- processing of data
Slide 17
Metadata-based Data Integration Metadata standards are step in
right direction Expose data in standard schema for transfer Dublin
Core ISO 19115 (geospatial metadata) and OGC Darwin Core
(biodiversity specimen metadata) EML (Ecological Metadata Language)
GeoSciML Can map one format to another to resolve minor differences
(but this gets arduous) And these still allow for terminological
inconsistencies, and dont support well hierarchy, synonymy, complex
relationships
Simple Darwin Core (2013)-- dwc:Occurrence
dwc:Eventdcterms:Location detected_during to_taxon happened_at
dwc:Identification dwc:Taxon basis_for dwc:MaterialSample
documented_by derived_from basis_for Can formalize in RDF: leads to
greater clarity of how concepts related; Conversion to triple
format can enable basic graph traversal
Slide 20
RDFS-based inferencing ENVO:Tropical Broadleaf Forest Biome! **
BENEFITS: Enhanced searching along subsumption hierarchies (classes
or properties) Formalized descriptions
Slide 21
Observational Data Model Implemented as an OWL-DL ontology
Provides basic concepts for describing observations Specific
extension points for domain-specific terms 21 Entity Characteristic
Observation Measurement Protocol Standard + precision : decimal +
method : anyType 1..1 * * * * 0..1 1..1 * * Value 1..1 * * Context
ObservedEntity
Slide 22
Semantic annotation 22 Attribute mappings
Slide 23
23 Open Open Science Scientists should communicate the data
they collect and the models they create, to allow free and open
access, and in ways that are intelligible, assessable and usable
for other specialists in the same or linked fields wherever they
are in the world. Where data justify it, scientists should make
them available in an appropriate data repository. Science as an
open enterprise, The Royal Society Science Policy Centre report
02/12
Slide 24
Why Open Science now? Technology is available to do it
(Internet + Web + Semantics + FLOPS) Growing politicization of
science: need for transparency Importance of large-
scale/interdisciplinary science Efficiencies in re-using or sharing
available data, code A return to fundamental premise of science:
objective, repeatable, transparent, general 24
Slide 25
Open Science: Open Data: repositories (e.g. NASA DAACs, NSFs
DataONE) Open Source: code and algorithms (e.g. Python, R) Open
Access: journals (e.g PLoS) Open Notebook: blog++ (e.g. iPython)
25
Slide 26
Open Data Rapid, highly affordable access to ALL the data
supporting scientific findings 26
Slide 27
Open Source Easy, fast, low-to-no (cost) barriers to languages,
code, libraries/packages, algorithms, and frameworks for
accomplishing analyses Multi-platform, scalable 27
Slide 28
Open Access (OA) Rapid, highly affordable access to the latest
scientific findings Issues: Peer-review process Copyright (IP
issues) Costs 28
Slide 29
Researchers still struggling to Discover relevant datasets
Access and integrate these its getting more difficult as volume,
diversity and complexity of data increase and Data Quality is
always a concern!!! 29 The (sad) status quo
Slide 30
Steps towards global-scale, Open Biodiversity Science?
Aggregators (coordinated, international): service providers who
assemble and harmonize distributed data for the scientific
community Better services and interfaces: requires standardization
of metadata and semantics overcome limitations of desktop
tools/frameworks all FOSS Cross-scale, cross data-type, integration
genomic, organismal, observational/ecological, sensors,
remote-sensing Must train researchers in the use of these new
tools, data types, and frameworks!! 30