Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop,

Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop, Dec 10, 2014

Analytical challenges Ecology and Biodiversity Sciences: inherently multi-disciplinary: bio + earth critical, societally-relevant environmental questions typically not local at regional if not global scale analyses become far more robust and efficient with faster access to wider range and larger volumes of DATA 2 From: Halpern et al. A Global Map of Human Impact on Marine Science Ecosystems, Science 15 February 2008: DOI: 10.1126/science.1149345

Good news more and more data There is a growing deluge of environmental data to assist in these investigations

4 Fundamental Problem: Big Data Ecological/biodiversity data are Big Data: globally distributed, voluminous highly heterogeneous in structure and content rapidly growing! i.e., the 3 Vs: Volume Variety Velocity

Informatics challenges Discovering and integrating data across scales micro (and nano) to global aligning heterogeneous schema and themes: land-use/land-cover, geology, soils, atmosphere, hydrology, oceanography genes to ecosystems human sciences: culture &traditions, demographics, economics, governance dealing with volume: TB PB ++ satellite images sensors (aerial and ground-based) observational data access and storage even GBs are problems at Desktop!!! Documenting effects of climate change on forest composition Large amounts of relevant data E.g., over 25,000 data sets are available in the Knowledge Network for Biocomplexity repository (KNB http://knb.ecoinormatic.org)http://knb.ecoinormatic.org 5

Environmental Data the status quo Distributed: stewarded by many groups, individuals Under-documented: sparsely and inconsistently documented; jargon and acronyms; critical details about data natural language (journals, white papers Inaccessible: varying degrees and mechanisms of presentation via FTP, Web, etc. Heterogeneous: broad range of relevant topics (semantics), lots of different data formats (structure), data access protocols (syntax), data models, etc.

Data collected by thousands of trained field scientists providing invaluable on-the-ground, in situ information: fine-grain detail on biodiversity but highly idiosyncratic approaches with methods, naming of measurements Also, there is the long tail of dark data* in ecology/biodiversity sciences * Heidorn, P Bryan. 2008 DOI: 10.1353/lib.0.0036

Ground-truthing & Observational Data AGGREGATORS are KEY: Plant Occurrences and Vegetation Plots BIEN, Turboveg, sPlot, CTFS, GBIF, Natural History Museum Collections, Map of Life Plant Functional Traits (PFT) TRY, BIEN 8

Ground-truthing & Observational Data AGGREGATORS are KEY: Sensor data NEON Genomic data iPlant Remote sensing and global climate data NASA DAACs, IPCC... others... and MANY independent, dark-tail, in situ data sets 9

Several Existing Resources eScience 201010

Geospatial Data Need better discovery, access to, and integration of remote-sensing data with ground- truthing and observational (in situ) data!! 11

Informatics Challenges Preservation Discovery/Integration Attribution 12

for Preservation: ARCHIVES Archives should be permanent, reliable, powerful, comprehensive, AND useful (usable) NSF DataNet program: data stewardship and interoperability; exploring models for sustainability federating major earth science data archives distributed framework (shared responsibility) API (new groups can participate, and are welcome!) Data, metadata, ontologies 13

for Discovery, Integration: SHARED KNOWLEDGE MODELS Consistency and rigor in terminology Standardized protocols, methods when possible Semantics approaches Ontologies for terminologies Ontologies to describe data schemas Machine-assisted discovery, reasoning, integration 14

Environmental Data the status quo Under-documented: sparsely and inconsistently documented; jargon and acronyms; critical details about data natural language (journals, white papers) measurements: MAT, MAR, LL, LMA, LNA, PET, PLNTHT, VPD, VSWIR techniques: SMLR, PSLR projects and models: LOPEX, ACCP, PROSPECT instruments: AVIRIS, CASI, HyspIRI

Advance consistency and rigor in terminology and data descriptions Standardize protocols, methods when possible Development Tasks: Ontologies for domains Ontologies to describe data schemas Mechanisms to bind data with Knowledge Models Machine-assisted discovery, reasoning, integration Observational data model as foundational template 16 Semantics approaches to support machine- processing of data

Metadata-based Data Integration Metadata standards are step in right direction Expose data in standard schema for transfer Dublin Core ISO 19115 (geospatial metadata) and OGC Darwin Core (biodiversity specimen metadata) EML (Ecological Metadata Language) GeoSciML Can map one format to another to resolve minor differences (but this gets arduous) And these still allow for terminological inconsistencies, and dont support well hierarchy, synonymy, complex relationships

Semantic Data Integration W3C Semantic standards-- RDF, OWL provide greater expressivity, formalization, enhanced search, reasoning Class/subclass subsumption Axioms/properties: reflexivity, transitivity, domain/rangy

Simple Darwin Core (2013)-- dwc:Occurrence dwc:Eventdcterms:Location detected_during to_taxon happened_at dwc:Identification dwc:Taxon basis_for dwc:MaterialSample documented_by derived_from basis_for Can formalize in RDF: leads to greater clarity of how concepts related; Conversion to triple format can enable basic graph traversal

RDFS-based inferencing ENVO:Tropical Broadleaf Forest Biome! ** BENEFITS: Enhanced searching along subsumption hierarchies (classes or properties) Formalized descriptions

Observational Data Model Implemented as an OWL-DL ontology Provides basic concepts for describing observations Specific extension points for domain-specific terms 21 Entity Characteristic Observation Measurement Protocol Standard + precision : decimal + method : anyType 1..1 * * * * 0..1 1..1 * * Value 1..1 * * Context ObservedEntity

Semantic annotation 22 Attribute mappings

23 Open Open Science Scientists should communicate the data they collect and the models they create, to allow free and open access, and in ways that are intelligible, assessable and usable for other specialists in the same or linked fields wherever they are in the world. Where data justify it, scientists should make them available in an appropriate data repository. Science as an open enterprise, The Royal Society Science Policy Centre report 02/12

Why Open Science now? Technology is available to do it (Internet + Web + Semantics + FLOPS) Growing politicization of science: need for transparency Importance of large- scale/interdisciplinary science Efficiencies in re-using or sharing available data, code A return to fundamental premise of science: objective, repeatable, transparent, general 24

Open Science: Open Data: repositories (e.g. NASA DAACs, NSFs DataONE) Open Source: code and algorithms (e.g. Python, R) Open Access: journals (e.g PLoS) Open Notebook: blog++ (e.g. iPython) 25

Open Data Rapid, highly affordable access to ALL the data supporting scientific findings 26

Open Source Easy, fast, low-to-no (cost) barriers to languages, code, libraries/packages, algorithms, and frameworks for accomplishing analyses Multi-platform, scalable 27

Open Access (OA) Rapid, highly affordable access to the latest scientific findings Issues: Peer-review process Copyright (IP issues) Costs 28

Researchers still struggling to Discover relevant datasets Access and integrate these its getting more difficult as volume, diversity and complexity of data increase and Data Quality is always a concern!!! 29 The (sad) status quo

Steps towards global-scale, Open Biodiversity Science? Aggregators (coordinated, international): service providers who assemble and harmonize distributed data for the scientific community Better services and interfaces: requires standardization of metadata and semantics overcome limitations of desktop tools/frameworks all FOSS Cross-scale, cross data-type, integration genomic, organismal, observational/ecological, sensors, remote-sensing Must train researchers in the use of these new tools, data types, and frameworks!! 30

Documents

Informatics and Computational Challenges for Satellite Monitoring of Global Biodiversity Mark SchildhauerRyan Pavlick NCEAS, UCSBNASA/JPL NASA/NCEAS Workshop,