50
Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia University Palisades, NY, 10964 Making small Data BIG Success and Challenges in the Earth Sciences

Making Small Data BIG (UT Austin, March 2016)

Embed Size (px)

Citation preview

Page 1: Making Small Data BIG (UT Austin, March 2016)

Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964

Making small Data BIGSuccess and Challenges in the Earth Sciences

Page 2: Making Small Data BIG (UT Austin, March 2016)

Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityFebruary 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/

What Makes Data BIG?

2

ValueThe sixth ‘V’:

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 3: Making Small Data BIG (UT Austin, March 2016)

‘Small’ Earth Science Research Data

• heterogeneous• customized & optimized

for research questions• lack of data standards• culture of data ‘hording’• lack of data

infrastructure (facilities)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 3

3/22/2016

Page 4: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 4

The Value of Small Research Data

3/22/2016

“While the data volumes are small when viewed individually, in total they represent a very significant

portion of the country’s scientific output.”

“The long tail is a breeding ground for new ideas and never before attempted science.”

(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)

Page 5: Making Small Data BIG (UT Austin, March 2016)

Small data: Pieces of a puzzle …

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 5

Page 6: Making Small Data BIG (UT Austin, March 2016)

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 6

… that form a picture

Page 7: Making Small Data BIG (UT Austin, March 2016)

Big Pictures from Small Data

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 7

The PetDB Synthesis

Map shows data from >300 publications Symbols are locations of rock samples. Color is scaled to the 87Sr/86Sr isotope ratio in the rocks.

Page 8: Making Small Data BIG (UT Austin, March 2016)

Big Pictures from Small Data

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 8

Page 9: Making Small Data BIG (UT Austin, March 2016)

Small Data – Big Effort or What it takes to generate a few kilobytes of data …

9

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

“Understanding where the dust that's in the atmosphere and oceans comes from can help scientists estimate its impact on earth's climate system.”

Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)

http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/

Example #1:Did New Zealand Dust Influence the Last Ice Age?

Page 10: Making Small Data BIG (UT Austin, March 2016)

Find funding and go into the field …

Making Small Data BIG: Succss and Challenges in the Earth Sciences 10

3/22/2016

Page 11: Making Small Data BIG (UT Austin, March 2016)

Prepare samples in the repository/lab

Making Small Data BIG: Succss and Challenges in the Earth Sciences 11

3/22/2016

Page 12: Making Small Data BIG (UT Austin, March 2016)

Analyze Samples in the Lab

Making Small Data BIG: Succss and Challenges in the Earth Sciences 12

3/22/2016

Page 13: Making Small Data BIG (UT Austin, March 2016)

The few kilobytes of data

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 13

Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.

Page 14: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 14

Small Data – Big Effort (example #2)

3/22/2016

Example #2:Do convergent margin volcanoes really represent continental crust?

“As it is crucial to understand the extent and origin of the compositional difference between central Aleutian lavas and plutons through time and space, this project will map and sample plutonic rocks exposed on the central Aleutians and their coeval volcanic host rocks.”

“Results and the samples acquired in this study will help to answer fundamental questions of continental crust formation, and shed light on the formation mechanisms of plutons and volcanics in arcs.”

http://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=135851&org=NSF

Page 15: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 15

Small Data - Big Investmentor What it takes to generate a few kilobytes of data

3/22/2016

Anticipated Data:• ~ 250 samples• ~ 200 major element analyses• ~ 150 trace element analyses• 50 U/Pb zircon geochronology• 30 Ar-Ar ages• 80 Sr, Nd, Hf and Pb isotope analyses

• 4 scientists (3 institutions) • 5 weeks on remote islands• a boat (with crew)• a helicopter

Page 16: Making Small Data BIG (UT Austin, March 2016)

Outcomes so far: ca. 500 samples

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 16

Page 17: Making Small Data BIG (UT Austin, March 2016)

Even bigger investments for small data …

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 17

Page 18: Making Small Data BIG (UT Austin, March 2016)

SharingSmall Data

Making Small Data BIG: Succss and Challenges in the Earth Sciences 18

3/22/2016

Page 19: Making Small Data BIG (UT Austin, March 2016)

Small Data have small value as long as …• They are widely dispersed in the literature (past &

present).• They are not openly accessible.• They lack sufficient and standardized metadata.• They are never published (“dark data”).

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 19

Page 20: Making Small Data BIG (UT Austin, March 2016)

Growing the Valueof Small Data

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 20

findableidentification,persistence

accessibleprotection,protocols

context,provenance

re-usableharmonized, machine-readable

interoperableBIG DATA

small data Data Curation Standards

Generic Repositories

Domain-specific Data Standards

Community Data Collections

Value

Page 21: Making Small Data BIG (UT Austin, March 2016)

Growing the Valueof Small Data

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 21

findableidentification,persistence

accessibleprotection,protocols

context,provenance

re-usableharmonized, machine-readable

interoperableBIG DATA

small data Data Curation Standards Domain-specific Data Standards

Value

Domain Repositories

Page 22: Making Small Data BIG (UT Austin, March 2016)

Domain-specific Data Facilities

Making Small Data BIG: Succss and Challenges in the Earth Sciences 22

Science Community

Domain specific Data facility

22

Libraries Archives

CI, Computer Science

Publishers, editors

Discipline-specific data services• Context & provenance metadata• Semantics• Workflows

Funding Agencies

Data Facilities

Registries

3/22/2016

Data curation servicesCI development

Disciplinary Expertise

Data Curation IT/CS

Expe

rtise

Page 23: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 23

IEDA: Interdisciplinary Earth Data Alliance

3/22/2016

Data Services for the Solid Earth Sciences

www.iedadata.org

Page 24: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 24

IEDA: A Multi-Disciplinary Data Facility

www.iedadata.org

• Solid Earth Observational Data• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology

• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Interoperability

3/22/2016

Page 25: Making Small Data BIG (UT Austin, March 2016)

25

Small Data Gone BIG

IEDA Repositories >720,000 files 59 TB 4 x 106 samples

IEDA Syntheses 19 x 106 analytical values in EarthChem 2.79 x 106 miles of data from 875 cruises in the

Global Multi-Resolution Topography (GMRT)

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 26: Making Small Data BIG (UT Austin, March 2016)

IEDA: Impact on Science

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 26

Page 27: Making Small Data BIG (UT Austin, March 2016)

27

EarthChem Data Systems

Data Data Data Data Data

EarthChem Library

Data Data Data Data Data

PetDB, SedDB EarthChem Portal

Data Publication & Preservation Data Mining & Analysis

InvestigatorsMetadata

Catalog Data & Metadata

Data & Metadata

External SystemsEarthChem Data Managers

FINDABLE & ACCESSIBLE• DOI registration• Long-term archiving• CC license• Guidelines for data reporting

(community endorsed)• QC by data managers

RE-USABLE & INTEROPERABLE• Data & metadata harmonization• Standards-compliant data model• Service Oriented Architecture (ECP)

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 28: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 28

DOI to allow proper citation

Link to publications

Link to funding source

28

ECL: Discovery & Access

3/22/2016

Page 29: Making Small Data BIG (UT Austin, March 2016)

Data Synthesis: PetDB

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 29

Global compilation of geochemical data for igneous rocks from the ocean floor & mantle xenoliths

> 2,200 data sets/publications> 84,000 samples> 3.2 million observed values

http://www.earthchem.org/petdb

Page 30: Making Small Data BIG (UT Austin, March 2016)

Data Synthesis: EarthChem Portal

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 30

Data from • >13,000 publications• >850,000 samples

Total: >19.6 million analytical values

Partner Databases:• PetDB• SedDB• GEOROC• USGS• MetPetDB• GANSEKI

Page 31: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 31

PetDB Data Mining: Search & Filter

3/22/2016

Filter by method or concentration

Page 32: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 32

3/22/2016

Page 33: Making Small Data BIG (UT Austin, March 2016)

Big Value!• 500 - 800 downloads per quarter• >600 citations in the literature• many fundamental new

discoveries & insights• Disciplinary• Multi-disciplinary• Unanticipated purposes

• new scientific approaches• Statistical rather than hypothetical

Making Small Data BIG: Succss and Challenges in the Earth Sciences 33

3/22/2016

Page 34: Making Small Data BIG (UT Austin, March 2016)

Geosamples Status: Access• Many samples and collections are not ‘online’.

• Repositories lack resources & expertise to develop & maintain digital collection catalogs.

• Samples often only described in publications.• Existing online catalogs are not connected or

federated.• No easy way to search for samples.

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 3434

February 25, 2016

DFG Rundgespräch Geochemical Databases

Page 35: Making Small Data BIG (UT Austin, March 2016)

Growing the Value of Samples• Linking physical samples digital data generated by their

study.• Reproducibility! Access to the physical samples is required to

verify & reproduce observations.• Re-usability! Access to information about samples is required

for proper evaluation & interpretation of sample-based data.• Broad sharing of physical samples for use & re-use.

• Samples are often expensive to collect (drilling, remote locations).• Many samples are unique and irreplaceable.• Re-analysis augments utility of existing data.• Samples often serve in ways that the collectors and repositories could

not have imagined.

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 35

Page 36: Making Small Data BIG (UT Austin, March 2016)

Big Value for Samples: The IGSN• Discovery & Access for Re-use and Reproducibility• Sample Citation• Data Integration• Sample Management

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 36

IGSN = International Geo Sample Number

Page 37: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 37

IGSN Adoption: Sample Repositories

3/22/2016

Page 38: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 38

IGSN Adoption: Publishers“… AGU Publications also strongly encourages use of other identifiers in our journal papers. International Geo Sample Numbers (IGSNs) uniquely identify items, such as a rock sample, a piece of coral, or a vial of water taken from the natural environment, and provide important, consistent information about these samples.”

Hanson, B. (2016), AGU opens its journals to author identifiers, Eos, 97, doi:10.1029/2016EO043183.

Published on 7 January 2016. 3/22/2016

Page 39: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 39

IGSN: Linking Samples, Data, & Publications

3/22/2016

Page 40: Making Small Data BIG (UT Austin, March 2016)

The Challenges

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 40

TechnicalOrganizationalSocial/cultural

Page 41: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 41

The BIG Challenge: Scalability• Limitation of resources versus diversity of data

• Need best practices for all small data communities• Need flexibility and performance of database schemas & search

applications• Need tools for investigators to improve quality of submitted data• Need tools for data managers to support (semi-automate?) QC

workflow• Repository standards/certification• Inclusion of legacy data (data rescue)

How can we grow small data across the Geosciences?

3/22/2016

Page 42: Making Small Data BIG (UT Austin, March 2016)

Partnerships, Alliances, Collaborations

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 42

Page 43: Making Small Data BIG (UT Austin, March 2016)

Partnership with Publishers Coalition for Publishing Data in the Earth & Space Sciences

43

• Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice.• Alignment of data policies across different publishers• Advancing integration of publication and data submission workflows• Support for authors and editors to comply with publishers’ data policies

• e.g., online community directory of appropriate Earth science community repositories that meet leading standards on curation, quality, and access

Increases development and enforcement of data best practicesReduces effort of metadata QCIncreases flow of small data into repositories

www.copdess.org3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 44: Making Small Data BIG (UT Austin, March 2016)

IT Collaborations• Cross-disciplinary development of community data

model ODM2 (Observation Data Model)• Collaboration with commercial software engineering

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 44

Page 45: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 45

EarthCube• Advances coordination, collaboration, and integration

• Community governance• Integrative Activities

• Fosters new data communities• Research Coordination Networks

• Develops and adapts new technologies to structure, transform, integrate, document, harmonize data & metadata• Building Blocks

3/22/2016

Page 46: Making Small Data BIG (UT Austin, March 2016)

Alliance of (Small) Data Providers

The Alliance Testbed Project“Interdisciplinary Earth Data Alliance as a Model for Integrating EarthCube

Technology Resources and Engaging the Broad Community”

• Design & develop the organizational and technical architecture of a data facility that operates as an alliance of scientifically related data communities

• Sharing data services and infrastructure that support trusted data curation and interdisciplinary science.

46

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 47: Making Small Data BIG (UT Austin, March 2016)

IEDA’s Service Structure

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences 47

Page 48: Making Small Data BIG (UT Austin, March 2016)

Alliance Testbed Project: Approach• Build on and transition existing infrastructure of an

established data facility (IEDA) to provide shared data services for all Alliance partners• Data Submission Hub• Trusted repository services (DOI registration, long-term preservation)

• Deploy newly developed EC technologies to align and integrate with EC architecture• CINERGI: pipeline for harvesting, improving, unifying, and re-

publishing metadata records assembled by Alliance partners• GeoWS: mechanism for Alliance partners to exchange data with data

discovery, search, and visualization tools across the Alliance• GeoLink: Vocabulary services to support the Data Submission Hub

48

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 49: Making Small Data BIG (UT Austin, March 2016)

49

ATP Participants• Data Facility: IEDA

• Including existing IEDA Partners: MGDS, EarthChem, SESAR, Geochron, ASP@UTIG, LEPR

• Community Data Collection: MetPetDB• New data communities: Mineral Physics, Deep Seafloor Processes• New data provider: IcePod• EarthCube Building Blocks: CINERGI, GeoLink, GeoWS• Stakeholder Alignment: WayMark Systems

3/22/2016

Making Small Data BIG: Succss and Challenges in the Earth Sciences

Page 50: Making Small Data BIG (UT Austin, March 2016)

Making Small Data BIG: Succss and Challenges in the Earth Sciences 50

Conclusions• Small data grows BIG when properly curated, documented,

harmonized, and integrated.• Domain-specific data facilities are essential to ensure quality of

data for trusted re-use & community engagement.• Current approaches are not sufficiently scalable.• Partnerships and collaborations help address the challenges.

• Integration with publications will augment the flow of data into repositories and data products.

• Partnerships among long-tail data communities allow sharing of data publication & preservation infrastructure while supporting domain-specific data curation.

• Community-wide initiatives such as EarthCube help solve the entire range of social, technical, and organizational challenges.

3/22/2016