Upload
kerstin-lehnert
View
176
Download
2
Embed Size (px)
Citation preview
Kerstin Lehnert Lamont -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964
Making small Data BIGSuccess and Challenges in the Earth Sciences
Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityFebruary 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/
What Makes Data BIG?
2
ValueThe sixth ‘V’:
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
‘Small’ Earth Science Research Data
• heterogeneous• customized & optimized
for research questions• lack of data standards• culture of data ‘hording’• lack of data
infrastructure (facilities)
Making Small Data BIG: Succss and Challenges in the Earth Sciences 3
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 4
The Value of Small Research Data
3/22/2016
“While the data volumes are small when viewed individually, in total they represent a very significant
portion of the country’s scientific output.”
“The long tail is a breeding ground for new ideas and never before attempted science.”
(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)
Small data: Pieces of a puzzle …
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 5
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 6
… that form a picture
Big Pictures from Small Data
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 7
The PetDB Synthesis
Map shows data from >300 publications Symbols are locations of rock samples. Color is scaled to the 87Sr/86Sr isotope ratio in the rocks.
Big Pictures from Small Data
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 8
Small Data – Big Effort or What it takes to generate a few kilobytes of data …
9
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
“Understanding where the dust that's in the atmosphere and oceans comes from can help scientists estimate its impact on earth's climate system.”
Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)
http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/
Example #1:Did New Zealand Dust Influence the Last Ice Age?
Find funding and go into the field …
Making Small Data BIG: Succss and Challenges in the Earth Sciences 10
3/22/2016
Prepare samples in the repository/lab
Making Small Data BIG: Succss and Challenges in the Earth Sciences 11
3/22/2016
Analyze Samples in the Lab
Making Small Data BIG: Succss and Challenges in the Earth Sciences 12
3/22/2016
The few kilobytes of data
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 13
Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.
Making Small Data BIG: Succss and Challenges in the Earth Sciences 14
Small Data – Big Effort (example #2)
3/22/2016
Example #2:Do convergent margin volcanoes really represent continental crust?
“As it is crucial to understand the extent and origin of the compositional difference between central Aleutian lavas and plutons through time and space, this project will map and sample plutonic rocks exposed on the central Aleutians and their coeval volcanic host rocks.”
“Results and the samples acquired in this study will help to answer fundamental questions of continental crust formation, and shed light on the formation mechanisms of plutons and volcanics in arcs.”
http://www.nsf.gov/discoveries/disc_summ.jsp?cntn_id=135851&org=NSF
Making Small Data BIG: Succss and Challenges in the Earth Sciences 15
Small Data - Big Investmentor What it takes to generate a few kilobytes of data
3/22/2016
Anticipated Data:• ~ 250 samples• ~ 200 major element analyses• ~ 150 trace element analyses• 50 U/Pb zircon geochronology• 30 Ar-Ar ages• 80 Sr, Nd, Hf and Pb isotope analyses
• 4 scientists (3 institutions) • 5 weeks on remote islands• a boat (with crew)• a helicopter
Outcomes so far: ca. 500 samples
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 16
Even bigger investments for small data …
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 17
SharingSmall Data
Making Small Data BIG: Succss and Challenges in the Earth Sciences 18
3/22/2016
Small Data have small value as long as …• They are widely dispersed in the literature (past &
present).• They are not openly accessible.• They lack sufficient and standardized metadata.• They are never published (“dark data”).
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 19
Growing the Valueof Small Data
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 20
findableidentification,persistence
accessibleprotection,protocols
context,provenance
re-usableharmonized, machine-readable
interoperableBIG DATA
small data Data Curation Standards
Generic Repositories
Domain-specific Data Standards
Community Data Collections
Value
Growing the Valueof Small Data
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 21
findableidentification,persistence
accessibleprotection,protocols
context,provenance
re-usableharmonized, machine-readable
interoperableBIG DATA
small data Data Curation Standards Domain-specific Data Standards
Value
Domain Repositories
Domain-specific Data Facilities
Making Small Data BIG: Succss and Challenges in the Earth Sciences 22
Science Community
Domain specific Data facility
22
Libraries Archives
CI, Computer Science
Publishers, editors
Discipline-specific data services• Context & provenance metadata• Semantics• Workflows
Funding Agencies
Data Facilities
Registries
3/22/2016
Data curation servicesCI development
Disciplinary Expertise
Data Curation IT/CS
Expe
rtise
Making Small Data BIG: Succss and Challenges in the Earth Sciences 23
IEDA: Interdisciplinary Earth Data Alliance
3/22/2016
Data Services for the Solid Earth Sciences
www.iedadata.org
Making Small Data BIG: Succss and Challenges in the Earth Sciences 24
IEDA: A Multi-Disciplinary Data Facility
www.iedadata.org
• Solid Earth Observational Data• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology
• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Interoperability
3/22/2016
25
Small Data Gone BIG
IEDA Repositories >720,000 files 59 TB 4 x 106 samples
IEDA Syntheses 19 x 106 analytical values in EarthChem 2.79 x 106 miles of data from 875 cruises in the
Global Multi-Resolution Topography (GMRT)
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
IEDA: Impact on Science
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 26
27
EarthChem Data Systems
Data Data Data Data Data
EarthChem Library
Data Data Data Data Data
PetDB, SedDB EarthChem Portal
Data Publication & Preservation Data Mining & Analysis
InvestigatorsMetadata
Catalog Data & Metadata
Data & Metadata
External SystemsEarthChem Data Managers
FINDABLE & ACCESSIBLE• DOI registration• Long-term archiving• CC license• Guidelines for data reporting
(community endorsed)• QC by data managers
RE-USABLE & INTEROPERABLE• Data & metadata harmonization• Standards-compliant data model• Service Oriented Architecture (ECP)
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
Making Small Data BIG: Succss and Challenges in the Earth Sciences 28
DOI to allow proper citation
Link to publications
Link to funding source
28
ECL: Discovery & Access
3/22/2016
Data Synthesis: PetDB
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 29
Global compilation of geochemical data for igneous rocks from the ocean floor & mantle xenoliths
> 2,200 data sets/publications> 84,000 samples> 3.2 million observed values
http://www.earthchem.org/petdb
Data Synthesis: EarthChem Portal
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 30
Data from • >13,000 publications• >850,000 samples
Total: >19.6 million analytical values
Partner Databases:• PetDB• SedDB• GEOROC• USGS• MetPetDB• GANSEKI
Making Small Data BIG: Succss and Challenges in the Earth Sciences 31
PetDB Data Mining: Search & Filter
3/22/2016
Filter by method or concentration
Making Small Data BIG: Succss and Challenges in the Earth Sciences 32
3/22/2016
Big Value!• 500 - 800 downloads per quarter• >600 citations in the literature• many fundamental new
discoveries & insights• Disciplinary• Multi-disciplinary• Unanticipated purposes
• new scientific approaches• Statistical rather than hypothetical
Making Small Data BIG: Succss and Challenges in the Earth Sciences 33
3/22/2016
Geosamples Status: Access• Many samples and collections are not ‘online’.
• Repositories lack resources & expertise to develop & maintain digital collection catalogs.
• Samples often only described in publications.• Existing online catalogs are not connected or
federated.• No easy way to search for samples.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 3434
February 25, 2016
DFG Rundgespräch Geochemical Databases
Growing the Value of Samples• Linking physical samples digital data generated by their
study.• Reproducibility! Access to the physical samples is required to
verify & reproduce observations.• Re-usability! Access to information about samples is required
for proper evaluation & interpretation of sample-based data.• Broad sharing of physical samples for use & re-use.
• Samples are often expensive to collect (drilling, remote locations).• Many samples are unique and irreplaceable.• Re-analysis augments utility of existing data.• Samples often serve in ways that the collectors and repositories could
not have imagined.
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 35
Big Value for Samples: The IGSN• Discovery & Access for Re-use and Reproducibility• Sample Citation• Data Integration• Sample Management
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 36
IGSN = International Geo Sample Number
Making Small Data BIG: Succss and Challenges in the Earth Sciences 37
IGSN Adoption: Sample Repositories
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 38
IGSN Adoption: Publishers“… AGU Publications also strongly encourages use of other identifiers in our journal papers. International Geo Sample Numbers (IGSNs) uniquely identify items, such as a rock sample, a piece of coral, or a vial of water taken from the natural environment, and provide important, consistent information about these samples.”
Hanson, B. (2016), AGU opens its journals to author identifiers, Eos, 97, doi:10.1029/2016EO043183.
Published on 7 January 2016. 3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 39
IGSN: Linking Samples, Data, & Publications
3/22/2016
The Challenges
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 40
TechnicalOrganizationalSocial/cultural
Making Small Data BIG: Succss and Challenges in the Earth Sciences 41
The BIG Challenge: Scalability• Limitation of resources versus diversity of data
• Need best practices for all small data communities• Need flexibility and performance of database schemas & search
applications• Need tools for investigators to improve quality of submitted data• Need tools for data managers to support (semi-automate?) QC
workflow• Repository standards/certification• Inclusion of legacy data (data rescue)
How can we grow small data across the Geosciences?
3/22/2016
Partnerships, Alliances, Collaborations
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 42
Partnership with Publishers Coalition for Publishing Data in the Earth & Space Sciences
43
• Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice.• Alignment of data policies across different publishers• Advancing integration of publication and data submission workflows• Support for authors and editors to comply with publishers’ data policies
• e.g., online community directory of appropriate Earth science community repositories that meet leading standards on curation, quality, and access
Increases development and enforcement of data best practicesReduces effort of metadata QCIncreases flow of small data into repositories
www.copdess.org3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
IT Collaborations• Cross-disciplinary development of community data
model ODM2 (Observation Data Model)• Collaboration with commercial software engineering
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 44
Making Small Data BIG: Succss and Challenges in the Earth Sciences 45
EarthCube• Advances coordination, collaboration, and integration
• Community governance• Integrative Activities
• Fosters new data communities• Research Coordination Networks
• Develops and adapts new technologies to structure, transform, integrate, document, harmonize data & metadata• Building Blocks
3/22/2016
Alliance of (Small) Data Providers
The Alliance Testbed Project“Interdisciplinary Earth Data Alliance as a Model for Integrating EarthCube
Technology Resources and Engaging the Broad Community”
• Design & develop the organizational and technical architecture of a data facility that operates as an alliance of scientifically related data communities
• Sharing data services and infrastructure that support trusted data curation and interdisciplinary science.
46
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
IEDA’s Service Structure
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences 47
Alliance Testbed Project: Approach• Build on and transition existing infrastructure of an
established data facility (IEDA) to provide shared data services for all Alliance partners• Data Submission Hub• Trusted repository services (DOI registration, long-term preservation)
• Deploy newly developed EC technologies to align and integrate with EC architecture• CINERGI: pipeline for harvesting, improving, unifying, and re-
publishing metadata records assembled by Alliance partners• GeoWS: mechanism for Alliance partners to exchange data with data
discovery, search, and visualization tools across the Alliance• GeoLink: Vocabulary services to support the Data Submission Hub
48
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
49
ATP Participants• Data Facility: IEDA
• Including existing IEDA Partners: MGDS, EarthChem, SESAR, Geochron, ASP@UTIG, LEPR
• Community Data Collection: MetPetDB• New data communities: Mineral Physics, Deep Seafloor Processes• New data provider: IcePod• EarthCube Building Blocks: CINERGI, GeoLink, GeoWS• Stakeholder Alignment: WayMark Systems
3/22/2016
Making Small Data BIG: Succss and Challenges in the Earth Sciences
Making Small Data BIG: Succss and Challenges in the Earth Sciences 50
Conclusions• Small data grows BIG when properly curated, documented,
harmonized, and integrated.• Domain-specific data facilities are essential to ensure quality of
data for trusted re-use & community engagement.• Current approaches are not sufficiently scalable.• Partnerships and collaborations help address the challenges.
• Integration with publications will augment the flow of data into repositories and data products.
• Partnerships among long-tail data communities allow sharing of data publication & preservation infrastructure while supporting domain-specific data curation.
• Community-wide initiatives such as EarthCube help solve the entire range of social, technical, and organizational challenges.
3/22/2016