1
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 ([email protected] ) , Sumit Purohit 2 ([email protected] ) , William Smith 2([email protected] ) , Jesse Weaver 2([email protected] ) , Alan Chappell 2([email protected] ) , Patrick West 1([email protected] ) , Peter Fox 1([email protected]) ( 1 Rensselaer Polytechnic Institute Troy, NY, 12180 United States) ( 2 Pacific Northwest National Laboratory Richland, VA, United States) Poster: IN33C-3785 Glossary: RDESC – Resource Discovery for Extreme Scale Collaboration RPI – Rensselaer Polytechnic Institute TWC – Tetherless World Constellation at Rensselaer Polytechnic Institute Acknowledgments: Eric Rozell – RPI Master’s Student now with Microsoft Sponsors: Department of Energy The volume and variety of data generated in science is rapidly increasing. Geophysical science is no exception in that various independent projects produce disparate, heterogeneous datasets. While researchers typically make this data available to others, there is a need to make these valuable resources more discoverable and understandable to user communities in order to accelerate scientific research. The cost of making data discoverable and understandable depends on how the original data was curated, transformed, generated, and published. User interfaces and visualizations that support exploration and interaction with the data further enhance understanding of available content. This presentation describes research and development conducted under the Resource Discovery for Extreme Scale Collaboration (RDESC) project. As part of RDESC we curate, clean, publish, and visualize scientific data following Linked Data principles. Towards enabling discovery and understandability, we curated data from multiple, interdisciplinary science domains and represented the metadata using standard Semantic Web and Web technologies. As a result of this transformation, we generated some 1.4 billion RDF triples that describe these previously existing data resources. These efforts led to our formulation of a number of suggested best practices for data publishers to reduce the cost and barriers to making data discoverable and understandable to research communities. Additionally, we developed a set of tools that provide scalable visualizations of this large-scale metadata to enhance the understandability for prospective users of the data resources. Abstract Resource splash pages dynamically generated using the twsparql module TWC S2S Faceted browser interface allowing search for collected resources Overall architecture of RDESC, curating information, trying different systems of curating, translating the information into semantic representation, different triple stores to store semantic information, and different ways of visualizing the information. RDESC – Concepts and Properties added by RDESC project Foaf – friend of a friend DC – Dublin Core terms Schema.org – common set of schemas for structured data and markup for the web RDESC web site http://rdesc.org using simple, standard web technologies Total number of triples currently being used by rdesc.org 230,743,316 Total number of triples available 1.4 billion Web Presence RDESC Ontology resolvable at http://rdesc.org/2014 / Virtuoso StarDog Take Away: Multiple sources of data curated into a seamless Semantic Knowledge Store for searching, browsing, and visualization Information represented in common semantic information model using RDFs Research into the use of various semantic technologies with billions of triples – storage, search, browse, visualization Best practices showing the importance of providing rich information, context and experience with existing metadata. Future Work: Trying different content management systems with the large number of triples Distributed/Federated system Semantically represented information flattened and pushed into Apache SOLR (left). Or retrieved directly from the RDESC Knowledge Store (right). From either SOLR or S2S Faceted browser, resources displayed within content management system. Showing the difference in limited provided information (left) vs. semantically rich information (right) And/or A postdoctoral researcher want to find resources (papers, graphics, data) representing the defined Essential Climate Variables. Climate modeler is looking for other models, model components, full models and their authors/ developers, that they can compare their models, and discuss proper use, approaches/ ideas with. Climate modeller is looking for an existing coupled system model run and output matching or similar to certain initial/scenario conditions. Climate modeller needs to find a compute/ storage resource capable of running their code and scenario and they need to know how to run it. Climate modeller needs to find relevant observational data from a specific class of instrument in a form they can use to compare with their model runs. Developer (or climate analyst) needs to find a software tool (or service) to perform some manipulation or analysis function on an existing climate dataset or datasets. Policy analyst/ decision support user is looking for data of a particular area, time period, spatial resolution in a format they can use. Model developer wants to make their model run results and their codes available to others and be confident that they can be discovered by others without the author having to perform extensive annotation. Climate analyst wants to access to some datasets that have been run for a specific test case, that they can analyze Use Cases RDESC Information Model

Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 ([email protected]), Sumit Purohit 2 ([email protected]),

Embed Size (px)

Citation preview

Page 1: Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 (leeb5@rpi.edu), Sumit Purohit 2 (sumit.purohit@pnnl.gov),

Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources

on the WebBenno Lee1 ([email protected]), Sumit Purohit2 ([email protected]), William Smith2(

[email protected]), Jesse Weaver2([email protected]), Alan Chappell2(

[email protected]), Patrick West1([email protected]), Peter Fox1([email protected]) (1Rensselaer Polytechnic Institute Troy, NY, 12180 United States) (2Pacific Northwest National Laboratory Richland, VA, United States)

Poster: IN33C-3785Glossary:RDESC – Resource Discovery for Extreme Scale CollaborationRPI – Rensselaer Polytechnic InstituteTWC – Tetherless World Constellation at Rensselaer Polytechnic Institute

Acknowledgments:Eric Rozell – RPI Master’s Student now with Microsoft

Sponsors:

Department of Energy

The volume and variety of data generated in science is rapidly increasing. Geophysical science is no exception in that various independent projects produce disparate, heterogeneous datasets. While researchers typically make this data available to others, there is a need to make these valuable resources more discoverable and understandable to user communities in order to accelerate scientific research. The cost of making data discoverable and understandable depends on how the original data was curated, transformed, generated, and published. User interfaces and visualizations that support exploration and interaction with the data further enhance understanding of available content.

This presentation describes research and development conducted under the Resource Discovery for Extreme Scale Collaboration (RDESC) project. As part of RDESC we curate, clean, publish, and visualize scientific data following Linked Data principles. Towards enabling discovery and understandability, we curated data from multiple, interdisciplinary science domains and represented the metadata using standard Semantic Web and Web technologies. As a result of this transformation, we generated some 1.4 billion RDF triples that describe these previously existing data resources. These efforts led to our formulation of a number of suggested best practices for data publishers to reduce the cost and barriers to making data discoverable and understandable to research communities. Additionally, we developed a set of tools that provide scalable visualizations of this large-scale metadata to enhance the understandability for prospective users of the data resources.

Abstract

Resource splash pages dynamically generated using the twsparql module

TWC S2S Faceted browser interface allowing search for collected resources

Overall architecture of RDESC, curating information, trying different systems of curating, translating the information into semantic representation, different triple stores to store semantic information, and different ways of visualizing the information.

• RDESC – Concepts and Properties added by RDESC project• Foaf – friend of a friend• DC – Dublin Core terms• Schema.org – common set of schemas for structured data and markup for the web

RDESC web site http://rdesc.org using simple, standard web technologies

Total number of triples currently being used by rdesc.org• 230,743,316

Total number of triples available• 1.4 billion

Web Presence

RDESC Ontology resolvable at http://rdesc.org/2014/

Virtuoso

StarDog

Take Away:• Multiple sources of data curated into a seamless Semantic Knowledge Store for searching, browsing, and visualization• Information represented in common semantic information model using RDFs• Research into the use of various semantic technologies with billions of triples – storage, search, browse, visualization• Best practices showing the importance of providing rich information, context and experience with existing metadata.

Future Work:• Trying different content management systems with the large number of triples• Distributed/Federated system

Semantically represented information flattened and pushed into Apache SOLR (left). Or retrieved directly from the RDESC Knowledge Store (right).

From either SOLR or S2S Faceted browser, resources displayed within content management system.

Showing the difference in limited provided information (left) vs. semantically rich information (right)

And/or

• A postdoctoral researcher want to find resources (papers, graphics, data) representing the defined Essential Climate Variables.

• Climate modeler is looking for other models, model components, full models and their authors/ developers, that they can compare their models, and discuss proper use, approaches/ ideas with.

• Climate modeller is looking for an existing coupled system model run and output matching or similar to certain initial/scenario conditions.

• Climate modeller needs to find a compute/ storage resource capable of running their code and scenario and they need to know how to run it.

• Climate modeller needs to find relevant observational data from a specific class of instrument in a form they can use to compare with their model runs.

• Developer (or climate analyst) needs to find a software tool (or service) to perform some manipulation or analysis function on an existing climate dataset or datasets.

• Policy analyst/ decision support user is looking for data of a particular area, time period, spatial resolution in a format they can use.

• Model developer wants to make their model run results and their codes available to others and be confident that they can be discovered by others without the author having to perform extensive annotation.

• Climate analyst wants to access to some datasets that have been run for a specific test case, that they can analyze

Use Cases

RDESC Information Model