20
Addressing the Challenges of Multi- Domain Data Integration with the SemantEco Framework Evan W. Patton, Patrice Seyed, Deborah L. McGuinness Presented at AGU Fall Meeting 2013

Addressing the Challenges of Multi-Domain Data Integration with the SemantEco Framework Evan W. Patton, Patrice Seyed, Deborah L. McGuinness Presented

Embed Size (px)

Citation preview

Addressing the Challenges of Multi-Domain Data Integration

with the SemantEco Framework

Evan W. Patton, Patrice Seyed, Deborah L. McGuinness

Presented at AGU Fall Meeting 2013

2

Overview

• Motivation & History• The SemantEco Ontology• The SemantEco Pipeline• Domain Extensions• Performance Analysis• Lessons Learned• Conclusions & Future Work

3

The Problem

• Real Life Motivating Example:– In 2009, in Bristol County, Rhode Island, children

became ill with symptoms such as diarrhea. The cause was found to be polluted water (E. Coli) and citizens were asked to boil water until the issue was resolved.

– Public concerns: “When did the contamination begin?”, “How did this happen?”, “How can we keep it from happening again?”

– We need environmental informatics systems that can automatically integrate and analyze water quality.

4

The Problem

1. Raw data from multiple sources and in different formats – difficult to integrate and query.

2. Semantics of the water quality data are not explicitly encoded in the data – machine can’t process data automatically.

3. Large amount of data due to large spatial region, long time span, and large number of pollutants and regulated limit – analysis can be time consuming and complex.

5

SemantEco Ontology

escim:Measurement:SubClassOf: repr:Measurement

unit:hasUnit exactly 1 unit:Unitescim:ofCharacteristic exactly 1

escim:Characteristic escim:hasValue exactly 1

xsd:decimal

water:WaterMeasurement:

SubClassOf: escim:Measurement

6

SemantEco Ontology

ThresholdViolation:

SubClassOf: escim:Measurement

ColiformRegulationViolation:

SubClassOf: ThresholdViolation

IntersectionOf:

water:WaterMeasurement

escim:ofCharacteristic escim:FecalColiform

escim:hasValue some xsd:decimal[> 400]unit:hasUnit escim:MPN_per_mL

7

Limitations

• How to show more than just water data?

• How to incorporate additional datasets with minimal modifications to queries?

• How to provide facets along different dimensions of the data?

8

Modular Approach

• SemantEco Framework employs modules to add functionality, data, domains

• Modules can be hot-deployed, making the system extensible/upgradable at runtime

• Application-level functionality and provenance is captured by a set of core classes that can be repurposed for different applications

9

10

SemantEco Data Pipeline

User Request

Ontology

Data Load

Forward Inference

Query Answering

Module Processing

12

SemantEco Query Pipeline

Inferred by regulation ontology

SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {

<#site1> a pol:PollutedSite ;

escim:hasMeasurement ?measure .

?measure a escim:Measurement ;

escim:ofCharacteristic ?characteristic ;

escim:hasValue ?value ;

unit:hasUnit ?unit ;

time:inXSDDateTime ?time .

}

13

SemantEco Query Pipeline

SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {

<#site1> a escim:MeasurementSite ;

escim:hasMeasurement ?measure .

?measure a escim:Measurement ;

escim:ofCharacteristic ?characteristic ;

escim:hasValue ?value ;

unit:hasUnit ?unit ;

time:inXSDDateTime ?time .

} Added by Characteristics

Module

escim:ofCharacteristic escim:FecalColiform ;

14

SemantEco Query Pipeline

SELECT ?measure ?characteristic ?value ?unit ?timeWHERE {

<#site1> a escim:MeasurementSite ;

escim:hasMeasurement ?measure .

?measure a escim:Measurement ;

escim:ofCharacteristic ?characteristic ;

escim:hasValue ?value ;

unit:hasUnit ?unit ;

time:inXSDDateTime ?time .

Added by Time Module

escim:ofCharacteristic escim:FecalColiform ;

FILTER( ?time < xsd:date(“2009-09-08”) )}

15

Domain Extensions

• Air quality data from the Environmental Protection Agency

• Bird species count data from Avian Knowledge Network eBird database

• Fish species count data from Santa Barbara Long Term Ecological Research (LTER) group

16

Performance Analysis

• Analysis on San Francisco, CA 94107;one of the largest datasets in the system

• Water data: 74920 triples• Species data: 5605 triples• Time to completion:

– Water: 0:03.790– Species: 0:00.813– Combined: 2:14.015

17

Performance Redux

• Partitioning of knowledge base as a function of declared domain– Water domain: 0:03.778– Bird domain: 0:00.632– “Combined”: 0:11.126

• Transformation of the regulation ontology into a ruleset executed with a traditional RETE engine improves performance ~66 %

18

Lessons Learned

• Best to represent thresholding as rules to improve inference performance

• Smartly partition data when disjointness is not or cannot be ontologically explicit

• Design modules to high-level ontologies for reusability in other domains

• Inferencing keeps queries simple to understand at the cost of making debugging more complex

19

Conclusions

• Semantic technologies and toolchains support integrating data from multiple domains and presenting it in a single portal

• Ontologies can assist in machine interpretation of data for non-expert end users

• Semantic software design is critical to a flexible, robust platform for building integrative applications

• The SemantEco framework has shown itself to be reusable and extensible both in semantic environmental and ecological monitoring and is proving to be useful more broadly

20

Future Work

• Cross-domain query answering• Support for moving average regulations• Support for registering ontology

transformations for Input/Output• Applications of the framework in other

scientific domains (e.g. health)• Expand our collaborations on semantic

monitoring (contact [email protected])

21

Acknowledgements

• Dr. A. Patrice Seyed• Prof. Deborah L. McGuinness, Advisor• Drs. F. Joshua Dein & R. Sky Bristol (USGS)• Students: Ping Wang, Jin Guang Zheng, Theodora Kampelou,

Linyun Fu, Matthew Ma, Chen Wang, Lynn Zheng, Robin Liu, Katherine Chastin, Brendan Ashby, & Irene Khan

• National Science Foundation– Graduate Research Fellowship (EWP)– DataONE Initiative (APS)

• RPI Tetherless World Constellation

• Contact [email protected] or [email protected] for collaboration opportunities