Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University...

Preview:

Citation preview

Data Grid Research GroupDept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, Ohio 43210, USA

David Chiu & Gagan Agrawal

Enabling Ad Hoc Queries over Low-Level Scientific

Data Sets

2D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Presentation Outline

• Motivation‣ Current Trends in Scientific Data Management‣ Problem Discussion

• Data Registration Indexing‣ Metadata Extraction‣ Transformation

• Service Composition

• Conclusion

3D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

4D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

5D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

6D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Web or Data Grid InfrastructureMass StorageSystems (MSS)

7D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Scientific Data Sets

• Data sets are typically low level, i.e., ‣ Unstructured or semi-structured0101071895 0.34 -2.45 0.50 -0.65 -0.62 -0.71 0.00 -0.96 0101071896 -1.71 0.49 0.27 -0.79 -1.53 0.60 0.09 -2.210101071897 -0.53 0.14 4.32 1.95 -1.55 -1.68 -1.32 -0.690101071898 1.90 -2.64 -1.70 1.11 -2.18 -1.08 -0.53 -0.250101071899 0.44 0.97 1.65 -0.71 -2.02 -2.10 -0.50 -2.030101071900 -1.65 1.19 -1.34 0.57 -1.37 7.00 -0.48 -1.77 . . .

• However, data is well-documented‣ Accompanying XML-based metadata describing data sets is

typically required in today’s repositories

8D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Mass StorageSystems (MSS)

Grid/Web Services & portals

Web or Data Grid Infrastructure

9D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories in the Global Scale

US EU

AU ...

10

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

What Do the Users Want?

US

EU

AU

...

I don’t care where data is located.

I also want to share my own data with others!

Don’t just give me the data, but...

- Transform it - Manipulate it - Compose it with other processes and data sets

And do this with the least amount of work required from me!

11

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

System Goals

• To enable queries over low level data sets, which involves:‣ identification of relevant data sets‣ automatic planning for the composition of dependent

services (processes) for derivation

• ... while being non-intrusive to existing schemes, i.e.,‣ avoids a standardized format for storing data sets‣ accommodates heterogeneous metadata‣ this system should - fit - into existing MSS and scientific

computing infrastructures (Data Grid & the Web)

12

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

13

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

DOMAIN KNOWLEDGE & SEMANTICS

• And without question, the need for

14

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The AUSPICE System

AUSPICE: Automatic Service Planning and Execution in Cloud/Grid Environments

15

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Semantics Layer

A Need for Domain Level Knowledge

• Assume the following service retrieves a satellite image pertaining to (x,y) with resolution respective to r

• Questions to ask the system:‣ How to deduce that this service can be used?‣ How to determine what information is needed for input?‣ Did the user provide enough information to invoke this service?

get_sat_image(double x, double y, double r)

inputsTo inputsToinputsTo

longitude latitude grid_size

outputsTo

satellite image

16

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Domain concepts can be derivedfrom executing a service

Domain concepts can also be derived from retrieving an

existing data setService parameters representdifferent domain concepts

17

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

• For instance, just within the geospatial domain,

Country Metadata Standards

US CSDGM

AU, NZ ANZLIC

EU ???

CDN ???

... ...

18

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

19

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata Transformation

. .

.

(transform to spatial index)

20

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata to DB transformations

. .

.

insert

21

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

22

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

23

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

24

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Data registration simplifies identification process within

25

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Indexing Services

• Services (inputs, outputs) are also registered in much the same way

26

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

A subset of the ontology (unrolled)

27

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition

begin compSrvc(concept, Q[...])W := ()

//perform DFS starting from conceptlet v := concept be the currently visited node

if v is a data type then W := (W, index.getData(v, Q))

else //v is a servicelet (p1,..,pn) be v’s params

//recursive call on each piW := (W, (v, compSrvc(p1, Q), ... , compSrvc(pn, Q)))

end if

return Wend

28

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

Ontology (unrolled)

A Derived Execution Plan This is what data registration provides

29

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Planning Times

30

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Conclusion

• The AUSPICE System...‣ unifies heterogeneous metadata‣ extracts certain metadata attributes and indexes low level

data sets and services for fast access from distributed repositories

‣ automatically composes these services and data sets to answer user queries

• Questions - Comments?‣ David Chiu chiud@cse.ohio-state.edu‣ Gagan Agrawal agrawal@cse.ohio-state.edu

Recommended