30
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

Embed Size (px)

Citation preview

Page 1: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

Data Grid Research GroupDept. of Computer Science and EngineeringThe Ohio State UniversityColumbus, Ohio 43210, USA

David Chiu & Gagan Agrawal

Enabling Ad Hoc Queries over Low-Level Scientific

Data Sets

Page 2: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

2D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Presentation Outline

• Motivation‣ Current Trends in Scientific Data Management‣ Problem Discussion

• Data Registration Indexing‣ Metadata Extraction‣ Transformation

• Service Composition

• Conclusion

Page 3: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

3D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 4: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

4D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 5: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

5D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Increased tremendously over the years

Scientific Data Sets

• The collection of scientific data has increased over the years with new instruments, simulations, etc.

• Data sets are stored in repositories around the globe

• Just within U.S. entities in the geospatial domain‣ NOAA: oceanic, climate, water

quality, ...‣ NASA: ozone, air quality, tropical, ...‣ NRCS: land quality, watershed, ...

Page 6: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

6D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Web or Data Grid InfrastructureMass StorageSystems (MSS)

Page 7: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

7D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Scientific Data Sets

• Data sets are typically low level, i.e., ‣ Unstructured or semi-structured0101071895 0.34 -2.45 0.50 -0.65 -0.62 -0.71 0.00 -0.96 0101071896 -1.71 0.49 0.27 -0.79 -1.53 0.60 0.09 -2.210101071897 -0.53 0.14 4.32 1.95 -1.55 -1.68 -1.32 -0.690101071898 1.90 -2.64 -1.70 1.11 -2.18 -1.08 -0.53 -0.250101071899 0.44 0.97 1.65 -0.71 -2.02 -2.10 -0.50 -2.030101071900 -1.65 1.19 -1.34 0.57 -1.37 7.00 -0.48 -1.77 . . .

• However, data is well-documented‣ Accompanying XML-based metadata describing data sets is

typically required in today’s repositories

Page 8: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

8D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories

Mass StorageSystems (MSS)

Grid/Web Services & portals

Web or Data Grid Infrastructure

Page 9: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

9D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Repositories in the Global Scale

US EU

AU ...

Page 10: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

10

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

What Do the Users Want?

US

EU

AU

...

I don’t care where data is located.

I also want to share my own data with others!

Don’t just give me the data, but...

- Transform it - Manipulate it - Compose it with other processes and data sets

And do this with the least amount of work required from me!

Page 11: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

11

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

System Goals

• To enable queries over low level data sets, which involves:‣ identification of relevant data sets‣ automatic planning for the composition of dependent

services (processes) for derivation

• ... while being non-intrusive to existing schemes, i.e.,‣ avoids a standardized format for storing data sets‣ accommodates heterogeneous metadata‣ this system should - fit - into existing MSS and scientific

computing infrastructures (Data Grid & the Web)

Page 12: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

12

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

Page 13: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

13

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

That’s good and all, but...

Challenges

• Not without challenges...‣ dealing with metadata from multiple entities‣ efficiently identifying relevant data sets‣ planning and executing accurate service compositions on

the spot

DOMAIN KNOWLEDGE & SEMANTICS

• And without question, the need for

Page 14: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

14

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The AUSPICE System

AUSPICE: Automatic Service Planning and Execution in Cloud/Grid Environments

Page 15: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

15

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Semantics Layer

A Need for Domain Level Knowledge

• Assume the following service retrieves a satellite image pertaining to (x,y) with resolution respective to r

• Questions to ask the system:‣ How to deduce that this service can be used?‣ How to determine what information is needed for input?‣ Did the user provide enough information to invoke this service?

get_sat_image(double x, double y, double r)

inputsTo inputsToinputsTo

longitude latitude grid_size

outputsTo

satellite image

Page 16: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

16

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Domain concepts can be derivedfrom executing a service

Domain concepts can also be derived from retrieving an

existing data setService parameters representdifferent domain concepts

Page 17: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

17

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

• For instance, just within the geospatial domain,

Country Metadata Standards

US CSDGM

AU, NZ ANZLIC

EU ???

CDN ???

... ...

Page 18: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

18

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Handling heterogeneous metadata

Page 19: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

19

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata Transformation

. .

.

(transform to spatial index)

Page 20: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

20

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

• Metadata to DB transformations

. .

.

insert

Page 21: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

21

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 22: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

22

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 23: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

23

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Data Registration Service

Indexing Data Sets

Page 24: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

24

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

In the Semantics Layer

Applying Domain Information

Data registration simplifies identification process within

Page 25: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

25

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Indexing Services

• Services (inputs, outputs) are also registered in much the same way

Page 26: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

26

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

A subset of the ontology (unrolled)

Page 27: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

27

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition

begin compSrvc(concept, Q[...])W := ()

//perform DFS starting from conceptlet v := concept be the currently visited node

if v is a data type then W := (W, index.getData(v, Q))

else //v is a servicelet (p1,..,pn) be v’s params

//recursive call on each piW := (W, (v, compSrvc(p1, Q), ... , compSrvc(pn, Q)))

end if

return Wend

Page 28: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

28

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

The Planning Layer

Service Composition: An Example

Ontology (unrolled)

A Derived Execution Plan This is what data registration provides

Page 29: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

29

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Planning Times

Page 30: Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling

30

D. Chiu & G. Agrawal. Enabling Ad Hoc Queries over Low-Level Scientific Data Sets

SSDBM ’09

Conclusion

• The AUSPICE System...‣ unifies heterogeneous metadata‣ extracts certain metadata attributes and indexes low level

data sets and services for fast access from distributed repositories

‣ automatically composes these services and data sets to answer user queries

• Questions - Comments?‣ David Chiu [email protected]‣ Gagan Agrawal [email protected]