16
Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e- Science 1 Grid-based Data Stream Processing in e-Science Richard Kuntschke 1 , Tobias Scholl 1 , Sebastian Huber 1 , Alfons Kemper 1 , Angelika Reiser 1 , Hans-Martin Adorf 2 , Gerard Lemson 3 , and Wolfgang Voges 3 1 Lehrstuhl Informatik III: 1 Datenbanksysteme 1 Fakultät für Informatik 1 Technische Universität München 2 Max-Planck-Institut 2 für Astrophysik 3 Max-Planck-Institut 3 für extraterrestrische 3 Physik

Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

Embed Size (px)

Citation preview

Page 1: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science 1

Grid-basedData Stream Processingin e-Science

Richard Kuntschke1, Tobias Scholl1, Sebastian Huber1,Alfons Kemper1, Angelika Reiser1,Hans-Martin Adorf2, Gerard Lemson3, and Wolfgang Voges3

1Lehrstuhl Informatik III:1Datenbanksysteme1Fakultät für Informatik1Technische Universität München

2Max-Planck-Institut2für Astrophysik

3Max-Planck-Institut3für extraterrestrische3Physik

Page 2: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

2

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Important Challenges in e-Science

In general: Large and exponentially growing amounts of data Distributed data archives No unique identifiers Uncertainty

In astrophysics:Spectral Energy Distributions (SEDs)

Used to classify celestial objects (active galactic nuclei, brown dwarfs, neutron stars, ...)

Generation requires spatial (astrometric) matching

Page 3: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

3

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Spatial (Astrometric) Matching

Current solutions … … load all data into main memory

Uses a lot of memory Infeasible if memory size is insufficient

… process all data at once and deliver the complete result at the end Inefficient No results until all processing has completed

Page 4: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

4

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Our Contributions

StarGlobe Grid-based P2P Data

Stream Management System implemented on top of Globus

In-network processing Early filtering Parallelization Pipelining Load-balancing

Mobile user-defined operators

Astrophysical Example Workflow Astrometric matching Performance evaluation

Page 5: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

5

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

The StarGlobe Architecture

Super-Peer BackboneQuery 1

Stream 0

Publish

Subscribe

filter

transform

Load mobile operators

Fct-Provider

filter

transform

Stream 1

Publish

Query 2

Subscribe

Page 6: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

6

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Traditional Approach: Bring Data to Code

union

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. BT

... ... ...

... ... ...

... ... ...

Data-Prov. CT

... ... ...

... ... ...

... ... ...

Data-Prov. DT

... ... ...

... ... ...

... ... ...

Data-Prov. A

Page 7: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

7

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

New Approach: Bring Code to Data

T... ... ...... ... ...... ... ...

Data-Prov. A

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. B

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. C

scan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. D

scan

NN_10

union

NN_10

Fct-Provider

NN_10

Page 8: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

8

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Mobile User-Defined Operators

Load user-defined operators from function provider servers in the network

Common interface for integrating external operators

Push-based iterator

Flexibility

Page 9: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

9

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

StreamIterator Interface

open(Config, StreamWriter) Configuration parameters Writer for result stream

next(StreamIteratorEvent) Next element in input stream Writing output to result stream using

StreamWriter.write() close()

Page 10: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

10

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Communication betweenStreamProcessor and StreamIterator

StreamIterator

StreamHandler 1

StreamHandler 2

StreamHandler n

StreamWriter

StreamProcessor

...

XML InputStream 1

XML InputStream 2

XML InputStream n

XML OutputStream

...

Item 1 Item 2 Item n Result Item

Page 11: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

11

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Astrophysical Example Workflow

peer-10

peer-9 peer-8

peer-6 peer-4 peer-5peer-7

peer-2peer-1peer-0 peer-3

Input ListRASS-BSC

2MASS FIRST USNOB1

NVSS GSC-2

SED assembly

Page 12: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

12

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Planplan-10

at peer-10

plan-8at peer-8

plan-5at peer-5enrichσ-5

transform-5

stream-5

χ²filter-2

join-2

plan-4at peer-4enrichσ-4

transform-4

stream-4

plan-9at peer-9

χ²filter-3

join-3

plan-7at peer-7

plan-2at peer-2enrichσ-2

transform-2

stream-2

plan-3at peer-3enrichσ-3

transform-3

stream-3

χ²filter-1

join-1

plan-6at peer-6

χ²filter-0

join-0

plan-1at peer-1enrichσ-1

transform-1

stream-1

χ²filter-4

join-4

display

plan-0at peer-0enrichσ-0

transform-0

stream-0

Page 13: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

13

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

plan-6at peer-6

χ²filter-0

join-0

plan-1at peer-1enrichσ-1

transform-1

stream-1

plan-0at peer-0enrichσ-0

transform-0

stream-0

Page 14: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

14

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

plan-10at peer-10

χ²filter-4

join-4

display

Page 15: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

15

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Evaluation of Early Filtering

Page 16: Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Richard Kuntschke 1, Tobias Scholl 1, Sebastian Huber 1, Alfons

16

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Conclusion

Synergies between research in computer science and other scientific disciplines, e.g., astrophysics

StarGlobe Handling large data volumes efficiently

Early filtering, parallelization, pipelining Returning first results early on

Pipelining Flexible support of domain-specific application logic

Mobile user-defined operators

Results also applicable to other domains