16
Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing in e-Science Richard Kuntschke 1 , Tobias Scholl 1 , Sebastian Huber 1 , Alfons Kemper 1 , Angelika Reiser 1 , Hans-Martin Adorf 2 , Gerard Lemson 3 , and Wolfgang Voges 3 1 Lehrstuhl Informatik III: 1 Datenbanksysteme 1 Fakultät für Informatik 1 Technische Universität München 2 Max-Planck-Institut 2 für Astrophysik 3 Max-Planck-Institut 3 für extraterrestrische 3 Physik

Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science 1

Grid-basedData Stream Processingin e-Science

Richard Kuntschke1, Tobias Scholl1, Sebastian Huber1,Alfons Kemper1, Angelika Reiser1,Hans-Martin Adorf2, Gerard Lemson3, and Wolfgang Voges3

1Lehrstuhl Informatik III:1Datenbanksysteme1Fakultät für Informatik1Technische Universität München

2Max-Planck-Institut2für Astrophysik

3Max-Planck-Institut3für extraterrestrische3Physik

Page 2: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

2

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Important Challenges in e-Science

In general:Large and exponentially growing amounts of dataDistributed data archivesNo unique identifiersUncertainty

In astrophysics:Spectral Energy Distributions (SEDs)

Used to classify celestial objects (active galactic nuclei, brown dwarfs, neutron stars, ...)Generation requires spatial (astrometric) matching

Page 3: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

3

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Spatial (Astrometric) Matching

Current solutions …… load all data into main memory

Uses a lot of memoryInfeasible if memory size is insufficient

… process all data at once and deliver thecomplete result at the end

InefficientNo results until all processing has completed

Page 4: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

4

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Our Contributions

StarGlobeGrid-based P2P DataStream Management System implemented on top of GlobusIn-network processing

Early filteringParallelizationPipeliningLoad-balancing

Mobile user-definedoperators

Astrophysical ExampleWorkflow

Astrometric matchingPerformance evaluation

Page 5: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

5

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

The StarGlobe Architecture

Super-Peer BackboneQuery 1

Stream 0

Publish

Subscribe

filter

transform

Loadmobile operators

Fct-Provider

filter

transform

Stream 1

Publish

Query 2

Subscribe

Page 6: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

6

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Traditional Approach: Bring Data to Code

union

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. BT

... ... ...

... ... ...

... ... ...

Data-Prov. CT

... ... ...

... ... ...

... ... ...

Data-Prov. DT

... ... ...

... ... ...

... ... ...

Data-Prov. A

Page 7: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

7

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

New Approach: Bring Code to Data

T... ... ...... ... ...... ... ...

Data-Prov. Ascan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. Bscan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. Cscan

NN_10

T... ... ...... ... ...... ... ...

Data-Prov. Dscan

NN_10

union

NN_10Fct-Provider

NN_10

Page 8: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

8

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Mobile User-Defined Operators

Load user-defined operators from functionprovider servers in the network

Common interface for integrating externaloperators

Push-based iterator

Flexibility

Page 9: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

9

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

StreamIterator Interface

open(Config, StreamWriter)Configuration parametersWriter for result stream

next(StreamIteratorEvent)Next element in input streamWriting output to result stream usingStreamWriter.write()

close()

Page 10: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

10

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Communication betweenStreamProcessor and StreamIterator

StreamIterator

StreamHandler 1

StreamHandler 2

StreamHandler n

StreamWriter

StreamProcessor

XML InputStream 1

XML InputStream 2

XML InputStream n

XML OutputStream

Item 1 Item 2 Item n Result Item

Page 11: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

11

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Astrophysical Example Workflow

peer-10

peer-9 peer-8

peer-6 peer-4 peer-5peer-7

peer-2peer-1peer-0 peer-3

Input ListRASS-BSC

2MASS FIRST USNOB1

NVSS GSC-2

SED assembly

Page 12: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

12

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Planplan-10

at peer-10

plan-8at peer-8

plan-5at peer-5enrichσ-5

transform-5

stream-5

χ²filter-2

join-2

plan-4at peer-4enrichσ-4

transform-4

stream-4

plan-9at peer-9χ²filter-3

join-3

plan-7at peer-7

plan-2at peer-2enrichσ-2

transform-2

stream-2

plan-3at peer-3enrichσ-3

transform-3

stream-3

χ²filter-1

join-1

plan-6at peer-6χ²filter-0

join-0

plan-1at peer-1enrichσ-1

transform-1

stream-1

χ²filter-4

join-4

display

plan-0at peer-0enrichσ-0

transform-0

stream-0

Page 13: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

13

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

χ²filter-0

join-0

enrichσ-1

transform-1

stream-1

enrichσ-0

transform-0

stream-0

Page 14: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

14

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Distributed Query Evaluation Plan

Page 15: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

15

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Evaluation of Early Filtering

Page 16: Grid-based Data Stream Processing in e-Science€¦ · Lehrstuhl Informatik III: Datenbanksysteme Grid-based Data Stream Processing in e-Science 1 Grid-based Data Stream Processing

16

Lehrstuhl Informatik III: Datenbanksysteme

Grid-based Data Stream Processing in e-Science

Conclusion

Synergies between research in computer scienceand other scientific disciplines, e.g., astrophysics

StarGlobeHandling large data volumes efficiently

Early filtering, parallelization, pipeliningReturning first results early on

PipeliningFlexible support of domain-specific application logic

Mobile user-defined operators

Results also applicable to other domains