Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science 1
Grid-basedData Stream Processingin e-Science
Richard Kuntschke1, Tobias Scholl1, Sebastian Huber1,Alfons Kemper1, Angelika Reiser1,Hans-Martin Adorf2, Gerard Lemson3, and Wolfgang Voges3
1Lehrstuhl Informatik III:1Datenbanksysteme1Fakultät für Informatik1Technische Universität München
2Max-Planck-Institut2für Astrophysik
3Max-Planck-Institut3für extraterrestrische3Physik
2
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Important Challenges in e-Science
In general:Large and exponentially growing amounts of dataDistributed data archivesNo unique identifiersUncertainty
In astrophysics:Spectral Energy Distributions (SEDs)
Used to classify celestial objects (active galactic nuclei, brown dwarfs, neutron stars, ...)Generation requires spatial (astrometric) matching
3
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Spatial (Astrometric) Matching
Current solutions …… load all data into main memory
Uses a lot of memoryInfeasible if memory size is insufficient
… process all data at once and deliver thecomplete result at the end
InefficientNo results until all processing has completed
4
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Our Contributions
StarGlobeGrid-based P2P DataStream Management System implemented on top of GlobusIn-network processing
Early filteringParallelizationPipeliningLoad-balancing
Mobile user-definedoperators
Astrophysical ExampleWorkflow
Astrometric matchingPerformance evaluation
5
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
The StarGlobe Architecture
Super-Peer BackboneQuery 1
Stream 0
Publish
Subscribe
filter
transform
Loadmobile operators
Fct-Provider
filter
transform
Stream 1
Publish
Query 2
Subscribe
6
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Traditional Approach: Bring Data to Code
union
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. BT
... ... ...
... ... ...
... ... ...
Data-Prov. CT
... ... ...
... ... ...
... ... ...
Data-Prov. DT
... ... ...
... ... ...
... ... ...
Data-Prov. A
7
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
New Approach: Bring Code to Data
T... ... ...... ... ...... ... ...
Data-Prov. Ascan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. Bscan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. Cscan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. Dscan
NN_10
union
NN_10Fct-Provider
NN_10
8
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Mobile User-Defined Operators
Load user-defined operators from functionprovider servers in the network
Common interface for integrating externaloperators
Push-based iterator
Flexibility
9
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
StreamIterator Interface
open(Config, StreamWriter)Configuration parametersWriter for result stream
next(StreamIteratorEvent)Next element in input streamWriting output to result stream usingStreamWriter.write()
close()
10
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Communication betweenStreamProcessor and StreamIterator
StreamIterator
StreamHandler 1
StreamHandler 2
StreamHandler n
StreamWriter
StreamProcessor
XML InputStream 1
XML InputStream 2
XML InputStream n
XML OutputStream
Item 1 Item 2 Item n Result Item
11
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Astrophysical Example Workflow
peer-10
peer-9 peer-8
peer-6 peer-4 peer-5peer-7
peer-2peer-1peer-0 peer-3
Input ListRASS-BSC
2MASS FIRST USNOB1
NVSS GSC-2
SED assembly
12
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Planplan-10
at peer-10
plan-8at peer-8
plan-5at peer-5enrichσ-5
transform-5
stream-5
χ²filter-2
join-2
plan-4at peer-4enrichσ-4
transform-4
stream-4
plan-9at peer-9χ²filter-3
join-3
plan-7at peer-7
plan-2at peer-2enrichσ-2
transform-2
stream-2
plan-3at peer-3enrichσ-3
transform-3
stream-3
χ²filter-1
join-1
plan-6at peer-6χ²filter-0
join-0
plan-1at peer-1enrichσ-1
transform-1
stream-1
χ²filter-4
join-4
display
plan-0at peer-0enrichσ-0
transform-0
stream-0
13
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Plan
χ²filter-0
join-0
enrichσ-1
transform-1
stream-1
enrichσ-0
transform-0
stream-0
14
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Plan
15
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Evaluation of Early Filtering
16
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Conclusion
Synergies between research in computer scienceand other scientific disciplines, e.g., astrophysics
StarGlobeHandling large data volumes efficiently
Early filtering, parallelization, pipeliningReturning first results early on
PipeliningFlexible support of domain-specific application logic
Mobile user-defined operators
Results also applicable to other domains