Upload
isaac-bates
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science 1
Grid-basedData Stream Processingin e-Science
Richard Kuntschke1, Tobias Scholl1, Sebastian Huber1,Alfons Kemper1, Angelika Reiser1,Hans-Martin Adorf2, Gerard Lemson3, and Wolfgang Voges3
1Lehrstuhl Informatik III:1Datenbanksysteme1Fakultät für Informatik1Technische Universität München
2Max-Planck-Institut2für Astrophysik
3Max-Planck-Institut3für extraterrestrische3Physik
2
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Important Challenges in e-Science
In general: Large and exponentially growing amounts of data Distributed data archives No unique identifiers Uncertainty
In astrophysics:Spectral Energy Distributions (SEDs)
Used to classify celestial objects (active galactic nuclei, brown dwarfs, neutron stars, ...)
Generation requires spatial (astrometric) matching
3
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Spatial (Astrometric) Matching
Current solutions … … load all data into main memory
Uses a lot of memory Infeasible if memory size is insufficient
… process all data at once and deliver the complete result at the end Inefficient No results until all processing has completed
4
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Our Contributions
StarGlobe Grid-based P2P Data
Stream Management System implemented on top of Globus
In-network processing Early filtering Parallelization Pipelining Load-balancing
Mobile user-defined operators
Astrophysical Example Workflow Astrometric matching Performance evaluation
5
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
The StarGlobe Architecture
Super-Peer BackboneQuery 1
Stream 0
Publish
Subscribe
filter
transform
Load mobile operators
Fct-Provider
filter
transform
Stream 1
Publish
Query 2
Subscribe
6
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Traditional Approach: Bring Data to Code
union
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. BT
... ... ...
... ... ...
... ... ...
Data-Prov. CT
... ... ...
... ... ...
... ... ...
Data-Prov. DT
... ... ...
... ... ...
... ... ...
Data-Prov. A
7
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
New Approach: Bring Code to Data
T... ... ...... ... ...... ... ...
Data-Prov. A
scan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. B
scan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. C
scan
NN_10
T... ... ...... ... ...... ... ...
Data-Prov. D
scan
NN_10
union
NN_10
Fct-Provider
NN_10
8
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Mobile User-Defined Operators
Load user-defined operators from function provider servers in the network
Common interface for integrating external operators
Push-based iterator
Flexibility
9
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
StreamIterator Interface
open(Config, StreamWriter) Configuration parameters Writer for result stream
next(StreamIteratorEvent) Next element in input stream Writing output to result stream using
StreamWriter.write() close()
10
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Communication betweenStreamProcessor and StreamIterator
StreamIterator
StreamHandler 1
StreamHandler 2
StreamHandler n
StreamWriter
StreamProcessor
...
XML InputStream 1
XML InputStream 2
XML InputStream n
XML OutputStream
...
Item 1 Item 2 Item n Result Item
11
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Astrophysical Example Workflow
peer-10
peer-9 peer-8
peer-6 peer-4 peer-5peer-7
peer-2peer-1peer-0 peer-3
Input ListRASS-BSC
2MASS FIRST USNOB1
NVSS GSC-2
SED assembly
12
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Planplan-10
at peer-10
plan-8at peer-8
plan-5at peer-5enrichσ-5
transform-5
stream-5
χ²filter-2
join-2
plan-4at peer-4enrichσ-4
transform-4
stream-4
plan-9at peer-9
χ²filter-3
join-3
plan-7at peer-7
plan-2at peer-2enrichσ-2
transform-2
stream-2
plan-3at peer-3enrichσ-3
transform-3
stream-3
χ²filter-1
join-1
plan-6at peer-6
χ²filter-0
join-0
plan-1at peer-1enrichσ-1
transform-1
stream-1
χ²filter-4
join-4
display
plan-0at peer-0enrichσ-0
transform-0
stream-0
13
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Plan
plan-6at peer-6
χ²filter-0
join-0
plan-1at peer-1enrichσ-1
transform-1
stream-1
plan-0at peer-0enrichσ-0
transform-0
stream-0
14
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Distributed Query Evaluation Plan
plan-10at peer-10
χ²filter-4
join-4
display
15
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Evaluation of Early Filtering
16
Lehrstuhl Informatik III: Datenbanksysteme
Grid-based Data Stream Processing in e-Science
Conclusion
Synergies between research in computer science and other scientific disciplines, e.g., astrophysics
StarGlobe Handling large data volumes efficiently
Early filtering, parallelization, pipelining Returning first results early on
Pipelining Flexible support of domain-specific application logic
Mobile user-defined operators
Results also applicable to other domains