28
Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper 3 rd IEEE International Conference on e-Science and Grid Computing Bangalore, India December 10 th – 13 th 2007

Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

Embed Size (px)

Citation preview

Page 1: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

Lehrstuhl Informatik III: Datenbanksysteme

Community Training: Partitioning Schemes in Good Shape for Federated Data Grids

Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper

3rd IEEE International Conference on e-Science and Grid ComputingBangalore, IndiaDecember 10th – 13th 2007

Page 2: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

2Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The AstroGrid-D Project

German Astronomy Community Grid http://www.gac-grid.org/

funded by the German Ministry of Education and Research

part of the D-Grid

Page 3: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

3Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Multiwavelength Milky Way

http://adc.gsfc.nasa.gov/mw/

Page 4: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

4Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Data-intensive e-Science Applications

Many e-science application areas astrophysics geosciences climatology

Combination of various, globally distributed data sources

Increasing popularity (within community and public domain): more users

Scalability issues with current approaches

Page 5: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

5Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Up-coming Data-intensive Applications

LOFAR

LHC

Data rates Terabytes a day/night Petabytes a year

LOFAR LSST Pan-STARRS LHC

Page 6: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

6Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Current Sharing in Data Grids

Images from: “Data-Intensive Grid Computing” by Srikumar Venugopal

Data autonomy Policies allow partners to access data Each institution ensures

Availability (replication) Scalability

Various organizational structures: Centralized Hierarchical Federated Hybrid

Page 7: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

7Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Community-Driven Data Distribution

Page 8: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

8Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Training Phase

Extract data from the archives Compute partitioning schemes Compare different partitioning schemes

Standard Quadtrees Median-based Quadtrees Zones

Page 9: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

9Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Quadtrees

Well-known index structure

Recursive decomposition

Adaptive to data resolution

Page 10: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

10Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Quadtree-based Schemes: Splitting Variants

Center splitting Always bisects each

dimension congruent subareas Splitting points stored

or computed

Median heuristics Compute median for

each dimension independently

O(n) median algorithm

Splitting points stored

Page 11: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

11Community Training

Lehrstuhl Informatik III: Datenbanksysteme

The Zones Index(J. Gray, M. Nieto-Santisteban, A. Szalay) Index structure for databases Specific spatial clustering in zones Optimized for

points-in-region queries self-match, cross-match queries

Equi-distant partitioning Declination coordinate Zone(ra, dec) = floor((dec + 90.0) / h)

Implemented directly in SQLImage by J. Gray et al.

Page 12: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

12Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evaluation Setup

2 data sets: skewed and uniform Size of data sample: 0.01%, 0.1%, 1%, 10% Number of partitions:

4, 16, 64, 256, 1024, 4096, 8192, 16384, 32768, 65536, 131072 (24 – 217)

Page 13: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

13Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Skewed Training Data (Dskew

)

Page 14: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

14Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Comparing Partitioning Schemes

Duration Average data population Variance in partition population Empty partitions Size of the training set Baseline comparison

Page 15: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

15Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Average Data Population

0

10

20

30

40

50

60

70

80

90

100

1310726553632768163848192409610242566416

aver

age

data

pop

ulat

ion

(%)

# partitions

center splitting (skewed data)median splitting (skewed data)center splitting (uniform data)

median splitting (uniform data)

10% training sample, Dskew

avg(# objects in partitions)

# objects in biggest

partition

Page 16: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

16Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evolution of the Partitioning Scheme

4096 partitions 8192 partitions 16384 partitions

Page 17: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

17Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Empty Leaves

0

1

2

3

4

5

6

7

8

1310726553632768163848192409610242566416

em

pty

pa

rtiti

on

s (%

)

# partitions

center splittingmedian splittingzones

10% training sample, Dskew

Page 18: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

18Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Size of the Training Set

1

4

16

64

256

1024

4096

16384

1310726553632768163848192409610242566416 0

5

10

15

20

25

30

35

sam

ple

ra

tio

em

pty

pa

rtiti

ons

(%)

# partitions

sample ratiocenter splitting

median splitting

0.1% training sample, Dskew

# objects in training set

# partitions to be created

Page 19: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

19Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Standard Quadtree vs. Median Heuristics

Page 20: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

20Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Evaluation – Summary

Quadtrees: good adaption to data distribution Quadtrees: Trade-off between data load

balancing and uniformly shaped regions Median-based heuristics: best data load

balancing even for skewed data sets Zone Index: good for uniform data sets Training set needs to be sufficiently large in

order not to artificially create empty partitions

Page 21: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

21Community Training

Lehrstuhl Informatik III: Datenbanksysteme

HiSbase

Peer-to-Peer layer assigns data partitions to peers

Higher flexibility New peers are

integrated seamlessly

Page 22: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

22Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Query Submission (at Peer d)

Determine relevant regions: [1,3]

Select coordinator:Region 1

Send CoordinateQuery-message to id1:{[1,3], SQL}

Message gets routed to Peer a.

01 2

3 4

5 6

Page 23: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

23Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Query Coordination (at Peer a)

Peer a is coordinator Contact relevant

regions Collect intermediate

results Send complete result

to client

select …

select …

Page 24: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

24Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Prototype Implementation

Java-based prototype FreePastry library (Pastry implementation)

Rice University MPI-SWS

Presentations at BTW 2007, Aachen, Germany VLDB 2007, Vienna, Austria

Deployed in various settings LAN WAN (AstroGrid-D, PlanetLab)

Page 25: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

25Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Summary

Training phase Community-driven Data Grids

Domain-specific partitioning scheme Partitioning scheme supports

Data skew Region-based queries

Framework for comparing partitioning schemes

Various measures with regard to data load balancing

Page 26: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

26Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Ongoing Work

Database-driven comparison 0.1% of a Petabyte still is 1 Terabyte Feasibility of median-based techniques

Workload-aware data partitioning Heterogeneous data nodes

Page 27: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

27Community Training

Lehrstuhl Informatik III: Datenbanksysteme

Get in Touch

Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: [email protected]

HiSbase http://www-db.in.tum.de/research/projects/hisbase/

Data stream management “Grid-based Data Stream Processing in e-Science”

(e-Science '06) http://www.gac-grid.de/project-products/Software/

DataStreamManagement.html

Page 28: Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape for Federated Data Grids Tobias Scholl, Richard Kuntschke,

28Community Training

Lehrstuhl Informatik III: Datenbanksysteme

AstroGrid-D Research Demo

Finding Galaxy Clusters using Grid Computing Technology

Room:Banquet

Wednesday3:30 pm to6:00 pm