Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape...

Lehrstuhl Informatik III: Datenbanksysteme

Community Training: Partitioning Schemes in Good Shape for Federated Data Grids

Tobias Scholl, Richard Kuntschke, Angelika Reiser, Alfons Kemper

3rd IEEE International Conference on e-Science and Grid ComputingBangalore, IndiaDecember 10th – 13th 2007

2Community Training

The AstroGrid-D Project

German Astronomy Community Grid http://www.gac-grid.org/

funded by the German Ministry of Education and Research

part of the D-Grid

3Community Training

The Multiwavelength Milky Way

http://adc.gsfc.nasa.gov/mw/

4Community Training

Data-intensive e-Science Applications

Many e-science application areas astrophysics geosciences climatology

Combination of various, globally distributed data sources

Increasing popularity (within community and public domain): more users

Scalability issues with current approaches

5Community Training

Up-coming Data-intensive Applications

Data rates Terabytes a day/night Petabytes a year

LOFAR LSST Pan-STARRS LHC

6Community Training

Current Sharing in Data Grids

Images from: “Data-Intensive Grid Computing” by Srikumar Venugopal

Data autonomy Policies allow partners to access data Each institution ensures

Availability (replication) Scalability

Various organizational structures: Centralized Hierarchical Federated Hybrid

7Community Training

Community-Driven Data Distribution

8Community Training

The Training Phase

Extract data from the archives Compute partitioning schemes Compare different partitioning schemes

Standard Quadtrees Median-based Quadtrees Zones

9Community Training

Quadtrees

Well-known index structure

Recursive decomposition

Adaptive to data resolution

10Community Training

Quadtree-based Schemes: Splitting Variants

Center splitting Always bisects each

dimension congruent subareas Splitting points stored

or computed

Median heuristics Compute median for

each dimension independently

O(n) median algorithm

Splitting points stored

The Zones Index(J. Gray, M. Nieto-Santisteban, A. Szalay) Index structure for databases Specific spatial clustering in zones Optimized for

points-in-region queries self-match, cross-match queries

Equi-distant partitioning Declination coordinate Zone(ra, dec) = floor((dec + 90.0) / h)

Implemented directly in SQLImage by J. Gray et al.

Evaluation Setup

2 data sets: skewed and uniform Size of data sample: 0.01%, 0.1%, 1%, 10% Number of partitions:

4, 16, 64, 256, 1024, 4096, 8192, 16384, 32768, 65536, 131072 (24 – 217)

Skewed Training Data (Dskew

Comparing Partitioning Schemes

Duration Average data population Variance in partition population Empty partitions Size of the training set Baseline comparison

Average Data Population

1310726553632768163848192409610242566416

# partitions

center splitting (skewed data)median splitting (skewed data)center splitting (uniform data)

median splitting (uniform data)

10% training sample, Dskew

avg(# objects in partitions)

# objects in biggest

partition

Evolution of the Partitioning Scheme

4096 partitions 8192 partitions 16384 partitions

Empty Leaves

1310726553632768163848192409610242566416

# partitions

center splittingmedian splittingzones

10% training sample, Dskew

Size of the Training Set

1310726553632768163848192409610242566416 0

# partitions

sample ratiocenter splitting

median splitting

0.1% training sample, Dskew

# objects in training set

# partitions to be created

Standard Quadtree vs. Median Heuristics

Evaluation – Summary

Quadtrees: good adaption to data distribution Quadtrees: Trade-off between data load

balancing and uniformly shaped regions Median-based heuristics: best data load

balancing even for skewed data sets Zone Index: good for uniform data sets Training set needs to be sufficiently large in

order not to artificially create empty partitions

HiSbase

Peer-to-Peer layer assigns data partitions to peers

Higher flexibility New peers are

integrated seamlessly

Query Submission (at Peer d)

Determine relevant regions: [1,3]

Select coordinator:Region 1

Send CoordinateQuery-message to id1:{[1,3], SQL}

Message gets routed to Peer a.

Query Coordination (at Peer a)

Peer a is coordinator Contact relevant

regions Collect intermediate

results Send complete result

to client

select …

Prototype Implementation

Java-based prototype FreePastry library (Pastry implementation)

Rice University MPI-SWS

Presentations at BTW 2007, Aachen, Germany VLDB 2007, Vienna, Austria

Deployed in various settings LAN WAN (AstroGrid-D, PlanetLab)

Summary

Training phase Community-driven Data Grids

Domain-specific partitioning scheme Partitioning scheme supports

Data skew Region-based queries

Framework for comparing partitioning schemes

Various measures with regard to data load balancing

Ongoing Work

Database-driven comparison 0.1% of a Petabyte still is 1 Terabyte Feasibility of median-based techniques

Workload-aware data partitioning Heterogeneous data nodes

Get in Touch

Database systems group, TU München Web site: http://www-db.in.tum.de E-mail: scholl@in.tum.de

HiSbase http://www-db.in.tum.de/research/projects/hisbase/

Data stream management “Grid-based Data Stream Processing in e-Science”

(e-Science '06) http://www.gac-grid.de/project-products/Software/

DataStreamManagement.html

AstroGrid-D Research Demo

Finding Galaxy Clusters using Grid Computing Technology

Room:Banquet

Wednesday3:30 pm to6:00 pm

Lehrstuhl Informatik III: Datenbanksysteme Community Training: Partitioning Schemes in Good Shape...

Documents

02.11.2013FB Info - H. Härtl1 Datenbanksysteme Einführung

Scholl supplies-new (1)

Sophie Scholl Sources

Vorlesung Datenbanksysteme vom 13.10.2004 Anfragebearbeitung

Vorlesung Datenbanksysteme WS 2.0

Optimierung von Anfragen an verteilte Datenbanksysteme

Geschwister-Scholl-Schule Detmold€¦ · Geschwister-Scholl-Schule, Detmold Version 2014 - 06 Seite 1 Geschwister-Scholl-Schule Detmold Hinweise zum Erstellen der Facharbeit.

Medienentwicklungsplan 2009 Geschwister-Scholl … · Medienentwicklungsplan des Geschwister- Scholl-Gymnasiums Fürstenwalde Seite 1 von 31 Medienentwicklungsplan 2009 Geschwister-Scholl-Gymnasium

Datenbanksysteme 1 - fj-strube.de

Scholl w Ock

Scholl Rep

Übung Datenbanksysteme II Index- strukturen

Geschwister-Scholl-Schule Dorsten - Geschwister-Scholl-Schule

Übung Datenbanksysteme Normalformen

Mobile Datenbanksysteme

NAME SOPHIE SCHOLL

Sophie Scholl

BS-Scholl-20191202084414 - BS Sophie Scholl Bremerhavenbs-sophiescholl.bremerhaven.de/AA/Druckvorlage...Schulzentrum Geschwister Scholl Berufsbildende Schulen Sophie Scholl Walter-Kolb-V./eg

SS 2004B. König-Ries: Datenbanksysteme2-1 Kapitel 2: Referenzarchitektur für Datenbanksysteme Methodischer Architekturentwurf Architekturentwurf für Datenbanksysteme

Übung Datenbanksysteme I Relationale Algebra - hpi.de · Thorsten Papenbrock | Übung Datenbanksysteme I – Relationale Algebra 5 Modellnummer Prozessorgeschwindigkeit [MHz] Festplattengröße