Spatial Partitioning Techniques in ... - Spatial Hadoopspatialhadoop.cs.umn.edu/publications/ExpPartitionPoster.pdf · Hadoop Distributed File System (HDFS)... Spatial Applications

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5 10 15 20 25 35

Speedup on MapReduce with Number of Queries

1 5 10 15 20

0.1

1

10

Range Query Speedup

STR STR+ Z-curve K-d Quad Hilbert

0

1

2

3

4

5

6

7

8

1 2 3 4

Q1 (Dead Space) with block size (B)

Hilbert K-d tree Quad-tree

STR+ STR Z-Curve

Spatial Partitioning Techniques in SpatialHadoop

Ahmed Eldawy Louai Alarabi Mohamed F. MokbelDepartment of Computer Science and Engineering University of Minnesota

University of Minnesota This work is supported in part by NSF under Grants IIS-0952977 and IIS-1218168

Big Spatial Data in HDFS

Hadoop Distributed File System (HDFS)

...

Spatial Applications

K-d Tree Z-Curve Hilbert Curve

Experimental Setup

Satellite DataMedical ImagesSpace ImagesSocial Networks

Spatial Partitioning Techniques

Grid Quad-tree STR/STR+

DatasetsSummary of techniquesDataset Size Records

All Objects 88GB 250M

Buildings 25GB 109M

Roads 23GB 70M

Lakes 8.6GB 9M

Cities 1.4GB 170K

Q1= (P.width×P.height)

Quality Measures

Measures the amount

of dead area

Q3= (P.width+P.height)

Measures the squareness

of partitions

Q2= (Partitions overlap)

Measures the overlap

between partitions

Q4= (P.size)/ (Block size)

Disk utilization

Q5=Std Dev(P.count)

Skewness of partitions

(Load balance)

Range Query

Spatial Join

Throughput, running time

Running time

Quality Measures

Indexing Time

Range Query

Spatial Join

High correlation

between Q1 and

the performance

of spatial join

Bitonic behavior due to the tradeo

between quality and load balance

The block size with peak performance

depends on the query size

Single machine is preferable over

MapReduce for small queries

Single Machineis better

Quad tree performancedegrades with largequery areas due toexcessive numberof partitions

The speedup of range query is

capped by number of simultaneous

queries the system runs

Brain simulation, Climate changesEvent detection and analysis, GIS applicationVolunteer GIS applications

Q1 0.75 0.89 0.89 1.47 1.47 2.71 0.50 0.71 0.78 0.71 1.07 1.81

Q1 Quad STR+ K-d STR Hilbert Zcurve Quad STR+ K-d STR Hilbert Zcurve

0.52 Quad 276 440 463 675 749 1446 628 919 888 895 1054 1633

0.66 STR+ 419 381 439 577 613 1048 801 740 834 763 960 1346

0.71 K-d 465 404 416 560 664 1058 787 836 735 782 935 1321

0.71 STR 409 401 437 570 599 1031 834 765 795 733 928 1321

1.03 Hilbert 512 464 491 612 630 1100 1248 1085 1026 1030 1184 1646

1.88 Zcurve 740 589 608 837 780 1246 1979 1587 1519 1519 1644 2223Ro

ad

s (

23G

B)

Lakes (8.6GB) Buildings (25GB)

0.00

0.20

0.40

0.60

0.80

1.00

1.20

Q1 Q2 Q3 Q4 Q5

Quality Measures

Grid Quad-tree STR STR+ K-d Tree Z-Curve Hilbert

0.00

0.50

1.00

1.50

2.00

2.50

3.00

0.01 0.02 0.05 0.1 0.2 0.5 1

Q1 (Dead space) with sample ratio ( )

Grid STR STR+ Quad K-d Tree Zcurve Hilbert

0.01

0.10

1.00

10.00

100.00

0.01 0.02 0.05 0.1 0.2 0.5 1

Q5 (Load balance) with sample ratio ( )

Grid STR STR+ Quad K-d Tree Zcurve Hilbert

0

0.2

0.4

0.6

0.8

1

1 2 3 4

Q4 (Utilization) with block size (B)

Hilbert K-d tree Quad-tree

STR+ STR Z-Curve0

500

1000

1500

2000

2500

5 10 15 20

Indexing Time

Sampling

str str+ zcurve

kdtree quadtree hilbert

2.4

2.6

2.8

3

3.2

32.mb 64.mb 128.mb 256.mb

Block Size

STR STR+ Z-Curve K-d Tree Quad-tree Hilbert

0.00

10.00

20.00

30.00

40.00

32MB 64MB 128MB 256MB

Range query throughput with Block Size (B)

0.0001 0.01 1

Tradeo between Q1 and Q4

with HDFS block size

Partition techniques are

relatively stable across datasets

... except for Q5 which

improves with sample ratio

0

2

4

6

8

10

12

14

Cities Lakes Roads Buildings All

Q1- All Datasets

hilbert kdtree quadtree str+ str zcurve

Distribution-basedtechniques degradewith large datasetsdue to expansionof partitions

1% sample is as good

as 100% for Q1-Q4 ...

Sampling is not as scalable

as partitioning

Distribution-basedtechniques almostreach 0 skewness

Technique Partitioning Boundary Objects

Grid Space Replication

Quad-tree Space Replication

STR Data Distribution

STR+ Data Replication

K-d tree Data Replication

Z-Curve SFC Distribution

Hilbert SFC Distribution

Documents

Spatial Partitioning Techniques in ... - Spatial Hadoopspatialhadoop.cs.umn.edu/publications/ExpPartitionPoster.pdf · Hadoop Distributed File System (HDFS)... Spatial Applications