Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5 10 15 20 25 35
Speedup on MapReduce with Number of Queries
1 5 10 15 20
0.1
1
10
Range Query Speedup
STR STR+ Z-curve K-d Quad Hilbert
0
1
2
3
4
5
6
7
8
1 2 3 4
Q1 (Dead Space) with block size (B)
Hilbert K-d tree Quad-tree
STR+ STR Z-Curve
Spatial Partitioning Techniques in SpatialHadoop
Ahmed Eldawy Louai Alarabi Mohamed F. MokbelDepartment of Computer Science and Engineering University of Minnesota
University of Minnesota This work is supported in part by NSF under Grants IIS-0952977 and IIS-1218168
Big Spatial Data in HDFS
Hadoop Distributed File System (HDFS)
...
Spatial Applications
K-d Tree Z-Curve Hilbert Curve
Experimental Setup
Satellite DataMedical ImagesSpace ImagesSocial Networks
Spatial Partitioning Techniques
Grid Quad-tree STR/STR+
DatasetsSummary of techniquesDataset Size Records
All Objects 88GB 250M
Buildings 25GB 109M
Roads 23GB 70M
Lakes 8.6GB 9M
Cities 1.4GB 170K
Q1= (P.width×P.height)
Quality Measures
Measures the amount
of dead area
Q3= (P.width+P.height)
Measures the squareness
of partitions
Q2= (Partitions overlap)
Measures the overlap
between partitions
Q4= (P.size)/ (Block size)
Disk utilization
Q5=Std Dev(P.count)
Skewness of partitions
(Load balance)
Range Query
Spatial Join
Throughput, running time
Running time
Quality Measures
Indexing Time
Range Query
Spatial Join
High correlation
between Q1 and
the performance
of spatial join
Bitonic behavior due to the tradeo
between quality and load balance
The block size with peak performance
depends on the query size
Single machine is preferable over
MapReduce for small queries
Single Machineis better
Quad tree performancedegrades with largequery areas due toexcessive numberof partitions
The speedup of range query is
capped by number of simultaneous
queries the system runs
Brain simulation, Climate changesEvent detection and analysis, GIS applicationVolunteer GIS applications
Q1 0.75 0.89 0.89 1.47 1.47 2.71 0.50 0.71 0.78 0.71 1.07 1.81
Q1 Quad STR+ K-d STR Hilbert Zcurve Quad STR+ K-d STR Hilbert Zcurve
0.52 Quad 276 440 463 675 749 1446 628 919 888 895 1054 1633
0.66 STR+ 419 381 439 577 613 1048 801 740 834 763 960 1346
0.71 K-d 465 404 416 560 664 1058 787 836 735 782 935 1321
0.71 STR 409 401 437 570 599 1031 834 765 795 733 928 1321
1.03 Hilbert 512 464 491 612 630 1100 1248 1085 1026 1030 1184 1646
1.88 Zcurve 740 589 608 837 780 1246 1979 1587 1519 1519 1644 2223Ro
ad
s (
23G
B)
Lakes (8.6GB) Buildings (25GB)
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Q1 Q2 Q3 Q4 Q5
Quality Measures
Grid Quad-tree STR STR+ K-d Tree Z-Curve Hilbert
0.00
0.50
1.00
1.50
2.00
2.50
3.00
0.01 0.02 0.05 0.1 0.2 0.5 1
Q1 (Dead space) with sample ratio ( )
Grid STR STR+ Quad K-d Tree Zcurve Hilbert
0.01
0.10
1.00
10.00
100.00
0.01 0.02 0.05 0.1 0.2 0.5 1
Q5 (Load balance) with sample ratio ( )
Grid STR STR+ Quad K-d Tree Zcurve Hilbert
0
0.2
0.4
0.6
0.8
1
1 2 3 4
Q4 (Utilization) with block size (B)
Hilbert K-d tree Quad-tree
STR+ STR Z-Curve0
500
1000
1500
2000
2500
5 10 15 20
Indexing Time
Sampling
str str+ zcurve
kdtree quadtree hilbert
2.4
2.6
2.8
3
3.2
32.mb 64.mb 128.mb 256.mb
Block Size
STR STR+ Z-Curve K-d Tree Quad-tree Hilbert
0.00
10.00
20.00
30.00
40.00
32MB 64MB 128MB 256MB
Range query throughput with Block Size (B)
0.0001 0.01 1
Tradeo between Q1 and Q4
with HDFS block size
Partition techniques are
relatively stable across datasets
... except for Q5 which
improves with sample ratio
0
2
4
6
8
10
12
14
Cities Lakes Roads Buildings All
Q1- All Datasets
hilbert kdtree quadtree str+ str zcurve
Distribution-basedtechniques degradewith large datasetsdue to expansionof partitions
1% sample is as good
as 100% for Q1-Q4 ...
Sampling is not as scalable
as partitioning
Distribution-basedtechniques almostreach 0 skewness
Technique Partitioning Boundary Objects
Grid Space Replication
Quad-tree Space Replication
STR Data Distribution
STR+ Data Replication
K-d tree Data Replication
Z-Curve SFC Distribution
Hilbert SFC Distribution