SpatialHadoop: A MapReduce Framework for Spatial Data Author: Ahmed Eldawy, Mohamed F. Mokbel Publication: ICDE 15’

SpatialHadoop: A MapReduce Framework

for Spatial DataAuthor: Ahmed Eldawy, Mohamed F. Mokbel

Publication: ICDE 15’

Context

• 0. Abstract• 1. Background• 2. Related Work• 3. Architecture• 4~7. Four layers• 8. Experiments

Abstract

• SpatialHadoop: a full-fledged MapReduce framework with native support for spatial data• It is a comprehensive extension to Hadoop that injects spatial data

awareness in each Hadoop layer:1. language: Pigeon2. storage: two level spatial index3. MapReduce: SpatialFileSplitter, SpatialRecordReader4. operations: range query, kNN, spatial join

Background

• Motivations:1. Hadoop: solution for scalable processing of huge datasets2. Recent explosions of spatial data• Present:Researchers and practitioners worldwide have started to take advantage of the MapReduce environment in supporting large-scale spatial data: * Industry: GIS tools on Hadoop * academic: 1. Parallel-Secondo 2. MD-HBase 3. Hadoop-GIS

Background

• Drawback: deals with Hadoop as a black box and limited by limitations of existing Hadoop system.• Take Hadoop-GIS as an example:1. Hadoop treat spatial data as non-spatial ones, without additional

support2. Support only uniform grid index, only applicable in uniform data

distribution3. MapReduce programs cannot access the constructed spatial index• Parallel-Secondo, MD-Hbase and ESRI tools on Hadoop suffer from

similar drawbacks.

Background

• SpatialHadoop:1. built-in Hadoop base code2. able to support a set of spatial index structures3. users can develop a myriad of spatial functions including range

queries, kNN and spatial join• Difference:

Background

• SpatialHadoop-four main layers:1. language layer: Pigeon2. storage layer: two-level index structure3. MapReduce layer: SpatialFileSplitter, SpatialRecordReader4. operations layer: encapsulates a dozen of spatial operations

Related Work

• Existing work can be classified into 2 categories:1. Specific spatial operations2. System

Related Work

• Specific:1. R-tree construction2. Range query3. kNN query4. All NN query5. Reverse NN query6. Spatial join7. kNN join

Related Work

• System:1. Hadoop-GIS2. MD-HBase3. Parallel-Secondo

Architecture

• Architecture:3 types of users-Casual user-Developer-System Admin4 layers-language-operations-MapReduce-storage

Architecture

• The language layerPigeon, a high-level SQL-like language that supports OGC-compliant spatial data types(Point and Polygon) and operations(Overlap and Touches)• The storage layertwo-level index structure of global and local indexing, implement 3 standard indexes: Grid file, R-tree and R+-tree

Architecture

• The MapReduce layerSpatialFileSplitter: use global index to prune file blocks that do not contribute to the answerSpatialRecordReader: use local index to retrieve partial answer from each block• The operation layerEncapsulates the implementation of various spatial operations that take advantage of the spatial indexes and the new components in the MapReduce layer

Language Layer

• Background: a set of declarative SQL-like languages havebeen proposed: HiveQL, Pig Latin, SCOPE and YSmart• Pigeon: an extension to Pig Latin language, adding spatial data types,

functions and operations that conform to OGC standard.

Language Layer

• Data types: overrides the bytearray to support spatial data types such as Point, LineString, Polygon

lakes = LOAD ‘lakes’ AS (id:int, area:polygon);• Spatial functions: provide spatial functions including aggregate functions

(e.g., Union), predicates (e.g., Overlaps), and others(e.g., Buffer)

houses_with_distance = FOREACH houses GENERATE id, Distance(house_loc, sc_loc);• kNN query: new KNN statementnearest_houses = KNN houses WITH_K=100 USING Distance(house_loc, query_loc);

Language Layer

Override following 2 Pig Latin statements• FILTER: to accept a spatial predicate and call the corresponding

procedure for range querieshouses_in_range = FILTER houses BY Overlaps(house_loc, query_range);• JOIN: to accept spatial files and forward to the corresponding spatial

join produrelakes_states = JOIN lakes BY lakes_boundary states BY states_boundary PREDICATE = Overlaps

Storage Layer

• Background:1. Input files in Hadoop: non-indexed heap files2. SpatialHadoop: Index structure in HDFSIndexing in SpatialHadoop is the key point in superior performance over Hadoop• Challenges:1. Index structures are optimized for procedural program2. A file in HDFS can be only written sequentially while traditional

indexes are constructed incrementally

Storage Layer

• Existing techniques for spatial indexing in Hadoop:1. build onlyconstruct a R-tree using MapReduce approach but queried outside MapReduce using other techniques2. custom on-the-fly indexingnon-standard index is created and discarded with each query execution3. indexing in HDFSonly support range queries on trajectory data, quite limited

Storage Layer

• Overview:

Storage Layer

• How to overcome challenges:1. local indexes can be processed in parallel2. The small size of local indexes allows each one to be bulk loaded in

memory and written to a file in an append-only manner• Generic way of building index:1. Partitioning2. Local indexing3. Global indexing

Storage Layer

• PartitioningMain goals: block fit, spatial locality, load balancingThree steps:1. Calculate numbers of n2. Decide partition boundaries3. Physical partitioning

Storage Layer

• 1. calculate numbers of partitions n

S: input file sizeB: HDFS block capacity(64MB): overhead ratio, set to 0.2 by default

Storage Layer

• 2. Partitions boundaries- we decide on the spatial area covered by each single partition defined

by a rectangle- boundaries are calculated differently according to the underlying

index being constructed to accommodate data distribution- The output of this step is a set of n rectangles representing

boundaries of the n partitions

Storage Layer

• 3. Physical partitioning- Initiate a MapReduce job that physically partitions the input file- The challenge here is to decide what to do with objects with spatial

extents (e.g., polygons) that may overlap more than one partition- At the end, for each record r assigned to a partition p, the map

function writes an intermediate pair <p, r>. Such pairs are then grouped by p and sent to the reduce function for the next phase

Storage Layer

• Local indexing- Purpose: build the requested index structure (e.g., Grid or R-tree) as a local

index on the data contents of each physical partition- Building the requested index structure is realized as a reduce function that

takes the records assigned to each partition and stores them in a spatial index, written in a local index file- local index has to fit in one HDFS block for two reasons

(1) This allows spatial operations written as MapReduce programs to access local indexes where each local index is processed in one map task(2) It ensures that the local index is treated by Hadoopload balancer as one unit when it relocates blocks acrossmachines

Storage Layer

• Global indexing- Build the requested structure as a global index that indexes all

partitions.- process:

1. initiate an HDFS concat command to concatenate all local indexes into one file2. master node builds all in memory global index which indexes all file blocks using their rectangular boundaries as the index key

Storage Layer

• Global indexing(ctd.)• global index is:1. using bulk loading2. kept in main memory all the time3. lazily constructed in case the master node fails and restarts

Storage Layer - Grid file

• Definition: a simple flat index that partitions the data according to a grid such that records overlapping each grid cell are stored in one file block as a single partition, assuming data is uniformly distributed• Partitioning:- 1. calculate number of partitions n- 2. creating a uniform grid of size in the space domain and take

boundaries of grid cells as partition boundaries- 3. a record r with a spatial extent, is replicated to every grid cell it

overlaps

Storage Layer - Grid file

• Local indexing: the records of each grid cell are just written to a heap file without building any local indexes• Global indexing: concatenates all these files and builds the global

index, which is a two dimensional directory table pointing to the corresponding blocks in the concatenated file

R-tree

• An R-tree is a height-balanced similar to a B-tree with index records in its leaf nodes containing pointers to data objects• Spatial databases: tuples(representing spatial objects) + identifiers• In a R tree:- Leaf node: <I, identifier>- Non-leaf node: <I, child-pointer>I – n-dimentional rectangle

R-tree

• Properties:(M: the maximum number of entn3 that snll At m one node)(m: parameter speclfymg the minimum number of entries in a node)1. Every leaf node contains between m and M index records unless it is

the root2. For each index record (I, identifier) in a leaf node, I is the smallest

rectangle that spatially contains the n-dnnenslonal data object represented by the indicated tuple

3. Every non-leaf node has between m and M children unless it is the root

R-tree

• Properties(ctd.):4. For each entry (I, child-pointer) in a non-leaf node, I is the smallest

rectangle that spatially contains the rectangles m the child node5. The root node has at least two children unless it is a leaf6. All leaves appear on the same level

Storage Layer - (R-tree)

• Partitioning- To compute partition boundaries, we bulk load a random sample from

the input file to an in-memory R-tree using the Sort-Tile-Recursive (STR) algorithm- (details)

Storage Layer - (R-tree)

• local indexing:- records of each partition are bulk loaded into an R-tree using the STR

algorithm, then dumped into a file- The block in a local index file is annotated with its minimum bounding

rectangle (MBR) of its contents- the partitions might end up being overlapped, similar to traditional R-

tree nodes• global indexing:- concatenates all local index files and creates the global index by bulk loading all blocks into an R-tree using their MBRs as the index key

R+-tree

• Differences from R-tree:- Nodes are not guaranteed to be at least half filled- The entries of any internal node do not overlap- An object ID may be stored in more than one leaf node• Adv:- Point query performance improves- A single path is followed and fewer nodes are visited than with the R-

tree

Storage Layer - (R+-tree)

• Definition: R+-tree is a variation of the R-tree where nodes at each level are kept disjoint while records overlapping multiple nodes are replicated to each node to ensure efficient query answering• Similar to R-tree except 3 changes:- 1. In the R+-tree physical partitioning step, each record is replicated to each

partition it overlaps with- 2. In the local indexing phase, the records of each partition are inserted into

an R+-tree which is then dumped to a local index file- 3. the global index is constructed based on the partition boundaries

computed in the partitioning phase rather than the MBR of its contents as boundaries should remain disjoint

MapReduce Layer

• Comparison:• Hadoop: - 1. the input file goes through a FileSplitter that divides it into n splits, where n is set by

the the MapReduce program, based on the number of available slave nodes.- 2. Then, each split goes through a RecordReader that extracts records as key-value pairs

which are passed to the map function

• SpatialHadoop- 1. SpatialFileSplitter, an extended splitter that exploits the global index(es) on input

file(s) to early prune file blocks not contributing to answer- 2. SpatialRecordReader, which reads a split originating from spatially indexed input

file(s) and exploits the local indexes to efficiently process it

MapReduce Layer

• Comparison(ctd.)

MapReduce Layer

• SpatialFileSplitter• Takes:- 1. one or two input files- 2. filter function• One input file- the SpatialFileSplitter applies the filter function on the global index of the input

file to select file blocks, based on their MBRs, that should be processed by the job- For example, a range query job provides a filter function that prunes file blocks

with MBRs completely outside the query range. For each selected file block in the query range, the SpatialFileSplitter creates a file split, to be processed later by the SpatialRecordReader

MapReduce Layer

• SpatialFileSplitter(ctd.)• Two input files, similar to one input file with two subtle differences:- 1. The filter function is applied to two global indexes; each

corresponds to one input file- 2. The output of the SpatialFileSplitter is a combined split that

contains a pair of file ranges (i.e., file offsets and lengths) corresponding to the two selected blocks from the filter function

MapReduce Layer

• SpatialRecordReaderThe SpatialRecordReader takes either a split or combined split and parses it to generate key-value pairs to be passed to the map function. It parses the block to extract the local index that acts as an access method to all records in the block.

MapReduce Layer

• SpatialRecordReaderThe record reader sends all the records to the map function indexed by the local index with two main benefits:- 1. it allows the map function to process all records together, which is

shown to make it more powerful and flexible- 2. the local index is harnessed when processing the block, making it

more efficient than scanning over all records

Operations Layer

• Spatial indexing(S Layer) + Spatial functionality(MR Layer) = possibility of efficient realizations of a myriad of spatial operations• 3 basic spatial operations:- range query- k nearest neighbor(kNN)- spatial Join

Operations Layer – Range Query

• Definition: A range query takes a set of spatial records R and a query area A as input, and returns the set of records in R that overlaps with A• 2 range query techniques depending on whether there is replication- No replication(R-tree)- Relication(Grid or R+-tree)


• No replication: each record is stored in exactly one partition• Range query algorithm:Step1 - global filter step- range filter->SpatialFileSplitter- blocks that is completely inside query area->output- blocks that are partially overlapping->are sent for further processing

in the second step


• Step 2 – local filter- The SpatialRecordReader reads a block that needs to be processed,

extracts its local index- sends it to the map function, which exploits the local index with a

traditional range query algorithm to return matching records


• Replication: some records are replicated across partitions• Range query algorithm, similar to the no replication one except:- (1) In the global filter step, blocks that are completely contained

in the query area A have to be further processed- (2) The output of the local filter goes through an additional duplicate

avoidance step to ensure that duplicates are removed from the final answer


• Duplicate avoidance step- For each candidate record produced by the local filter step, we

compute its intersection with the query area. A record is added to the final result only if the top-left corner of the intersection is inside the partition boundaries. - Since partitions are disjoint, it is guaranteed that only one partition

contains that point. The output of the duplicate avoidance step gives the final answer of the range query, hence, no reduce function is needed

Operations Layer - kNN

• Definition: A kNN query takes a set of spatial points P , a query point Q, and an integer k as input, and returns the k closest points in P to Q• kNN query algorithm in SpatialHadoop:- (1) Initial answer- (2) Correctness check- (3) Answer refinement


• Initial answer- First locate the partition that includes Q by feeding the

SpatialFileSplitter with a filter function that selects only the overlapping partition- The selected partition goes through the SpatialRecordReader to

exploit its local index with a traditional kNN algorithm to produce the initial k answers


• Correctness check- We draw a test circle C centered at Q with a radius equal to the

distance from Q to its kth furthest neighbor- If C does not overlap any partition other than Q, the initial answer is

considered final, otherwise to Answer refinement step.


• Answer refinement- run a range query to get all points inside the MBR of the test circle C- a scan over the range query result is executed to produce the closest

k points as the final answer

Operations Layer – Spatial join

• Definition: A spatial join takes two sets of spatial records R and S and a spatial join predicate θ (e.g., overlaps) as input, and returns the set of all pairs <r, s> where r R, s S, and θ is true for <r, s>∈ ∈• SJMR algorithm, MapReduce version of partition-based spatial-merge

join(PBSM)- Employs a map function that partitions input records according to a

uniform grid- A reduce function that joins records in each partition


• Distributed join- (Preprocessing if needed)- Global join- Local join- Duplicate avoidance


• Global join: this step produces all pairs of file blocks with overlapping MBRs- the SpatialFileSplitter module is fed with the overlapping filter

function to exploit two spatially indexed input files.- Then, a traditional spatial join algorithm is applied over the two global

indexes to produce the overlapping pairs of partitions.- The SpatialFileSplitter will finally create a combined split for each pair

of overlapping blocks


• Local join: this step joins the records in the two blocks in this split to produce pairs of overlapping records- the SpatialRecordReader reads the combined split, extracts the

records and local indexes from its two blocks, and sends all of them to the map function for processing.- The map function exploits the two local indexes to speed up the process

of joining the two sets of records in the combined split.- The result of the local join may contain duplicate results due to having

records overlapping with multiple blocks


• Duplicate avoidance: employs the reference-point duplicate avoidance technique- For each detected overlapping pair of records, the intersection of

their MBRs is first computed.- Then, the overlapping pair is reported as a final answer only if the

top-left corner (i.e., reference point) of the intersection falls in the overlap of the MBRs with the two processed blocks

Experiments

• Compared to the standard Hadoop• All experiments are conducted on an Amazon EC2 cluster of up to 100

nodes. The default cluster size is 20 nodes of ‘small’ instances• Datasets: - TIGER- OSM- NASA- SYNTH

Experiments – Range Query

• SYNTH

Experiments – Range Query

• TIGER

Experiments - kNN

• SYNTH

Experiments - kNN

• TIGER

Experiments – Spatial join

Experiments – Index creation

谢谢大家！

Documents

SpatialHadoop: A MapReduce Framework for Spatial Data Author: Ahmed Eldawy, Mohamed F. Mokbel Publication: ICDE 15’