Upload
randall-bruce
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
SpatialHadoop: A MapReduce Framework
for Spatial DataAuthor: Ahmed Eldawy, Mohamed F. Mokbel
Publication: ICDE 15’
Context
• 0. Abstract• 1. Background• 2. Related Work• 3. Architecture• 4~7. Four layers• 8. Experiments
Abstract
• SpatialHadoop: a full-fledged MapReduce framework with native support for spatial data• It is a comprehensive extension to Hadoop that injects spatial data
awareness in each Hadoop layer:1. language: Pigeon2. storage: two level spatial index3. MapReduce: SpatialFileSplitter, SpatialRecordReader4. operations: range query, kNN, spatial join
Background
• Motivations:1. Hadoop: solution for scalable processing of huge datasets2. Recent explosions of spatial data• Present:Researchers and practitioners worldwide have started to take advantage of the MapReduce environment in supporting large-scale spatial data: * Industry: GIS tools on Hadoop * academic: 1. Parallel-Secondo 2. MD-HBase 3. Hadoop-GIS
Background
• Drawback: deals with Hadoop as a black box and limited by limitations of existing Hadoop system.• Take Hadoop-GIS as an example:1. Hadoop treat spatial data as non-spatial ones, without additional
support2. Support only uniform grid index, only applicable in uniform data
distribution3. MapReduce programs cannot access the constructed spatial index• Parallel-Secondo, MD-Hbase and ESRI tools on Hadoop suffer from
similar drawbacks.
Background
• SpatialHadoop:1. built-in Hadoop base code2. able to support a set of spatial index structures3. users can develop a myriad of spatial functions including range
queries, kNN and spatial join• Difference:
Background
• SpatialHadoop-four main layers:1. language layer: Pigeon2. storage layer: two-level index structure3. MapReduce layer: SpatialFileSplitter, SpatialRecordReader4. operations layer: encapsulates a dozen of spatial operations
Related Work
• Existing work can be classified into 2 categories:1. Specific spatial operations2. System
Related Work
• Specific:1. R-tree construction2. Range query3. kNN query4. All NN query5. Reverse NN query6. Spatial join7. kNN join
Related Work
• System:1. Hadoop-GIS2. MD-HBase3. Parallel-Secondo
Architecture
• Architecture:3 types of users-Casual user-Developer-System Admin4 layers-language-operations-MapReduce-storage
Architecture
• The language layerPigeon, a high-level SQL-like language that supports OGC-compliant spatial data types(Point and Polygon) and operations(Overlap and Touches)• The storage layertwo-level index structure of global and local indexing, implement 3 standard indexes: Grid file, R-tree and R+-tree
Architecture
• The MapReduce layerSpatialFileSplitter: use global index to prune file blocks that do not contribute to the answerSpatialRecordReader: use local index to retrieve partial answer from each block• The operation layerEncapsulates the implementation of various spatial operations that take advantage of the spatial indexes and the new components in the MapReduce layer
Language Layer
• Background: a set of declarative SQL-like languages havebeen proposed: HiveQL, Pig Latin, SCOPE and YSmart• Pigeon: an extension to Pig Latin language, adding spatial data types,
functions and operations that conform to OGC standard.
Language Layer
• Data types: overrides the bytearray to support spatial data types such as Point, LineString, Polygon
lakes = LOAD ‘lakes’ AS (id:int, area:polygon);• Spatial functions: provide spatial functions including aggregate functions
(e.g., Union), predicates (e.g., Overlaps), and others(e.g., Buffer)
houses_with_distance = FOREACH houses GENERATE id, Distance(house_loc, sc_loc);• kNN query: new KNN statementnearest_houses = KNN houses WITH_K=100 USING Distance(house_loc, query_loc);
Language Layer
Override following 2 Pig Latin statements• FILTER: to accept a spatial predicate and call the corresponding
procedure for range querieshouses_in_range = FILTER houses BY Overlaps(house_loc, query_range);• JOIN: to accept spatial files and forward to the corresponding spatial
join produrelakes_states = JOIN lakes BY lakes_boundary states BY states_boundary PREDICATE = Overlaps
Storage Layer
• Background:1. Input files in Hadoop: non-indexed heap files2. SpatialHadoop: Index structure in HDFSIndexing in SpatialHadoop is the key point in superior performance over Hadoop• Challenges:1. Index structures are optimized for procedural program2. A file in HDFS can be only written sequentially while traditional
indexes are constructed incrementally
Storage Layer
• Existing techniques for spatial indexing in Hadoop:1. build onlyconstruct a R-tree using MapReduce approach but queried outside MapReduce using other techniques2. custom on-the-fly indexingnon-standard index is created and discarded with each query execution3. indexing in HDFSonly support range queries on trajectory data, quite limited
Storage Layer
• Overview:
Storage Layer
• How to overcome challenges:1. local indexes can be processed in parallel2. The small size of local indexes allows each one to be bulk loaded in
memory and written to a file in an append-only manner• Generic way of building index:1. Partitioning2. Local indexing3. Global indexing
Storage Layer
• PartitioningMain goals: block fit, spatial locality, load balancingThree steps:1. Calculate numbers of n2. Decide partition boundaries3. Physical partitioning
Storage Layer
• 1. calculate numbers of partitions n
S: input file sizeB: HDFS block capacity(64MB): overhead ratio, set to 0.2 by default
Storage Layer
• 2. Partitions boundaries- we decide on the spatial area covered by each single partition defined
by a rectangle- boundaries are calculated differently according to the underlying
index being constructed to accommodate data distribution- The output of this step is a set of n rectangles representing
boundaries of the n partitions
Storage Layer
• 3. Physical partitioning- Initiate a MapReduce job that physically partitions the input file- The challenge here is to decide what to do with objects with spatial
extents (e.g., polygons) that may overlap more than one partition- At the end, for each record r assigned to a partition p, the map
function writes an intermediate pair <p, r>. Such pairs are then grouped by p and sent to the reduce function for the next phase
Storage Layer
• Local indexing- Purpose: build the requested index structure (e.g., Grid or R-tree) as a local
index on the data contents of each physical partition- Building the requested index structure is realized as a reduce function that
takes the records assigned to each partition and stores them in a spatial index, written in a local index file- local index has to fit in one HDFS block for two reasons
(1) This allows spatial operations written as MapReduce programs to access local indexes where each local index is processed in one map task(2) It ensures that the local index is treated by Hadoopload balancer as one unit when it relocates blocks acrossmachines
Storage Layer
• Global indexing- Build the requested structure as a global index that indexes all
partitions.- process:
1. initiate an HDFS concat command to concatenate all local indexes into one file2. master node builds all in memory global index which indexes all file blocks using their rectangular boundaries as the index key
Storage Layer
• Global indexing(ctd.)• global index is:1. using bulk loading2. kept in main memory all the time3. lazily constructed in case the master node fails and restarts
Storage Layer - Grid file
• Definition: a simple flat index that partitions the data according to a grid such that records overlapping each grid cell are stored in one file block as a single partition, assuming data is uniformly distributed• Partitioning:- 1. calculate number of partitions n- 2. creating a uniform grid of size in the space domain and take
boundaries of grid cells as partition boundaries- 3. a record r with a spatial extent, is replicated to every grid cell it
overlaps
Storage Layer - Grid file
• Local indexing: the records of each grid cell are just written to a heap file without building any local indexes• Global indexing: concatenates all these files and builds the global
index, which is a two dimensional directory table pointing to the corresponding blocks in the concatenated file
R-tree
• An R-tree is a height-balanced similar to a B-tree with index records in its leaf nodes containing pointers to data objects• Spatial databases: tuples(representing spatial objects) + identifiers• In a R tree:- Leaf node: <I, identifier>- Non-leaf node: <I, child-pointer>I – n-dimentional rectangle
R-tree
• Properties:(M: the maximum number of entn3 that snll At m one node)(m: parameter speclfymg the minimum number of entries in a node)1. Every leaf node contains between m and M index records unless it is
the root2. For each index record (I, identifier) in a leaf node, I is the smallest
rectangle that spatially contains the n-dnnenslonal data object represented by the indicated tuple
3. Every non-leaf node has between m and M children unless it is the root
R-tree
• Properties(ctd.):4. For each entry (I, child-pointer) in a non-leaf node, I is the smallest
rectangle that spatially contains the rectangles m the child node5. The root node has at least two children unless it is a leaf6. All leaves appear on the same level
Storage Layer - (R-tree)
• Partitioning- To compute partition boundaries, we bulk load a random sample from
the input file to an in-memory R-tree using the Sort-Tile-Recursive (STR) algorithm- (details)
Storage Layer - (R-tree)
• local indexing:- records of each partition are bulk loaded into an R-tree using the STR
algorithm, then dumped into a file- The block in a local index file is annotated with its minimum bounding
rectangle (MBR) of its contents- the partitions might end up being overlapped, similar to traditional R-
tree nodes• global indexing:- concatenates all local index files and creates the global index by bulk loading all blocks into an R-tree using their MBRs as the index key
R+-tree
• Differences from R-tree:- Nodes are not guaranteed to be at least half filled- The entries of any internal node do not overlap- An object ID may be stored in more than one leaf node• Adv:- Point query performance improves- A single path is followed and fewer nodes are visited than with the R-
tree
Storage Layer - (R+-tree)
• Definition: R+-tree is a variation of the R-tree where nodes at each level are kept disjoint while records overlapping multiple nodes are replicated to each node to ensure efficient query answering• Similar to R-tree except 3 changes:- 1. In the R+-tree physical partitioning step, each record is replicated to each
partition it overlaps with- 2. In the local indexing phase, the records of each partition are inserted into
an R+-tree which is then dumped to a local index file- 3. the global index is constructed based on the partition boundaries
computed in the partitioning phase rather than the MBR of its contents as boundaries should remain disjoint
MapReduce Layer
• Comparison:• Hadoop: - 1. the input file goes through a FileSplitter that divides it into n splits, where n is set by
the the MapReduce program, based on the number of available slave nodes.- 2. Then, each split goes through a RecordReader that extracts records as key-value pairs
which are passed to the map function
• SpatialHadoop- 1. SpatialFileSplitter, an extended splitter that exploits the global index(es) on input
file(s) to early prune file blocks not contributing to answer- 2. SpatialRecordReader, which reads a split originating from spatially indexed input
file(s) and exploits the local indexes to efficiently process it
MapReduce Layer
• Comparison(ctd.)
MapReduce Layer
• SpatialFileSplitter• Takes:- 1. one or two input files- 2. filter function• One input file- the SpatialFileSplitter applies the filter function on the global index of the input
file to select file blocks, based on their MBRs, that should be processed by the job- For example, a range query job provides a filter function that prunes file blocks
with MBRs completely outside the query range. For each selected file block in the query range, the SpatialFileSplitter creates a file split, to be processed later by the SpatialRecordReader
MapReduce Layer
• SpatialFileSplitter(ctd.)• Two input files, similar to one input file with two subtle differences:- 1. The filter function is applied to two global indexes; each
corresponds to one input file- 2. The output of the SpatialFileSplitter is a combined split that
contains a pair of file ranges (i.e., file offsets and lengths) corresponding to the two selected blocks from the filter function
MapReduce Layer
• SpatialRecordReaderThe SpatialRecordReader takes either a split or combined split and parses it to generate key-value pairs to be passed to the map function. It parses the block to extract the local index that acts as an access method to all records in the block.
MapReduce Layer
• SpatialRecordReaderThe record reader sends all the records to the map function indexed by the local index with two main benefits:- 1. it allows the map function to process all records together, which is
shown to make it more powerful and flexible- 2. the local index is harnessed when processing the block, making it
more efficient than scanning over all records
Operations Layer
• Spatial indexing(S Layer) + Spatial functionality(MR Layer) = possibility of efficient realizations of a myriad of spatial operations• 3 basic spatial operations:- range query- k nearest neighbor(kNN)- spatial Join
Operations Layer – Range Query
• Definition: A range query takes a set of spatial records R and a query area A as input, and returns the set of records in R that overlaps with A• 2 range query techniques depending on whether there is replication- No replication(R-tree)- Relication(Grid or R+-tree)
Operations Layer – Range Query
• No replication: each record is stored in exactly one partition• Range query algorithm:Step1 - global filter step- range filter->SpatialFileSplitter- blocks that is completely inside query area->output- blocks that are partially overlapping->are sent for further processing
in the second step
Operations Layer – Range Query
• Step 2 – local filter- The SpatialRecordReader reads a block that needs to be processed,
extracts its local index- sends it to the map function, which exploits the local index with a
traditional range query algorithm to return matching records
Operations Layer – Range Query
• Replication: some records are replicated across partitions• Range query algorithm, similar to the no replication one except:- (1) In the global filter step, blocks that are completely contained
in the query area A have to be further processed- (2) The output of the local filter goes through an additional duplicate
avoidance step to ensure that duplicates are removed from the final answer
Operations Layer – Range Query
• Duplicate avoidance step- For each candidate record produced by the local filter step, we
compute its intersection with the query area. A record is added to the final result only if the top-left corner of the intersection is inside the partition boundaries. - Since partitions are disjoint, it is guaranteed that only one partition
contains that point. The output of the duplicate avoidance step gives the final answer of the range query, hence, no reduce function is needed
Operations Layer - kNN
• Definition: A kNN query takes a set of spatial points P , a query point Q, and an integer k as input, and returns the k closest points in P to Q• kNN query algorithm in SpatialHadoop:- (1) Initial answer- (2) Correctness check- (3) Answer refinement
Operations Layer - kNN
• Initial answer- First locate the partition that includes Q by feeding the
SpatialFileSplitter with a filter function that selects only the overlapping partition- The selected partition goes through the SpatialRecordReader to
exploit its local index with a traditional kNN algorithm to produce the initial k answers
Operations Layer - kNN
• Correctness check- We draw a test circle C centered at Q with a radius equal to the
distance from Q to its kth furthest neighbor- If C does not overlap any partition other than Q, the initial answer is
considered final, otherwise to Answer refinement step.
Operations Layer - kNN
• Answer refinement- run a range query to get all points inside the MBR of the test circle C- a scan over the range query result is executed to produce the closest
k points as the final answer
Operations Layer – Spatial join
• Definition: A spatial join takes two sets of spatial records R and S and a spatial join predicate θ (e.g., overlaps) as input, and returns the set of all pairs <r, s> where r R, s S, and θ is true for <r, s>∈ ∈• SJMR algorithm, MapReduce version of partition-based spatial-merge
join(PBSM)- Employs a map function that partitions input records according to a
uniform grid- A reduce function that joins records in each partition
Operations Layer – Spatial join
• Distributed join- (Preprocessing if needed)- Global join- Local join- Duplicate avoidance
Operations Layer – Spatial join
• Global join: this step produces all pairs of file blocks with overlapping MBRs- the SpatialFileSplitter module is fed with the overlapping filter
function to exploit two spatially indexed input files.- Then, a traditional spatial join algorithm is applied over the two global
indexes to produce the overlapping pairs of partitions.- The SpatialFileSplitter will finally create a combined split for each pair
of overlapping blocks
Operations Layer – Spatial join
• Local join: this step joins the records in the two blocks in this split to produce pairs of overlapping records- the SpatialRecordReader reads the combined split, extracts the
records and local indexes from its two blocks, and sends all of them to the map function for processing.- The map function exploits the two local indexes to speed up the process
of joining the two sets of records in the combined split.- The result of the local join may contain duplicate results due to having
records overlapping with multiple blocks
Operations Layer – Spatial join
• Duplicate avoidance: employs the reference-point duplicate avoidance technique- For each detected overlapping pair of records, the intersection of
their MBRs is first computed.- Then, the overlapping pair is reported as a final answer only if the
top-left corner (i.e., reference point) of the intersection falls in the overlap of the MBRs with the two processed blocks
Experiments
• Compared to the standard Hadoop• All experiments are conducted on an Amazon EC2 cluster of up to 100
nodes. The default cluster size is 20 nodes of ‘small’ instances• Datasets: - TIGER- OSM- NASA- SYNTH
Experiments – Range Query
• SYNTH
Experiments – Range Query
• TIGER
Experiments - kNN
• SYNTH
Experiments - kNN
• TIGER
Experiments – Spatial join
Experiments – Index creation
谢谢大家!