Efficient Query Routing in Distributed Spatial Databases

Efficient Query Routing in Distributed Spatial Databases

Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng ChuComputer Science Department

University of Southern California

Presenter: Xunfei Jiang

IntroductionProblem

current architectures rely on a centralized data repository applications utilize and integrate data sets

remotely accessible under different administrative control

Solution Combine spatial database with Web service

data is maintained by specific entities or organizations the correct data set for a specific calculation can be downloaded automatically

without manual user intervention updates and changes to the data are instantly available to remote applications

Propose middleware design

based on distributed R-tree and Quad-tree index structures access to the data is public and available through a Web services interface requests are sent only o the specific repositories that most likely have relevant

data

MotivationAutonomy Cooperative and efficient query

processingDecentralization

Figure 1: The proposed distributed spatial database infrastructure with middleware utilizing replicated spatial index structures (either R-trees or Quad-trees).

query

Cooperate to decide which other nodes contain potentially relevant data and which do not

query

data archives are disbursed, query access method androuting mechanism are expect to be fully decentralized

the overall system must cooperatively execute the request and return all relevant data

Baseline AlgorithmEQR (exhaustive query routing)

When a query q arrives at a specific node, forwarded it to all other nodes.

messages generated by queries M = 2× Q × (N − 1)

Q: the total number of queries N: the number of nodes. 2 × : the same number of result messages are generated

Shorten of EQR generates a lot of message traffic poor scalability

Query Routing with Spatial IndexingHow to reduce the query space

recursively partition the key space into a set of equivalence classes R-tree and Quad-tree algorithms

generate tree-structure indices that partition the overall space into successively smaller areas at lower levels of the index hierarchy

successfully used in the core engines of spatial database systems

Quad-tree R-treeStructure each internal node has

exactly four childrenExtension of B tree, and the upper bound and lower bound of the number of entries for each internal node are pre-specified (usually 2 and 4).

Characteristics

Decompose space into adaptable cells.Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits.The tree directory follows the spatial decomposition of the Quadtree.

Split spatial space into rectangles.When a rectangle contains more than maximum children, adjust should be made.Object nodes are at the lowest level.

A novel method: use R-trees and Quad-trees as index structures across multiple spatial databases

insert the MBR of the data set of each archive into a global R-tree or Quad-tree distribute copies of the global index structure to each archive(avoid a centralized index

server) an archive can intersect each query rectangle with the archive MBRs stored in the global

index. The query is then only forwarded to archives whose MBR overlaps with query rectangle

Cons: additional cost on synchronizing the global index structures

Pros: reduces the overhead the global index structures manage bounding rectangles, changes to the data set of any

individual archive only result in index updates if the MBR changes – and this is very infrequent

Example An archive manages 1,000 two-dimensional spatial data objects. The MBR is

defined by at most four of them2. insertion or deletion confined within a MBR do not affect the global index only changes that stretch or shrink the MBR need to be propagated. the estimation function of the number of messages when a global index is used:

MT = MQ + MU = [2 × Q × (N − 1) × SQ] + [U × (N − 1) × SU]

an example of SU values ranging from 0.001 to 0.058 with one of our experiments.

Symbol

Description

Q Number of queries

U Number of global index updates

N Number of nodes

M Number of messages

MT Total number of messages

MQ Number of messages generated by queries

MU Number of messages generated by updates

SQ Selectivity of queries

SU Selectivity of global index updates

Assumptions Every archive in the distributed environment

hosts a database engine storing retrieving querying

holds a directory, termed the server list, with entries denoting the network location (e.g., IP address) the minimum bounding rectangle (MBR) of every spatial databaseFrom the directory information, each server computes the corresponding R-tree (or Quad-tree) global index data structure.

Updates local data structure is updated after receiving MBR update messages local MBR changes due to data insertions or deletions initiate update messages to all the

other database servers

In the R-tree based design, the upper bound and lower bound of the number of entries that will fit in one internal tree node are pre-specified. e.g., M is the maximum number of entries that will fit in one node and m ≤ M/2 is the

parameter specifying the minimum number of entries in a node.

The R-Tree Based DesignIndex Initialization and Topology Maintenance

a new server (A) joins a spatial database1) A sends its information (IP address and MBR) as update

message to an existing server B2) The existing server B updates its local R-tree index and

replies with the current system information3) A constructs its own R-tree index4) A broadcasts an update message with its hostname and

MBR to all the other servers except Barchive A departures

it broadcasts an update message to announce that it is removing itself from the topology

other servers delete the leaving server’s MBR from their R-treeserver fails

the node that first detects the unresponsive system broadcasts a removal message to everyone

The R-Tree Based DesignQuery Routing

Clients do not need to contact all the servers to obtain comprehensive query results

Queries sent to a server will automatically be forwarded and yield accurate spatial results from the complete data set

The queried server determines through its local R-tree whether any of the other archives in the collective potentially have relevant data the query rectangle and the archives’ MBR intersect

Forwarded queries are flagged show that they originated from a server rather than a client avoid query loops

The results of forwarded queries are returned to the initially contacted server aggregates them and returns the set to the client.

The R-Tree Based DesignR-Tree Index Update

Each spatial database server must process data object update requests from local users data insertions Data deletions

Variation of MBR update of local R-tree index the new MBR is broadcast to all the other servers in

the system for tree index synchronization

The Quad-tree Based DesignQuad-tree Index Update.

Slight differences from R-tree based design for tree index updates arise as follows. If an object insertion or deletion results in changes to the MBR

boundary, Quad-tree model checks whether the MBR variation affects the Quad-tree structure. changed: the update is propagated to all the other servers as

usual for tree index synchronization. Unchanged: the MBR update is not broadcast.

Server number increasing, updates are constant Table 2 illustrates that approximately 4.2% to 5.8% of all insert

or delete operations result in an MBR change.increasing both the server and update numbers linearly

the activity per server is relatively constant. Table 3 shows the experimental results

Conclusion: the update message traffic to synchronize distributed Quad-trees is much lower than for R-trees.

Typo?

Nearest Neighbor QueriesMost recently research applied to

our systembranch-and-bound R-tree traversal

algorithm efficiently answers both NN and k-NN

queries Two metrics

minimum distance (MINDIST) the minimum Euclidean distance

between the query point and the nearest edge of an MBR

optimal choice minimum of the maximum possible

distance (MINMAXDIST) minimum of all the maximum

distances between the query point and points on each of the axes of an MBR

pessimistic choice

MINDIST

MINM

AXDIST

Query PointMB

R

MAXDIST on edge x1

x1

y2y1

x2MAXDIST on edge y2

MAXDIST on edge x2

MAXDIST on edge y1

Nearest Neighbor Queries NN search algorithm implements an ordered depth first traversal based on the values of MINDIST and MINMAXDIST.

It begins from the R-tree root node and proceeds down the tree hierarchy. At a leaf node, a distance computation function is invoked to decide the

actual distance between the query point and the candidate DB objects. The algorithm iterates with three search-pruning strategies until it finds

the NN object.

Distributed design every server maintains a local R-tree the NN search algorithm is executed on local R-tree to compute both the

MINDIST and MINMAXDIST values. a Web service interface is created at each node

access these distance values across multiple archives remotely obtain the distance between the search point and a candidate nearest

data point To answer a NN query, a server needs to send several distance query

messages to other servers in the system during the branch-and-bound process. With the three search pruning strategies proposed in “Nearest Neighbor Queries”

and a slightly modified search algorithm, NN queries can be efficiently executed.

EXPERIMENTAL VALIDATIONImplemented tree-based design in a simulator to evaluate

the performance of our approachIndex Structure

R-tree algorithm MX-CIF Quad-tree algorithm

Index tree search complexity the same as these algorithms

In a distributed environment, the search complexity is dominated by the communication overhead between servers.

Focus of the simulation quantifying the query routing traffic generated by

queries updates

Data sets synthetic spatial data set real-world spatial data set

Simulator Implementation The leaf nodes represent specific server MBRs and contain

forwarding pointers (i.e., the host names and IP addresses) to the remote servers.

The leaf node of the MBR of the local data set directly points to the local database.

If a query window intersects with several server MBRs, then the query is forwarded to each.

The simulator counts all the messages generated through the query forwarding mechanism all the return messages containing result data sets.

Additionally, tree update information must be broadcast to all servers.

Simulator ImplementationEvent Generation

Two types of events: Queries Updates

Data updates could either be insertion or deletion requests.

Both types were generated according to a Poisson distribution, with the inter-arrival rate λQ and λU being specified independently.

Simulated Time: ten hours

Simulator ImplementationQuery Parameter Generation

dynamically created based on the two parameters(see Table 4). mean query window size (QWS-μ): mean percentage of the

global geographical area that was used for the query window based on a normal distribution.

deviation (QWS-σ): provides a variation range bound by one QWS-σ deviation such that the query window area was different for each query event.

Eg: With QWS-μ and (QWS-σ), the simulator first chose the query

window size. Randomly selected a point (x1, y1) as one corner coordinate

and a value x2 inside the global boundary as the x-value of the other coordinate across the diagonal of the query window.

Based on the window size, calculate y2.

Simulator ImplementationSynthetic Data Generation

borehole data item a spatial location attribute Longitude Latitude

Randomly generating N data center points, C0, C1, ..., CN−1, which located inside a global boundary. Ci = (xi, yi) is the geographical center of all the borehole data managed

by an individual spatial database server. For each Ci, B associated boreholes pj are generated according to a

normal distribution. the borehole points are more dense near the center point sparse when the distance to the center point increases.

The generator limited the maximal distance of a borehole from its center to the value of two standard deviations.

After all the borehole points were created, the MBR of each database server was computed.

Figure 3 illustrates the boreholes managed by ten servers and their respective MBRs.

ExperimentsSynthetic Data

The accumulated traffic of queries and updates of the tree-based designs and the exhaustive query routing mechanism.

ExperimentsKobe Data Experiment

Data set geotechnical data provided by

Kobe University, Japan. 4000 boreholes of Kobe county K-means algorithm [7] to cluster

the Kobe data points in Euclidean space and assign them to database servers

data set was divided into ten clusters (see Figure 5)

Experiment parameters: Use both the R-tree and quad-tree

index structures with different query window sizes (ranging from 1% to 50%).

we also generated a synthetic data set with the same parameters (10 servers, 400 boreholes per server)

[7] J.B. McQueen. Some methods of classification and analysis of multivariate observations. In 5th Berkeley Symposium in Mathematics, Statistics and Probability, pages 281–297, 1967.

Performanceimproved with both synthetic and real-world data sets

EQR: the worst case OQR(optimal Query Routing): defined as the best case The tree-based designs :

reduction of 60% ~ 70% of inter-server message traffic compared with exhaustive query routing (with query window sizes of 10% to 20%).

relationship between EQR (upper bound), the two tree-based designs, and OQR (lower bound) with different query window sizes (Figure 7)Normalized y-scale

divided the accumulated message count of EQR and the tree-based designs by the message count of OQR

network performance improvement rate (NPIR) NPIR =

TEQR: Total of accumulated messages for EQR

Ttree: Total of accumulated messages for the tree based designs

Experiment Increase the number of

servers from 10 to 1000 NPIR keeps constant

Increase the query window size from 1% to 50% NPIR declines

Conclusion tree based designs scale well

to large distributed systems

The best conditionNo overlap between any server MBRs tree-based designs can reduce inter-server traffic by

up to 90%

The worst conditionsignificant MBR overlapthe performance decreases to the same level or

slightly worse (because of the update costs) than ERQ

System designer needs to consider the characteristics of the data set before opting for the tree-based query routing algorithms

Relative Work large scale distributed data management systems

P2P (peer-to-peer) systems key characteristics

dynamic topology heterogeneity self-organization

The query processing and routing approaches of some of the initial P2P systems focused on a centralized index server (e.g., Napster) a flooding mechanism (e.g., Gnutella).

not very scalable or inefficient distributed hash tables (DHT)

achieve massive scalability and efficient query forwarding Pasty [11], Chord [16], and CAN [9] provide a mechanism to perform object location within a potentially very large overlay network of

nodes connected to the Internet. unsuitable for range queries

techniques adapted DHT mechanisms for range queries Harwood and Tanin [5] introduce a method to hash spatial content over P2P networks

Space is divided in a Quadtree-like manner and the central points of each square, denoted control points, are hashed to a Chord ring.

Spatial objects and queries are resolved to spatial regions whose control points are then hashed onto the DHT ring.

A distributed catalog service that can locate XML path data Range queries are supported via wildcards in XML strings (i.e., “*”) may require a scan of some of the data.

ConclusionPresented an architecture

efficiently route and execute spatial queries based on globally distributed and replicated index structures R-tree Quad-tree

performed extensive simulations with both synthetic and real data sets and observed thatupdate message traffic to keep the replicated indices

synchronized is negligibleoverall query message traffic is significantly reduced

only slightly higher than what an optimal distribution algorithm with global knowledge could achieve

Future Work

Current metric: number of messagesdoes not capture the parallelism that is

achieved within the system

Future workmeasure the response time and the query

throughput

Thank you!

Documents

Efficient Query Routing in Distributed Spatial Databases