Upload
schuyler
View
70
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Efficient Query Routing in Distributed Spatial Databases. Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng Chu Computer Science Department University of Southern California Presenter: Xunfei Jiang. Introduction. Problem current architectures rely on a centralized data repository - PowerPoint PPT Presentation
Citation preview
Efficient Query Routing in Distributed Spatial Databases
Roger Zimmermann, Wei-Shinn Ku, and Wei-Cheng ChuComputer Science Department
University of Southern California
Presenter: Xunfei Jiang
IntroductionProblem
current architectures rely on a centralized data repository applications utilize and integrate data sets
remotely accessible under different administrative control
Solution Combine spatial database with Web service
data is maintained by specific entities or organizations the correct data set for a specific calculation can be downloaded automatically
without manual user intervention updates and changes to the data are instantly available to remote applications
Propose middleware design
based on distributed R-tree and Quad-tree index structures access to the data is public and available through a Web services interface requests are sent only o the specific repositories that most likely have relevant
data
MotivationAutonomy Cooperative and efficient query
processingDecentralization
Figure 1: The proposed distributed spatial database infrastructure with middleware utilizing replicated spatial index structures (either R-trees or Quad-trees).
query
Cooperate to decide which other nodes contain potentially relevant data and which do not
query
data archives are disbursed, query access method androuting mechanism are expect to be fully decentralized
the overall system must cooperatively execute the request and return all relevant data
Baseline AlgorithmEQR (exhaustive query routing)
When a query q arrives at a specific node, forwarded it to all other nodes.
messages generated by queries M = 2× Q × (N − 1)
Q: the total number of queries N: the number of nodes. 2 × : the same number of result messages are generated
Shorten of EQR generates a lot of message traffic poor scalability
Query Routing with Spatial IndexingHow to reduce the query space
recursively partition the key space into a set of equivalence classes R-tree and Quad-tree algorithms
generate tree-structure indices that partition the overall space into successively smaller areas at lower levels of the index hierarchy
successfully used in the core engines of spatial database systems
Quad-tree R-treeStructure each internal node has
exactly four childrenExtension of B tree, and the upper bound and lower bound of the number of entries for each internal node are pre-specified (usually 2 and 4).
Characteristics
Decompose space into adaptable cells.Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits.The tree directory follows the spatial decomposition of the Quadtree.
Split spatial space into rectangles.When a rectangle contains more than maximum children, adjust should be made.Object nodes are at the lowest level.
A novel method: use R-trees and Quad-trees as index structures across multiple spatial databases
insert the MBR of the data set of each archive into a global R-tree or Quad-tree distribute copies of the global index structure to each archive(avoid a centralized index
server) an archive can intersect each query rectangle with the archive MBRs stored in the global
index. The query is then only forwarded to archives whose MBR overlaps with query rectangle
Cons: additional cost on synchronizing the global index structures
Pros: reduces the overhead the global index structures manage bounding rectangles, changes to the data set of any
individual archive only result in index updates if the MBR changes – and this is very infrequent
Example An archive manages 1,000 two-dimensional spatial data objects. The MBR is
defined by at most four of them2. insertion or deletion confined within a MBR do not affect the global index only changes that stretch or shrink the MBR need to be propagated. the estimation function of the number of messages when a global index is used:
MT = MQ + MU = [2 × Q × (N − 1) × SQ] + [U × (N − 1) × SU]
an example of SU values ranging from 0.001 to 0.058 with one of our experiments.
Symbol
Description
Q Number of queries
U Number of global index updates
N Number of nodes
M Number of messages
MT Total number of messages
MQ Number of messages generated by queries
MU Number of messages generated by updates
SQ Selectivity of queries
SU Selectivity of global index updates
Assumptions Every archive in the distributed environment
hosts a database engine storing retrieving querying
holds a directory, termed the server list, with entries denoting the network location (e.g., IP address) the minimum bounding rectangle (MBR) of every spatial databaseFrom the directory information, each server computes the corresponding R-tree (or Quad-tree) global index data structure.
Updates local data structure is updated after receiving MBR update messages local MBR changes due to data insertions or deletions initiate update messages to all the
other database servers
In the R-tree based design, the upper bound and lower bound of the number of entries that will fit in one internal tree node are pre-specified. e.g., M is the maximum number of entries that will fit in one node and m ≤ M/2 is the
parameter specifying the minimum number of entries in a node.
The R-Tree Based DesignIndex Initialization and Topology Maintenance
a new server (A) joins a spatial database1) A sends its information (IP address and MBR) as update
message to an existing server B2) The existing server B updates its local R-tree index and
replies with the current system information3) A constructs its own R-tree index4) A broadcasts an update message with its hostname and
MBR to all the other servers except Barchive A departures
it broadcasts an update message to announce that it is removing itself from the topology
other servers delete the leaving server’s MBR from their R-treeserver fails
the node that first detects the unresponsive system broadcasts a removal message to everyone
The R-Tree Based DesignQuery Routing
Clients do not need to contact all the servers to obtain comprehensive query results
Queries sent to a server will automatically be forwarded and yield accurate spatial results from the complete data set
The queried server determines through its local R-tree whether any of the other archives in the collective potentially have relevant data the query rectangle and the archives’ MBR intersect
Forwarded queries are flagged show that they originated from a server rather than a client avoid query loops
The results of forwarded queries are returned to the initially contacted server aggregates them and returns the set to the client.
The R-Tree Based DesignR-Tree Index Update
Each spatial database server must process data object update requests from local users data insertions Data deletions
Variation of MBR update of local R-tree index the new MBR is broadcast to all the other servers in
the system for tree index synchronization
The Quad-tree Based DesignQuad-tree Index Update.
Slight differences from R-tree based design for tree index updates arise as follows. If an object insertion or deletion results in changes to the MBR
boundary, Quad-tree model checks whether the MBR variation affects the Quad-tree structure. changed: the update is propagated to all the other servers as
usual for tree index synchronization. Unchanged: the MBR update is not broadcast.
Server number increasing, updates are constant Table 2 illustrates that approximately 4.2% to 5.8% of all insert
or delete operations result in an MBR change.increasing both the server and update numbers linearly
the activity per server is relatively constant. Table 3 shows the experimental results
Conclusion: the update message traffic to synchronize distributed Quad-trees is much lower than for R-trees.
Typo?
Nearest Neighbor QueriesMost recently research applied to
our systembranch-and-bound R-tree traversal
algorithm efficiently answers both NN and k-NN
queries Two metrics
minimum distance (MINDIST) the minimum Euclidean distance
between the query point and the nearest edge of an MBR
optimal choice minimum of the maximum possible
distance (MINMAXDIST) minimum of all the maximum
distances between the query point and points on each of the axes of an MBR
pessimistic choice
MINDIST
MINM
AXDIST
Query PointMB
R
MAXDIST on edge x1
x1
y2y1
x2MAXDIST on edge y2
MAXDIST on edge x2
MAXDIST on edge y1
Nearest Neighbor Queries NN search algorithm implements an ordered depth first traversal based on the values of MINDIST and MINMAXDIST.
It begins from the R-tree root node and proceeds down the tree hierarchy. At a leaf node, a distance computation function is invoked to decide the
actual distance between the query point and the candidate DB objects. The algorithm iterates with three search-pruning strategies until it finds
the NN object.
Distributed design every server maintains a local R-tree the NN search algorithm is executed on local R-tree to compute both the
MINDIST and MINMAXDIST values. a Web service interface is created at each node
access these distance values across multiple archives remotely obtain the distance between the search point and a candidate nearest
data point To answer a NN query, a server needs to send several distance query
messages to other servers in the system during the branch-and-bound process. With the three search pruning strategies proposed in “Nearest Neighbor Queries”
and a slightly modified search algorithm, NN queries can be efficiently executed.
EXPERIMENTAL VALIDATIONImplemented tree-based design in a simulator to evaluate
the performance of our approachIndex Structure
R-tree algorithm MX-CIF Quad-tree algorithm
Index tree search complexity the same as these algorithms
In a distributed environment, the search complexity is dominated by the communication overhead between servers.
Focus of the simulation quantifying the query routing traffic generated by
queries updates
Data sets synthetic spatial data set real-world spatial data set
Simulator Implementation The leaf nodes represent specific server MBRs and contain
forwarding pointers (i.e., the host names and IP addresses) to the remote servers.
The leaf node of the MBR of the local data set directly points to the local database.
If a query window intersects with several server MBRs, then the query is forwarded to each.
The simulator counts all the messages generated through the query forwarding mechanism all the return messages containing result data sets.
Additionally, tree update information must be broadcast to all servers.
Simulator ImplementationEvent Generation
Two types of events: Queries Updates
Data updates could either be insertion or deletion requests.
Both types were generated according to a Poisson distribution, with the inter-arrival rate λQ and λU being specified independently.
Simulated Time: ten hours
Simulator ImplementationQuery Parameter Generation
dynamically created based on the two parameters(see Table 4). mean query window size (QWS-μ): mean percentage of the
global geographical area that was used for the query window based on a normal distribution.
deviation (QWS-σ): provides a variation range bound by one QWS-σ deviation such that the query window area was different for each query event.
Eg: With QWS-μ and (QWS-σ), the simulator first chose the query
window size. Randomly selected a point (x1, y1) as one corner coordinate
and a value x2 inside the global boundary as the x-value of the other coordinate across the diagonal of the query window.
Based on the window size, calculate y2.
Simulator ImplementationSynthetic Data Generation
borehole data item a spatial location attribute Longitude Latitude
Randomly generating N data center points, C0, C1, ..., CN−1, which located inside a global boundary. Ci = (xi, yi) is the geographical center of all the borehole data managed
by an individual spatial database server. For each Ci, B associated boreholes pj are generated according to a
normal distribution. the borehole points are more dense near the center point sparse when the distance to the center point increases.
The generator limited the maximal distance of a borehole from its center to the value of two standard deviations.
After all the borehole points were created, the MBR of each database server was computed.
Figure 3 illustrates the boreholes managed by ten servers and their respective MBRs.
ExperimentsSynthetic Data
The accumulated traffic of queries and updates of the tree-based designs and the exhaustive query routing mechanism.
ExperimentsKobe Data Experiment
Data set geotechnical data provided by
Kobe University, Japan. 4000 boreholes of Kobe county K-means algorithm [7] to cluster
the Kobe data points in Euclidean space and assign them to database servers
data set was divided into ten clusters (see Figure 5)
Experiment parameters: Use both the R-tree and quad-tree
index structures with different query window sizes (ranging from 1% to 50%).
we also generated a synthetic data set with the same parameters (10 servers, 400 boreholes per server)
[7] J.B. McQueen. Some methods of classification and analysis of multivariate observations. In 5th Berkeley Symposium in Mathematics, Statistics and Probability, pages 281–297, 1967.
Performanceimproved with both synthetic and real-world data sets
EQR: the worst case OQR(optimal Query Routing): defined as the best case The tree-based designs :
reduction of 60% ~ 70% of inter-server message traffic compared with exhaustive query routing (with query window sizes of 10% to 20%).
relationship between EQR (upper bound), the two tree-based designs, and OQR (lower bound) with different query window sizes (Figure 7)Normalized y-scale
divided the accumulated message count of EQR and the tree-based designs by the message count of OQR
network performance improvement rate (NPIR) NPIR =
TEQR: Total of accumulated messages for EQR
Ttree: Total of accumulated messages for the tree based designs
Experiment Increase the number of
servers from 10 to 1000 NPIR keeps constant
Increase the query window size from 1% to 50% NPIR declines
Conclusion tree based designs scale well
to large distributed systems
The best conditionNo overlap between any server MBRs tree-based designs can reduce inter-server traffic by
up to 90%
The worst conditionsignificant MBR overlapthe performance decreases to the same level or
slightly worse (because of the update costs) than ERQ
System designer needs to consider the characteristics of the data set before opting for the tree-based query routing algorithms
Relative Work large scale distributed data management systems
P2P (peer-to-peer) systems key characteristics
dynamic topology heterogeneity self-organization
The query processing and routing approaches of some of the initial P2P systems focused on a centralized index server (e.g., Napster) a flooding mechanism (e.g., Gnutella).
not very scalable or inefficient distributed hash tables (DHT)
achieve massive scalability and efficient query forwarding Pasty [11], Chord [16], and CAN [9] provide a mechanism to perform object location within a potentially very large overlay network of
nodes connected to the Internet. unsuitable for range queries
techniques adapted DHT mechanisms for range queries Harwood and Tanin [5] introduce a method to hash spatial content over P2P networks
Space is divided in a Quadtree-like manner and the central points of each square, denoted control points, are hashed to a Chord ring.
Spatial objects and queries are resolved to spatial regions whose control points are then hashed onto the DHT ring.
A distributed catalog service that can locate XML path data Range queries are supported via wildcards in XML strings (i.e., “*”) may require a scan of some of the data.
ConclusionPresented an architecture
efficiently route and execute spatial queries based on globally distributed and replicated index structures R-tree Quad-tree
performed extensive simulations with both synthetic and real data sets and observed thatupdate message traffic to keep the replicated indices
synchronized is negligibleoverall query message traffic is significantly reduced
only slightly higher than what an optimal distribution algorithm with global knowledge could achieve
Future Work
Current metric: number of messagesdoes not capture the parallelism that is
achieved within the system
Future workmeasure the response time and the query
throughput
Thank you!