View
215
Download
2
Embed Size (px)
Citation preview
An Incremental Refining Spatial Join Algorithm for Estimating
Query Results in GIS
Wan D. Bae, Shayma Alkobaisi, Scott T. LeuteneggerDepartment of Computer Science
University of Denver
{wbae, salkobai, leut}@cs.du.edu
Outline
• Introduction• Motivation• Spatial Join Estimation • IRSJ Algorithm
– Sampling
– Joining
– Statistics
• Experiments• Conclusion
Introduction
• GIS data is used to describe the geometry and location of geographic phenomena.
• GIS data is represented in two ways:– Raster: divides the world into cells. – Vector: defines features based on coordinate-
based structures (e.g. point, line, polygon).
• This paper targets vector data.
Introduction Cont.
• Geographic or spatial queries are applied to spatially indexed databases.– e.g. containment and intersection.
• We focus on finding the number of intersections of two spatial datasets (spatial joins).– e.g. number of roads that intersect rivers in the US.
Spatial Joins
• Spatial joins relate two data sets that share locations in space.
• The processing of spatial queries can be accelerated when some spatial indexing such as R-tree exists.
• Spatial joins of two R-trees can be done by applying synchronized tree traversals on both R-tree nodes to find intersecting items.
Spatial Joins Cont.
R: Rivers S: Cities
R1 R2 S1 S2
r1 r2 r4r3 r5 s1 s2 s3 s4
R2
R1 r1
r2r3
r4
r5
s1
s2 s3
s4
S1
S2
Motivation
• GIS supports very large data sets finding exact answers to spatial queries can be very time consuming.
• In GIS data analysis a fast estimation of the final result that has error bounded to 2%-10% can do the job.
• So, provide an approximate answer through an incremental refining process.
• Thus, allow for more interactive data exploration.
Examples
1. “What are the intersections of mineral plants and radiometric ages areas in the US?”
2. “Where do mineral resources intersect geochemical sediments in the US?”
Examples Cont.
Locations of geochemical sediments in CO Locations of mineral resources in CO
Intersections
Spatial Join Estimation
• Parametric: uses some properties of data distribution to present a formula for the estimation. (e.g. power law, fractal dimension).
• Histograms: keep certain information for different regions of the data to be used when a query is given. (e.g. Geometric Histogram, Euler Histogram).
Spatial Join Estimation Cont.
• Sampling: uses smaller data sets (samples) to calculate an estimate of the final result by applying the join on the sample.
Incremental Refining Join Process
Dataset 1 (R) Dataset 2 (S)
Sampling
Samples
WQ
# intersections (intermediate
result)
Statistics
Final Estimation w/ CI
User
ReportIncremental Process
Random Sampling
• We assume that both data sets are indexed using R-trees.
• Samples are chosen from one R-tree called the outer relation R.
• Samples are used as window queries to query the inner relation S.
• Randomness:– Acceptance/Rejection method: inclusion probability is
proportional to some parameter of the item sampled.
Tuple and Page Level Sampling
• Tuple-level: – A page is selected at random from R and one tuple
(MBR) of that page is chosen at random.
• Page-level: – A page is selected at random from R and all tuples
(MBRs) of that page are used as a sample.
Window Query
• The chosen MBR from one data set (R-tree) serves as a window query to the other data set to find the intersections.
• The query returns all the objects from the second data set that overlaps with the query window.
• The number of intersections found is used in the process of finding an approximate answer to the query.
Window Query Example
R: Rivers S: Cities
R1 R2 S1 S2
r1 r2 r4r3 r5 s1 s2 s3 s4
R2
R1 r1
r2r3
r4
r5
s1
s2 s3
s4
S1
S2
Estimated Value and Confidence Interval
• Estimated Value: the statistic computed from sample information.
• Population Proportion: fraction indicating the part of the sample having a particular interest.
• Confidence interval: an interval that estimates a population parameter within a range of possible values at specified probability.– The specified probability is called the level of confidence.
N
nN
n
ppzE c
1
ˆ1
IRSJt Algorithm
1. C 0; CI 0 {count, confidence interval}2. repeat3. for i = 0 to k do4. L Choose leaf from R at random5. M MBR of a randomly chosen tuple within L6. I number of intersections of a Window Query
(M,S)7. C C + I8. end for
9. CI Compute confidence interval using C10. EV Compute estimated value using C
11. until The desired confidence interval Cf attained
Experiments (settings and data sets)
• IRSJ compared to full R-tree join.• Confidence level set to 95%.• Varied buffer size and data size.• Data sets:
– Synthetic: U x S, S x U, U x U (# of tuples in each relation varied from 100,000 to 600,000).
– Real: from the U.S. Geological Survey: 1. Mineral Resources in the US 2005 (300,432 tuples).
2. Geochemistry of unconsolidated sediments in the US 2001 (199,850 tuples).
Experiments Cont.(Synthetic data results)
Estimated Value
U-600K x S-400K
Synthetic Cont.
Confidence Interval
U-600K x S-400K
Synthetic Cont.
I/Os with 10% buffer
Synthetic Cont.
R-tree join Ratio to IRSJt
Experiments Cont.(Real data results)
Estimated Value
Real Cont.
Confidence Interval
Real Cont.
Buffer Size
CI = 10 CI = 5 CI = 3 CI=2 CI=1 R-join
I/Os 5%
10%
233
182
403
333
770
752
1896
1164
8159
3795
27756
24756
Node Accesses
5%
10%
780
668
2591
2136
6933
9028
18299
19671
87233
85822
262200
262200
I/O and Node Accesses of IRSJt and a full R-tree join
Conclusion
• Proposed Incremental Refining Spatial Join:– Page-level
– Tuple-level
• Experimental results showed:– IRSJ provides a reasonably accurate estimate in much
earlier stages than the exact answer obtained by full R-tree join.
– IRSJt performs better than IRSJp.
– As the data size increased, the improvement of IRSJt over full R-tree join increased.
Dataset 1 (R) Dataset 2 (S)
Sampling
Samples
WQ
# intersections (intermediate
result)
UserIncremental Joining Process
Statistics
Report
Final Estimation w/ CI
Dataset
Output / Input
Process
R: Rivers S: Cities
L1 L2 S1 S2
r1 r2 r4r3 r5 s1 s2 s3 s4
L2
L1 r1
r2r3
r4
r5
s1
s2 s3
s4
S1
S2
r3
R: Rivers S: Cities
L1 L2 S1 S2
r1 r2 r4r3 r5 s1 s2 s3 s4
L2
L1 r1
r2r3
r4
r5
s1
s2 s3
s4
S1
S2
r3
R: Rivers
L1 L2
r1 r2 r4r3 r5
L2
L1 r1
r2r3
r4
r5
R: Rivers
L1
r1
r3
L2 r2
r4
L1 L2 L3 L4 L5 L6
r6 r9 r10 r12
L3
L4
L5
L6r6
r9
r10
r12
r5
r11
r7
r8
r1 r2r3 r4 r5 r11 r7 r8
ST1 ST2 ST3
R: Rivers
L1
r1
r3
L2 r2
r4
L1 L2 L3 L4 L5 L6
r6 r9 r10 r12
L3
L4
L5
L6r6
r9
r10
r12
r5
r11
r7
r8
r1 r2r3 r4 r5 r11 r7 r8
Dataset 1 (R) Sampling Samples
WQDataset 2 (S)
# intersections (intermediate
result)
Statistics Final Estimation w/ CI
Report
User
Stop
End
YesNo
Dataset Output / Input Process Transition step
L1
r1
r3
L3
L4
L6
r6
r9
r10
r12
r5
r11
r7
r8
R1 R2 R3 R4
L1 L2 L3 L4 L5 L6 L7 L8
R: Rivers
r1 r3
L5
L2
r2
r4
r2 r4 r6 r9 r10 r12
L7
L8
R1
R2 R4
R3
ST1 ST2
L1 L2 L3 L4
R1 R2 R3 R4
L1 L2 L3 L4 L5 L6 L7 L8
r2
ST1 ST2
L1 L2 L3 L4
R: Rivers
r1 r3 r4 r5 r6 r7 r8 r10 r11 r12 r13 r16 r17 r9 r14
L5 L6 L7 L8
r15 r18
R1 R2 R3 R4