An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer

An Incremental Refining Spatial Join Algorithm for Estimating

Query Results in GIS

Wan D. Bae, Shayma Alkobaisi, Scott T. LeuteneggerDepartment of Computer Science

University of Denver

{wbae, salkobai, leut}@cs.du.edu

Outline

• Introduction• Motivation• Spatial Join Estimation • IRSJ Algorithm

– Sampling

– Joining

– Statistics

• Experiments• Conclusion

Introduction

• GIS data is used to describe the geometry and location of geographic phenomena.

• GIS data is represented in two ways:– Raster: divides the world into cells. – Vector: defines features based on coordinate-

based structures (e.g. point, line, polygon).

• This paper targets vector data.

Introduction Cont.

• Geographic or spatial queries are applied to spatially indexed databases.– e.g. containment and intersection.

• We focus on finding the number of intersections of two spatial datasets (spatial joins).– e.g. number of roads that intersect rivers in the US.

Spatial Joins

• Spatial joins relate two data sets that share locations in space.

• The processing of spatial queries can be accelerated when some spatial indexing such as R-tree exists.

• Spatial joins of two R-trees can be done by applying synchronized tree traversals on both R-tree nodes to find intersecting items.

Spatial Joins Cont.

R: Rivers S: Cities

R1 R2 S1 S2

r1 r2 r4r3 r5 s1 s2 s3 s4

R2

R1 r1

r2r3

r4

r5

s1

s2 s3

s4

S1

S2

Motivation

• GIS supports very large data sets finding exact answers to spatial queries can be very time consuming.

• In GIS data analysis a fast estimation of the final result that has error bounded to 2%-10% can do the job.

• So, provide an approximate answer through an incremental refining process.

• Thus, allow for more interactive data exploration.

Examples

1. “What are the intersections of mineral plants and radiometric ages areas in the US?”

2. “Where do mineral resources intersect geochemical sediments in the US?”

Examples Cont.

Locations of geochemical sediments in CO Locations of mineral resources in CO

Intersections

Spatial Join Estimation

• Parametric: uses some properties of data distribution to present a formula for the estimation. (e.g. power law, fractal dimension).

• Histograms: keep certain information for different regions of the data to be used when a query is given. (e.g. Geometric Histogram, Euler Histogram).

Spatial Join Estimation Cont.

• Sampling: uses smaller data sets (samples) to calculate an estimate of the final result by applying the join on the sample.

Incremental Refining Join Process

Dataset 1 (R) Dataset 2 (S)

Sampling

Samples

WQ

# intersections (intermediate

result)

Statistics

Final Estimation w/ CI

User

ReportIncremental Process

Random Sampling

• We assume that both data sets are indexed using R-trees.

• Samples are chosen from one R-tree called the outer relation R.

• Samples are used as window queries to query the inner relation S.

• Randomness:– Acceptance/Rejection method: inclusion probability is

proportional to some parameter of the item sampled.

Tuple and Page Level Sampling

• Tuple-level: – A page is selected at random from R and one tuple

(MBR) of that page is chosen at random.

• Page-level: – A page is selected at random from R and all tuples

(MBRs) of that page are used as a sample.

Window Query

• The chosen MBR from one data set (R-tree) serves as a window query to the other data set to find the intersections.

• The query returns all the objects from the second data set that overlaps with the query window.

• The number of intersections found is used in the process of finding an approximate answer to the query.

Window Query Example

R: Rivers S: Cities

R1 R2 S1 S2

r1 r2 r4r3 r5 s1 s2 s3 s4

R2

R1 r1

r2r3

r4

r5

s1

s2 s3

s4

S1

S2

Estimated Value and Confidence Interval

• Estimated Value: the statistic computed from sample information.

• Population Proportion: fraction indicating the part of the sample having a particular interest.

• Confidence interval: an interval that estimates a population parameter within a range of possible values at specified probability.– The specified probability is called the level of confidence.

N

nN

n

ppzE c

1

ˆ1

IRSJt Algorithm

1. C 0; CI 0 {count, confidence interval}2. repeat3. for i = 0 to k do4. L Choose leaf from R at random5. M MBR of a randomly chosen tuple within L6. I number of intersections of a Window Query

(M,S)7. C C + I8. end for

9. CI Compute confidence interval using C10. EV Compute estimated value using C

11. until The desired confidence interval Cf attained

Experiments (settings and data sets)

• IRSJ compared to full R-tree join.• Confidence level set to 95%.• Varied buffer size and data size.• Data sets:

– Synthetic: U x S, S x U, U x U (# of tuples in each relation varied from 100,000 to 600,000).

– Real: from the U.S. Geological Survey: 1. Mineral Resources in the US 2005 (300,432 tuples).

2. Geochemistry of unconsolidated sediments in the US 2001 (199,850 tuples).

Experiments Cont.(Synthetic data results)

Estimated Value

U-600K x S-400K

Synthetic Cont.

Confidence Interval

U-600K x S-400K

Synthetic Cont.

I/Os with 10% buffer

Synthetic Cont.

R-tree join Ratio to IRSJt

Experiments Cont.(Real data results)

Estimated Value

Real Cont.

Confidence Interval

Real Cont.

Buffer Size

CI = 10 CI = 5 CI = 3 CI=2 CI=1 R-join

I/Os 5%

10%

233

182

403

333

770

752

1896

1164

8159

3795

27756

24756

Node Accesses

5%

10%

780

668

2591

2136

6933

9028

18299

19671

87233

85822

262200

262200

I/O and Node Accesses of IRSJt and a full R-tree join

Conclusion

• Proposed Incremental Refining Spatial Join:– Page-level

– Tuple-level

• Experimental results showed:– IRSJ provides a reasonably accurate estimate in much

earlier stages than the exact answer obtained by full R-tree join.

– IRSJt performs better than IRSJp.

– As the data size increased, the improvement of IRSJt over full R-tree join increased.

Dataset 1 (R) Dataset 2 (S)

Sampling

Samples

WQ


result)

UserIncremental Joining Process

Statistics

Report

Final Estimation w/ CI

Dataset

Output / Input

Process

R: Rivers S: Cities

L1 L2 S1 S2

r1 r2 r4r3 r5 s1 s2 s3 s4

L2

L1 r1

r2r3

r4

r5

s1

s2 s3

s4

S1

S2

r3

R: Rivers S: Cities

L1 L2 S1 S2

r1 r2 r4r3 r5 s1 s2 s3 s4

L2

L1 r1

r2r3

r4

r5

s1

s2 s3

s4

S1

S2

r3

R: Rivers

L1 L2

r1 r2 r4r3 r5

L2

L1 r1

r2r3

r4

r5

R: Rivers

L1

r1

r3

L2 r2

r4

L1 L2 L3 L4 L5 L6

r6 r9 r10 r12

L3

L4

L5

L6r6

r9

r10

r12

r5

r11

r7

r8

r1 r2r3 r4 r5 r11 r7 r8

ST1 ST2 ST3

R: Rivers

L1

r1

r3

L2 r2

r4

L1 L2 L3 L4 L5 L6

r6 r9 r10 r12

L3

L4

L5

L6r6

r9

r10

r12

r5

r11

r7

r8

r1 r2r3 r4 r5 r11 r7 r8

Dataset 1 (R) Sampling Samples

WQDataset 2 (S)


result)

Statistics Final Estimation w/ CI

Report

User

Stop

End

YesNo

Dataset Output / Input Process Transition step

L1

r1

r3

L3

L4

L6

r6

r9

r10

r12

r5

r11

r7

r8

R1 R2 R3 R4

L1 L2 L3 L4 L5 L6 L7 L8

R: Rivers

r1 r3

L5

L2

r2

r4

r2 r4 r6 r9 r10 r12

L7

L8

R1

R2 R4

R3

ST1 ST2

L1 L2 L3 L4

R1 R2 R3 R4

L1 L2 L3 L4 L5 L6 L7 L8

r2

ST1 ST2

L1 L2 L3 L4

R: Rivers

r1 r3 r4 r5 r6 r7 r8 r10 r11 r12 r13 r16 r17 r9 r14

L5 L6 L7 L8

r15 r18

R1 R2 R3 R4

Documents

An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer