Approximate String Search in Spatial Databaseslifeifei/papers/SAS.pdfApproximate string search is...

Preview:

Citation preview

1-1

Approximate String Search in Spatial Databases

Bin Yao

Florida StateUniversity

Kun Hou

Florida StateUniversity

Marios Hadjieleftheriou

AT&T Labs Research

Feifei Li

Florida StateUniversity

2-1

Introduction

Approximate string search is important

data cleaning

data integration

online search engine

2-2

Introduction

Approximate string search is important

Also, spatial database

data cleaning

data integration

online search engine

map service

strategic planning of resources

2-3

Introduction

Approximate string search is important

Also, spatial database

Our work: the approximate string search in spatial database(Spatial Approximate String (SAS) queries).

data cleaning

data integration

online search engine

map service

strategic planning of resources

3-1

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

3-2

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

3-3

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

4-1

A straightforward solution

R-tree solution:

4-2

A straightforward solution

R-tree solution:

4-3

A straightforward solution

R-tree solution:

4-4

A straightforward solution

R-tree solution:

Problem: only utilize the spatial dimension for the pruning.

5-1

Background: q-grams based edit distance pruning

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

5-2

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

5-3

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

Gσ1{#t, th, he, ea, at, tr, re, e$},Gσ2{#t, th, he, ea, at, te, er, r$}.

5-4

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

Gσ1{#t, th, he, ea, at, tr, re, e$},Gσ2{#t, th, he, ea, at, te, er, r$}.

6-1

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

6-2

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-3

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-4

Background: estimating set resemblance

Unbiased estimator for ρ(A,B):

ρ(A,B) = Pr(min{π(A)} = min{π(B)}).

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-5

Background: estimating set resemblance

Unbiased estimator for ρ(A,B):

ρ(A,B) = Pr(min{π(A)} = min{π(B)}).

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

ρ(A,B) =|{i|min{πi(A)} = min{πi(B)}}|

`.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

7-1

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

7-2

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

7-3

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

7-4

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

7-5

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

ρ(A,B) = 1/4

8-1

The MHR-tree: basic idea

8-2

The MHR-tree: basic idea

Lemma 2 Let Gu be the set for the union of q-grams of strings inthe subtree of node u. For a SAS query (r, σ, τ), if |Gu∩Gσ| <|σ|− 1− (τ − 1) ∗ q, then the subtree of node u does not containany element from As.

9-1

The MHR-tree: union of q-grams (q = 2)

· · ·ab cd ab ef gh ij kl nr

9-2

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef gh, ij, kl, nr

9-3

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef, gh, ij, kl, nr

ab, cd, ef gh, ij, kl, nr

9-4

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef, gh, ij, kl, nr

ab, cd, ef gh, ij, kl, nr

MBR

10-1

The MHR-tree

Min-wise signature with linear hashing R-tree (MHR-tree)

10-2

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

Min-wise signature with linear hashing R-tree (MHR-tree)

10-3

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

Min-wise signature with linear hashing R-tree (MHR-tree)

10-4

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2 · · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-5

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2 · · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-6

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2

1, 1, 1

· · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-7

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2

1, 1, 1

· · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

MBR

11-1

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-2

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-3

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-4

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-5

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

12-1

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

12-2

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5

12-3

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5

12-4

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

12-5

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

12-6

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

12-7

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

12-8

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

ρ(Gu, Gσ) = |{i|min{hi(Gu)}=min{hi(Gσ)}}|` = 3/4.

12-9

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

ρ(Gu, Gσ) = |{i|min{hi(Gu)}=min{hi(Gσ)}}|` = 3/4.

|Gu ∩Gσ| = ρ(Gu, Gσ) ∗ |Gu ∪Gσ| = 3/4 ∗ 20/3 = 5.

13-1

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

13-2

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-3

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-4

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-5

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

Issue 2: duplicate q-grams between strings.

13-6

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

Issue 2: duplicate q-grams between strings.

Do not distinguish q-grams from different nodes.node 1: pizz (#p, pi, iz, zz, z$);node 2: zza (#z, zz, za, a$)parent node 3:union of signatures corresponding to (#p, pi, iz, zz, z$, #z,za, a$)

14-1

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

14-2

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

14-3

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

14-4

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

Θ(b2)

14-5

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

Θ(b2)

14-6

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

Θ(b2)

14-7

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

14-8

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

14-9

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

|Abi | = niΘ(bi,r)Θ(bi)

ρiLMni

= Θ(bi,r)Θ(bi)

ρiLM . (ρiLM estimates the number

of similar strings with query string in bi.)

14-10

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

|Abi | = niΘ(bi,r)Θ(bi)

ρiLMni

= Θ(bi,r)Θ(bi)

ρiLM . (ρiLM estimates the number

of similar strings with query string in bi.)

Improvements:Minimum number of neighborhoods principle,Spatial uniformity principle.

15-1

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

15-2

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

too

15-3

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 4

15-4

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

15-5

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-6

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-7

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-8

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-9

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

16-1

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

16-2

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

Our goal:Minimize

∑ki=1 ∆(bi), where k is the number of buckets specified

by the user.

16-3

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

The greedy algorithm;The adaptive R-tree algorithm.

Our goal:Minimize

∑ki=1 ∆(bi), where k is the number of buckets specified

by the user.

17-1

The greedy algorithm

b1

b2

b3

17-2

The greedy algorithm

b1

b2

b3 ?

17-3

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-4

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

P

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-5

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-6

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-7

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-8

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-9

The greedy algorithm

b1

b2

b3

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

17-10

The greedy algorithm

b1

b2

b3

b4

17-11

The greedy algorithm

b1

b2

b3

b4 b5

18-1

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

R-tree root

18-2

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-3

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-4

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

b3

18-5

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

b3

?

18-6

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

η3 · n3 ·Π(b3)

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.

18-7

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

η3 · n3 ·Π(b3) ηk−3 · n · (

rk−3 )1/d · d

r

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.η: number of neighborhoods of strings for remaining points.

18-8

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

U ′(bi) = ηi · ni ·Π(bi) + ηk−i · n · (

rk−i )

1/d · d.

η3 · n3 ·Π(b3) ηk−3 · n · (

rk−3 )1/d · d

r

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.η: number of neighborhoods of strings for remaining points.

18-9

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-10

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

19-1

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.

19-2

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.Real data sets:The road-networks for states (Texas, California) in USA, each pointis associated with the names of the state, county and town.Synthetic data sets: uniform points and random clustered points.Assign strings from real data sets randomly to the point generated.

19-3

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.Real data sets:The road-networks for states (Texas, California) in USA, each pointis associated with the names of the state, county and town.Synthetic data sets: uniform points and random clustered points.Assign strings from real data sets randomly to the point generated.

The default experimental parameters are summarized below.

Symbol Definition Default Valueθ query area percentage of data space 3%N size of points set 2,000,000l signature length 50τ edit distance threshold 2d dimensionality 2

20-1

SAS range queries: impact of the signature size

20-2

SAS range queries: impact of the signature size

20-3

SAS range queries: impact of the signature size

20-4

SAS range queries: impact of the signature size

21-1

SAS range queries: query performance

21-2

SAS range queries: query performance

21-3

SAS range queries: query performance

21-4

SAS range queries: query performance

22-1

Selectivity estimation: relative errors

CA data set

22-2

Selectivity estimation: relative errors

TX data set

22-3

Selectivity estimation: relative errors

22-4

Selectivity estimation: relative errors

22-5

Selectivity estimation: relative errors

22-6

Selectivity estimation: relative errors

23-1

Selectivity estimation: cost of the adaptive estimator

23-2

Selectivity estimation: cost of the adaptive estimator

24-1

Conclusions

We designed MHR-tree for spatial approximate string queries.

24-2

Conclusions

We designed MHR-tree for spatial approximate string queries.

We designed novel selectivity estimator for SAS range queries,which take into account both the spatial and string distribu-tions.

24-3

Conclusions

We designed MHR-tree for spatial approximate string queries.

We designed novel selectivity estimator for SAS range queries,which take into account both the spatial and string distribu-tions.

Future work includes examining spatial approximate sub-stringqueries, and using the KMV synopsis to improve the perfor-mance.

25-1

The End

THANK Y OU

Q and A

Recommended