Approximate String Search in Spatial Databaseslifeifei/papers/SAS.pdfApproximate string search is...

View
12
Download
0
Category

Documents

Preview:

Citation preview

1-1

Approximate String Search in Spatial Databases

Bin Yao

Florida StateUniversity

Kun Hou

Florida StateUniversity

Marios Hadjieleftheriou

AT&T Labs Research

Feifei Li

Florida StateUniversity

2-1

Introduction

Approximate string search is important

data cleaning

data integration

online search engine

2-2

Introduction

Approximate string search is important

Also, spatial database

data cleaning

data integration

online search engine

map service

strategic planning of resources

2-3

Introduction

Approximate string search is important

Also, spatial database

Our work: the approximate string search in spatial database(Spatial Approximate String (SAS) queries).

data cleaning

data integration

online search engine

map service

strategic planning of resources

3-1

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

3-2

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

3-3

Example of a SAS range query

String similarity metric: edit distance with threshold τ .

4-1

A straightforward solution

R-tree solution:

4-2

A straightforward solution

R-tree solution:

4-3

A straightforward solution

R-tree solution:

4-4

A straightforward solution

R-tree solution:

Problem: only utilize the spatial dimension for the pruning.

5-1

Background: q-grams based edit distance pruning

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

5-2

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

5-3

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

Gσ1{#t, th, he, ea, at, tr, re, e$},Gσ2{#t, th, he, ea, at, te, er, r$}.

5-4

Background: q-grams based edit distance pruning

Example: q = 2, τ = 2σ1 : theatre; σ2 : theater.

Lemma: For strings σ1 and σ2 of length |σ1| and |σ2|, ifε(σ1, σ2) = τ , then |Gσ1 ∩Gσ2 | ≥ max(|σ1|, |σ2|)− 1− (τ −1) ∗ q.

Gσ1{#t, th, he, ea, at, tr, re, e$},Gσ2{#t, th, he, ea, at, te, er, r$}.

6-1

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

6-2

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-3

Background: estimating set resemblance

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-4

Background: estimating set resemblance

Unbiased estimator for ρ(A,B):

ρ(A,B) = Pr(min{π(A)} = min{π(B)}).

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

6-5

Background: estimating set resemblance

Unbiased estimator for ρ(A,B):

ρ(A,B) = Pr(min{π(A)} = min{π(B)}).

Set resemblance: two sets A and Bρ(A,B) = |A∩B|

|A∪B| .

s(A) = {min{π1(A)}, . . . ,min{π`(A)}},s(A ∪B) = {min{π1(A), π1(B)}, . . . ,min{π`(A), π`(B)}}.

ρ(A,B) =|{i|min{πi(A)} = min{πi(B)}}|

`.

The min-wise signature:Any set X, any x ∈ X,π ∈ F (a family of min-wise independentpermutations),

Pr(min{π(X)}) = π(x)) = 1|X| .

7-1

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

7-2

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

7-3

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

7-4

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

7-5

Background: estimating set resemblance

hashes 1 2 4h1 1 3 4h2 3 1 2h3 2 4 3h4 4 2 1

Set A = {1, 2, 4}

signatureof A

1121

Set B = {2, 3}hashes 2 3h1 3 2h2 1 4h3 4 1h4 2 3

signatureof B

2112

ρ(A,B) = 1/4

8-1

The MHR-tree: basic idea

8-2

The MHR-tree: basic idea

Lemma 2 Let Gu be the set for the union of q-grams of strings inthe subtree of node u. For a SAS query (r, σ, τ), if |Gu∩Gσ| <|σ|− 1− (τ − 1) ∗ q, then the subtree of node u does not containany element from As.

9-1

The MHR-tree: union of q-grams (q = 2)

· · ·ab cd ab ef gh ij kl nr

9-2

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef gh, ij, kl, nr

9-3

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef, gh, ij, kl, nr

ab, cd, ef gh, ij, kl, nr

9-4

The MHR-tree: union of q-grams (q = 2)

· · ·

· · ·

· · ·

ab cd ab ef gh ij kl nr

ab, cd, ef, gh, ij, kl, nr

ab, cd, ef gh, ij, kl, nr

MBR

10-1

The MHR-tree

Min-wise signature with linear hashing R-tree (MHR-tree)

10-2

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

Min-wise signature with linear hashing R-tree (MHR-tree)

10-3

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

Min-wise signature with linear hashing R-tree (MHR-tree)

10-4

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2 · · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-5

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2 · · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-6

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2

1, 1, 1

· · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

10-7

The MHR-tree

2, 1, 3 4, 1, 2 5, 2, 1 7, 2, 5 6, 2, 1 · · ·1, 6, 3 2, 7, 4 8, 3, 2

2, 1, 1 1, 2, 2

1, 1, 1

· · ·

Min-wise signature with linear hashing R-tree (MHR-tree)

MBR

11-1

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-2

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-3

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-4

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

11-5

Query algorithms for the MHR-tree

Range-MHR(MHR-tree R, Range r, String σ, int τ)

Follow the range query algorithm on R-tree,

If u is a leaf nodeFor every point p ∈ up

If p is contained in r and |Gp ∩Gσ| ≥ max(|σp|, |σ|)−1− (τ − 1) ∗ q and ε(σp, σ) < τ then Insert p in A;

ElseFor every child node wi of u

If r and MBR(wi) intersect, and |Gwi ∩Gσ| ≥ |σ| −1− (τ − 1) ∗ qinsert wi into queue;

Return A

12-1

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

12-2

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5

12-3

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5

12-4

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

12-5

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

12-6

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

12-7

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

12-8

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

ρ(Gu, Gσ) = |{i|min{hi(Gu)}=min{hi(Gσ)}}|` = 3/4.

12-9

Query algorithms for the MHR-tree

Key issue: estimating |Gu ∩Gσ| using s(Gu) and σ.

σ: crab Gσ: #c,cr,ra,ab,b$s(Gσ): 3, 1, 4, 5 s(Gu): 3, 1, 3, 5Let G = Gu ∪Gσ, s(G) = s(Gu ∪Gσ) = 3, 1, 3, 5ρ(G,Gσ) = |{i|min{hi(G)}=min{hi(Gσ)}}|

` = 3/4.

ρ(G,Gσ) = |G∩Gσ||G∪Gσ| = |(Gu∪Gσ)∩Gσ|

|(Gu∪Gσ)∪Gσ| = |Gσ||Gu∪Gσ| .

|Gu ∪Gσ| = |Gσ|ρ(G,Gσ) = 5/(3/4) = 20/3.

ρ(Gu, Gσ) = |{i|min{hi(Gu)}=min{hi(Gσ)}}|` = 3/4.

|Gu ∩Gσ| = ρ(Gu, Gσ) ∗ |Gu ∪Gσ| = 3/4 ∗ 20/3 = 5.

13-1

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

13-2

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-3

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-4

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

13-5

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

Issue 2: duplicate q-grams between strings.

13-6

Duplicate q-grams in strings

Issue 1: duplicate q-grams in one string (for both query and data).

example:s: aabbaabb, {1#a,1aa, 1ab, 1bb, 1ba, 2aa, 2ab, 2bb, 1b$}q: aabcaa, {1#a, 1aa, 1ab, 1bc, 1ca, 2aa, 1a$}

Issue 2: duplicate q-grams between strings.

Do not distinguish q-grams from different nodes.node 1: pizz (#p, pi, iz, zz, z$);node 2: zza (#z, zz, za, a$)parent node 3:union of signatures corresponding to (#p, pi, iz, zz, z$, #z,za, a$)

14-1

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

14-2

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

14-3

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

14-4

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

Θ(b2)

14-5

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

b1

b2

b3

Θ(b2)

14-6

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

Θ(b2)

14-7

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

14-8

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

14-9

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

|Abi | = niΘ(bi,r)Θ(bi)

ρiLMni

= Θ(bi,r)Θ(bi)

ρiLM . (ρiLM estimates the number

of similar strings with query string in bi.)

14-10

Selectivity estimation for SAS range queries

Combine the range query selectivity estimator with the stringselectivity estimator (V Sol [Mazeika et al.2007] based on min-wise signatures of inverted lists of q-grams).

V Sol1b1

b2

b3

V Sol2

V Sol3

r

Θ(b2)

Θ(b2, r)

|Abi | = niΘ(bi,r)Θ(bi)

ρiLMni

= Θ(bi,r)Θ(bi)

ρiLM . (ρiLM estimates the number

of similar strings with query string in bi.)

Improvements:Minimum number of neighborhoods principle,Spatial uniformity principle.

15-1

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

15-2

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

too

15-3

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 4

15-4

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

15-5

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-6

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-7

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-8

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

15-9

Two improvements for selectivity estimation

Minimum number of neighborhoods principle:

Spatial uniformity principle:

toyτ = 1

boycoy

min

men

mine

tooη = 2

A smaller η give a more accurate estimator.

16-1

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

16-2

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

Our goal:Minimize

∑ki=1 ∆(bi), where k is the number of buckets specified

by the user.

16-3

The partitioning metric

Neighborhood and uniformity quality of b:∆(b) = ηbnb

∑1,...,dXi,

{X1, . . . , Xd}: the side lengths of b in each dimension.

The greedy algorithm;The adaptive R-tree algorithm.

Our goal:Minimize

∑ki=1 ∆(bi), where k is the number of buckets specified

by the user.

17-1

The greedy algorithm

b1

b2

b3

17-2

The greedy algorithm

b1

b2

b3 ?

17-3

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-4

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

P

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-5

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.

Π(bi): the perimeter of bi.

17-6

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-7

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

?

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-8

The greedy algorithm

b1

b2

b3

η3 · n3 ·Π(b3)

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

k − 3 · ( ηk−3 ·

|P |k−3 ·

(Θ(P )k−3

)1/d

· d)

Θ(P )

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.η: number of neighborhoods of strings for remaining points.

Π(bi): the perimeter of bi.

k − 3 buckets.

17-9

The greedy algorithm

b1

b2

b3

U(bi) = ηi · ni ·Π(bi) + ηk−i · |P | ·

(Θ(P )k−i

)1/d

· d(Π(bi) is the perimeter of bucket bi)

17-10

The greedy algorithm

b1

b2

b3

b4

17-11

The greedy algorithm

b1

b2

b3

b4 b5

18-1

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

R-tree root

18-2

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-3

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-4

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

b3

18-5

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

b3

?

18-6

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

η3 · n3 ·Π(b3)

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.

18-7

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

η3 · n3 ·Π(b3) ηk−3 · n · (

rk−3 )1/d · d

r

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.η: number of neighborhoods of strings for remaining points.

18-8

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

U ′(bi) = ηi · ni ·Π(bi) + ηk−i · n · (

rk−i )

1/d · d.

η3 · n3 ·Π(b3) ηk−3 · n · (

rk−3 )1/d · d

r

ηi: the number of neighborhoods of strings in bi.ni: the number of points in bi.Π(bi): the perimeter of bi.η: number of neighborhoods of strings for remaining points.

18-9

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

18-10

The adaptive R-tree algorithm

k = 6

· · ·· · · · · · · · · · · ·

level3

R-tree root

19-1

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.

19-2

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.Real data sets:The road-networks for states (Texas, California) in USA, each pointis associated with the names of the state, county and town.Synthetic data sets: uniform points and random clustered points.Assign strings from real data sets randomly to the point generated.

19-3

Experiment setup

All experiments were executed on a Linux machine with an IntelXeon CPU at 2GHz and 2GB of memory.Real data sets:The road-networks for states (Texas, California) in USA, each pointis associated with the names of the state, county and town.Synthetic data sets: uniform points and random clustered points.Assign strings from real data sets randomly to the point generated.

The default experimental parameters are summarized below.

Symbol Definition Default Valueθ query area percentage of data space 3%N size of points set 2,000,000l signature length 50τ edit distance threshold 2d dimensionality 2

20-1

SAS range queries: impact of the signature size

20-2

SAS range queries: impact of the signature size

20-3

SAS range queries: impact of the signature size

20-4

SAS range queries: impact of the signature size

21-1

SAS range queries: query performance

21-2

SAS range queries: query performance

21-3

SAS range queries: query performance

21-4

SAS range queries: query performance

22-1

Selectivity estimation: relative errors

CA data set

22-2

Selectivity estimation: relative errors

TX data set

22-3

Selectivity estimation: relative errors

22-4

Selectivity estimation: relative errors

22-5

Selectivity estimation: relative errors

22-6

Selectivity estimation: relative errors

23-1

Selectivity estimation: cost of the adaptive estimator

23-2

Selectivity estimation: cost of the adaptive estimator

24-1

Conclusions

We designed MHR-tree for spatial approximate string queries.

24-2

Conclusions

We designed MHR-tree for spatial approximate string queries.

We designed novel selectivity estimator for SAS range queries,which take into account both the spatial and string distribu-tions.

24-3

Conclusions

We designed MHR-tree for spatial approximate string queries.

We designed novel selectivity estimator for SAS range queries,which take into account both the spatial and string distribu-tions.

Future work includes examining spatial approximate sub-stringqueries, and using the KMV synopsis to improve the perfor-mance.

25-1

The End

THANK Y OU

Q and A

Recommended

Geo-spatial Search Engine

Documents

Search, Wage Posting, and Urban Spatial Structure

Documents

Spatial Mismatch, Search E¤ort and Urban Spatial Structure

Documents

Geo/Spatial Search with MySQL

Documents

String Search - Computer Science Department at Princeton University

Documents

Efficient String Matching : An Aid to Bibliographic Search

Documents

The String B-Tree: A New Data Structure for String Search ... · PDF fileThe String B-Tree: A New Data Structure for String Search in External Memory and Its Applications PAOLO FERRAGINA

Documents

Exact String Search

Documents

Unraveling the (search) string - Library Assessment

Documents

radhasundar.weebly.com€¦ · Web viewUsing regular expression you can search a particular string inside a another string, you can replace one string by another string and you

Documents

String Sorts Tries Substring Search: KMP, BM, RK

Documents

Spatial Variation in Search Engine Queries

Documents

Efficient Approximate Search on String Collections Part I

Documents

Modeling Residential and Spatial Search Behaviour

Documents

Spatial Language Understanding for Object Search in

Documents

Efficient Approximate Search on String Collections Part IIyaobin/teaching/2012db2/lect10.pdf · Efficient Approximate Search on String Collections ... Prefix filter (CGK06) ... Those

Documents

E cient Edit Distance based String Similarity Search using …leser/searchjoincompetition... · E cient Edit Distance based String Similarity Search using Deletion Neighborhoods Shashwat

Documents

Appendix 1 : The Search String for the Systematic Review...Appendix 1 : The Search String for the Systematic Review . 1. DEVELOPING THE SEARCH STRATEGY The search strategy was required

Documents

Boyer-Moore String Search Algorithm - univie.ac.at...Boyer-Moore String Search Algorithm Michael Sedlmair Boyer, Robert S., and Moore, J. Strother. "A fast string searching algorithm."

Documents

A Method to Support Search String Building in Systematic

Documents