34
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang , Bin Wang Chen Li Northeastern University, China

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Xiaochun Yang, Bin Wang Chen Li

Northeastern University, China

2

Approximate selection queries

Keanu Reeves

Samuel Jackson

Schwarzenegger

Samuel Jackson

Schwarrzenger

Query errors: Limited knowledge about data Typos Limited input device (cell phone) input

Data errors Typos Web data OCR

Applications Spellchecking Query relaxation …

Similarity functions: Edit distance Jaccard Cosine …

3

Performance is a big issue

Answer queries interactively Many queries on a server

5ms/query 20ms/query

200 queries/second 50 queries/second

4

Outline

Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

5

q-grams

b i n g o n

2-grams

6

q-gram inverted lists

2-grams

id strings123456

bingobioinngbitinginbitingboinggoing

D0

gram string ids bi 1,2,3,4 bo 5 gi 3 go 1,6 in 1,2,3,3,4,5,6 io 2 it 3,4 ng 1,2,3,4,5,6 nn 2 oi 2,5,6 ti 3,4

7

Query processing

2-grams

id strings123456

bingobioinngbitinginbitingboinggoing

ED(bingon, ?)≤1

D0

gram string ids bi 1,2,3,4 bo 5 gi 3 go 1,6 in 1,2,3,3,4,5,6 io 2 it 3,4 ng 1,2,3,4,5,6 nn 2 oi 2,5,6 ti 3,4

# of common grams >= 3

8

VGRAM: variable-length grams [VLDB07]

[2,3]-gram dictionaryb i n g o n

gram bi bin bo gi go in ing io it ng nn oi ti

i nb

n4on13

o

n10

n3i o

n11tn14n12

nn15

n5g n

n16

n6in17

n7in18

n1tg

n24

gn8

n2i o

n9

n19

n #n20

#n32

#n21

#n22

#n23

#n25

#n26

#n27

#n28

#n29

#n30

#n31

#n33

9

Adopting VGRAM in algorithms

VGRAMgram dictionary

string grams

lower bound

b i n g o nb i n g o n

i nb

n 4o

n13

o

n10

n 3i o

n11

t

n14n12

n

n15

n 5g n

n16

n 6i

n17

n 7i

n18

n 1t

g

n24

gn8

n 2i o

n 9

n19

n #

n20#

n32

#

n21

#

n22

#

n23

#

n25

#

n26

#

n27

#

n28

#

n29

#

n30

#

n31#

n33

# of common grams >= 3

10

Contributions of this study Tightening lower bounds using dynamic

programming Cost-based quantitative approach

Analyze and estimate query performance when adding each gram

Automatically find high-quality grams

Gram dictionary

Stringcollection

High quality gram

11

Outline

Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

12

Calculating lower bound

ed(s1,s2) <= k, then

# of common grams >= # of s1 grams – k * q

Fixed length (q)

b i i n d i n g

13

Calculating lower bound

b i i n d i n g

1 2 3 2 3 2 1 1

lower bound = # of grams of s1 – NAG(s1,k)

Variable lengths

14

Too pessimistic?

k-Max: Summation of k largest values

NAG(s,2)=3+3=6

1 2 3 2 3 2 1 1 b i i n d i n g

15

Tightening lower bound

Dynamic programming: tightening NAG(s,k) Subproblems: NAG(s[1,j], i)

String sj1

opi

16

Dynamic programming Recurrence function

String sj1

opi

B[ j ]

opiopi-1

17

Dynamic programming

1 2 3 2 3 2 1 1 b i i n d i n g

0 0 0 0 0 0 0 0 0

0 1 2 3 3 3 3 3 3

0 1 2 3 4 5 5 5 5

k=0

k=1

k=2

NAG vector

18

Outline

Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

19

Effects on inverted lists

ab

bcadd gram abc

Gram dictionaryab

bc

abc

Gram dictionary

string --abc----ab----bc--

20

Effects on query performance

Decrease query’s inverted list Change lower bound Change # of candidates

21

Effects on query’s inverted lists

ab

bcadd gram abc

Gram dictionaryab

bc

abc

Gram dictionary

Query Q

Adding a new gram abc will not change or decrease the query’s inverted lists

- - - - - - - - - - - - -- - - - - a b - - - - - -- - - - - a b c - - - - -

22

Effects on lower bound

Query Q - - - - a b c d - - - - -

- - - - a b c d - - - - -Query Q

Query: Q, ED(Q, ?)≤1

23

Effects on # of candidates

Change lower bound change # of candidates

Query Q

- - - - a b c d - - - -

ab

bcadd gram abc

Gram dictionaryab

bc

abc

Gram dictionary

- - - - a b c d - - - -

24

Outline

Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

25

Construct a gram dictionary [VLDB07]

qmin=2

qmax=4

26

Cost-base construction

qmin=2

27

Outline

Motivation Tightening lower bound of common strings Effects of adding a gram on index and queries Cost-based construction of gram dictionary Experiments

28

Data sets

Environment:GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory

Data set String # Length Range of # of injected edit operations

Min Max Avg

Article Titles 277,000 6 207 66 [1,6]

Movie Titles 855,000 8 249 35 [1,3]

Actor Names 1,200,000 4 74 17 [1,2]

29

Effect of Tightening Lower Bound

1M Actor names, Construct gram dictionary: 100,000 sample strings, 5000 queries, qmin = 4

30

Comparison with algorithm Prune [VLDB07]

Dataset: 1M article titlesPrune: qmin=5, qmax=7, T=2000, LargeFirst policyGramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

31

Choosing qmin

Construct gram dictionary: (a) 3,000 queries, (b) sample ratio=2%

32

Conclusions Tightening lower bound

Dynamic programming Analysis of adding a gram affects

Index structure Performance of queries

Efficient algorithm Automatically generating a high-quality gram

dictionary

33

Thank you

Questions or Comments?

34

Related work

Approximate String Matching q-Grams, q-Samples Inside DBMS Substring matching

Set similarity join Estimation

Selectivity of SQL LIKE substring queries Approximate string answers