67
Approximate Queries on String Collections Xi h Y Xiaochun Yang Institute of Computer Software and Theory School of Information Science and Engineering Northeastern University, CHINA ang c@mail ne ed cn 1 yangxc@mail.neu.edu.cn

Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Approximate Queries on String Collections

Xi h YXiaochun YangInstitute of Computer Software and Theory

School of Information Science and EngineeringNortheastern University, CHINA

ang c@mail ne ed cn1

[email protected]

Page 2: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

OutlineOutline

• What is approximate query?• Q-gram based algorithmsQ gram based algorithms• Our research results

– VGRAM [VLDB’07]– Gram selection [SIGMOD’08][ ]

2

Page 3: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Example: a movie databaseExample: a movie database

Star Title Year Genre

The user doesn’t know the exact spelling!

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-FiSamuel Jackson Iron man 2008 Sci-FiSamuel Jackson Iron man 2008 Sci FiSchwarzenegger The Terminator 1984 Sci-FiSamuel Jackson The man 2006 Crime

3

Page 4: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Gap between Queries and Data

Errors in the queryThe user doesn’t remember a string exactlyThe user unintentionally types a wrong stringLimited input device

Query: Schwarrzenger. Data : Schwarzenegger

4… …

Page 5: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Data integration and cleansing

Errors in the database:Data often is not clean by itself, especially true in data integration and cleansingintegration and cleansing

Typos, Web data, OCR

Relation R Relation S

StarKeanu Reeves

StarKeanu Reeves

Samuel JacksonSchwarzenegger

Samuel L. JacksonSchwarzenegger

5

SchwarzeneggerSamuel Jackson

gg

Samuel L. Jackson

Page 6: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Search engineg

6

Page 7: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Spell checkingSpell checking

77

Page 8: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Approximate queryApproximate query

• Approximate search• Approximate join• Approximate join

8

Page 9: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Approximate searchApproximate search

Collection of strings s

Keanu ReevesStar

Search

S l J k

Schwarzenegger

Samuel Jackson

Query qSamuel Jackson

…Samule Jackson

Output: strings s that satisfy Sim(q,s)≤δ

9

Page 10: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Approximate joinApproximate join

Collection RJenny Stamatopoulou

Collection SPanos Ipirotis

John Paul McDougal

Aldridge Rodriguez

Jonh Smith

Panos Ipeirotis

John Smith

Jenny Stamatopulou

John P. McDougal

⋈…

g

Al Dridge Rodriguez

Output: string pair that satisfy Sim(r,s)≤δ

… Al Dridge Rodriguez

10

Page 11: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Similarity functionsSimilarity functions

• Edit distance• Jaccard similarityJaccard similarity• Hamming distance• Cosine similarity•• …

11

Page 12: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Performance is a big issuePerformance is a big issue

• Answer queries interactively• Many queries on a serverMany queries on a server

5ms/query 20ms/queryResponse time 5ms/query 20ms/query200 queries/second 50 queries/secondThroughput

12

Page 13: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

OutlineOutline

• What is approximate query?• Q-gram based algorithmsQ gram based algorithms• Our research results

– VGRAM [VLDB’07]– Gram selection [SIGMOD’08][ ]

13

Page 14: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

“q-grams” of stringsq grams of strings

u n i v e r s a lu n i v e r s a l

2-grams V(s,q)

(un, ni, iv, ve, er, rs, sa, al)

14String with length L → L - q + 1 q-grams

Page 15: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Similarity between gram setsSimilarity between gram sets

Set Sim Join

GG Gram setGram set

mcrosoft…… …

microsoft…… … SR

……

……

……

……

15

Page 16: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

b d i d li i dq-gram based inverted lists index

4ath

id strings0 rich

chckic

201 30 1 2 41

23

stickstichstuck 2 3

01 4

2-grams icrist

0 1 2 4

34

stuckstatic

tatit

41 2 4

16

tuuc

33

Page 17: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Approximate query processingApproximate query processing

Q “ hti k” ED( hti k ?)≤1# of common grams >= 3

• Query: “shtick”, ED(shtick, ?)≤1ti ic cksh ht ti ic ck

ath

4

id strings0 rich

chckic

201 30 1 2 41

23

stickstichstuck

2-grams icrist 2 3

01 4

0 1 2 4

34

stuckstatic

tatit

41 2 4

17

tuuc

33

Page 18: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

2-grams -> 3-grams?2 grams > 3 grams?

Q “ hti k” ED( hti k ?)≤1# of common grams >= 1

• Query: “shtick”, ED(shtick, ?)≤1tic icksht hti tic ick

atiich

420

id strings0 rich

ickricsta

10

id strings0 richid strings0 rich123

stickstichstuck

3-grams stastistu

2413

123

stickstichstuck

123

stickstichstuck3

4stuckstatic

stutattic

341 42

34

stuckstatic

34

stuckstatic

18tucuck

33

Page 19: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Observation 1: skew distributions of gram frequencies• DBLP: 276 699 article titlesDBLP: 276,699 article titles• Popular 5-grams: ation (>114K times), tions, ystem, catio

19

Page 20: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Observation 2: dilemma of choosing “q”

I i “ ” i• Increasing “q” causing:– Longer grams Shorter lists – Smaller # of common grams of similar strings

4ath

id strings0 rich

chckic

201 30 1 2 41

23

stickstichstuck 2 3

01 4

2-grams icrist

0 1 2 4

34

stuckstatic

tatit

41 2 4

20

tuuc

33

Page 21: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

MotivationMotivation

• Small index size (memory)• Small running time• Small running time

– Scan matched inverted lists– Calculate ED(query, candidate)

21

Page 22: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

What we got?What we got?

• VLDB 2007– VGRAM: Improving Performance of Approximate p g pp

Queries on String Collections Using Variable-Length Gramsg

• SIGMOD 2008C B d V i bl L h G S l i– Cost-Based Variable-Length-Gram Selection for String Collections to Support A i Q i Effi i lApproximate Queries Efficiently

22

Page 23: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

VGRAM: Main idea [VLDB07]VGRAM: Main idea [VLDB07]

i h i bl l h (b d• Grams with variable lengths (between qmin and qmax)– zebra

• ze(123)– corrasion

• co(5213), cor(859), corr(171)

• Advantages– Reduce index size☺Reduce index size ☺– Reducing running time ☺

Adoptable by many algorithms☺23

– Adoptable by many algorithms ☺

Page 24: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

ChallengesChallenges

• Generating variable-length grams?• Constructing a high quality gram dictionary?• Constructing a high-quality gram dictionary?• Relationship between string similarity and their

i il igram-set similarity?• Adopting VGRAM in existing algorithms?p g g g

24

Page 25: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Challenge 1: String Variable-length grams?Challenge 1: String Variable length grams?

Fi d l th 2• Fixed-length 2-gramsu n i v e r s a l

• Variable-length grams[2 4] gram dictionary

nii

[2,4]-gram dictionary

u n i v e r s a l ivrsal

i25

univers

Page 26: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

R i di i iRepresenting gram dictionary as a trie

Fi d l th 2• Fixed-length 2-gramsu n i v e r s a l

• Variable-length grams[2 4] gram dictionary

nii

[2,4]-gram dictionary

u n i v e r s a l ivrsal

i26

univers

Page 27: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

l ti

Challenge 2: Constructing gram dictionary

• selecting grams – Pruning trie using a frequency threshold T (e.g., 2)

27

Page 28: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Challenge 2: Constructing gram dictionary

l ti• selecting grams – Pruning trie using a frequency threshold T (e.g., 2)

28

Page 29: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Final gram dictionaryFinal gram dictionary

29Final grams

Page 30: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

• VGRAM– Main idea– Decomposing strings to grams

Choosing good grams– Choosing good grams– Effect of edit operations on grams– Adopting vgram in existing algorithms

• ExperimentsExperiments

30

Page 31: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Challenge 3: Edit operation’s effect on grams

u n i v e r s a lFixed length: q

k operations could affect k * q grams

31

Page 32: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Deletion affects variable-length grams

Not affected Not affectedAff dNot affected Not affectedAffected

i-qmax+1 i+qmax- 1Deletion

ie et o

32

Page 33: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Grams affected by a deletionGrams affected by a deletion

Aff d?Affected?

i-qmax+1 i+qmax- 1Deletion

i

[2,4]-gramsniivru n i v e r s a l

Deletion

saluni

u n i v e r s a l

Aff t d?33

versAffected?

Page 34: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

# f ff t d b h ti# of grams affected by each operation

Deletion/substitution Insertion

u n i v e r s a l

0 1 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 1 0_ u _ n _ i _ v _ e _ r _ s _ a _ l _

34

Page 35: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

# f ff t d b k ti# of grams affected by k operations

• k-Max: Summation of k largest valuesNAG(s 2)=3+3=6NAG(s,2) 3+3 6

1 2 3 2 3 2 1 1b i i n d i n gb i i n d i n g

Too pessimistic?35

Too pessimistic?

Page 36: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Tightening NAG(s k) [SIGMOD’08]Tightening NAG(s,k) [SIGMOD 08]

• Dynamic programming: tightening NAG(s,k)– Subproblems: NAG(s[1 j] i)Subproblems: NAG(s[1,j], i)

opi

String sj1

36

Page 37: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Dynamic programmingDynamic programming

• Recurrence function

B[ j ]

opiopiopi-1

String sj1

37

Page 38: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Dynamic programmingDynamic programming

1 2 3 2 3 2 1 1b i i n d i n gb i i n d i n g

0 0 0 0 0 0 0 0 00 1 2 3 3 3 3 3 3

k=0k=1 NAG vector

380 1 2 3 4 5 5 5 5k=2

Page 39: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Lower bound on # of common gramsLower bound on # of common grams

Fi d l h ( )

i l

Fixed length (q)

u n i v e r s a l

If ed(s1,s2) <= k, then their # of common grams >=:(|s1|- q + 1) – k * q

Variable lengths: # of grams of s1 – NAG(s1,k)

39

g g ( , )

Page 40: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Challenge 4: adopting VGRAMChallenge 4: adopting VGRAM

Easily adoptable by many algorithms

Basic interfaces:• String s grams• String s1 s2 such that ed(s1 s2) <= k• String s1, s2 such that ed(s1,s2) < k

min # of their common grams

VGRAMstring grams

l b d40

gram dictionary lower bound

Page 41: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Example: algorithm using inverted listsExample: algorithm using inverted lists

Q “ h i k” ED( h i k ?) 1• Query: “shtick”, ED(shtick, ?)≤1sh ht tick

2-4 grams2-gramstick

1 3…ck

Lower bound = 31 3

…cki

2 4 grams2 grams

1 2 4

1 20 4ic…ti

401

2icich

1 2 4ti… 2 4

1

…tictick

id strings0 richid strings0 richid strings0 rich

…012

richstickstich

012

richstickstich

012

richstickstich

41

Lower bound = 134

stuckstatic

34

stuckstatic

34

stuckstatic

Page 42: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Gram selection [SIGMOD’08]Gram selection [SIGMOD 08]

C t b d tit ti hCost-based quantitative approachAnalyze and estimate query performance when adding each gramAutomatically find high-quality grams

High quality gramGram

dictionaryString

collection

High quality gram

42

dictionary

Page 43: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

OutlineOutline

• Effects of adding a gram on index and queries• Cost based construction of gram dictionary• Cost-based construction of gram dictionary• Experiments

43

Page 44: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on inverted listsEffects on inverted lists

b

Gram dictionaryabGram dictionary

abbc

add gram abc bcabcabc

string --abc----ab----bc--

44

Page 45: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on query performanceEffects on query performance

• Decrease query’s inverted list• Change lower boundChange lower bound• Change # of candidates

45

Page 46: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on query’s inverted listsEffects on query s inverted lists

b

Gram dictionaryabGram dictionary

abbc

add gram abc bcabcabc

Query QQuery Q - - - - - - - - - - - - -- - - - - a b - - - - - -- - - - - a b c - - - - -

46Adding a new gram abc will not change or decrease the query’s inverted lists

Page 47: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on lower boundEffects on lower bound

• Query: Q, ED(Q, ?)≤1

Query Q - - - - a b c d - - - - -

- - - - a b c d - - - - -Query Q

47

Page 48: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on lower boundid strings12

bingobi iEffects on lower bound2

34

bioinngbitinginbiting

• Query: “bingon”, ED(bingon, ?)≤1 56

gboinggoing

Dictionary

Gram set: VG(Q) |VG(Q)| NAG(Q,1) Lower bound

D0 {bi, in, ng, go, on} 5 2 3D1 (+ ing) {bi, ing, go, on} 4 2 21 ( g) { , g, g , }D2 (+ bin) {bin, ing, go, on} 4 2 2

48

Page 49: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects on # of candidatesEffects on # of candidates

• Change lower bound change # of candidates

Gram dictionary

ab add gram abc

Gram dictionaryabbc

Gram dictionary

bcg bc

abc

Query Q

b d b d- - - - a b c d - - - - - - - - a b c d - - - -

49

Page 50: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Effects of queriesid strings12

bingobi iEffects of queries 2

34

bioinngbitinginbiting

56

gboinggoing

Query Q

Dictionary Gram setVG(Q)

Scanned list size

CandidatesQ VG(Q) list size

D0 {bi, in, ng, go, on} 19 1,2,3,4,6bi D (+ i ) {bi i } 11 1 3 4 6bingon D1 (+ ing) {bi, ing, go, on} 11 1,3,4,6

D2 (+ bin) {bin, ing, go, on} 8 1,6D0 {bi, it, tt, ti, in, ng} 21 3,4

bitting D1 (+ ing) {bi, it, tt, ti, ing } 13 3,4

50

1

D2 (+ bin) {bi, it, tt, ti, ing } 12 1,3,4

Page 51: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

OutlineOutline

• Effects of adding a gram on index and queries• Cost-based construction of gram dictionaryg y• Experiments

51

Page 52: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

C t t di tiConstruct a gram dictionary [VLDB’07]

qmin=2

q =4qmax=4

52

Page 53: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

C b iCost-base construction [SIGMOD’08]

qmin=2

53

Page 54: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Construct a gram dictionaryid strings12

bingobioinngConstruct a gram dictionary2

345

bioinngbitinginbitingboing

qmin=2

i nbo

n1

tg

6g

going

i

in2

on4

on3

i o tnn5

g nn6

in7

i

g

…8n n9 n13n10 n11 n14n12 n15 n16 n17 n18

tn o123

33……

123

123

123

21nn19 n20 34

34

34

34

2 234

^

34

3456

3456

343

434

11 22 56

54

6 6

Page 55: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Experiments -- Data setsExperiments Data sets

Data set String # Length Range of # of g g ginjected edit operations

Min Max Avg

Article Titles 277,000 6 207 66 [1,6]Movie Titles 855,000 8 249 35 [1,3]Movie Titles 855,000 8 249 35 [1,3]Actor Names 1,200,000 4 74 17 [1,2]

Environment:GNU C++ Dell GX620 PC with an Intel Pentium 2 40Hz Dual Core CPU

55

GNU C++, Dell GX620 PC with an Intel Pentium 2.40Hz Dual Core CPU, 2GB memory, 250GB disk, Ubuntu (Linux) O.S. Index structure were assumed to be in memory

Page 56: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Improving algorithm ProbeCount [VLDB’07]

56Dataset 1: Person name

Page 57: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Improving algorithm ProbeCluster[VLDB’07]

57Dataset 1: Person name

Page 58: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Improving algorithm PartEnum[VLDB’07]

58Dataset 1: Person name

Page 59: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Comparison with algorithm PruneComparison with algorithm Prune [SIGMOD’08]

59

Dataset: 1M article titlesPrune: qmin=5, qmax=7, T=2000, LargeFirst policyGramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

Page 60: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Comparison with algorithm PruneComparison with algorithm Prune [SIGMOD’08]

60Prune: qmin=5, qmax=7, T=2000, LargeFirst policyGramGen: 1% sampling ratio, 2000 queries, (qmin=5 automatically determined)

Page 61: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

ConclusionsConclusions

• VGRAM + gram selection– variable-lengthva ab e e g– high-quality

Ad t bl i i ti l ith• Adoptable in existing algorithms– Reduce index size– Reduce running time

61

Page 62: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

How to do researchHow to do research

U i tifi th d• Use scientific method

62

Page 63: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

How to do researchHow to do research

• Real application -- Make your approach usefulReal application Make your approach useful

63

Page 64: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

How to do researchHow to do research

Work hard

M k diffMake a difference

Enjoy your work

64

Page 65: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

How to do researchHow to do research

• Focus on detailFocus on detail

• Be efficient

• Beat the problem to death!• Beat the problem to death!

65

Page 66: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Can we run faster?

66

Page 67: Approximate Queries on String Collections · 2008-10-07 · Approximate Queries on String Collections Xi h YXiaochun Yang Institute of Computer Software and Theory School of Information

Thank youy

67