43
Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-006 1e-005 0.0001 0.001 0.01 0.1 1 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings in a corpus of lenth N. Substring statistics are useful to find word boundary in Japanese 19 Kyoji Umemura and Kenneth W. Church CICLing 2009 April 14, 2009

Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Substring statistics : Algorithm

0.0001

0.001

0.01

0.1

1

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

df2/

df

df/N

NTCIR2G(any ngrams)

There are N(N-1)/2 substrings in a corpus of lenth N.  Substring statistics are useful to find word boundary in Japanese

19

Kyoji Umemura and Kenneth W. Church CICLing 2009

April 14, 2009

Page 2: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

How it works

0.0001

0.001

0.01

0.1

1

1e-006 1e-005 0.0001 0.001 0.01 0.1 1

df2/

df

df/N

NTCIR2G(any ngrams)

First prepare the table of all statistics. Then obtain each statistics using binary search

19

Result: PreparationO(N), Access: O(log(N))

Page 3: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Size of Table is O(N), not O(N^2) There are N(N-1)/2 substrings in a corpus of lenth N. But many of them have same statistics.

19

N(N-1)/2 is too large for memory.

Strings with same statistics are grouped into a class

Page 4: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Saving Preparation Time If each table may required O(N) computation to get value, the preparation time would be O(N^2) because table size (N).

19

obtaining statistics of a shorter substring using statistics of longer substring which share the begining part.

Page 5: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of cf(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus cf(x) : number of occurrences of string X in a corpus

cf (abc) =5 #4 #5 #6

#1 #2 #3

Ex.

Page 6: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus cf(x) : number of occurrences of string (not word) x. The word "abc" contains a string "ab".

cf (abc) =5 #4 #5 #6

#1 #2 #3

Ex.

cf (ab) =11

definition of cf(x):x is string

Page 7: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of tf(d,x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

ex.

Page 8: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of tf(d,x):x is string

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus tf(d,x) : number of occurrence of string x in document # d

tf (1, abc) =2

#4 #5 #6

#1 #2 #3

ex. tf (6, ab) =4

Page 9: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of df(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df(x) : number of documents which contain the string x, at least once.

df (abc) =4 #4 #5 #6

#1 #2 #3

ex.

Page 10: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of df(x): x is string

abc abc ab

ab abc ab

abc

ab ab abc ab

テキストコーパス df(x) : number of documents which contain the string x . The word "abc" contains a string "ab". df (abc) =4 #4 #5 #6

#1 #2 #3

ex. df (ab) =5

Page 11: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of df2(x)

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

df2 (abc) =1

#4 #5 #6

#1 #2 #3

ex.

Page 12: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of df2(x) 

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus df2 (x) : the number of documents which contain string x twice or more.

#4 #5 #6

#1 #2 #3

ex. df2 (ab) =3

Page 13: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition of dfk(x) 

abc abc ab

ab abc ab

abc

ab ab abc ab

Corpus dfk(x) : the number of documents which contain string x k times or more.

#4 #5 #6

#1 #2 #3

ex. df3 (ab) =2

Page 14: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Suffix Array 1

a c a b d b b

c c b

a c b a c b b a c a b b a c a b b b a c a b d b b

c c b

a c b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

sorted as dictionary

Suffix Array

Suffix Array 1

Page 15: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Suffix Array 2

a c a b d b b

c c b

a c b

a c b b

a c a b b

a c a b b b

a c a b d b b

suffix[0] suffix[1] suffix[2] suffix[3] suffix[4] suffix[5] suffix[6]

A suffix is expressed by one integer

suffix[6] suffix[5]

suffix[2]

suffix[0] suffix[1]

suffix[3] suffix[4]

Space  O(N )

Suffix Array 2

Page 16: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes: Set of strings which starts with same suffixes

c d b a

c b

a c b a a

b c

a b

a

c a d

d d

b c d b a a c e b c d

a a b e e d a a b e e e e

d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

The set of string, "aabb", "aabbc", and "aabbcc" is a class because they start with from suffix1 to suffix2 and not other suffix.

"aabb", "aabbc", and "aabbcc"

"aab"

"a" and "aa"

Page 17: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Members of a class shares statistics

For class C, x, y in C satifies following, because they start with same suffixes.

A df (x) = df (y) df2 (x) = df2 (y) dfn(x) = dfn(y)

cf (x) = cf (y)

tf (d,x) = tf (d,y)

A

Page 18: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

c d b a

c b

a c b a a

b c

a b

a

c a d

d d

b c d b a a c e b c d

a a b e e d a a b e e e e

d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

lcp[0] = 2, since suffix[0] and suffix[1] have common prefix "aa", and the length of "aa" is 2.

LCP(length of Common Prefix)

Page 19: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

common[i ] : suffix[i ]におけるsuffix[i+1]との共通の先頭部分 outgoing : outgoing(i,j)=max(common[i-1],common[j]) inner : inner (i,j) =min (common[k])

k=1 j -1

Classes on Suffix Array

c d b a

c b

a c b a a

b c

a b

a

c a d

d d

b c d b a a c e b c d

a a b e e d a a b e e e e

d d d

suffix[0]

suffix[1]

suffix[2]

suffix[3]

suffix[4]

suffix[5]

A class corresponds to an suffix interval which satisfies ... outgoing(1,4)=max(lcp[0],lcp[4])=2

     inner(1,4) =min (lcp[k]) =3 k=1 3

[1,4]

[0,0]

[1,1]

[2,2]

[3,3]

[4,4]

[5,5]

[0,4]

[1,2]

[3,4]

Page 20: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Upper and Lower Class

Lower class

Upper class

if C1 contains all C2 suffixes, C1 is upper class of C2 C2 is lower class of C1

suffix array

Page 21: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Property of Class

Lower class

Upper class

Classes form one tree. Therefore total number of classes are less than 2N-1 where N is size of Corpus

suffix array

Page 22: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Corpus

#4 #5 #6

#1 #2 #3

cf (abc) cf (abx) cf (ab) =3+6

=3 =6

=9 abx

abc

abc

abc

abx

abx

abx

abx

abx

number of occurrence of a string are obtained by longer strings

Page 23: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Corpus

#4 #5 #6

#1 #2 #3

number of documents are not

cf (abc) =3 cf (abx) =6 cf (ab) =3+6

=9 df (abc) =3 df (abx) df (ab) =3+4?

=4

=5

abx

abc

abc

abc

abx

abx

abx

abx

abx

Page 24: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

numbering string "ab" in a document by suffix order.

Document #1

abc

abd

abx

aby

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

b a e

a y b a x b

abx suffixes of doc #1 abe

abd

abx aac

for string "ab", the numbering of suffix[89]

(abx..) is 3.

1

3

4

2

Page 25: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

numbering string "a" in a document by suffix order.

Document #1

abc

abd

abx

aby

suffix[11] suffix[23] suffix[29] suffix[89] suffix[94]

a d b c a a

b a e

a y b a x b

abx suffixes of doc #1 abe

abd

abx aac

for string "a", the numbering of suffix[89]

(abx..) is 4.

2

4

5

3

1

Page 26: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

k-frequency of string x

cfk (x)

tfk (d,x)

:sum of tfk (d,x) in all d of corpus

: number of string x in document d, where numbering of x is k or larger.

For class C, x, y in C satifies following, because they start with same suffixes. tfk (d,x) = tfk (d,y)

Property

Page 27: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

k-frequency and document frequency

document #1 document #2 document #3

ab

ab ab

ab

abc

abc

abc

ab

ab

ab

abc abc

abc abc

abc

abc abc

abc

ab

ab ab

dfk(x)=cfk(x)-cfk+1(x) property

cf8(ab)=1

ab ab

ab ab ab

ab ab

ab ab ab

ab ab

ab ab

ab

ab

ab ab

ab ab

ab

1 1 1 2 2

2 3 3 3 4 4

4 5 5

5 6 6

6 7 7 8

cf7(ab)=3

7 7 ab ab

cf6(ab)=6

ab ab ab

6

6

6

df7(ab)=cf7(ab)-cf8(ab)=3-1=2 df6(ab)=cf6(ab)-cf7(ab)=6-3=3

Page 28: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

enumerating classes scanning suffix array and LCP

suffix array

suffix[0] suffix[1] suffix[2] suffix[3]

LCP for each suffix from begin to end

1. LCP increase : begining of class

stack

2. LCP decrease: end of class

assing start point , and lcp of class push class record to stack

pop class record assign end point of class

suffix[1] suffix[2]

detecting class

suffix[2]

detecting class

class record in stack incomplete class

suffix position

Note: needs more care for the classes sharing start or end

Page 29: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Data structure for k-frequency -document link-

Doc

umen

t #

234

Doc

umen

t #

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t #

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t #

235

Page 30: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Marking the numbering is 3 or more for "abcx", "abcy" and "abcz"

Doc

umen

t #

234

Doc

umen

t #

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t #

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t #

235

Page 31: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Marking the numbering is 3 or more for "abc"

Doc

umen

t #

234

Doc

umen

t #

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abc

-1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t #

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t #

235

Page 32: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Counting k-frequency n  Deside Range by Document Link, n  Count up variables in class in stack where start is smaller than range

Suffix Position

number of record in class

Range

Page 33: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Reducing computation by Class*

simple method : counting up variables in all incomplete classes where the class contains specified range.        

→multiple counting for one suffix

There is a way to get same result by single counting with propagation.

Page 34: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Suffixes marked for "abcx" "abcy" and "abcz", are also marked in "abc"

Doc

umen

t #

234

Doc

umen

t #

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

abc -1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t #

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t #

235

Page 35: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Ranges by document link show that they belongs to "abc", and not "abcx", "abcy"...

Doc

umen

t #

234

Doc

umen

t #

235

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

abcz abcz

abcz

abcx

abcx abcx

abcy

abcy

abcy

12301 12308

12311

12331

12313

12345 12357

12379

12388 12405

12415 12536

12444 12456

12465 12476 12481

12500 12494

abcx

abcy

abcz

abc -1

-1

abcx abcy

abcz abcy abcy

abcx abcx

abcx

abcx abcz

12301 12345 12357 12379 12388 12415 12465 12481 12308

12331 12336

12500

12444

12311

12476

12313 12305

12456 12494

-1

-1

Doc

umen

t #

234

abcz abcz

abcz abcy

abcy

abcy abcx

abcx

abcx

Doc

umen

t #

235

Page 36: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

definition Class* of interval n  The class which has smallest interval among

the classes which contains the interval. n  i.e. The lowest class among the class which

contains the interval.

suffix position

number record instack

Range

Page 37: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Reducing computation by Class*

simple method : counting up variables in all incomplete classes where the class contains specified range.        

noting the following property

Class forms a tree structure. if a suffix has number K for one class, the suffix has K or more number in upper class.

Property

counting up variables of class* of range, then propergate count to upper class

→multiple counting for one suffix

Page 38: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

How to find Class* n  Class* is a class on stack n  where its start is smaller than range start

and closest one

suffix position

number of stack in a class

Range

Page 39: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

How to find Class* n  using binary search, search class of

appropriate start position. Note: class in stack are ordered by start.

suffix position

number of stack in a class

Range

Page 40: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Preparation

abcabcabc abcd abcde bcde

Corpus class[4,8]   df=3 df2=1 S=“abc” class[5,6]  df=1 df2=1 S=“abcabc” class[7,8]  df=2 df2=0 S=“abcd” class[9,14]   df=4 df2=1 S=“bc” class[10,11]  df=1 df2=1 S=“bcabc” ・・・

Tables

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8]  df=2 df2=0 S=“abcd”

abcd

class[9,14] df=4 df2=1 S=“bc”

b, bc

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

Sort the class table by dictionary order of the longest string of Class

Page 41: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Access to table

“abc” ??

abcabcabc abcd abcde bcde

コーパス class[4,8]   df=3 df2=1 S=“abc” class[5,6]  df=1 df2=1 S=“abcabc” class[7,8]  df=2 df2=0 S=“abcd” class[9,14]   df=4 df2=1 S=“bc” class[10,11]  df=1 df2=1 S=“bcabc” ・・・

df=3 df2=1 S=“abc”

abc, ab, a class[4,8]

class[5,6] df=1 df2=1 S=“abcabc”

abca,abcab, abcabc

class[7,8]  df=2 df2=0 S=“abcd”

abcd

class[9,14] df=4 df2=1 S=“bc”

b, bc

bca,bcab, bcabc class[10,11]

df=1 df2=1 S=“bcabc”

df=3 df2=1

bBinary search to find first class that maches query. note: other ordering is also possible

Page 42: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Conclusion

n  Various document frequencies can be computed by K-frequency of string

n  K-frequecy of short string is sum of K-frequency of longer strings, that is not the case for document frequency.

n  K-frequency is efficiently computed using the suffix array and document link.

Page 43: Substring statistics : Algorithm...Substring statistics : Algorithm 0.0001 0.001 0.01 0.1 1 1e-0061e-0050.00010.0010.010.11 df2/df df/N NTCIR2G(any ngrams) There are N(N-1)/2 substrings

Download & Try

n  http://www.ss.cs.tut.ac.jp/umemura/cicling2009/

n  Try ./show-class for your favorite string. n  suggesting short string with patterns n  verify the various frequencies