38
King’s College London, University of London MSc in Advanced Software Engineering Approximate Indexing: Gapped Suffix Array KyungHoon Park

Approximate Indexing: Gapped Suffix Array

Embed Size (px)

Citation preview

Page 1: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

MSc in Advanced Software Engineering

Approximate Indexing: Gapped Suffix Array

KyungHoon Park

Page 2: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Agenda

Research Objective

Gapped suffix array

Application

Going beyond gSA

Q&A

Page 3: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Research Objective

Page 4: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Main questions

1. Using the developed suffix array, can gapped suffix array be developed in O(n) time?

2. What are the limitations of gapped suffix array? How can these can be overcome?

Page 5: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Research aims

1. To fully understand and implement suffix array and LCP.

2. Implement a gapped suffix array from the suffix array in O(n) time.

3. To study and implement the paper gapped suffix array.

4. If there are possibilities to develop to multiple gapped suffix array, to research other limitations.

Page 6: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Gapped Suffix Array

Page 7: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Main questions

1. Using the developed suffix array, can gapped suffix array be developed in O(n) time?

2. 2. What are the limitations of gapped suffix array? How can these can be overcome?

Page 8: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Definitions

T = t1t2 … tn, P = p1 p2 … pn , strings of symbols in finite alphabet

m = length of search string

n = length of text

k = k-mistake (Hamming distance)

Page 9: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Suffix Array

i T[i] SA T[SA[i]] LCP

0 mississippi 10 i 0

1 ississippi 7 ippi 1

2 ssissippi 4 issippi 1

3 sissippi 1 ississippi 4

4 issippi 0 mississippi 0

5 ssippi 9 pi 0

6 sippi 8 ppi 1

7 ippi 6 sippi 0

8 ppi 3 sissippi 2

9 pi 5 ssippi 1

10 i 2 ssissippi 3

T = mississippi

Page 10: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Gapped Suffix Array

1. First introduced by Crochemore and Tischler(2010)

2. Constructed after SA

3. SA that has a Gap within a specific range to provide approximate index.

4. The range of gap defined before constructing the gapped suffix array.

Page 11: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Gapped Suffix ArrayT = mississippi, (1, 2)-gSA (3,1)

i T[i] SA gSA (1, 2)- gSA(3,1)

1 mississippi 10 10 i#

2 ississippi 7 7 i#pi

3 ssissippi 4 4 i#sippi

4 sissippi 1 1 i#sissippi

5 issippi 0 0 m#ssissippi

6 Ssippi 9 9 p#

7 Sippi 8 8 p#i

8 Ippi 6 5 s#ppi

9 ppi 3 2 s#ssippi

10 pi 5 6 s#ippi

11 i 2 3 s#issippi

Definition

(g0, g1)-gSA (m, k)

gSA = Gapped suffix array

g0 = start cursor of the gap

g1 = end cursor of the gap

m = length of search string

k = Hamming distance

Page 12: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Flow of constructing the gSA

• Skew Algorithm

1. Constructing the SA

• Figure of the k-mistake

• Range of gap

2. Defining the limitations

• Sorting based on GRANK & HRANK

3. Constructing the gSA

Page 13: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Limitations of gSA

1. Hamming distance, length of pattern and gap range should define prior to constructing.

2. gSA cannot cover all of approximate string matching based on defined k-mistake.ex) k = 2, gap=(1,3) coat -> c##t, ##at, co## (support)

#o#t, c#a# (cannot support)

3. gSA cannot support multiple gapsEX) coach -> c#a#h

Page 14: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Constructing gSA - #1. GRANK

i 0 1 2 3 4 5 6 7 8 9 10

T[i] m i s s i s s i p p i

GRANK 5 1 8 8 1 8 8 1 6 6 1

GRANK contains the ranks of factors of y with length up to g0. That is, rank created by cutting the characters before the beginning of the gap at position g0

For Example, m = 3, gap range = (1,2)

Page 15: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Constructing gSA - #2. HRANK

HRANK contains the RANKs of the suffixes that are at the end of the gap.

As we have now already created the suffix array before constructing the gapped suffix, it is possible to easily bring the suffix of where the gap ends.

HRANK[r] = ISA[SA[r]+g1]

Page 16: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

GRANK & HRANK

For example, the structure of the GRANK and HRANK of the fourth suffix sissippi is constructed as below.

s i s s i p p i

GRANK Gap HRANK

If we perform the radix sort by combining both GRANK and HRANK created in this way, it is possible to create gSA in linear time.

Page 17: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Example of (1,2)-gSA(3,1)

i T[i] SA gSA (1, 2)- gSA GRANK HRANK

1 mississippi 10 10 i# 5 0

2 ississippi 7 7 i#pi 1 6

3 ssissippi 4 4 i#sippi 8 8

4 sissippi 1 1 i#sissippi 8 9

5 issippi 0 0 m#ssissippi 1 11

6 Ssippi 9 9 p# 8 0

7 Sippi 8 8 p#i 8 1

8 Ippi 6 5 s#ppi 1 7

9 ppi 3 2 s#ssippi 6 10

10 pi 5 6 s#ippi 6 2

11 i 2 3 s#issippi 1 3

Page 18: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Search in (1,2)-gSA(3,1)

For example, if m = mis (m0, m1, m2), it needs to search three times:

- search mi (m0, m1) in the SA- search is (m1, m2) in the SA- search ms (m0, m2) in the gSA

P = cot

(1,2)-gSA(3,1) c#t #ot co#

Searching array in the (1,2)-gSA(3,1) in the SA in the SA

Page 19: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Application

Page 20: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Platform and Language

1. Language: C#

2. Platform: Microsoft .NET (.Net Framework v4.0)

Page 21: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Algorithms

1. Construction of suffix array with LCP- Radix sort- Skew algorithm

2. Construction of gapped suffix array with gLCP- Radix sort

3. Approximate string search- pattern analysis- binary search with LCP

Page 22: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Gapped Suffix Array

Page 23: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Going beyond gSA

Page 24: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Main questions

1. Using the developed suffix array, can gappedsuffix array be developed in O(n) time?

2. What are the limitations of gappedsuffix array? How can these can beovercome?

Page 25: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Limitation of gSA

P = coat

(2,3)-gSA(4,1) #oat c#at co#t coa#

Searching array SA Cannot

support

gSA(4,1) SA

P = coast

(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#

Searching array SA Cannot

support

Cannot

support

gSA(5,1) SA

If we suppose k is 1 and gap is ended at m-1

Page 26: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Countermeasure

P = coat

(2,3)-gSA(4,1) #oat c#at co#t coa#

Searching array SA gSA(3,1) gSA(4,1) SA

P = coast

(3,4)-gSA(5,1) #oast c#oast co#st coa#t coas#

Searching array SA gSA(3,1) gSA(4,1) gSA(5,1) SA

Page 27: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Countermeasure

P = cot c#t, #ot, co#

gSA(3, 1) SA, gSA(3, 1)

P = coat #oat, c#at, co#t, coa#

gSA(4, 1) SA, gSA(3, 1), gSA(4, 1)

P = coast #oast, c#oast, co#st, coa#t, coas#

gSA(5, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1)

P = coasts #oasts, c#oasts, co#sts, coa#ts, coas#s, coast#

gSA(6, 1) SA, gSA(3, 1), gSA(4, 1), gSA(5, 1), gSA(6, 1)

gSA(m, 1) SA, gSA(3, 1) … gSA(m-2, 1), gSA(m-1, 1), gSA(m, 1)

Page 28: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Theorem If the length of the Gap is 1, the requiredcount of gSA is | m - 2 |, and it is possible for bothconstruction and search time to be performed in lineartime.

Page 29: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Total count of required gSAsgSA(m, p) Required gapped suffix arrays

gSA(3,1) SA, gSA(3,1)

gSA(4,1) SA, gSA(3,1), gSA(4,1)

gSA(4,2) SA, gSA(3,1), gSA(4,2)

gSA(5,1) SA, gSA(3,1), gSA(4,1), gSA(5,1)

gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2)

gSA(5,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,3)

gSA(6,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1)

gSA(6,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(6,2),

gSA(6,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS

A(6,3)

gSA(6,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS

A(6,4)

gSA(7,1) SA, gSA(3,1), gSA(4,1), gSA(5,1), gSA(6,1), gSA(7,1)

gSA(7,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2) , gSA(6,1), gS

A(6,2), gSA(7,2)

gSA(7,3) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS

A(6,1) , gSA(6,2), gSA(6,3), gSA(7,3)

gSA(7,4) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS

A(6,1) , gSA(6,2) , gSA(6,3), gSA(6,4), gSA(7,4)

gSA(7,5) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), gSA(5,3), gS

A(6,1) , gSA(6,2) , gSA(6,3) , gSA(6,4), gSA(7,5)

gC = Total count of required gSAs

𝒈𝑪 =

𝒊=𝟏

𝒑−𝟏

𝒌 − 𝒊 𝒊𝒇 𝒌 − 𝒊 > 𝟎𝟏 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆

Page 30: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Multiple gaps, m is various

P = coat ##at, #o#t, #oa#, c##t, c#a#, co##

gSA(4,2) SA, gSA(3,1), gSA(4,2)

P = coast ##ast, #o#st, #oa#t, #oas#, c##st, c#a#t, c#as#, co##t, co#s#,coa##

gSA(5,2) SA, gSA(3,1), gSA(4,1), gSA(4,2), gSA(5,2), (1,2)(3,4)-gSA(5,2)

P = coasts ##asts, #o#sts, #oa#ts, #oas#s, #oast#, c##sts, c#a#ts, c#as#s, c#ast#, co#

#ts, co#s#s, co#st#, coa##s, coa#t#, coas##

gSA(6,2) SA, gSA(3,1) , gSA(4,1),gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS

A(6,2), (1,2)(4,5)-gSA(6,2), (2,3)(4,5)-gSA(6,2)

P = coasts ###sts, ##a#ts, ##as#s, ##ast#, #o##ts, #o#s#s, #o#st#, #oa##s, #oa#t#, #

oas##, c###ts, c##s#s, c##st#, c#a##s, c#a#t#, c#as##, co###s, co##t#, co

#s##, coa###

gSA(6,3) SA, gSA(3,1) , gSA(4,1), gSA(4,2), gSA(5,1), gSA(5,2), (1,2)(3,4)-gSA(5,2), gS

A(5,3)gSA(6,3), (1,3)(4,5)-gSA(6,3), (1,2)(3,5)-gSA(6,3)

Page 31: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Two approaches to support the multiple gaps

Second is to continuously additionally create multiple gapped suffix array as per above method.

Perform a search where the search is carried out until the first gap of the search pattern, and after that every individual character is compared.

Page 32: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

First approach

c # a # t

r = gSA[i](3,1), T[r]

T[ r+2 ] T[ r+3 ] T[ r+4 ]

c # a s # s

r = gSA[i](3,1), T[r]

T[r+3] T[r+4] T[r+5]

Page 33: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Worst case for searching with it

First fragment’s length is defined fm

Binary search the first fragment with gLCP = O(logn + fm)Search rest of fragment = O((m - fm)n)

So O((m - fm)n + log n + fm)

Page 34: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Summary

Page 35: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Further work

Gapped suffix array only supports searching of specific patterns.

For it to support approximate indexing in all situations, will require more research and development into multiple gapped suffix arrays.

Future task is to study multiple gapped suffix array and its efficiency

Page 36: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Conclusion

The theory of Maxime that gSA can be created in linear time has been put into practice and confirmed to be true

Additionally to this research, further potentials of multiple gSAs were looked at and were able to conclude that it’s an area requiring more research

Page 37: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Page 38: Approximate Indexing: Gapped Suffix Array

King’s College London, University of London

Q&A