49
Compressed suffix Compressed suffix arrays and suffix arrays and suffix trees with trees with applications to applications to text indexing and text indexing and string matching string matching

Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Embed Size (px)

Citation preview

Page 1: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix Compressed suffix arrays and suffix trees arrays and suffix trees with applications to text with applications to text

indexing and string indexing and string matchingmatching

Page 2: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Jeffrey scott vitter Roberto Grosssi

Page 3: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

AgendaAgenda A (very) short review on suffix arraysA (very) short review on suffix arrays IntroductionIntroduction Problem DefinitionProblem Definition Information theory reasoningInformation theory reasoning

– Simple solution round 2 Simple solution round 2 Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access Compressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access

timetime Rank And Select Problem definitionsRank And Select Problem definitions

– Rank DSRank DS Compressed suffix arrays in Compressed suffix arrays in εε-1-1n + O(n) bits and O(logn + O(n) bits and O(logεεn) access timen) access time Select data structure (if time permits) Select data structure (if time permits)

Page 4: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Short review on suffix arraysShort review on suffix arrays

A suffix array is a A suffix array is a sorted array of the sorted array of the suffix of a string S suffix of a string S represented by an represented by an array of pointers to array of pointers to the suffixes of Sthe suffixes of S

For example The For example The string TelAviv and string TelAviv and it’s corresponding it’s corresponding suffix array suffix array

SS00telavivtelaviv

SS11elavivelaviv

SS22lavivlaviv

SS33avivaviv

SS44vivviv

SS55iviv

SS66vv33115522006644

Page 5: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

IntroductionIntroduction Succinct data structures branch Succinct data structures branch

Dna genome strings (small alphabet, large strings)Dna genome strings (small alphabet, large strings)

Mainly a Theoretical article Mainly a Theoretical article

Page 6: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Problem DefinitionProblem Definition The Algorithm Is composed of two phasesThe Algorithm Is composed of two phases

– compression compression – lookup lookup

Compress :Compress :– given a suffix array Sa compress it to get it’s succinct given a suffix array Sa compress it to get it’s succinct

representationrepresentation

lookup(i): lookup(i): – Given the compressed representation return SA[i]Given the compressed representation return SA[i]

Page 7: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Some DefinitionsSome Definitions We will deal (at first) with binary alphabetWe will deal (at first) with binary alphabet

– ΣΣ = {a,b} = {a,b} We will add a special end of string symbol #We will add a special end of string symbol #

And will set the relation between the characters to be And will set the relation between the characters to be – a<#<b (*)a<#<b (*)

Basic Ram ModelBasic Ram Model – Log(n) word sizeLog(n) word size– Word lookup and arithmetic in constant timeWord lookup and arithmetic in constant time

Page 8: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Information theory reasoningInformation theory reasoning

aaaaaaaa##

1234512345

aaabaaab##

1235412354

aabaaaba##

1425314253

aabb#aabb#1254312543

abaaabaa##

3415234152

abaaabaa##

1352413524

abababab##

4153241532

abbaabba##

1543215432

baaabaaa##

2345123451

baab#baab#2351423514

baba#baba#4253142531

babb#babb#2514325143

bbaa#bbaa#3452134521

bbab#bbab#3524135241

bbba#bbba#4532145321

bbbb#bbbb#5432154321

Page 9: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Information theory reasoning (2)Information theory reasoning (2)

Suffix array size nlog(n) Suffix array size nlog(n) One to one corresponds between the One to one corresponds between the

suffix array to the stringsuffix array to the string– Construction detailsConstruction details

Number of possible suffix arrays 2Number of possible suffix arrays 2n-1n-1

– Perfect compress n bits (the string itself)Perfect compress n bits (the string itself)– The cost for lookup The cost for lookup ΩΩ(n) see prev lecture(n) see prev lecture

Page 10: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

““Simple” solution round 2Simple” solution round 2different approachdifferent approach

Let’s pack together each logn bits to Let’s pack together each logn bits to create a new alphabet.create a new alphabet.

So the text length will be n/logn and So the text length will be n/logn and the pattern length would be m/lognthe pattern length would be m/logn

The suffix array will take o(n) bitsThe suffix array will take o(n) bits Searching becomes hard (alignment) Searching becomes hard (alignment)

– the text is aligned but the pattern isn’t the text is aligned but the pattern isn’t logn caseslogn cases

Page 11: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

““Simple” solution round 2Simple” solution round 2 the text isn’t aligned the pattern occurs k bit right to a word the text isn’t aligned the pattern occurs k bit right to a word

boundaryboundary Need to append k bits to the pattern and check itNeed to append k bits to the pattern and check it So we need to check 2^k cases So we need to check 2^k cases K~logn => n different cases to check K~logn => n different cases to check Assuming we know how much to pad!! Assuming we know how much to pad!!

Page 12: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

General frameworkGeneral framework

Abstract Data Type Optimization [Jacobson'89]Abstract Data Type Optimization [Jacobson'89]

# distinct Data structures = C(n) => Each data # distinct Data structures = C(n) => Each data structure occupies O(log C(n)) bits.structure occupies O(log C(n)) bits.

Doesn’t guarantee the time complexity on the supported Doesn’t guarantee the time complexity on the supported operationsoperations

Page 13: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix arrays inCompressed suffix arrays in ½*nloglogn +O(n) bits and O(loglogn) access time ½*nloglogn +O(n) bits and O(loglogn) access time

Recursive method in natureRecursive method in nature– Take advantage on the suffixesTake advantage on the suffixes

Let SaLet Sa00 be the uncompressed suffix array be the uncompressed suffix array And NAnd N00 be it’s size (assume power of 2) be it’s size (assume power of 2)

In The k phase of the compression we start with In The k phase of the compression we start with SaSak k with the size with the size and create Saand create Sak+1 k+1 with the size with the size SaSak+1k+1 holds the permutation {1..N holds the permutation {1..Nk+1k+1}}

k

NkN 2

01

0

21 k

NkN

Page 14: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

SaSak+1k+1 Construction Construction

Create the BCreate the Bkk bit vector bit vector

BBkk[i] = 1 iff Sa[i] = 1 iff Sakk[i] is even [i] is even create the Rank vectorcreate the Rank vector

RankRankkk(j) counts the number of one bits in the first j (j) counts the number of one bits in the first j bits of Bbits of Bkk

Create the Create the ΨΨk(i) vector k(i) vector – stores the 0 to 1 companion relation)stores the 0 to 1 companion relation)

Store the even values from SaStore the even values from Sakk in Sa in Sak+1k+1

otherwise

1][Sa ][Sa and odd is ][Sa if )( kkk

i

ijijik

Page 15: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

An ExampleAn Example

The 32 chars string T The 32 chars string T abbabbabbabbabaaabababbabbbabba#abbabbabbabbabaaabababbabbbabba#

Page 16: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

An ExampleAn Example

1122334455667788991010111112121313141415151616

TextTextaabbbbaabbbbaabbbbaabbbbaabbaaaa

SaSa001515161631311313171719192828101077441121212424323214143030

BB0000110000000011110011000011111111

RankRank0000111111111122333344444455667788

ΨΨ0022221414151518182323778828281010303031311313141415151616

Page 17: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ExampleExample… …

16161717181819192020212122222323242425252626272728282929303031313232

aaaabbaabbaabbbbaabbbbbbaabbbbaa##

30301212181827279966332020232329291111262688552222222525

1111110000110011000000111100111100

8899101010101010111111111212121212121212131314141414151516161616

1616171718187788212110102323131316161717272728282121303031312727

Page 18: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

How To compute SaHow To compute Sakk from Sa from Sak-1k-1

Lemma 1Lemma 1– Given suffix array SaGiven suffix array Sakk let B let Bkk rank rankkk ΨΨkk and Sa and Sak+1k+1

Be the result of the transformation performed Be the result of the transformation performed by phase k we can construct Saby phase k we can construct Sak k from Sak+1 from Sak+1 by the following formula by the following formula

SaSakk[i] = 2* Sa[i] = 2* Sak+1k+1[rank[rankkk((ΨΨkk(i))]+(B(i))]+(Bkk[i]-1)[i]-1)

– Let’s split for 2 casesLet’s split for 2 cases Bk[i] is even Bk[i] is even Bk[i] is odd Bk[i] is odd

Page 19: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Example continueExample continueSaSa118814145522121216167715156699331010131344111111

BB1111110011111100001100001100110000

RankRank

1111222233445555556666667777888888

ΨΨ111122994455661166991212141412122214144455

SaSa224477116688335522

BB221100001111000011

RankRank

221111112233333344

ΨΨ221155884455114488

SaSa3322334411

Page 20: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

CompressCompress– We Keep l = O(loglogn) levelsWe Keep l = O(loglogn) levels

– All Levels but the Sal level are save implicitlyAll Levels but the Sal level are save implicitly

– For each of the level 0..l-1 we save BFor each of the level 0..l-1 we save Bjj,rank,rankjj ΨΨjj

– rankrankjj ΨΨjj are stored implicitly are stored implicitly

– The Size of SaThe Size of Sall is is )(

log

loglog*)

log(log*

log)

2(log*

2log*

loglogloglognO

n

nn

n

n

n

NnnNN

nnll

Page 21: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

lookuplookup

just compute recursively Sajust compute recursively Sakk[i] from Sa[i] from Sak+1k+1[i][i]

Recursion depth loglognRecursion depth loglogn

All data structure going to be used have o(1) access timeAll data structure going to be used have o(1) access time

O(loglogn) lookup costO(loglogn) lookup cost

Page 22: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

How The Data Is StoredHow The Data Is Stored The Bk bit vector is stored explctiy The Bk bit vector is stored explctiy

– O(Nk) space O(Nk) space – O(1) lookup O(1) lookup – O(Nk) preprocess timeO(Nk) preprocess time

The RankThe RankK K vector is stored implicitly using Jacobson rank data vector is stored implicitly using Jacobson rank data structure structure – O(NO(Nkk(loglogn(loglognkk)/logn)/lognkk) space ) space

– O(1) lookup O(1) lookup – O(Nk) preprocess timeO(Nk) preprocess time

The The ΨΨk k vector is stored implicitly (using rank and select)vector is stored implicitly (using rank and select)

timepreprocess )2( -

bits )loglog2

()2

3

2

1( using-

Time acess O(1) -

1

kk

kk

NO

n

nOn

Page 23: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

it togcorspondin

]N[1,j indices oflist ordered a keep pattern webit 2each for

Tin suffix ])[(2 thepreceded symbol this

1][*2)..1][(*2 positionsin symbols 2 heConsider t

Bin 1ith ofindex thebe jlet 2

ni1each for

kk

k

kk

kk

thjSa

jSajSa

k

kk

k

Page 24: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Let’s Take a lookLet’s Take a look

0list bbbb {}list bbbb 0list abbb {}list abbb

0list bbba {}list bbba 2list abba {5,8}list abba

0list bbab {}list bbab 0list abab {}list abab

0list bbaa {}list bbaa 0list abaa {}list abaa

1list babb {4}list babb 0list aabb {}list aabb

1list baba {1}list baba 0list aaba {}list aaba

0list baab {}list baab 0list aaab {}list aaab

0list baaa {}list baaa 0list aaaa {}list aaaa

2 levlel

3list bb {2,4,5}list bb

4list ba }{1,6,12,14list ba

1list ab {9}list ab

0list aa {}list aa

1 level

8list b 27},16,17,21,{7,8,10,13 :list b

8list a ,31}8,23,28,30{2,14,15,1 :list a

0 level

Page 25: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

An ExampleAn Example

1122334455667788991010111112121313141415151616

TextTextaabbbbaabbbbaabbbbaabbbbaabbaaaa

SaSa001515161631311313171719192828101077441121212424323214143030

BB0000110000000011110011000011111111

RankRank0000111111111122333344444455667788

ΨΨ0022221414151518182323778828281010303031311313141415151616

Page 26: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ExampleExample… …

16161717181819192020212122222323242425252626272728282929303031313232

aaaabbaabbaabbbbaabbbbbbaabbbbaa##

30301212181827279966332020232329291111262688552222222525

1111110000110011000000111100111100

8899101010101010111111111212121212121212131314141414151516161616

1616171718187788212110102323131316161717272728282121303031312727

Page 27: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

So What can we do with all the list’sSo What can we do with all the list’s

Concatenate them together in a lexicographical order and form Concatenate them together in a lexicographical order and form the Lk listthe Lk list

LL11={9,1,6,12,14,2,4,5} ={9,1,6,12,14,2,4,5}

Let’s see how we can compute Let’s see how we can compute ΨΨk k (i)(i)

– If BIf Bkk[i] is even , it’s simply i[i] is even , it’s simply i

– Otherwise , Otherwise , – because all the prefix patterns saved are in sorted order, because all the prefix patterns saved are in sorted order, – We saved in the Lk list till the point i , entries for all the odd We saved in the Lk list till the point i , entries for all the odd

suffix’s before i , h=i-rank[i]suffix’s before i , h=i-rank[i]– So we can look up the h entry in LkSo we can look up the h entry in Lk

And it will give us the answerAnd it will give us the answer

Page 28: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Simple exampleSimple example

LL22={5,8,2,4}={5,8,2,4}

RankRank22={1,1,1,2,3,3,3,4}={1,1,1,2,3,3,3,4}

BB22={1,0,0,1,1,0,0,1}={1,0,0,1,1,0,0,1} ΨΨ2={1,5,8,4,5,1,4,8}2={1,5,8,4,5,1,4,8} ΨΨ(3) = ?(3) = ? Rank(3) = 1, h= 3-1 , L2[2] = 8 Rank(3) = 1, h= 3-1 , L2[2] = 8 ΨΨ(3) =8 (3) =8

Page 29: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation Lemma 2Lemma 2

Given s integers in sorted order ,Given s integers in sorted order ,

each containing w bits ,where s<2each containing w bits ,where s<2w w

we can store them with at most we can store them with at most

s(2+w-floor(logs))+O(s/loglogs) bits s(2+w-floor(logs))+O(s/loglogs) bits

so that retrieving the hth integer takes constant timeso that retrieving the hth integer takes constant time

Page 30: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representationTake the first z=floor(logs) bits of each int, creating the qTake the first z=floor(logs) bits of each int, creating the q11..q..qs s intint

It’s easy to see that , qIt’s easy to see that , q11<q<qii<q<qi+1i+1<s (we take the msb bits after all)<s (we take the msb bits after all)

The rest w-z bits of each int , will be rThe rest w-z bits of each int , will be r ii

10101010101010101010101010101

1010101010101010101010101

Si

qi

101

ri

Page 31: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

Store rStore ri i in a simple array, (w-z)*s bits in a simple array, (w-z)*s bits

Store qStore q11..q..qs s in a table supporting select and rank in in a table supporting select and rank in constant time.constant time.

The table Q is implemented in the following wayThe table Q is implemented in the following wayInstead of saving the number themselves,Instead of saving the number themselves,

we store qwe store q11,q,q22-q-q11,q,q22-q-q33,… q,… qss-q-qs-1 s-1

in unary representation )0in unary representation )0ii1( 1( And add a select data structure.And add a select data structure.

Page 32: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

In order to get qi we simply do select(i) ,In order to get qi we simply do select(i) ,

and count the number of zeros before the ith 1 and count the number of zeros before the ith 1

Qi = select(i) - rank(select(i))Qi = select(i) - rank(select(i))

Page 33: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

The q table size is The q table size is

the size of the unary string is s+2z <2s + the the size of the unary string is s+2z <2s + the select overhead O(s/loglogs) select overhead O(s/loglogs)

So we can output Si easilySo we can output Si easily

SSii=q=qii*2*2w-zw-z+r+ri i

Page 34: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

Lemma 3Lemma 3

We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk

))

There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)

Page 35: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

Lemma 3Lemma 3

We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk

))

There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)

Each XEach Xi i integer in the lists, 1<xinteger in the lists, 1<xii<N<Nkk will be transformed into a new will be transformed into a new integer by appending it’s list int representation integer by appending it’s list int representation

X` bit size is , 2X` bit size is , 2KK+logn+lognk k , ,

Page 36: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ΨΨk k vector representationvector representation

Lemma 3Lemma 3

We can store the concatenated list LWe can store the concatenated list Lkk used for used for ΨΨk k in in n*(1/2+3/2n*(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn), so accessing the hth element will take loglogn), so accessing the hth element will take constant time, with preprocessing time o(n/2k+2constant time, with preprocessing time o(n/2k+222kk

))

There are 2There are 222k k lists, number them ,(even the empty ones)lists, number them ,(even the empty ones)

Each XEach Xi i integer in the lists, 1<xinteger in the lists, 1<xii<N<Nkk will be transformed into a new will be transformed into a new integer by appending it’s list int representation integer by appending it’s list int representation

X` bit size is , 2X` bit size is , 2KK+logn+lognk k , ,

After concatenating all the lists ,we have a NAfter concatenating all the lists ,we have a Nkk/2 sorted numbers sized /2 sorted numbers sized 22KK+logn+lognk k bitsbits

Using lemma 2 we get.Using lemma 2 we get. O(1) access timeO(1) access time And a space bound of n(1/2+3/2And a space bound of n(1/2+3/2k+1k+1)+O(n/2)+O(n/2kkloglogn) bitsloglogn) bits

Page 37: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Sum it up (space complexity)Sum it up (space complexity)

O(n) 2 is timepreprocess the

)loglog()loglog

(6loglog2

1)

loglog2

1(

2

3

2

1

2log

2loglog

2

1

2

1

2

nlogn

is space totalthe

timesing)preproces2O(n loglogn)O(n/2)3/2(1/2*n is size

timeingpreprocess )O(n ))/logn(loglognO(n is size Rank

timeingpreprocess )O(n bits )O(n is size B

bits O(n) is size Sa

loglognl be set to is level ofnumber The

1-l

1

2

1

11l

2k

k1kk

kkkkk

kkk

l

k

k

k

l

kkk

k

k

kk

n

nnOn

nOnnn

nO

n

n

On

Page 38: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Rank data structureRank data structure Due to JacobsonDue to Jacobson Given a bit vector length n ,Rank[i] is the number of 1 bits Given a bit vector length n ,Rank[i] is the number of 1 bits

till Itill I Multilevel approach Multilevel approach

We will slice the bit string to log2n chunks.We will slice the bit string to log2n chunks. Between each chunk we will keep rank counterBetween each chunk we will keep rank counter

Each chunk will be divvied into ½ * logn chunks , Each chunk will be divvied into ½ * logn chunks , And a counter will be kept between each sub chunksAnd a counter will be kept between each sub chunks

At The Bottom Level a simple Lookup table will be used.At The Bottom Level a simple Lookup table will be used.

Page 39: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

RankRank

3

7

101

Lookup table

14

Log2n chunks

½ logn sub chunks

The output 14+3+1

Page 40: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Rank AnalysisRank Analysis

space total)(

)loglog*log*(loglog*log*2

takes, tableLookup The

)logn

nloglognO( of total,counter loglogn a havingeach subchunks

logn

2n have we

)logn

nO( of total,counter logn a havingeach , chunks

log

n

levelfirst

logn 1/2

2

no

nnnOnn

n

Page 41: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Compressed suffix arrays in Compressed suffix arrays in εε-1-1n + n + O(n) bits and O(logO(n) bits and O(logεεn) access timen) access time

In order to break the space barrier we need to save less In order to break the space barrier we need to save less levels =>longer lookup’slevels =>longer lookup’s

Lets save 3 compressed levels only SaLets save 3 compressed levels only Sa00 Sa Sall Sa Sal` l`

L = ceil(loglogn) , l`=ceil(1/2loglogn)L = ceil(loglogn) , l`=ceil(1/2loglogn)

using A Dictionary data structure , which Can say If an using A Dictionary data structure , which Can say If an element is member of the Dictionary, and support a rank element is member of the Dictionary, and support a rank query, O(1) time for both queriesquery, O(1) time for both queries

The Space complexity of the dictionary isThe Space complexity of the dictionary is

We keep in 2 dictionaries what items we have in the next We keep in 2 dictionaries what items we have in the next level Dlevel D00 and Dl (from Sa and Dl (from Sa00->Sa->Sal`l` Sa Sal`l`->Sa->Sall

bits *log lnn

nO l

l

Page 42: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

The The ΨΨ`̀k k functionfunction

We define the We define the ΨΨ`̀k k functionfunction , which maps each 1 to it’s companion 0 , which maps each 1 to it’s companion 0

Let’s define the Let’s define the φφkk function to be function to be

We just need to merge the indexes in LWe just need to merge the indexes in Lkk and L` and L`kk

otherwise

1][Sa ][Sa andeven is ][Sa and ][Sa if )( kkkk

i

ijiNiji k

k

otherwise

1][Sa ][Sa and N ][Sa if )( kkk

i

ijkijik

Page 43: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

ExampleExample

6,2,3,4,5},7,12,14,1,13,6*,1,6{10,8,9,11

gives merging The

1list bb {3}list bb

2list ba {7,16}list ba

3list ab {8,11,13}list ab

1list aa {10}list aa

sList'even

3list bb {2,4,5}list bb

4list ba }{1,6,12,14list ba

1list ab {9}list ab

0list aa {}list aa

1 level

slist' Odd

5432161412761613119810k

1615313161110787613113810`

54142121412961654921

0010100100111011

11141310396157161225148

16151413121110987654321

1

1

1

1

B

Sa

Page 44: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

The The φφk k function implementationfunction implementation

Lemma 4 :We can store the concatenated list used for Lemma 4 :We can store the concatenated list used for φφk k

– k =0 in n+O(n/loglogn) bitsk =0 in n+O(n/loglogn) bits– K>0 in n*(1+1/2K>0 in n*(1+1/2k-1k-1)+O(n/2kloglogn) , preprocess time of O(n/2)+O(n/2kloglogn) , preprocess time of O(n/2kk

+2+222kk))

– If k>0 simply using lemma 3If k>0 simply using lemma 3– K=0K=0

Encode a,# as 0, and b as 1.Encode a,# as 0, and b as 1. Create a n bit vector , named lCreate a n bit vector , named l L[f] = 0 iff the list for L[f] = 0 iff the list for φφ0 0 is a or # at the f positionis a or # at the f position

We add a select and selectWe add a select and select00 data structure on top of it. O(n/loglogn) data structure on top of it. O(n/loglogn) Also we keep the number of 0 in l as c0, Also we keep the number of 0 in l as c0, Query Query φφkk(j) (j) is done in the following wayis done in the following way if j = C0 , return select0(c0)if j = C0 , return select0(c0) If j<c0 return select0(j)If j<c0 return select0(j) If j>c0 return select(j-c0)If j>c0 return select(j-c0)

Page 45: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

The Lookup algorithmThe Lookup algorithm Sa[i] , we start walking the Sa[i] , we start walking the φφk k function i,i`,i``,i```function i,i`,i``,i``` Sa0[i]+1=Sa0[i`]…Sa0[i]+1=Sa0[i`]… Until reaching entry found in the dictionary DUntil reaching entry found in the dictionary D00, ,

– Let s be the walk length Let s be the walk length – And r the entry rank in the dictionary (how many items, already passed And r the entry rank in the dictionary (how many items, already passed

to the next level?)to the next level?) Using r we start walking the next level Using r we start walking the next level

– Let s` be the walk length Let s` be the walk length – And r` the entry rank in the dictionaryAnd r` the entry rank in the dictionary

we return the following resultwe return the following result

The walk length is , max(s,s`)<2The walk length is , max(s,s`)<2l`l`<sqr(logn) <sqr(logn)

So the query time is O(sqr(logn))So the query time is O(sqr(logn))

)2`*2*`][ ` ssrSa lll

Page 46: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

The General multilevel BuildThe General multilevel Build

For every 0<For every 0<εε<1 ,<1 , Assume Assume εεl is an integer so 2l is an integer so 2εεll<2log<2logεεnn Create all the levels , 0, Create all the levels , 0, εεl,2l,2εεl ..ll ..l

Number of levels is Number of levels is εε-1-1+1 => lookup of O(log+1 => lookup of O(logεεn)n)

Page 47: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

The General multilevel BuildThe General multilevel Build

)loglog

()log

loglog()(D

space esdicitonari the

)loglog

()1()log

()loglog

()1(

)loglog2

1()

2

11()

loglog(

2

log

k k

11

1

1

1

n

nO

n

nnOlnO

n

nOn

n

nO

n

nOn

nOn

n

nOn

nn

l

k

ilik

kl

Page 48: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Select data structureSelect data structure select(i)- returns the i 1 bit in the stringselect(i)- returns the i 1 bit in the string

Same idea as rank , a bit more complicatedSame idea as rank , a bit more complicated

multilevel approachmultilevel approach

At the first level we record the position of every lognloglognAt the first level we record the position of every lognloglognthth bit, bit, – Total space o(N/loglogn)Total space o(N/loglogn)

Between each two bits, we keep the following data, Between each two bits, we keep the following data, If the distance between them r>(lognloglogn)If the distance between them r>(lognloglogn)2 2

– we keep the absolute pos of all the indexes between them we keep the absolute pos of all the indexes between them loglog22nloglogn nloglogn

– Other wise we keep , the relative position of each logrloglognOther wise we keep , the relative position of each logrloglognthth bit bit Total space logr*loglogn <logTotal space logr*loglogn <log22nloglogn = r/loglogn r<N !!!nloglogn = r/loglogn r<N !!!

Then we keep one more level (the same notions) Then we keep one more level (the same notions) – Block size comes to the size of (lgn)Block size comes to the size of (lgn)44

Page 49: Compressed suffix arrays and suffix trees with applications to text indexing and string matching

Select data structureSelect data structure After that, we keep a lookup tableAfter that, we keep a lookup table For every logn/d pattern we save (d>=2)For every logn/d pattern we save (d>=2)

– Number of 1 bits, Number of 1 bits, – the location of the ith 1 bit in the patternthe location of the ith 1 bit in the pattern

Same as before the space is O(nSame as before the space is O(n1/d1/dlognloglogn)lognloglogn)

The lookup is then very simple, just walk the levels,The lookup is then very simple, just walk the levels,

Get a block and ask a query about him using the lookup Get a block and ask a query about him using the lookup table.table.

Space complexity , O(n/loglogn)Space complexity , O(n/loglogn)