Boosting Textual Compression in Optimal Linear Time

Boosting Textual Compression in Optimal Linear Time

Article by Ferragina, Giancarlo, Manzini and Sciortino.

Presentation by: Maor Itzkovitch.

DisclaimerThe author of this presentation,

henceforthreferred to as “The Author”, should not

beheld accountable for any mental illness,confusion, disorientation, or general

lack ofwill to live caused, directly or indirectly,

byprolonged exposure to this material.

IntroductionA boosting technique, in very informal terms, can beseen as a method that, when applied to a particularclass of algorithms, yields improved algorithms interms of one or more parameters characterizing theirperformance in the class. General boosting techniques have a deepsignificance for Computer Science. Using suchtechniques, one can, informally, take a goodalgorithm and, applying the boosting technique on it,get a very high-quality algorithm, again in terms ofthe parameters characterizing the nature of the

problem.

Introduction (cont)In the past weeks, I am sure we have all beenconvinced of the importance of textual compression to

ourfield of study.If so, we would like to come up with a boostingtechnique to improve existing textual compressionalgorithms, while sustaining the smallest possibleloss in the algorithm’s asymptotic time and spacecomplexity.In general, such efficient boosting techniques arevery hard to come by. In this class I will present onesuch boosting technique for improving textualcompression algorithms.

Presentation OutlineFor a change, this presentation will begin with

the results of the boosting technique. only then will I elaborate further.

As with all previous presentations, I will have to introduce many new definitions, and repeat a few that we have already seen. It’s not going to be easy, so bear with me.

Once the new definitions are all clear, we will see the pseudocode for the boosting technique. Assuming that the definitions are indeed clear, the technique itself is quite straightforward.

To conclude this presentation, I will show some remaining open problems.

Statement of ResultsLet s be a string over a finite alphabet Σ,

Let denote the k-th order empirical entropy of s, and let

be the k-th order modified empirical entropy of s, both of which will be

defined soon enough. Also, let us recall the Burrows-Wheeler Transform

that, given a string s, computes a permutation of that string, hereby denoted

BWT(s).

let us consider a compression algorithm A that compresses any string z# in at

most bits, where λ,η and μ are constants independent of z, and #

is a special symbol not appearing elsewhere in z. A general outline of the

boosting technique will be shown in the next slide.

.0k

)(sH k)(

*sH k

zzz H )(0

Statement of Results (cont)Here are the three major steps of the

technique:1. compute .2. using the suffix tree of , greedily partition

so that a suitably defined objective function is minimized.

3. compress each substring of the partition, separately, using algorithm A.

)(ˆ RsBWTs Rs s

Statement of Results (cont)We will show that for any , the length in bits of

thestring resulting from the boosting is bounded by:

If we rely on the stronger assumption that Acompresses every string z# in at most bits then the following improved bound can beachieved:

0k

kkgssss H 2log)(

)(*0 zHz

kk gssHs 2* log)(

Definitions Let s be a string over the alphabet

and, for each , let be the number of occurences of in s. We will assume that . The zeroth order empirical entropy of s is:

For any string w, we denote by the string of single symbols following the occurrences of w in s, taken from left to right.For example, if s = mississippi and w = si then = sp. We define the k-th order entropy as:

},...,{ 1 haaia in

ia 1 ii nn

)log()(1

0 s

n

s

nsH i

h

i

i

sw

sw

)(1

)( 0 sw

sK wHws

sHk

Definitions (cont) Now, we shall define the zeroth order modified

empirical entropy:

To define the k-th order modified empirical entropy, I will introduce the notion of suffix cover:we say that a set of substrings of s of length at most k, is a suffix cover of , and write , if every string in has a unique suffix in .For example, if and k = 3 then both and are suffix covers for .

)(

)log1(

0

)(

0

*0

sH

sssH

.

0)(0

0

0

otherwise

sHandsif

sif

kS k kkS

k kS},{ ba },{ ba

},,,{ bbbabbaba 3

Definitions (cont)

We now define, for every set cover :

Now we can finally define the k-th order modified empirical entropy of s:

For some optimal suffix cover .

kS

k

kSw

ssS wHws

sH )(1

)( *0

*

)()(min)( **** sHsHsHkkk

kSS

Sk

*kS

Definitions (cont) Three more notations also worthy of

mentioning, but briefly, are , and , where is the string of single characters preceding the occurrences of w in s from left to right, and .

I also wish to introduce the notion of prefix cover, which is equivalent to the notion of suffix cover, just with prefixes instead of suffixes. That is, is a prefix cover of if every string in has a unique prefix in .

sw@

)(* sH k

@)(sH k

@sw

@

)()( ** Rkk sHsH

@

)()( Rkk sHsH

@

kk k

k

Definitions (cont)Let us recall that BWT(s) constructs a matrix whoserows are cyclic shifts of s$ sorted by lexicographicalorder and returns the last column of that matrix.Let w be a substring of s. then by the matrix’sconstruction, all of the rows prefixed by w areconsecutive (because the matrix is sorted inlexicographical order). this means that the singlesymbols preceding every occurrence of w in s aregrouped together in a set of consecutive positions ofthe string . we denote this substring .It is easy to see that is a premutation of .

)(ˆ sBWTs ][ws][ws )(sw

@

Definitions (cont)Example: BWT matrix for the string t =

mississippi.

Let w = s. The four occurrences of s in t are in the

last four rows of the BWT matrix. Thenand that is indeed a permutation of .

ssiist ][

isisws @

Definitions (cont)Let T be the suffix tree of the string s$. We assumethat the suffix tree edges are sorted

lexicographically.If so, then the i-th leaf (counting from the left) of

thesuffix tree corresponds to the i-th row of the BWTmatrix. We associate the i-th leaf of T with the i-thsymbol of the string . I’ll denote the i-th leaf of T

by and the symbol associated with it by .By definition, .

si

i

11ˆ...ˆˆ ss

Definitions (cont)Let w be a substring of s. The locus of w,

denoted , is the node of T that has associated the

shorteststring prefixed by w.

][w

Definitions (cont)Example: suffix tree for the string s =

mississippi$.

The locus of the substrings ss and ssi is the node

reachable by the path labelled by ssi.

Definitions (cont)Another very important notion I would like tointroduce is that of the leaf cover. Given a suffix treeT we say that a subset L of its nodes is a leaf cover ifevery leaf of the suffix tree has a unique ancestor inL.For every node u of T we will denote by thesubstring of concatenating, from left to right, thesymbols associated to the leaves descending fromnode u. For example, in the suffix tree from theprevious slide, .

us

s

pssmis ][ˆ

Definitions (cont)Note that these symbols are exactly the singlesymbols preceding i in mississippi$. that is, for

anystring w we have .

][][ˆ wsws

Definitions (cont)A key observation in this article is the natural relationbetween leaf covers and prefix covers.let be the optimal prefix cover defining and let be the set of nodes .since is a prefix cover of we get that every leafof T corresponding to a suffix of length greater than khas a unique ancestor in . on the other hand,leaves of T corresponding to suffixes of length smaller than k might not have an ancestor in .We would like to enhance in a way that will make it a leaf cover of T.

},...,{ 1*

pk ww

)(* sH k

@k ]}[],...,[{ 1 pww

*k k

k

kk

Definitions (cont)We will denote by the set of leaves

correspondingto suffixes of s$ of length at most k which are

notprefixed by a string in . we set .

because s$ has at most k suffixes of length

smaller than k.

This relation is exploited next.

kQ

*k

kkk QL *

kQk

Definitions (cont)The Cost of a Leaf Cover:

Let C denote the function which associates to every string x over , with at most one occurrence of $, the positive real value

where are constants and x’ is the string x with the symbol $ removed, if it was present.we will now define the value of C for a leaf cover L:

{$}

)'(')( *0 xHxxC

,

Lu

usCLC )ˆ()(

Definitions (cont)

In this section, I only have the following lemma left to

prove:For any given there exists a constant such that for any string s:

The next three slides details the proof for the lemma.

0k kg

kkk gsHsLC )()( **@

Definitions (cont)Let us recall that and that by

definition .If so, then the following equation obviously

holds:

Observe that every is a leaf of T. By thedefinition of C we get that for every :

kkk QL *

kk Q

)2()1(

* )ˆ()ˆ()ˆ()(*

kkk QuuLu

k usCusCusCLC

kQu

kQu

1

*0

1

)'ˆ('ˆ)ˆ( usHususC

Definitions (cont)Also, recall that . Combined, we get thatsummation (2) is bound by .

For us to evaluate summation (1), recall thatevery is the locus of a string .By the relation between the suffix tree and the

BWTmatrix we have that . Also,

.Then we get:

kQk

)( k

ku *kw

][][ˆ wsws

k

wwu kkk

wsHwswsCusC

**

])[(][])[()ˆ( *0

k

k *

Definitions (cont)For the last step, recall that is a

permutation of and therefore and,

obviously, .

Finally, we get:

][ws

sw@

)(])[( *0

*0 swHwsH

@

kk

g

k

wssk gsHskwHwLC

kk

)()()()( **0

*

*

@

@@

swws@

][

Computing the Optimal Leaf Cover

Now that we’re finally done with all of the requireddefinitions, we can finally get on to business.

Perhaps the most important aspect of this boostingtechnique is that the optimal leaf cover can becomputed in time linear in |s|.

In the following slides I will present an algorithm that

computes that optimal leaf cover in linear time, andprove its correctness and time complexity.

Computing the Optimal Leaf Cover (cont)Before I show the actually algorithm, I will

prove thefollowing lemma:

An optimal leaf cover for the subtree rooted at u

consists of either the single node u, or of the union of

optimal leaf covers of the subtrees rooted at the

children of u in T.

Computing the Optimal Leaf Cover (cont)Proof: Let denote the optimal leaf cover

for thesubtree of T rooted at u.

If u is a leaf then the result obviously holds. Weassume then that u is an internal node and that are its children.

It’s obvious that and are both leafcovers of the subtree rooted at u.I will show that one of them is optimal.

)(min uL

cuu ,...,1

}{u c

iiuL

1min )(

Computing the Optimal Leaf Cover (cont)Let’s assume that . We can then

say that

where each is a leaf cover (not

necessarily the optimal one) for the subtree rooted at .

then the following holds:

}{)(min uuL

c

iiuLuL

1min )()(

)( iuL

iu

c

ii

c

ii uLCuLCuLC

1min

1min ))(())(())((

Computing the Optimal Leaf Cover (cont)Since the cost of the optimal leaf cover is

smaller orequal to that of any other leaf cover we get

that:

Which means that the union of the optimal leafcovers of the trees rooted at the children of u isindeed an optimal leaf cover for the tree rooted

at u.

c

iiuLCuLC

1minmin ))(())((

Computing the Optimal Leaf Cover (cont)The following algorithm computes the optimal leafcover in linear time:

The algorithm’s correctness follows immediately from

the previous lemma. I will show that it runs in O(|s|)

time.

Computing the Optimal Leaf Cover (cont)The only nontrivial operation in the algorithm is the calculation of at each step.To do that, we have to know the number ofoccurrences of each symbol in the alphabet in thestring (Because in order to calculate the cost

of astring, we have to calculate ).

Doing this is possible in constant time for each node

because if u is a leaf then each symbol in thealphabet appears either once or never in .

)ˆ( usC

us

)ˆ(*0 usH

us

Computing the Optimal Leaf Cover (cont)If u is not a leaf, then the number of

occurrences ofeach symbol in is the sum of the number of

itsoccurrences in where are the

childrenof u (Recall that is the concatenation of ).

Now we are finally ready to see the actual algorithm

describing the boosting technique.

us

jus cuu ,...,1

uscusus ˆ,...,ˆ 1

The Boosting TechniqueThe following algorithm describes the

technique:

The Boosting TechniqueFirst, any compression algorithm we wish to usethe boosting technique on has to satisfy the

followingproperty:

A is a compression algorithm such that, given an

input string , A first appends an end-of-string

symbol # to x and then compresses x# with the

following space and time bounds:

1. A compresses x# in at most bits.

2. the running time of A is T(|x|) and its working space is S(|x|)where T is convex and S is non-decreasing.

*x

)(*0 xHx

The Boosting TechniqueThe boosting algorithm can be used on any

algorithmsatisfying the previous property to boost itscompression up to the k-th order entropy for

any kwithout any asymptotic loss in time efficiency

andand with a slightly larger working space

complexity.

The Boosting TechniqueTheorem: Given a compression algorithm A

thatsatisfies the aforementioned property, our

boostingtechnique yields the following results:

1. If applied to s, it compresses it within bits, for any k.

2. If applied to , it compresses it within bits, for any k.

kk gssHs log)(*@

Rs

kk gssHs log)(*

Documents

Boosting Textual Compression in Optimal Linear Time