Upload
casandra-merryweather
View
229
Download
0
Tags:
Embed Size (px)
Citation preview
Boosting Textual Compression in Optimal Linear Time
Article by Ferragina, Giancarlo, Manzini and Sciortino.
Presentation by: Maor Itzkovitch.
DisclaimerThe author of this presentation,
henceforthreferred to as “The Author”, should not
beheld accountable for any mental illness,confusion, disorientation, or general
lack ofwill to live caused, directly or indirectly,
byprolonged exposure to this material.
IntroductionA boosting technique, in very informal terms, can beseen as a method that, when applied to a particularclass of algorithms, yields improved algorithms interms of one or more parameters characterizing theirperformance in the class. General boosting techniques have a deepsignificance for Computer Science. Using suchtechniques, one can, informally, take a goodalgorithm and, applying the boosting technique on it,get a very high-quality algorithm, again in terms ofthe parameters characterizing the nature of the
problem.
Introduction (cont)In the past weeks, I am sure we have all beenconvinced of the importance of textual compression to
ourfield of study.If so, we would like to come up with a boostingtechnique to improve existing textual compressionalgorithms, while sustaining the smallest possibleloss in the algorithm’s asymptotic time and spacecomplexity.In general, such efficient boosting techniques arevery hard to come by. In this class I will present onesuch boosting technique for improving textualcompression algorithms.
Presentation OutlineFor a change, this presentation will begin with
the results of the boosting technique. only then will I elaborate further.
As with all previous presentations, I will have to introduce many new definitions, and repeat a few that we have already seen. It’s not going to be easy, so bear with me.
Once the new definitions are all clear, we will see the pseudocode for the boosting technique. Assuming that the definitions are indeed clear, the technique itself is quite straightforward.
To conclude this presentation, I will show some remaining open problems.
Statement of ResultsLet s be a string over a finite alphabet Σ,
Let denote the k-th order empirical entropy of s, and let
be the k-th order modified empirical entropy of s, both of which will be
defined soon enough. Also, let us recall the Burrows-Wheeler Transform
that, given a string s, computes a permutation of that string, hereby denoted
BWT(s).
let us consider a compression algorithm A that compresses any string z# in at
most bits, where λ,η and μ are constants independent of z, and #
is a special symbol not appearing elsewhere in z. A general outline of the
boosting technique will be shown in the next slide.
.0k
)(sH k)(
*sH k
zzz H )(0
Statement of Results (cont)Here are the three major steps of the
technique:1. compute .2. using the suffix tree of , greedily partition
so that a suitably defined objective function is minimized.
3. compress each substring of the partition, separately, using algorithm A.
)(ˆ RsBWTs Rs s
Statement of Results (cont)We will show that for any , the length in bits of
thestring resulting from the boosting is bounded by:
If we rely on the stronger assumption that Acompresses every string z# in at most bits then the following improved bound can beachieved:
0k
kkgssss H 2log)(
)(*0 zHz
kk gssHs 2* log)(
Definitions Let s be a string over the alphabet
and, for each , let be the number of occurences of in s. We will assume that . The zeroth order empirical entropy of s is:
For any string w, we denote by the string of single symbols following the occurrences of w in s, taken from left to right.For example, if s = mississippi and w = si then = sp. We define the k-th order entropy as:
},...,{ 1 haaia in
ia 1 ii nn
)log()(1
0 s
n
s
nsH i
h
i
i
sw
sw
)(1
)( 0 sw
sK wHws
sHk
Definitions (cont) Now, we shall define the zeroth order modified
empirical entropy:
To define the k-th order modified empirical entropy, I will introduce the notion of suffix cover:we say that a set of substrings of s of length at most k, is a suffix cover of , and write , if every string in has a unique suffix in .For example, if and k = 3 then both and are suffix covers for .
)(
)log1(
0
)(
0
*0
sH
sssH
.
0)(0
0
0
otherwise
sHandsif
sif
kS k kkS
k kS},{ ba },{ ba
},,,{ bbbabbaba 3
Definitions (cont)
We now define, for every set cover :
Now we can finally define the k-th order modified empirical entropy of s:
For some optimal suffix cover .
kS
k
kSw
ssS wHws
sH )(1
)( *0
*
)()(min)( **** sHsHsHkkk
kSS
Sk
*kS
Definitions (cont) Three more notations also worthy of
mentioning, but briefly, are , and , where is the string of single characters preceding the occurrences of w in s from left to right, and .
I also wish to introduce the notion of prefix cover, which is equivalent to the notion of suffix cover, just with prefixes instead of suffixes. That is, is a prefix cover of if every string in has a unique prefix in .
sw@
)(* sH k
@)(sH k
@sw
@
)()( ** Rkk sHsH
@
)()( Rkk sHsH
@
kk k
k
Definitions (cont)Let us recall that BWT(s) constructs a matrix whoserows are cyclic shifts of s$ sorted by lexicographicalorder and returns the last column of that matrix.Let w be a substring of s. then by the matrix’sconstruction, all of the rows prefixed by w areconsecutive (because the matrix is sorted inlexicographical order). this means that the singlesymbols preceding every occurrence of w in s aregrouped together in a set of consecutive positions ofthe string . we denote this substring .It is easy to see that is a premutation of .
)(ˆ sBWTs ][ws][ws )(sw
@
Definitions (cont)Example: BWT matrix for the string t =
mississippi.
Let w = s. The four occurrences of s in t are in the
last four rows of the BWT matrix. Thenand that is indeed a permutation of .
ssiist ][
isisws @
Definitions (cont)Let T be the suffix tree of the string s$. We assumethat the suffix tree edges are sorted
lexicographically.If so, then the i-th leaf (counting from the left) of
thesuffix tree corresponds to the i-th row of the BWTmatrix. We associate the i-th leaf of T with the i-thsymbol of the string . I’ll denote the i-th leaf of T
by and the symbol associated with it by .By definition, .
si
i
11ˆ...ˆˆ ss
Definitions (cont)Let w be a substring of s. The locus of w,
denoted , is the node of T that has associated the
shorteststring prefixed by w.
][w
Definitions (cont)Example: suffix tree for the string s =
mississippi$.
The locus of the substrings ss and ssi is the node
reachable by the path labelled by ssi.
Definitions (cont)Another very important notion I would like tointroduce is that of the leaf cover. Given a suffix treeT we say that a subset L of its nodes is a leaf cover ifevery leaf of the suffix tree has a unique ancestor inL.For every node u of T we will denote by thesubstring of concatenating, from left to right, thesymbols associated to the leaves descending fromnode u. For example, in the suffix tree from theprevious slide, .
us
s
pssmis ][ˆ
Definitions (cont)Note that these symbols are exactly the singlesymbols preceding i in mississippi$. that is, for
anystring w we have .
][][ˆ wsws
Definitions (cont)A key observation in this article is the natural relationbetween leaf covers and prefix covers.let be the optimal prefix cover defining and let be the set of nodes .since is a prefix cover of we get that every leafof T corresponding to a suffix of length greater than khas a unique ancestor in . on the other hand,leaves of T corresponding to suffixes of length smaller than k might not have an ancestor in .We would like to enhance in a way that will make it a leaf cover of T.
},...,{ 1*
pk ww
)(* sH k
@k ]}[],...,[{ 1 pww
*k k
k
kk
Definitions (cont)We will denote by the set of leaves
correspondingto suffixes of s$ of length at most k which are
notprefixed by a string in . we set .
because s$ has at most k suffixes of length
smaller than k.
This relation is exploited next.
kQ
*k
kkk QL *
kQk
Definitions (cont)The Cost of a Leaf Cover:
Let C denote the function which associates to every string x over , with at most one occurrence of $, the positive real value
where are constants and x’ is the string x with the symbol $ removed, if it was present.we will now define the value of C for a leaf cover L:
{$}
)'(')( *0 xHxxC
,
Lu
usCLC )ˆ()(
Definitions (cont)
In this section, I only have the following lemma left to
prove:For any given there exists a constant such that for any string s:
The next three slides details the proof for the lemma.
0k kg
kkk gsHsLC )()( **@
Definitions (cont)Let us recall that and that by
definition .If so, then the following equation obviously
holds:
Observe that every is a leaf of T. By thedefinition of C we get that for every :
kkk QL *
kk Q
)2()1(
* )ˆ()ˆ()ˆ()(*
kkk QuuLu
k usCusCusCLC
kQu
kQu
1
*0
1
)'ˆ('ˆ)ˆ( usHususC
Definitions (cont)Also, recall that . Combined, we get thatsummation (2) is bound by .
For us to evaluate summation (1), recall thatevery is the locus of a string .By the relation between the suffix tree and the
BWTmatrix we have that . Also,
.Then we get:
kQk
)( k
ku *kw
][][ˆ wsws
k
wwu kkk
wsHwswsCusC
**
])[(][])[()ˆ( *0
k
k *
Definitions (cont)For the last step, recall that is a
permutation of and therefore and,
obviously, .
Finally, we get:
][ws
sw@
)(])[( *0
*0 swHwsH
@
kk
g
k
wssk gsHskwHwLC
kk
)()()()( **0
*
*
@
@@
swws@
][
Computing the Optimal Leaf Cover
Now that we’re finally done with all of the requireddefinitions, we can finally get on to business.
Perhaps the most important aspect of this boostingtechnique is that the optimal leaf cover can becomputed in time linear in |s|.
In the following slides I will present an algorithm that
computes that optimal leaf cover in linear time, andprove its correctness and time complexity.
Computing the Optimal Leaf Cover (cont)Before I show the actually algorithm, I will
prove thefollowing lemma:
An optimal leaf cover for the subtree rooted at u
consists of either the single node u, or of the union of
optimal leaf covers of the subtrees rooted at the
children of u in T.
Computing the Optimal Leaf Cover (cont)Proof: Let denote the optimal leaf cover
for thesubtree of T rooted at u.
If u is a leaf then the result obviously holds. Weassume then that u is an internal node and that are its children.
It’s obvious that and are both leafcovers of the subtree rooted at u.I will show that one of them is optimal.
)(min uL
cuu ,...,1
}{u c
iiuL
1min )(
Computing the Optimal Leaf Cover (cont)Let’s assume that . We can then
say that
where each is a leaf cover (not
necessarily the optimal one) for the subtree rooted at .
then the following holds:
}{)(min uuL
c
iiuLuL
1min )()(
)( iuL
iu
c
ii
c
ii uLCuLCuLC
1min
1min ))(())(())((
Computing the Optimal Leaf Cover (cont)Since the cost of the optimal leaf cover is
smaller orequal to that of any other leaf cover we get
that:
Which means that the union of the optimal leafcovers of the trees rooted at the children of u isindeed an optimal leaf cover for the tree rooted
at u.
c
iiuLCuLC
1minmin ))(())((
Computing the Optimal Leaf Cover (cont)The following algorithm computes the optimal leafcover in linear time:
The algorithm’s correctness follows immediately from
the previous lemma. I will show that it runs in O(|s|)
time.
Computing the Optimal Leaf Cover (cont)The only nontrivial operation in the algorithm is the calculation of at each step.To do that, we have to know the number ofoccurrences of each symbol in the alphabet in thestring (Because in order to calculate the cost
of astring, we have to calculate ).
Doing this is possible in constant time for each node
because if u is a leaf then each symbol in thealphabet appears either once or never in .
)ˆ( usC
us
)ˆ(*0 usH
us
Computing the Optimal Leaf Cover (cont)If u is not a leaf, then the number of
occurrences ofeach symbol in is the sum of the number of
itsoccurrences in where are the
childrenof u (Recall that is the concatenation of ).
Now we are finally ready to see the actual algorithm
describing the boosting technique.
us
jus cuu ,...,1
uscusus ˆ,...,ˆ 1
The Boosting TechniqueThe following algorithm describes the
technique:
The Boosting TechniqueFirst, any compression algorithm we wish to usethe boosting technique on has to satisfy the
followingproperty:
A is a compression algorithm such that, given an
input string , A first appends an end-of-string
symbol # to x and then compresses x# with the
following space and time bounds:
1. A compresses x# in at most bits.
2. the running time of A is T(|x|) and its working space is S(|x|)where T is convex and S is non-decreasing.
*x
)(*0 xHx
The Boosting TechniqueThe boosting algorithm can be used on any
algorithmsatisfying the previous property to boost itscompression up to the k-th order entropy for
any kwithout any asymptotic loss in time efficiency
andand with a slightly larger working space
complexity.
The Boosting TechniqueTheorem: Given a compression algorithm A
thatsatisfies the aforementioned property, our
boostingtechnique yields the following results:
1. If applied to s, it compresses it within bits, for any k.
2. If applied to , it compresses it within bits, for any k.
kk gssHs log)(*@
Rs
kk gssHs log)(*