Upload
jukkateuhola
View
322
Download
19
Embed Size (px)
Citation preview
V&me 7. number 6 INFORMATION PROCESSING LETTERS October 1978
A COMPRESSION METHOD FOR CLUSTERED BITVECTORS
Juklca TEUHBLA Department of Computer Science, University of lhrku, SF20500 l’bku 50, Finlavnd
Rlcceived 30 May 1978
Encoding methods, bitvector compression
1. Introduction
A bitvector can be compressed, if the frequence of zeroes (or ones as well) differs from 0.5 or if the vector is clustered in some way (i.e. not random). There are several compression methods, some of which are represented in the references [l3]. The methods can be divided into three types:
(1) Fixedtovariable encoding: The bitvector is divided into futedlength subvectors, which are replaced with variablelength codewords.
(2) Variizbletofuced encoding: The bitvector is divided into subvectors, socalled runs, which consist of consecutive Obits terminating with a lbit (or vice versa). The number of the Obits is called runlength and it is represented with a furedlength number.
(3) Variizbletovariable encoding: The runlength is encoded to a variablelength codeword.
The efficiency of a compression method can be expressed with compression gain, which simply means the idO.
length of the original bitvector G=___ length of the compressed bitvector l
For a random bitvector it is possible to determine the maximum value of ‘the expectation of G for a given Obit prsbability p, see [4] :
G,, = 1
P l%zP  (1 Po%2(l 3' (0
In the followmgp is assumed to be greater than 0.5 (without restriction).
Golomb [2] has described an exizoding method,
308
the gain of which is quite near to the maximum gain in a random case. It is a runlength encoding method of the third type. The codeword consists of a varia blelength prefer and a f=edlength tail. The idea of the method is illustrated in Fig. 1.
The prefer consists of 1 runlength/m 1 lbits fol lowed by a Obit as a separator. The tail consists of a binary number of pogzrn 1 bits and its value is: run length  m l 1 runl!engtb./m 1. The parameter m should be chosen so that pm = 0.5. It is preferable that m is of the form 2N. It can be shown [l] that the compression gain for a random bitvector and for optimal rlt is
1
GGol = (1  p)(logzm + (1  p”)l
1 wherem=
(2)
b2P l
If the bitvector is chrstared, i.e. not random, or p is not known beforehand, the Golomb code is not any more nearly optimal. For such cases Bahl and Kobayashi [ 1] have presented a socalled multimode Golomb code, which works better. In this paper an other (and simpler) modification of the Golomb code is presented. The method is called exponential Golomb encoding (abbr. expGo1). It is slightly worse in a random case but its worstcase behaviour is far better than that of the Golomb code.
2. ExpGdomb encoding method
Fig.. 2 illustrates the method. When a run is en
ceded, vze try to separate successively subvectors of
Volume 7, nwber 6 INFORMATION ,PROCESSING LETTERS October 1978
/’ ’ Rup , \
l . . 0*,~00000000000100100*.: LLysdVh~Q
m<i/Y: * 111077
J / I I Prefix The rest of the runlength
Separator represented as a binary
number of rlog2ml bits
(= tail)
Fig. 1. An example of the Golomb code.
2k 2k+l 2k’2 , . . . binary zeroes. When this doesn’t su&eed ‘my more, i.e. the end of the run (a lbit) is encountered, the rest of the runlength is encoded to a binary number. If the length of the last subvector was 2k+‘, then k + i t 1 bits are needed to represent the rest of the run. These bits are called the rail and its length is now variable (fixed in the Golomb code). The power of the method is in the fact that it en codes both short and long runs eificiently, because the successive subvectors grow exponentially. The method is not very sensitive to the T.&e of k, but it can be chosen, as for the Golomb code, so that pzk k: 0.5, if p is known. This is no an exact optimization rule, because the subvectors are not fwed, as in the Golomb code. Table 1 shows the codes of runlengths O10 fork = 0,l and 2.
,A n ; n+l
The encoding algorithm. Let s = runlength. Step 1. Determine n such that
Run
C2’<s<C 2’ (n>k 1). i=k i=k
Step 2. Form the prefm of n  k + 1 lbits. Step 3. Insert the separator (Obit). Step 4. Form the tail: express the value of s  x$k2’ as a binary number with n + 1 bits.
3. Analysis of the code
For a random bitvector it is easy to determine the expected length of the codeword (= LCW). Let us assume that p = probability of a Obit in the original vector and that the bits are independent:
k  .
E(LCw)=(k+ &gl ~‘(1 p)
The rest of the runlength
. represented as a binary Separator number of k+3 bits (=tail)
Fig. 2.An example of the expGolomb code (here k = 1).
309
Vobme 7, nuder 16, INFORblATlON PRTESSING LETTERS
4. Cih@arison of methods .
October 1978
Table 1 ExpGolomb codes for some rwhngths .
k0 k=l k=2
0 1 2 3 4 5 6 7 8 9
10
0 100 101 11000 11001 11010 11011 1110000 1110001 1110010 1 10011
I
00 01 1000 1001 1010 1011 110000 110001 110010 110011 110100
000 001 010 011 10000 10001 10010 10011 10100 10101 10110
2k+ *kb1 1
+ (k + 3) c: P’(l P) i=2k
2k+2k+l+2k+2_ 1
+(k+S) c p’( 1  p) + l ** ;_z2k+2k+1
= (k + 1) 5 p’(1 p) + 2 5 p’(1  p) i=o f=2k
t2 5 p’(1 p)+*** +2k+2k+f
k k+l =kt 1 t2*p2kt2 l p2 +2 + . . .
= k + 1 + 2 c p2kd+* 1) . I=0
The compression gain is
1 GexpGd = (1pp)qq ’
(3)
because the expected runlength is l/(1  p). As we can see, E(LCW) remains tolerable even if p were very near to 1 (the case p = 1 is impossible).
We first consider the random bitvector. Table 2 presents compression gains in five cases: Gnlax l the informationtheoretical maximum corn
pression gain.  Gexp~or, for k = 1 fmed.  GexpGo1 9 for k = 3 fqed.  GexpGals for & integer such that ~2~ * 0.5..  GG~~, when m * l/log2p and m = 2N. We can see that it is almost impossible to beat the Golomb code, because it is very near to the theoreti cal optimum. By choosing k properly, the exp Golomb encoder works almost as well.
Secondly we examine the clustered case. We define a cluster as a vector consisting of n  i consecutive Obits and i consecutive lbits. In this case we take p = (n  i)/n. Tabie 3 shows L!X compression gains for the exp=Golomb and Golomb codes, when n = 100, 1000 and 10 000. The compr&on gains have been calculated simply by performing the encoding in each case by hand. The value k = 0 was selected for the expGolomb code. The parameter m of the Golomb code was determined in the same way as before on the basis of p, without knowledge of the nonrandom distribution. Even these simple examples show the applicability of expGolomb encoding to clustered cases: tht;! usual Golomb encoding works almost as well as in the random case, but the exp= Golomb code is clearly better in all other cases ex cept those where i = 1 (a single lbit at the end of a Ovector). It is, however, difficult to say, what is the worst%ase situation for the expGolomb code.
Table 2 Comparison of compression gains (the random case)
P G max Gexp.Gol GGol
k=l k=3 opt. k
0.8 1.39 1.28 1.15 1.28 1.35 0.9 2.13 1.91 1.99 I.99 2.10 0.95 3.49 2.96 3.32 3.32 3.45 0.98 7.07 5.54 6.44 6.80 7.04 0.99 12.38 9.20 10.7 1 11.96 12.33 0.995 22.02 15.67 18.10 21.35 21.34 0.999 87.66 57.79 64.99 85.58 87.00  ..
310
Volume 7, number 6 INFORMATION PROCESSING LETTERS i Octob 1978  1 c :* *cr ” .*:. * . , . ^II,, “‘1. , /q‘; . ~ .I i’
Table 3 > : .*;* , i .. ** ‘I<
, .‘;.
’ . 2. ” 1 : ;’ . L _.
COIIqXUiSOn Of COmpreSSiOn g&IS h the case Qf ciusters  .  .
P n = 100 R = 1000 n=ldOOb
1 .” GexpGo1 ’ GGol GexpGal “GGol . : &~p’G;l :” G&h

’ 0.8 3.13 1.30 4.59 1.26 4v.94 1.25 0.9 4.55 1.96 8.48 1.95 9.75 1.95 0.95 5.88 3.33 14.71 3.24 19.01 3.23 0.98 7.14 6.67 26.32 6.67 44.25 6.64 . 0.99 7.69 12.50 35.71 11.77 79.37 * 11.71 0.995 43.48 21.28 131.58 20.96 0.999 52.63 100.00 277.78 84.03
_ _ I 
5. Conclusions References
The expGolomb code is very applicable to the compress& of such bitvectors that are clustered, i.e. the distribution of bits is not random. Also when the probability p of the Obit is not known, it is safe to use this encoder, because it works reasonably well for iuly small k.
The method could e.g. be applied to compression of inverted fties [3] and transmission or storage of digital images [ I].
[ 11 L.R. Bahl and II. Kobayashi, Image data compression by predictive coding II: Encoding algorithms, IBM J. Res. Develop. 18 (2) (1974).
[ 21 S.W. Golomb, Runlength en&ding, IEEE Trans. Infor mation Theory IT12 (1966) 399.
[ 31 M. Jakobsson and 0. Nevaiainen, On the compression of inverted fites: Report Ser. B No. 15, Dept. of Comput. Science, Unive& of Turku, Finland (i977).
[4 ] C.E. Shannon and W. Weaver, The Mathematical S%eory of Communication (Univ. of Iilmois Press, 1949).
311