A compression method for clustered bit-vectors

  • View
    245

  • Download
    9

Embed Size (px)

Text of A compression method for clustered bit-vectors

  • V&me 7. number 6 INFORMATION PROCESSING LETTERS October 1978

    A COMPRESSION METHOD FOR CLUSTERED BIT-VECTORS

    Juklca TEUHBLA Department of Computer Science, University of lhrku, SF-20500 lbku 50, Finlavnd

    Rlcceived 30 May 1978

    Encoding methods, bit-vector compression

    1. Introduction

    A bit-vector can be compressed, if the frequence of zeroes (or ones as well) differs from 0.5 or if the vector is clustered in some way (i.e. not random). There are several compression methods, some of which are represented in the references [l-3]. The methods can be divided into three types:

    (1) Fixed-to-variable encoding: The bit-vector is divided into futed-length sub-vectors, which are replaced with variable-length codewords.

    (2) Variizble-to-fuced encoding: The bit-vector is divided into sub-vectors, so-called runs, which consist of consecutive O-bits terminating with a l-bit (or vice versa). The number of the O-bits is called run-length and it is represented with a fured-length number.

    (3) Variizble-to-variable encoding: The run-length is encoded to a variable-length codeword.

    The efficiency of a compression method can be expressed with compression gain, which simply means the idO.

    length of the original bit-vector G=--_-_--_ length of the compressed bit-vector l

    For a random bit-vector it is possible to determine the maximum value of the expectation of G for a given O-bit prsbability p, see [4] :

    G,, = 1

    --P l%zP - (1 -Po%2(l -3' (0

    In the followmgp is assumed to be greater than 0.5 (without restriction).

    Golomb [2] has described an exizoding method,

    308

    the gain of which is quite near to the maximum gain in a random case. It is a run-length encoding method of the third type. The codeword consists of a varia- ble-length prefer and a f=edlength tail. The idea of the method is illustrated in Fig. 1.

    The prefer consists of 1 run-length/m 1 l-bits fol- lowed by a O-bit as a separator. The tail consists of a binary number of pogzrn 1 bits and its value is: run- length - m l 1 run-l!engtb./m 1. The parameter m should be chosen so that pm = 0.5. It is preferable that m is of the form 2N. It can be shown [l] that the compression gain for a random bit-vector and for optimal rlt is

    1

    GGol = (1 - p)(logzm + (1 - p)-l

    1 wherem=--

    (2)

    b2P l

    If the bit-vector is chrstared, i.e. not random, or p is not known beforehand, the Golomb code is not any more nearly optimal. For such cases Bahl and Kobayashi [ 1] have presented a so-called multi-mode Golomb code, which works better. In this paper an- other (and simpler) modification of the Golomb code is presented. The method is called exponential- Golomb encoding (abbr. expGo1). It is slightly worse in a random case but its worst-case behaviour is far better than that of the Golomb code.

    2. Exp-Gdomb encoding method

    Fig.. 2 illustrates the method. When a run is en- ceded, vze try to separate successively sub-vectors of

  • Volume 7, nwber 6 INFORMATION ,PROCESSING LETTERS October 1978

    / Rup , \

    l . . 0*,~00000000000100100*--.: LLysdVh~Q

    m

  • Vobme 7, nuder 16, INFORblATlON PRTESSING LETTERS

    4. Cih@arison of methods .

    October 1978

    Table 1 Exp-Golomb codes for some rwhngths -.

    k-0 k=l k=2

    0 1 2 3 4 5 6 7 8 9

    10

    0 100 101 11000 11001 11010 11011 1110000 1110001 1110010 1 10011

    I-

    00 01 1000 1001 1010 1011 110000 110001 110010 110011 110100

    000 001 010 011 10000 10001 10010 10011 10100 10101 10110

    2k+ *k-b1 -1

    + (k + 3) c: P(l -P) i=2k

    2k+2k+l+2k+2_ 1

    +(k+S) c p( 1 - p) + l ** ;_z2k+2k+1

    = (k + 1) 5 p(1 -p) + 2 5 p(1 - p) i=o f=2k

    t2 5 p(1 -p)+*** +2k+2k+f

    k k+l =kt 1 t2*p2kt2 l p2 +2 + . . .

    = k + 1 + 2 c p2kd+* -1) . I=0

    The compression gain is

    1 GexpGd = (1pp)qq

    (3)

    because the expected run-length is l/(1 - p). As we can see, E(LCW) remains tolerable even if p were very near to 1 (the case p = 1 is impossible).

    We first consider the random bit-vector. Table 2 presents compression gains in five cases: -Gnlax- l the information-theoretical maximum corn-

    pression gain. - Gexp~or, for k = 1 fmed. - Gexp-Go1 9 for k = 3 fqed. - Gexp-Gals for & integer such that ~2~ * 0.5.. - GG~~, when m * -l/log2p and m = 2N. We can see that it is almost impossible to beat the Golomb code, because it is very near to the theoreti- cal optimum. By choosing k properly, the exp Golomb encoder works almost as well.

    Secondly we examine the clustered case. We define a cluster as a vector consisting of n - i consecutive O-bits and i consecutive l-bits. In this case we take p = (n - i)/n. Tabie 3 shows L!X compression gains for the exp=Golomb and Golomb codes, when n = 100, 1000 and 10 000. The compr&on gains have been calculated simply by performing the encoding in each case by hand. The value k = 0 was selected for the expGolomb code. The parameter m of the Golomb code was determined in the same way as before on the basis of p, without knowledge of the non-random distribution. Even these simple examples show the applicability of exp-Golomb encoding to clustered cases: tht;! usual Golomb encoding works almost as well as in the random case, but the exp= Golomb code is clearly better in all other cases ex- cept those where i = 1 (a single l-bit at the end of a O-vector). It is, however, difficult to say, what is the worst%ase situation for the exp-Golomb code.

    Table 2 Comparison of compression gains (the random case)

    P G max Gexp.Gol GGol

    k=l k=3 opt. k

    0.8 1.39 1.28 1.15 1.28 1.35 0.9 2.13 1.91 1.99 I.99 2.10 0.95 3.49 2.96 3.32 3.32 3.45 0.98 7.07 5.54 6.44 6.80 7.04 0.99 12.38 9.20 10.7 1 11.96 12.33 0.995 22.02 15.67 18.10 21.35 21.34 0.999 87.66 57.79 64.99 85.58 87.00 - -..--

    310

  • Volume 7, number 6 INFORMATION PROCESSING LETTERS- i Octob 1978 - 1 c :* *cr .*:. * . , . ^II,, 1. , /q;- . ~ .I i

    Table 3 > : .*-;* ,- i -.. ** I< , .;.

    . 2. 1 : -; . L _. COIIqXUiSOn Of COmpreSSiOn g&IS h the case Qf ciusters - .

    -- -.

    P n = 100 R = 1000 -n=ldOOb

    1 . Gexp-Go1 GGol Gexp-Gal GGol . : &~pG;l : G&h -

    0.8 3.13 1.30 4.59 1.26 4v.94 1.25 0.9 4.55 1.96 8.48 1.95 9.75 1.95 0.95 5.88 3.33 14.71 3.24 19.01 3.23 0.98 7.14 6.67 26.32 6.67 44.25 6.64 . 0.99 7.69 12.50 35.71 11.77 79.37 * 11.71 0.995 43.48 21.28 131.58 20.96 0.999 52.63 100.00 277.78 84.03

    _ ----_ I--- ---

    5. Conclusions References

    The exp-Golomb code is very applicable to the compress& of such bit-vectors that are clustered, i.e. the distribution of bits is not random. Also when the probability p of the O-bit is not known, it is safe to use this encoder, because it works reasonably well for iuly small k.

    The method could e.g. be applied to compression of inverted fties [3] and transmission or storage of -digital images [ I].

    [ 11 L.R. Bahl and II. Kobayashi, Image data compression by predictive coding II: Encoding algorithms, IBM J. Res. Develop. 18 (2) (1974).

    [ 21 S.W. Golomb, Run-length en&ding, IEEE Trans. Infor- mation Theory IT-12 (1966) 399.

    [ 31 M. Jakobsson and 0. Nevaiainen, On the compression of inverted fites: Report Ser. B No. 15, Dept. of Comput. Science, Unive& of Turku, Finland (i977).

    [4 ] C.E. Shannon and W. Weaver, The Mathematical S%eory of Communication (Univ. of Iilmois Press, 1949).

    311