of 4 /4
V&me 7. number 6 INFORMATION PROCESSING LETTERS October 1978 A COMPRESSION METHOD FOR CLUSTERED BIT-VECTORS Juklca TEUHBLA Department of Computer Science, Universityof lhrku, SF-20500 l’bku 50, Finlavnd Rlcceived 30 May 1978 Encoding methods, bit-vector compression 1. Introduction A bit-vector can be compressed, if the frequence of zeroes (or ones as well) differs from 0.5 or if the vector is clustered in some way (i.e. not random). There are several compression methods, some of which are represented in the references [l-3]. The methods can be divided into three types: (1) Fixed-to-variable encoding: The bit-vector is divided into futed-length sub-vectors, which are replaced with variable-length codewords. (2) Variizble-to-fuced encoding: The bit-vector is divided into sub-vectors, so-called runs, which consist of consecutive O-bits terminating with a l-bit (or vice versa). The number of the O-bits is called run-length and it is represented with a fured-length number. (3) Variizble-to-variable encoding: The run-length is encoded to a variable-length codeword. The efficiency of a compression method can be expressed with compression gain, which simply means the idO. length of the original bit-vector G=--_-_--_ length of the compressed bit-vector l For a random bit-vector it is possible to determine the maximum value of ‘the expectation of G for a given O-bit prsbability p, see [4] : G,, = 1 --P l%zP - (1-Po%2(l -3' (0 In the followmgp is assumed to be greater than 0.5 (without restriction). Golomb [2] has described an exizoding method, 308 the gain of which is quite near to the maximum gain in a random case. It is a run-length encoding method of the third type. The codeword consists of a varia- ble-length prefer and a f=edlength tail. The idea of the method is illustrated in Fig. 1. The prefer consists of 1 run-length/m 1 l-bits fol- lowed by a O-bit as a separator. The tail consists of a binary number of pogzrn 1 bits and its value is: run- length - m l 1 run-l!engtb./m 1. The parameter m should be chosen so that pm = 0.5. It is preferable that m is of the form 2N. It can be shown [l] that the compression gain for a random bit-vector and for optimal rlt is 1 GGol = (1 - p)(logzm + (1 - p”)-l 1 wherem=-- (2) b2P l If the bit-vector is chrstared, i.e. not random, or p is not known beforehand, the Golomb code is not any more nearly optimal. For such cases Bahl and Kobayashi [ 1] have presented a so-called multi-mode Golomb code, which works better. In this paper an- other (and simpler) modification of the Golomb code is presented. The method is called exponential- Golomb encoding (abbr. expGo1). It is slightly worse in a random case but its worst-case behaviour is far better than that of the Golomb code. 2. Exp-Gdomb encoding method Fig.. 2 illustrates the method. When a run is en- ceded, vze try to separate successively sub-vectors of

# A compression method for clustered bit-vectors

Embed Size (px)

Citation preview

V&me 7. number 6 INFORMATION PROCESSING LETTERS October 1978

A COMPRESSION METHOD FOR CLUSTERED BIT-VECTORS

Juklca TEUHBLA Department of Computer Science, University of lhrku, SF-20500 l’bku 50, Finlavnd

Rlcceived 30 May 1978

Encoding methods, bit-vector compression

1. Introduction

A bit-vector can be compressed, if the frequence of zeroes (or ones as well) differs from 0.5 or if the vector is clustered in some way (i.e. not random). There are several compression methods, some of which are represented in the references [l-3]. The methods can be divided into three types:

(1) Fixed-to-variable encoding: The bit-vector is divided into futed-length sub-vectors, which are replaced with variable-length codewords.

(2) Variizble-to-fuced encoding: The bit-vector is divided into sub-vectors, so-called runs, which consist of consecutive O-bits terminating with a l-bit (or vice versa). The number of the O-bits is called run-length and it is represented with a fured-length number.

(3) Variizble-to-variable encoding: The run-length is encoded to a variable-length codeword.

The efficiency of a compression method can be expressed with compression gain, which simply means the idO.

length of the original bit-vector G=--_-_--_ length of the compressed bit-vector l

For a random bit-vector it is possible to determine the maximum value of ‘the expectation of G for a given O-bit prsbability p, see [4] :

G,, = 1

--P l%zP - (1 -Po%2(l -3' (0

In the followmgp is assumed to be greater than 0.5 (without restriction).

Golomb [2] has described an exizoding method,

308

the gain of which is quite near to the maximum gain in a random case. It is a run-length encoding method of the third type. The codeword consists of a varia- ble-length prefer and a f=edlength tail. The idea of the method is illustrated in Fig. 1.

The prefer consists of 1 run-length/m 1 l-bits fol- lowed by a O-bit as a separator. The tail consists of a binary number of pogzrn 1 bits and its value is: run- length - m l 1 run-l!engtb./m 1. The parameter m should be chosen so that pm = 0.5. It is preferable that m is of the form 2N. It can be shown [l] that the compression gain for a random bit-vector and for optimal rlt is

1

GGol = (1 - p)(logzm + (1 - p”)-l

1 wherem=--

(2)

b2P l

If the bit-vector is chrstared, i.e. not random, or p is not known beforehand, the Golomb code is not any more nearly optimal. For such cases Bahl and Kobayashi [ 1] have presented a so-called multi-mode Golomb code, which works better. In this paper an- other (and simpler) modification of the Golomb code is presented. The method is called exponential- Golomb encoding (abbr. expGo1). It is slightly worse in a random case but its worst-case behaviour is far better than that of the Golomb code.

2. Exp-Gdomb encoding method

Fig.. 2 illustrates the method. When a run is en-

ceded, vze try to separate successively sub-vectors of

Volume 7, nwber 6 INFORMATION ,PROCESSING LETTERS October 1978

/’ ’ Rup , \

l . . 0*,~00000000000100100*--.: LLysdVh~Q

m<i/Y: * 111077

J- / I I Prefix The rest of the run-length

Separator represented as a binary

number of rlog2ml bits

(= tail)

Fig. 1. An example of the Golomb code.

2k 2k+l 2k’2 , . . . binary zeroes. When this doesn’t su&eed ‘my more, i.e. the end of the run (a l-bit) is encountered, the rest of the run-length is encoded to a binary number. If the length of the last sub-vector was 2k+‘, then k + i t 1 bits are needed to represent the rest of the run. These bits are called the rail and its length is now variable (fixed in the Golomb code). The power of the method is in the fact that it en- codes both short and long runs eificiently, because the successive sub-vectors grow exponentially. The method is not very sensitive to the T.&e of k, but it can be chosen, as for the Golomb code, so that pzk k: 0.5, if p is known. This is no an exact optimization rule, because the sub-vectors are not fwed, as in the Golomb code. Table 1 shows the codes of run-lengths O-10 fork = 0,l and 2.

,A n ; n+l

The encoding algorithm. Let s = run-length. Step 1. Determine n such that

Run

C2’<s<C 2’ (n>k- 1). i=k i=k

Step 2. Form the prefm of n - k + 1 l-bits. Step 3. Insert the separator (O-bit). Step 4. Form the tail: express the value of s - x\$k2’ as a binary number with n + 1 bits.

3. Analysis of the code

For a random bit-vector it is easy to determine the expected length of the codeword (= LCW). Let us assume that p = probability of a O-bit in the original vector and that the bits are independent:

k - .

E(LCw)=(k+ &gl ~‘(1 -p)

The rest of the run-length

. represented as a binary Separator number of k+3 bits (=tail)

Fig. 2.An example of the exp-Golomb code (here k = 1).

309

Vobme 7, nuder 16, INFORblATlON PRTESSING LETTERS

4. Cih@arison of methods .

October 1978

Table 1 Exp-Golomb codes for some rwhngths -.

k-0 k=l k=2

0 1 2 3 4 5 6 7 8 9

10

0 100 101 11000 11001 11010 11011 1110000 1110001 1110010 1 10011

I-

00 01 1000 1001 1010 1011 110000 110001 110010 110011 110100

000 001 010 011 10000 10001 10010 10011 10100 10101 10110

2k+ *k-b1 -1

+ (k + 3) c: P’(l -P) i=2k

2k+2k+l+2k+2_ 1

+(k+S) c p’( 1 - p) + l ** ;_z2k+2k+1

= (k + 1) 5 p’(1 -p) + 2 5 p’(1 - p) i=o f=2k

t2 5 p’(1 -p)+*** +2k+2k+f

k k+l =kt 1 t2*p2kt2 l p2 +2 + . . .

= k + 1 + 2 c p2kd+* -1) . I=0

The compression gain is

1 GexpGd = (1pp)qq ’

(3)

because the expected run-length is l/(1 - p). As we can see, E(LCW) remains tolerable even if p were very near to 1 (the case p = 1 is impossible).

We first consider the random bit-vector. Table 2 presents compression gains in five cases: -Gnlax- l the information-theoretical maximum corn-

pression gain. - Gexp~or, for k = 1 fmed. - Gexp-Go1 9 for k = 3 fqed. - Gexp-Gals for & integer such that ~2~ * 0.5.. - GG~~, when m * -l/log2p and m = 2N. We can see that it is almost impossible to beat the Golomb code, because it is very near to the theoreti- cal optimum. By choosing k properly, the exp Golomb encoder works almost as well.

Secondly we examine the clustered case. We define a cluster as a vector consisting of n - i consecutive O-bits and i consecutive l-bits. In this case we take p = (n - i)/n. Tabie 3 shows L!X compression gains for the exp=Golomb and Golomb codes, when n = 100, 1000 and 10 000. The compr&on gains have been calculated simply by performing the encoding in each case by hand. The value k = 0 was selected for the expGolomb code. The parameter m of the Golomb code was determined in the same way as before on the basis of p, without knowledge of the non-random distribution. Even these simple examples show the applicability of exp-Golomb encoding to clustered cases: tht;! usual Golomb encoding works almost as well as in the random case, but the exp= Golomb code is clearly better in all other cases ex- cept those where i = 1 (a single l-bit at the end of a O-vector). It is, however, difficult to say, what is the worst%ase situation for the exp-Golomb code.

Table 2 Comparison of compression gains (the random case)

P G max Gexp.Gol GGol

k=l k=3 opt. k

0.8 1.39 1.28 1.15 1.28 1.35 0.9 2.13 1.91 1.99 I.99 2.10 0.95 3.49 2.96 3.32 3.32 3.45 0.98 7.07 5.54 6.44 6.80 7.04 0.99 12.38 9.20 10.7 1 11.96 12.33 0.995 22.02 15.67 18.10 21.35 21.34 0.999 87.66 57.79 64.99 85.58 87.00 - -..--

310

Volume 7, number 6 INFORMATION PROCESSING LETTERS- i Octob 1978 - 1 c :* *cr ” .*:. * . , . ^II,, “‘1. , /q‘;- . ~ .I i’

Table 3 > : .*-;* ,- i -.. ** ‘I<

, .‘;.

’ . 2. ” 1 : -;’ . L _.

COIIqXUiSOn Of COmpreSSiOn g&IS h the case Qf ciusters - . -- -.

P n = 100 R = 1000 -n=ldOOb

1 .” Gexp-Go1 ’ GGol Gexp-Gal “GGol . : &~p’G;l :” G&h

-

’ 0.8 3.13 1.30 4.59 1.26 4v.94 1.25 0.9 4.55 1.96 8.48 1.95 9.75 1.95 0.95 5.88 3.33 14.71 3.24 19.01 3.23 0.98 7.14 6.67 26.32 6.67 44.25 6.64 . 0.99 7.69 12.50 35.71 11.77 79.37 * 11.71 0.995 43.48 21.28 131.58 20.96 0.999 52.63 100.00 277.78 84.03

_ ----_ I--- ---

5. Conclusions References

The exp-Golomb code is very applicable to the compress& of such bit-vectors that are clustered, i.e. the distribution of bits is not random. Also when the probability p of the O-bit is not known, it is safe to use this encoder, because it works reasonably well for iuly small k.

The method could e.g. be applied to compression of inverted fties [3] and transmission or storage of -digital images [ I].

[ 11 L.R. Bahl and II. Kobayashi, Image data compression by predictive coding II: Encoding algorithms, IBM J. Res. Develop. 18 (2) (1974).

[ 21 S.W. Golomb, Run-length en&ding, IEEE Trans. Infor- mation Theory IT-12 (1966) 399.

[ 31 M. Jakobsson and 0. Nevaiainen, On the compression of inverted fites: Report Ser. B No. 15, Dept. of Comput. Science, Unive& of Turku, Finland (i977).

[4 ] C.E. Shannon and W. Weaver, The Mathematical S%eory of Communication (Univ. of Iilmois Press, 1949).

311