Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Source Coding

Data CompressionMay 7, 2012

A.J. Han Vinck

2

DATA COMPRESSION

NO LOSS of information and exact reproduction (low compression ratio 1:4)

• general problem statement:

“find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”

3

GENERAL IDEA:• represent likely symbols with short length binary words

where likely is derived from

- prediction of next symbol in source output

q-ue q-ua q-ui q-uoq ?

q-00 q-01 q-10 q-11

- context between the source symbolswords sounds context in pictures

4

Why compress?1. - Lossless compression often reduces file size by 40% to 80%.

1. - More economical to transport and store

2. - Most Internet content is compressed for transmission

3. - Compression before encryption can make code-breaking difficult

4. - Conserve battery power and storage space on mobile devices

5. - Compression and decompression can be hardwired

5

Some history

• 1948 – Shannon-Fano coding

• 1952 – Huffman coding– reduced redundancy in symbol coding– demonstrably optimal fixed-length coding

• 1977 – Lempel-Ziv coding– first major “dictionary method”– maps repeated word patterns to code words

6

MODEL KNOWLEDGE

best performance: exact prediction!

exact prediction: no new information!

no new information: no message to transmit

7

Example No prediction

source: C

message 0 1 2 3 4 5 6 7

code 000 001 010 011 100 101 110 111

representation length: = 3

8

Example with prediction ENCODE DIFFERENCE

probability .25 .5 .25 difference -1 0 1

code 00 1 01 +

source C code -

P

L = .25 * 2 + .5 * 1 + .25 * 2 = 1.5 bit/difference symbol

•

9

binary tree codes the relation between source symbols and codewords

A:= 11

B:= 10

C:= 0

1

0

1

0

General Properties:

- every node has two successors: leaves or/and nodes

- the way to reach a leave gives the connected codeword

- source letters are only assigned to leaves

i.e. no codeword is prefix of another code word

code

10

tree codes

Tree codes are prefix codes and uniquely decodable

i.e. a string of codewords can be uniquely decomposed

into the individual codewords

Non-prefix codes may be uniquely decodable

example:

A:= 1

B:= 10

C:= 100

Proper trees

11

We look at trees that make sense: occupy all possible leaves!

12

binary tree codes

The average codeword length

Property: an optimal code has minimum L

Homework: show that L = sum (node probabilities)

1

M

i ii

L Pn

13

Tree encoding (1)for data / text the compression should be: lossless no errors

– STEP 1: assign messages to nodes codeword niP(i)

a 0.5 1 1 1 1 1.5

1b 0.25 0 0 1 1 0.75

1 c 0.125 0 0 1 0.25

d 0.0625 1 1 0 0.125 0e 0.0625 0 0 0 0.125

• AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol

14

Tree encoding (2)• STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length )

codeword niP(i)

e 0.0625 1 1111 0.251

d 0.0625 0 0111 0.25 1

c 0.125 0 011 0.375 1

b 0.25 0 01 0.5 a 0.5 0 0 0.5

• AVERAGE CODEWORD LENGTH: = 1.875 bit/source symbol !

15

Kraft inequality• Prefix codes with M code words satisfy the Kraft inequality:

where nk is the code word length for message k

• Proof: let nM be the longest codeword length

then, in a code tree of depth nM, the terminal nodes eliminate

from the total number of available nodes

1

2 1M

nk

k

1

2 2M

nn nkM M

k

16

example

Depth = 4

eliminates 8

eliminates 4eliminates 2

Homework: can we replace ≤ into = in the Kraft inequality?

17

Kraft inequality• Suppose that the length specification of M code words satisfies the

Kraft inequality,

then

where Ni is the number of code words of length i.

Then, we can construct a prefix code with the specified lengths.

Note that:

1 1

2 2 1n MM

ni ki

i k

N

1 1

: 2 2 1nj M

i iM i i

i i

j n N N

18

Kraft inequality

• From this,

• Interpretation: at every level less nodes used than available!• E.g. for level 3, we have 8 nodes minus the nodes cancelled

byLevel 1 and 2.

11

2 12 1

3 2 13 2 1

2 1

2 2 1

2 2 2 1

N

N N

N N N

1 1

: 2 2 1nj M

i iM i i

i i

j n N N

212

33

12

2

1

2N2N2N

2N2N

2N

19

performance

• Suppose that we select the code word lengths as

• Then, a prefix code exists, since

with average length

1

2 1k

Mn

k

1

1 1 1

2 log 1

log ( ) 1

knk k k

M M M

k k k k kk k k

p n p

p n p p p H U

12 2n nk kkp

20

Lower bound for prefix codes

• We show that

• We write

• Equality can be established for

1

( )M

i ii

H U L Pn

2 21 1 1

2 21 1 1

2( ) log log

2log ( 1) log ( 2 ) 0

nM M M i

i i i i iii i i

nM M Mini

i iii i i

H U L P P Pn PP

e P e PP

2logi in P for all i

Summary so far

21

1. Prefix codes satisfy the Kraft inequality

2. If the length assignment of a code satisfies the Kraft inequality

, then a prefix code exists

3. Shannon gives a length assignment

with efficiency

1

2 1M

nk

k

1

1 1 1

2 log 1

log ( ) 1

knk k k

M M M

k k k k kk k k

p n p

p n p p p H U

22

Huffman coding (1)

The average codeword length

Property: an optimal code has minimum L

Property: for an optimal code the two least probable codewords

have the same length, are the longest

by manipulating the assignment differ only in the last code digit

1

M

i ii

L Pn

2

1 1 11

M

i i M M M Mi

L Pn P n P n

Homework: proof

23

Huffman Coding: optimality (2)Given code C with average length L and M symbols construct C‘:

( For C:the codewords for least probable symbols differ in last digit )

1. replace the 2 least probable symbols CM and CM-1 in C

by symbol CM-1‘ with probability P(M-1)‘ = P(M) + P(M-1)

2. to minimize L, we have to minimize L‘.

2

1 1 11

2

1 1 11

1

( )( 1)

' '

'

M

i i M M M M Mi

M

i i M M M Mi

M M

L Pn P P n P P

L Pn P n P P

L L P P

24

Huffman Coding: (JPEG, MPEG, MP3)

• 1 take together smallest probabilites: P(i) + P(j) • 2 replace symbol i and j by new symbol • 3 go to 1 - until end

•

Example: code

0.3 0.3 0.3 11 0.55

0.25 0.25 0.25 011.00

0.25 0.25 100.45 0.45

0.1 100 0.2

0.1 000 •

25

Properties ADVANTAGES: – uniquely decodable code– smallest average codeword length

DISADVANTAGES:– LARGE tables give complexity – variable word length– sensitive to channel errors

26

Conclusion Huffman

Tree coding (Huffman) is not universal! it is only valid for one particular type of source!

For COMPUTER DATA data reduction is

lossless no errors at reproductionuniversal effective for different types of data

27

Some comments

• The Huffman code is not unique, but efficiency is the same!

• For alphabets larger than 2 small modifications are necessary (where?)

28

Performance Huffman

• Using the probability distribution for the source U, a prefix code exists with average length

L < H(U) + 1

Since Huffman is optimum, this bound is also true for Huffman codes.Problem if H(U) 0

• Improvements can be made when we take J symbols together, then

JH(U) ≤ L < J H(U) + 1 and H(U) ≤ L’ = L/J < H(U) + 1/J

example

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05

• Step 1 0.35 0.25 0.2 0.1 0.1

• Step 2 0.35 0.25 0.2 0.2

• Step 3 0.35 0.25 0.4

• Step 4 0.6 0.4

29

Assign bits

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05 1 0

• Step 1 0.35 0.25 0.2 0.1 0.1• 1 0• Step 2 0.35 0.25 0.2 0.2

1 0• Step 3 0.35 0.25 0.4

1 0• Step 4 0.6 0.4

1 0 average length = 2.3 entropy = 2.25 b/s

30

31

Encoding idea Lempel Ziv Welch-LZW

Assume we have just read a segment w from the text.a is the next symbol.

If wa is not in the dictionary,●Write the index of w in the output file.●Add wa to the dictionary, and set w a.

●If wa is in the dictionary,●Process the next symbol with segment wa.

a

w

a

32

Encoding example• address 0: a address 1: b address 2: c• Stringa a b a a c a b c a b c b output updatea a aa not in dictionry, output 0 add aa to dictionary 0 aa 3a a b continue with a, store ab in dictionary 0 ab 4a a b a continue with b, store ba in dictionary 1 ba 5a a b a a c aa in dictionary, aac not, 3 aac 6a a b a a c a 2 ca 7a a b a a c a b c 4 abc 8a a b a a c a b c a b 7 cab 9

33

UNIVERSAL (LZW) (decoder)

1. Start with basic symbol set

2. Read a code c from the compressed file.- The address c in the dictionary determines the segment w.- write w in the output file.

3. Add wa to the dictionary: a is the first letter of the next segment

34

Decoding example• address 0: a address 1: b address 2: c

String input updatea ? output a 0a a ! output a determines ? = a, update aa 0 aa 3a a b . output 1 determines !=b, update ab 1 ab 4 a a b a a . 3 ba 5a a b a a c . 2 aac 6a a b a a c a b . 4 ca 7a a b a a c a b c a . 7 abc 8

35

Conclusion (LZW)

IDEA: TRY to copy long parts of source output

– if overflow• throw least-recently used entry away in en- and

decoder– universal– lossless

Homework: encode/decode the sequence 1001010110011...

Try to solve the problem that occurs!

36

Some history

• GIF, TIFF, V.42bis modem compression standard, PostScript Level 2

– 1977 published by Abraham Lempel and Jakob Ziv

– 1984 LZ-Welch algorithm published in IEEE Computer

– Sperry patent transferred to Unisys (1986)– GIF file format Required use of LZW algorithm

37

references

J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, May 1977.

Terry Welch, A Technique for High-Performance Data Compression, Computer, June 1984.

38

Summary of operations

• ENCODING output update location

W1 A loc( W1 ) W1A N

W2 F loc( W2 ) W2 F N+1

W3 X loc( W3 ) W3 X N+2

• DECODE: INPUT update location– loc( W1 ) W1 ?

– loc( W2 ) W2 ? W1A N

– loc( W3) W3 ? W2 F N+1

39

Problem and solution• ENCODING output update location

– W1 A loc( W1 ) W1A N

W2 = W1 A F loc( W2 ) W2 F N+1

• DECODE: INPUT update location

– loc( W1 ) W1 ?

– loc( W2 = W1 A) W2 # W1A N

Since W2 = W1 A, the ? can be solved W2 updated at location N as W1A

40

Shannon-Fano coding

Suppose that we have a source with M symbols. Every symbol ui occurs with probability P(ui).

We try to encode symbol ui with bits

Then the average representation length is

1)u(Plogn i2i

1)U(H)u(P)u(Plog)u(Pn)u(PM

1iii2

M

1iii

M

1ii

41

code realization

Define

.1)U(Hn:Then

11ij)u(P)u(P

1Mi0)u(Plogn

0)u(Q

Mk1)u(P)u(Q

ij

i2i

0

1k

0jjk

42

continued

Define:

The codeword for ui is the binary expansion for Q(ui) of length ni

Property: The code is a prefix code with the promised length

Proof: Let i k+1

1i

kj

knkjki

1k

0jjk

1i

0jji

.2)u(P)u(P)u(Q)u(Q

:thus

)u(P)u(Q)u(P)u(Q

43

continued

1. The binary radix-2 representation for Q(ui) and Q(uk) differ at least in position nk.

2. The codewords for Q(ui) and Q(uk) have length

3. The truncated representation for Q(uk) can never be a prefix for the codeword ni.

kk2 n)u(Plog

44

example

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

0)u(Q

7

6

5

4

3

2

1

0

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

0)u(Q

7

6

5

4

3

2

1

0

5)32/1(logn

4)16/1(logn

4)32/2(logn

4)32/3(logn

3)8/1(logn

3)16/2(logn

3)16/3(logn

2)16/5(logn

27

26

25

24

23

22

21

20

11111.

1110.

1101.

1100.

101.

100.

010.

00.

truncate

P(u0 u1 u2 u3 u4 u5 u6 u7)=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)

45

Enumerative coding

suppose pn ones in long sequence of length n.

According to Shannon:

we need ~ nh(p) bits to represent every sequence

How do we realize the encoding and decoding?

46

Enumerative coding

Solution: do lexicographical ordering

Example: 2 ones in sequence of length 6

14 1 1 0 0 0 0

•••

9 0 1 1 0 0 0

8 0 1 0 1 0 0

7 0 1 0 0 1 0

6 0 1 0 0 0 1

5 0 0 1 1 0 0

6 0 0 1 0 1 0

3 0 0 1 0 0 1

2 0 0 0 1 1 0

1 0 0 0 1 0 1

0 0 0 0 0 1 1

Encode:

Sequence # = # of sequences with lower

lexicographical order

Decode: reconstruct sequence with sequence #

47

Enumerative encoding

Example: index for sequence 0 1 0 1 0 0 = 8

0 1 0 1 0 0

There are 2 sequences

with prefix 0

length 2 and with 1 one

There are 6 sequences

with prefix 0

length 4 and with 2 ones

48

Enumerative decodingGiven: sequence of length 6 with 2 ones

What is the sequence for index 8 ?

There are 10 sequences with prefix 0, length 5 and 2 ones

Hence, sequence starts with 0

There are 6 sequences with prefix 00, length 4 and 2 ones

Hence, sequence starts with 01 01 6

There are 3 sequences with prefix 010, length 3 and 1 one

Hence, sequence starts with 010 and not 011 010 6

There are 2 sequences with prefix 0100, length 2 and 1 one

Hence, sequence starts with 0101 010100 8

49

Enumerative encoding: performance

The number of bits per n source outputs for pn ones

)p(hpn

nlog

n

Asymptotically: Efficiency h(p) bits per source output

Note added: for words of length n,

- encode first the number of ones in a block with log2(n+1) bits,

- then do the enumerative encoding with h(p) bits per source output

The contribution (log2(n+1))/n dissappears for large n!

50

David A. HuffmanIn 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman's professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer's memory.

Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. "It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization."

51

The inventors

Abraham Lempel

Jacob ZivLZW (Lempel-Ziv-Welch) is an implementation of a lossless data compression algorithm created by Lempel and Ziv. It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Abraham Lempel and Jacob Ziv.

52

Intuitive Lempel Ziv (be careful !)

• A source generates independent symbols 0 and 1: – p(1) = 1-p(0) = p

• Then:– There are roughly 2nh(p) typical sequences, – every typical sequence has p(t) 2-nh(p)

• We expect that in a binary sequence of lenth N = 2nh(p) , every typical sequence occurs once

(with very high probability)

53


Idea for the Algorithm:

Start with an initial sequence of length N

a. Generate: a string of length n ( which is typical with high probability)

b. Transmit its starting position in the string of length N with log2N bitsif not present, transmit the n bits as they occur

c. Delete the first n bits of the initial sequence and append the newly generated n bits. Go back to a, unless end of the source sequence

54

EFFICIENCY: the new n bits are typical with probability 1 - , where 0

- if non typical, transmit 0, followed by the n bitsif typical, transmit 1, followed by log2N bits for the position in the block

hence: average #bits/source output:

(1-)(log2N)/n + +1/n h(p) bits/source output for large n and 0!

NOTE: - if p changes, we can adapt N and n, or choose some worst case value in advance

- the typical words can also be stored in a memory. The algorithm then outputs the location of the new word. Every time a new word is entered into the memory and one word is deleted.Why is this not a good solution?


55

Another approach (1)

- Suppose that a source generates N independent M-ary symbols

- The frequency of a symbol i is fi and thus fi N symbols i occur

- We call F = (f1, f2, •••, fN ) the composition of x

- Then, the number of different vectors x for a given F is

source x = (x1, x2, •••, xN ), xi Є {1,2, •••,M}

)(entropy! flogfNlogNflogf|x|logN1

x represent to needed bits of number the and

N)!(fN)!(fN)!(fN!

Nf

NfNfNfN

Nf

NfN

Nf

N|x|

M

1ii2i2

M

1ii2iF2

M21M

1M21

2

1

1F

56

en- and decoding

sourcex

letter /output bits NlogNM

need we F, of value the transmit To

entropy! Shannon the to equal is flogf thus and pfN, large forge

2

M

1ii2iii

encoder

F (composition)

Lexicographical index for x

encoder

encoder

F (composition)

Lexicographical index for x

decoderx

Lexicographical en- and decoding is a solved problem in computer science

Documents

Source Coding Data Compression May 7, 2012 A.J. Han Vinck