56
Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Embed Size (px)

Citation preview

Page 1: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Source Coding

Data CompressionMay 7, 2012

A.J. Han Vinck

Page 2: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

2

DATA COMPRESSION

NO LOSS of information and exact reproduction (low compression ratio 1:4)

• general problem statement:

“find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”

Page 3: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

3

GENERAL IDEA:• represent likely symbols with short length binary words

where likely is derived from

- prediction of next symbol in source output

q-ue q-ua q-ui q-uoq ?

q-00 q-01 q-10 q-11

- context between the source symbolswords sounds context in pictures

Page 4: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

4

Why compress?1. - Lossless compression often reduces file size by 40% to 80%.

1. - More economical to transport and store

2. - Most Internet content is compressed for transmission

3. - Compression before encryption can make code-breaking difficult

4. - Conserve battery power and storage space on mobile devices

5. - Compression and decompression can be hardwired

Page 5: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

5

Some history

• 1948 – Shannon-Fano coding

• 1952 – Huffman coding– reduced redundancy in symbol coding– demonstrably optimal fixed-length coding

• 1977 – Lempel-Ziv coding– first major “dictionary method”– maps repeated word patterns to code words

Page 6: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

6

MODEL KNOWLEDGE

  best performance: exact prediction!

 exact prediction: no new information!

 no new information: no message to transmit 

Page 7: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

7

Example No prediction

source: C

message 0 1 2 3 4 5 6 7

code 000 001 010 011 100 101 110 111

 

representation length: = 3

Page 8: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

8

Example with prediction ENCODE DIFFERENCE

probability .25 .5 .25 difference -1 0 1

code 00 1 01  +

source C code -

P

L = .25 * 2 + .5 * 1 + .25 * 2 = 1.5 bit/difference symbol

•  

Page 9: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

9

binary tree codes the relation between source symbols and codewords

A:= 11

B:= 10

C:= 0

1

0

1

0

General Properties:

- every node has two successors: leaves or/and nodes

- the way to reach a leave gives the connected codeword

- source letters are only assigned to leaves

i.e. no codeword is prefix of another code word

code

Page 10: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

10

tree codes

Tree codes are prefix codes and uniquely decodable

i.e. a string of codewords can be uniquely decomposed

into the individual codewords

Non-prefix codes may be uniquely decodable

example:

A:= 1

B:= 10

C:= 100

Page 11: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Proper trees

11

We look at trees that make sense: occupy all possible leaves!

Page 12: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

12

binary tree codes

The average codeword length

Property: an optimal code has minimum L

Homework: show that L = sum (node probabilities)

1

M

i ii

L Pn

Page 13: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

13

Tree encoding (1)for data / text the compression should be: lossless no errors

– STEP 1: assign messages to nodes codeword niP(i)

a 0.5 1 1 1 1 1.5

1b 0.25 0 0 1 1 0.75

1 c 0.125 0 0 1 0.25

d 0.0625 1 1 0 0.125 0e 0.0625 0 0 0 0.125

• AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol

Page 14: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

14

Tree encoding (2)• STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length )

codeword niP(i)

e 0.0625 1 1111 0.251

d 0.0625 0 0111 0.25  1

c 0.125        0 011 0.375  1

b 0.25 0 01 0.5  a 0.5 0 0 0.5

• AVERAGE CODEWORD LENGTH: = 1.875 bit/source symbol !

Page 15: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

15

Kraft inequality• Prefix codes with M code words satisfy the Kraft inequality:

where nk is the code word length for message k

• Proof: let nM be the longest codeword length

then, in a code tree of depth nM, the terminal nodes eliminate

from the total number of available nodes

1

2 1M

nk

k

1

2 2M

nn nkM M

k

Page 16: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

16

example

Depth = 4

eliminates 8

eliminates 4eliminates 2

Homework: can we replace ≤ into = in the Kraft inequality?

Page 17: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

17

Kraft inequality• Suppose that the length specification of M code words satisfies the

Kraft inequality,

then

where Ni is the number of code words of length i.

Then, we can construct a prefix code with the specified lengths.

Note that:

1 1

2 2 1n MM

ni ki

i k

N

1 1

: 2 2 1nj M

i iM i i

i i

j n N N

Page 18: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

18

Kraft inequality

• From this,

• Interpretation: at every level less nodes used than available!• E.g. for level 3, we have 8 nodes minus the nodes cancelled

byLevel 1 and 2.

11

2 12 1

3 2 13 2 1

2 1

2 2 1

2 2 2 1

N

N N

N N N

1 1

: 2 2 1nj M

i iM i i

i i

j n N N

212

33

12

2

1

2N2N2N

2N2N

2N

Page 19: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

19

performance

• Suppose that we select the code word lengths as

• Then, a prefix code exists, since

with average length

1

2 1k

Mn

k

1

1 1 1

2 log 1

log ( ) 1

knk k k

M M M

k k k k kk k k

p n p

p n p p p H U

12 2n nk kkp

Page 20: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

20

Lower bound for prefix codes

• We show that

• We write

• Equality can be established for

1

( )M

i ii

H U L Pn

2 21 1 1

2 21 1 1

2( ) log log

2log ( 1) log ( 2 ) 0

nM M M i

i i i i iii i i

nM M Mini

i iii i i

H U L P P Pn PP

e P e PP

2logi in P for all i

Page 21: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Summary so far

21

1. Prefix codes satisfy the Kraft inequality

2. If the length assignment of a code satisfies the Kraft inequality

, then a prefix code exists

3. Shannon gives a length assignment

with efficiency

1

2 1M

nk

k

1

1 1 1

2 log 1

log ( ) 1

knk k k

M M M

k k k k kk k k

p n p

p n p p p H U

Page 22: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

22

Huffman coding (1)

The average codeword length

Property: an optimal code has minimum L

Property: for an optimal code the two least probable codewords

have the same length, are the longest

by manipulating the assignment differ only in the last code digit

1

M

i ii

L Pn

2

1 1 11

M

i i M M M Mi

L Pn P n P n

Homework: proof

Page 23: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

23

Huffman Coding: optimality (2)Given code C with average length L and M symbols construct C‘:

( For C:the codewords for least probable symbols differ in last digit )

1. replace the 2 least probable symbols CM and CM-1 in C

by symbol CM-1‘ with probability P(M-1)‘ = P(M) + P(M-1)

2. to minimize L, we have to minimize L‘.

2

1 1 11

2

1 1 11

1

( )( 1)

' '

'

M

i i M M M M Mi

M

i i M M M Mi

M M

L Pn P P n P P

L Pn P n P P

L L P P

Page 24: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

24

Huffman Coding: (JPEG, MPEG, MP3)

• 1 take together smallest probabilites: P(i) + P(j) • 2 replace symbol i and j by new symbol • 3 go to 1 - until end

Example: code

0.3 0.3 0.3 11 0.55

0.25 0.25 0.25 011.00

0.25 0.25 100.45 0.45

0.1 100 0.2

0.1                                                                                                     000 •  

Page 25: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

25

Properties ADVANTAGES: – uniquely decodable code– smallest average codeword length

DISADVANTAGES:– LARGE tables give complexity – variable word length– sensitive to channel errors

Page 26: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

26

Conclusion Huffman

Tree coding (Huffman) is not universal!  it is only valid for one particular type of source! 

For COMPUTER DATA data reduction is

lossless no errors at reproductionuniversal effective for different types of data

  

Page 27: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

27

Some comments

• The Huffman code is not unique, but efficiency is the same!

• For alphabets larger than 2 small modifications are necessary (where?)

Page 28: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

28

Performance Huffman

• Using the probability distribution for the source U, a prefix code exists with average length

L < H(U) + 1

Since Huffman is optimum, this bound is also true for Huffman codes.Problem if H(U) 0

• Improvements can be made when we take J symbols together, then

JH(U) ≤ L < J H(U) + 1 and H(U) ≤ L’ = L/J < H(U) + 1/J

Page 29: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

example

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05

• Step 1 0.35 0.25 0.2 0.1 0.1

• Step 2 0.35 0.25 0.2 0.2

• Step 3 0.35 0.25 0.4

• Step 4 0.6 0.4

29

Page 30: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Assign bits

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05 1 0

• Step 1 0.35 0.25 0.2 0.1 0.1• 1 0• Step 2 0.35 0.25 0.2 0.2

1 0• Step 3 0.35 0.25 0.4

1 0• Step 4 0.6 0.4

1 0 average length = 2.3 entropy = 2.25 b/s

30

Page 31: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

31

Encoding idea Lempel Ziv Welch-LZW

Assume we have just read a segment w from the text.a is the next symbol.

If wa is not in the dictionary,●Write the index of w in the output file.●Add wa to the dictionary, and set w a.

●If wa is in the dictionary,●Process the next symbol with segment wa.

a

w

a

Page 32: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

32

Encoding example• address 0: a address 1: b address 2: c• Stringa a b a a c a b c a b c b output updatea a aa not in dictionry, output 0 add aa to dictionary 0 aa 3a a b continue with a, store ab in dictionary 0 ab 4a a b a continue with b, store ba in dictionary 1 ba 5a a b a a c aa in dictionary, aac not, 3 aac 6a a b a a c a 2 ca 7a a b a a c a b c 4 abc 8a a b a a c a b c a b 7 cab 9

Page 33: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

33

UNIVERSAL (LZW) (decoder)

1. Start with basic symbol set

2. Read a code c from the compressed file.- The address c in the dictionary determines the segment w.- write w in the output file.

3. Add wa to the dictionary: a is the first letter of the next segment

Page 34: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

34

Decoding example• address 0: a address 1: b address 2: c

String input updatea ? output a 0a a ! output a determines ? = a, update aa 0 aa 3a a b . output 1 determines !=b, update ab 1 ab 4 a a b a a . 3 ba 5a a b a a c . 2 aac 6a a b a a c a b . 4 ca 7a a b a a c a b c a . 7 abc 8

Page 35: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

35

Conclusion (LZW)

IDEA: TRY to copy long parts of source output

– if overflow• throw least-recently used entry away in en- and

decoder– universal– lossless

Homework: encode/decode the sequence 1001010110011...

Try to solve the problem that occurs!

Page 36: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

36

Some history

• GIF, TIFF, V.42bis modem compression standard, PostScript Level 2

– 1977 published by Abraham Lempel and Jakob Ziv

– 1984 LZ-Welch algorithm published in IEEE Computer

– Sperry patent transferred to Unisys (1986)– GIF file format Required use of LZW algorithm

Page 37: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

37

references

J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, May 1977.

Terry Welch, A Technique for High-Performance Data Compression, Computer, June 1984.

Page 38: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

38

Summary of operations

• ENCODING output update location

W1 A loc( W1 ) W1A N

W2 F loc( W2 ) W2 F N+1

W3 X loc( W3 ) W3 X N+2

• DECODE: INPUT update location– loc( W1 ) W1 ?

– loc( W2 ) W2 ? W1A N

– loc( W3) W3 ? W2 F N+1

Page 39: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

39

Problem and solution• ENCODING output update location

– W1 A loc( W1 ) W1A N

W2 = W1 A F loc( W2 ) W2 F N+1

• DECODE: INPUT update location

– loc( W1 ) W1 ?

– loc( W2 = W1 A) W2 # W1A N

Since W2 = W1 A, the ? can be solved W2 updated at location N as W1A

Page 40: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

40

Shannon-Fano coding

Suppose that we have a source with M symbols. Every symbol ui occurs with probability P(ui).

We try to encode symbol ui with bits

Then the average representation length is

1)u(Plogn i2i

1)U(H)u(P)u(Plog)u(Pn)u(PM

1iii2

M

1iii

M

1ii

Page 41: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

41

code realization

Define

.1)U(Hn:Then

11ij)u(P)u(P

1Mi0)u(Plogn

0)u(Q

Mk1)u(P)u(Q

ij

i2i

0

1k

0jjk

Page 42: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

42

continued

Define:

The codeword for ui is the binary expansion for Q(ui) of length ni

Property: The code is a prefix code with the promised length

Proof: Let i k+1

1i

kj

knkjki

1k

0jjk

1i

0jji

.2)u(P)u(P)u(Q)u(Q

:thus

)u(P)u(Q)u(P)u(Q

Page 43: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

43

continued

1. The binary radix-2 representation for Q(ui) and Q(uk) differ at least in position nk.

2. The codewords for Q(ui) and Q(uk) have length

3. The truncated representation for Q(uk) can never be a prefix for the codeword ni.

kk2 n)u(Plog

Page 44: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

44

example

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

0)u(Q

7

6

5

4

3

2

1

0

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

0)u(Q

7

6

5

4

3

2

1

0

5)32/1(logn

4)16/1(logn

4)32/2(logn

4)32/3(logn

3)8/1(logn

3)16/2(logn

3)16/3(logn

2)16/5(logn

27

26

25

24

23

22

21

20

11111.

1110.

1101.

1100.

101.

100.

010.

00.

truncate

P(u0 u1 u2 u3 u4 u5 u6 u7)=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)

Page 45: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

45

Enumerative coding

suppose pn ones in long sequence of length n.

According to Shannon:

we need ~ nh(p) bits to represent every sequence

How do we realize the encoding and decoding?

Page 46: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

46

Enumerative coding

Solution: do lexicographical ordering

Example: 2 ones in sequence of length 6

14 1 1 0 0 0 0

•••

9 0 1 1 0 0 0

8 0 1 0 1 0 0

7 0 1 0 0 1 0

6 0 1 0 0 0 1

5 0 0 1 1 0 0

6 0 0 1 0 1 0

3 0 0 1 0 0 1

2 0 0 0 1 1 0

1 0 0 0 1 0 1

0 0 0 0 0 1 1

Encode:

Sequence # = # of sequences with lower

lexicographical order

Decode: reconstruct sequence with sequence #

Page 47: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

47

Enumerative encoding

Example: index for sequence 0 1 0 1 0 0 = 8

0 1 0 1 0 0

There are 2 sequences

with prefix 0

length 2 and with 1 one

There are 6 sequences

with prefix 0

length 4 and with 2 ones

Page 48: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

48

Enumerative decodingGiven: sequence of length 6 with 2 ones

What is the sequence for index 8 ?

There are 10 sequences with prefix 0, length 5 and 2 ones

Hence, sequence starts with 0

There are 6 sequences with prefix 00, length 4 and 2 ones

Hence, sequence starts with 01 01 6

There are 3 sequences with prefix 010, length 3 and 1 one

Hence, sequence starts with 010 and not 011 010 6

There are 2 sequences with prefix 0100, length 2 and 1 one

Hence, sequence starts with 0101 010100 8

Page 49: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

49

Enumerative encoding: performance

The number of bits per n source outputs for pn ones

)p(hpn

nlog

n

Asymptotically: Efficiency h(p) bits per source output

Note added: for words of length n,

- encode first the number of ones in a block with log2(n+1) bits,

- then do the enumerative encoding with h(p) bits per source output

The contribution (log2(n+1))/n dissappears for large n!

Page 50: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

50

David A. HuffmanIn 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman's professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer's memory.

Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. "It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization."

Page 51: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

51

The inventors

                     

                 

   

Abraham Lempel

Jacob ZivLZW (Lempel-Ziv-Welch) is an implementation of a lossless data compression algorithm created by Lempel and Ziv. It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Abraham Lempel and Jacob Ziv.

Page 52: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

52

Intuitive Lempel Ziv (be careful !)

• A source generates independent symbols 0 and 1: – p(1) = 1-p(0) = p

• Then:– There are roughly 2nh(p) typical sequences, – every typical sequence has p(t) 2-nh(p)

• We expect that in a binary sequence of lenth N = 2nh(p) , every typical sequence occurs once

(with very high probability)

Page 53: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

53

Intuitive Lempel Ziv (be careful !)

Idea for the Algorithm:

Start with an initial sequence of length N

a. Generate: a string of length n ( which is typical with high probability)

b. Transmit its starting position in the string of length N with log2N bitsif not present, transmit the n bits as they occur

c. Delete the first n bits of the initial sequence and append the newly generated n bits. Go back to a, unless end of the source sequence

Page 54: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

54

EFFICIENCY: the new n bits are typical with probability 1 - , where 0

- if non typical, transmit 0, followed by the n bits- if typical, transmit 1, followed by log2N bits for the position in the block

hence: average #bits/source output:

(1-)(log2N)/n + +1/n h(p) bits/source output for large n and 0!

NOTE: - if p changes, we can adapt N and n, or choose some worst case value in advance

- the typical words can also be stored in a memory. The algorithm then outputs the location of the new word. Every time a new word is entered into the memory and one word is deleted.Why is this not a good solution?

Intuitive Lempel Ziv (be careful !)

Page 55: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

55

Another approach (1)

- Suppose that a source generates N independent M-ary symbols

- The frequency of a symbol i is fi and thus fi N symbols i occur

- We call F = (f1, f2, •••, fN ) the composition of x

- Then, the number of different vectors x for a given F is

source x = (x1, x2, •••, xN ), xi Є {1,2, •••,M}

)(entropy! flogfNlogNflogf|x|logN1

x represent to needed bits of number the and

N)!(fN)!(fN)!(fN!

Nf

NfNfNfN

Nf

NfN

Nf

N|x|

M

1ii2i2

M

1ii2iF2

M21M

1M21

2

1

1F

Page 56: Source Coding Data Compression May 7, 2012 A.J. Han Vinck

56

en- and decoding

sourcex

letter /output bits NlogNM

need we F, of value the transmit To

entropy! Shannon the to equal is flogf thus and pfN, large forge

2

M

1ii2iii

encoder

F (composition)

Lexicographical index for x

encoder

encoder

F (composition)

Lexicographical index for x

decoderx

Lexicographical en- and decoding is a solved problem in computer science