Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Source Coding

Data CompressionMay 7, 2012

A.J. Han Vinck

DATA COMPRESSION

NO LOSS of information and exact reproduction (low compression ratio 1:4)

• general problem statement:

“find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”

GENERAL IDEA:• represent likely symbols with short length binary words

where likely is derived from

- prediction of next symbol in source output

q-ue q-ua q-ui q-uoq ?

q-00 q-01 q-10 q-11

- context between the source symbolswords sounds context in pictures

Why compress?1. - Lossless compression often reduces file size by 40% to 80%.

1. - More economical to transport and store

2. - Most Internet content is compressed for transmission

3. - Compression before encryption can make code-breaking difficult

4. - Conserve battery power and storage space on mobile devices

5. - Compression and decompression can be hardwired

Some history

• 1948 – Shannon-Fano coding

• 1952 – Huffman coding– reduced redundancy in symbol coding– demonstrably optimal fixed-length coding

• 1977 – Lempel-Ziv coding– first major “dictionary method”– maps repeated word patterns to code words

MODEL KNOWLEDGE

best performance: exact prediction!

exact prediction: no new information!

no new information: no message to transmit

Example No prediction

source: C

message 0 1 2 3 4 5 6 7

code 000 001 010 011 100 101 110 111

representation length: = 3

Example with prediction ENCODE DIFFERENCE

probability .25 .5 .25 difference -1 0 1

code 00 1 01 +

source C code -

L = .25 * 2 + .5 * 1 + .25 * 2 = 1.5 bit/difference symbol

binary tree codes the relation between source symbols and codewords

A:= 11

B:= 10

General Properties:

- every node has two successors: leaves or/and nodes

- the way to reach a leave gives the connected codeword

- source letters are only assigned to leaves

i.e. no codeword is prefix of another code word

tree codes

Tree codes are prefix codes and uniquely decodable

i.e. a string of codewords can be uniquely decomposed

into the individual codewords

Non-prefix codes may be uniquely decodable

example:

B:= 10

C:= 100

Proper trees

We look at trees that make sense: occupy all possible leaves!

binary tree codes

The average codeword length

Property: an optimal code has minimum L

Homework: show that L = sum (node probabilities)

Tree encoding (1)for data / text the compression should be: lossless no errors

– STEP 1: assign messages to nodes codeword niP(i)

a 0.5 1 1 1 1 1.5

1b 0.25 0 0 1 1 0.75

1 c 0.125 0 0 1 0.25

d 0.0625 1 1 0 0.125 0e 0.0625 0 0 0 0.125

• AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol

Tree encoding (2)• STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length )

codeword niP(i)

e 0.0625 1 1111 0.251

d 0.0625 0 0111 0.25 1

c 0.125 0 011 0.375 1

b 0.25 0 01 0.5 a 0.5 0 0 0.5

• AVERAGE CODEWORD LENGTH: = 1.875 bit/source symbol !

Kraft inequality• Prefix codes with M code words satisfy the Kraft inequality:

where nk is the code word length for message k

• Proof: let nM be the longest codeword length

then, in a code tree of depth nM, the terminal nodes eliminate

from the total number of available nodes

nn nkM M

example

Depth = 4

eliminates 8

eliminates 4eliminates 2

Homework: can we replace ≤ into = in the Kraft inequality?

Kraft inequality• Suppose that the length specification of M code words satisfies the

Kraft inequality,

where Ni is the number of code words of length i.

Then, we can construct a prefix code with the specified lengths.

Note that:

2 2 1n MM

: 2 2 1nj M

i iM i i

j n N N

Kraft inequality

• From this,

• Interpretation: at every level less nodes used than available!• E.g. for level 3, we have 8 nodes minus the nodes cancelled

byLevel 1 and 2.

2 12 1

3 2 13 2 1

2 2 2 1

: 2 2 1nj M

i iM i i

j n N N

2N2N2N

performance

• Suppose that we select the code word lengths as

• Then, a prefix code exists, since

with average length

2 log 1

log ( ) 1

knk k k

k k k k kk k k

p n p p p H U

12 2n nk kkp

Lower bound for prefix codes

• We show that

• We write

• Equality can be established for

H U L Pn

2 21 1 1

2( ) log log

2log ( 1) log ( 2 ) 0

nM M M i

i i i i iii i i

nM M Mini

i iii i i

H U L P P Pn PP

e P e PP

2logi in P for all i

Summary so far

1. Prefix codes satisfy the Kraft inequality

2. If the length assignment of a code satisfies the Kraft inequality

, then a prefix code exists

3. Shannon gives a length assignment

with efficiency

2 log 1

log ( ) 1

knk k k

k k k k kk k k

p n p p p H U

Huffman coding (1)

The average codeword length

Property: an optimal code has minimum L

Property: for an optimal code the two least probable codewords

have the same length, are the longest

by manipulating the assignment differ only in the last code digit

1 1 11

i i M M M Mi

L Pn P n P n

Homework: proof

Huffman Coding: optimality (2)Given code C with average length L and M symbols construct C‘:

( For C:the codewords for least probable symbols differ in last digit )

1. replace the 2 least probable symbols CM and CM-1 in C

by symbol CM-1‘ with probability P(M-1)‘ = P(M) + P(M-1)

2. to minimize L, we have to minimize L‘.

1 1 11

( )( 1)

i i M M M M Mi

i i M M M Mi

L Pn P P n P P

L Pn P n P P

L L P P

Huffman Coding: (JPEG, MPEG, MP3)

• 1 take together smallest probabilites: P(i) + P(j) • 2 replace symbol i and j by new symbol • 3 go to 1 - until end

Example: code

0.3 0.3 0.3 11 0.55

0.25 0.25 0.25 011.00

0.25 0.25 100.45 0.45

0.1 100 0.2

0.1 000 •

Properties ADVANTAGES: – uniquely decodable code– smallest average codeword length

DISADVANTAGES:– LARGE tables give complexity – variable word length– sensitive to channel errors

Conclusion Huffman

Tree coding (Huffman) is not universal! it is only valid for one particular type of source!

For COMPUTER DATA data reduction is

lossless no errors at reproductionuniversal effective for different types of data

Some comments

• The Huffman code is not unique, but efficiency is the same!

• For alphabets larger than 2 small modifications are necessary (where?)

Performance Huffman

• Using the probability distribution for the source U, a prefix code exists with average length

L < H(U) + 1

Since Huffman is optimum, this bound is also true for Huffman codes.Problem if H(U) 0

• Improvements can be made when we take J symbols together, then

JH(U) ≤ L < J H(U) + 1 and H(U) ≤ L’ = L/J < H(U) + 1/J

example

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05

• Step 1 0.35 0.25 0.2 0.1 0.1

• Step 2 0.35 0.25 0.2 0.2

• Step 3 0.35 0.25 0.4

• Step 4 0.6 0.4

Assign bits

• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05 1 0

• Step 1 0.35 0.25 0.2 0.1 0.1• 1 0• Step 2 0.35 0.25 0.2 0.2

1 0• Step 3 0.35 0.25 0.4

1 0• Step 4 0.6 0.4

1 0 average length = 2.3 entropy = 2.25 b/s

Encoding idea Lempel Ziv Welch-LZW

Assume we have just read a segment w from the text.a is the next symbol.

If wa is not in the dictionary,●Write the index of w in the output file.●Add wa to the dictionary, and set w a.

●If wa is in the dictionary,●Process the next symbol with segment wa.

Encoding example• address 0: a address 1: b address 2: c• Stringa a b a a c a b c a b c b output updatea a aa not in dictionry, output 0 add aa to dictionary 0 aa 3a a b continue with a, store ab in dictionary 0 ab 4a a b a continue with b, store ba in dictionary 1 ba 5a a b a a c aa in dictionary, aac not, 3 aac 6a a b a a c a 2 ca 7a a b a a c a b c 4 abc 8a a b a a c a b c a b 7 cab 9

UNIVERSAL (LZW) (decoder)

1. Start with basic symbol set

2. Read a code c from the compressed file.- The address c in the dictionary determines the segment w.- write w in the output file.

3. Add wa to the dictionary: a is the first letter of the next segment

Decoding example• address 0: a address 1: b address 2: c

String input updatea ? output a 0a a ! output a determines ? = a, update aa 0 aa 3a a b . output 1 determines !=b, update ab 1 ab 4 a a b a a . 3 ba 5a a b a a c . 2 aac 6a a b a a c a b . 4 ca 7a a b a a c a b c a . 7 abc 8

Conclusion (LZW)

IDEA: TRY to copy long parts of source output

– if overflow• throw least-recently used entry away in en- and

decoder– universal– lossless

Homework: encode/decode the sequence 1001010110011...

Try to solve the problem that occurs!

Some history

• GIF, TIFF, V.42bis modem compression standard, PostScript Level 2

– 1977 published by Abraham Lempel and Jakob Ziv

– 1984 LZ-Welch algorithm published in IEEE Computer

– Sperry patent transferred to Unisys (1986)– GIF file format Required use of LZW algorithm

references

J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, May 1977.

Terry Welch, A Technique for High-Performance Data Compression, Computer, June 1984.

Summary of operations

• ENCODING output update location

W1 A loc( W1 ) W1A N

W2 F loc( W2 ) W2 F N+1

W3 X loc( W3 ) W3 X N+2

• DECODE: INPUT update location– loc( W1 ) W1 ?

– loc( W2 ) W2 ? W1A N

– loc( W3) W3 ? W2 F N+1

Problem and solution• ENCODING output update location

– W1 A loc( W1 ) W1A N

W2 = W1 A F loc( W2 ) W2 F N+1

• DECODE: INPUT update location

– loc( W1 ) W1 ?

– loc( W2 = W1 A) W2 # W1A N

Since W2 = W1 A, the ? can be solved W2 updated at location N as W1A

Shannon-Fano coding

Suppose that we have a source with M symbols. Every symbol ui occurs with probability P(ui).

We try to encode symbol ui with bits

Then the average representation length is

1)u(Plogn i2i

1)U(H)u(P)u(Plog)u(Pn)u(PM

code realization

Define

.1)U(Hn:Then

11ij)u(P)u(P

1Mi0)u(Plogn

Mk1)u(P)u(Q

continued

Define:

The codeword for ui is the binary expansion for Q(ui) of length ni

Property: The code is a prefix code with the promised length

Proof: Let i k+1

knkjki

.2)u(P)u(P)u(Q)u(Q

)u(P)u(Q)u(P)u(Q

continued

1. The binary radix-2 representation for Q(ui) and Q(uk) differ at least in position nk.

2. The codewords for Q(ui) and Q(uk) have length

3. The truncated representation for Q(uk) can never be a prefix for the codeword ni.

kk2 n)u(Plog

example

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

32/31)u(Q

32/29)u(Q

32/27)u(Q

4/3)u(Q

8/5)u(Q

2/1)u(Q

16/5)u(Q

5)32/1(logn

4)16/1(logn

4)32/2(logn

4)32/3(logn

3)8/1(logn

3)16/2(logn

3)16/3(logn

2)16/5(logn

11111.

truncate

P(u0 u1 u2 u3 u4 u5 u6 u7)=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)

Enumerative coding

suppose pn ones in long sequence of length n.

According to Shannon:

we need ~ nh(p) bits to represent every sequence

How do we realize the encoding and decoding?

Enumerative coding

Solution: do lexicographical ordering

Example: 2 ones in sequence of length 6

14 1 1 0 0 0 0

•••

9 0 1 1 0 0 0

8 0 1 0 1 0 0

7 0 1 0 0 1 0

6 0 1 0 0 0 1

5 0 0 1 1 0 0

6 0 0 1 0 1 0

3 0 0 1 0 0 1

2 0 0 0 1 1 0

1 0 0 0 1 0 1

0 0 0 0 0 1 1

Encode:

Sequence # = # of sequences with lower

lexicographical order

Decode: reconstruct sequence with sequence #

Enumerative encoding

Example: index for sequence 0 1 0 1 0 0 = 8

0 1 0 1 0 0

There are 2 sequences

with prefix 0

length 2 and with 1 one

There are 6 sequences

with prefix 0

length 4 and with 2 ones

Enumerative decodingGiven: sequence of length 6 with 2 ones

What is the sequence for index 8 ?

There are 10 sequences with prefix 0, length 5 and 2 ones

Hence, sequence starts with 0

There are 6 sequences with prefix 00, length 4 and 2 ones

Hence, sequence starts with 01 01 6

There are 3 sequences with prefix 010, length 3 and 1 one

Hence, sequence starts with 010 and not 011 010 6

There are 2 sequences with prefix 0100, length 2 and 1 one

Hence, sequence starts with 0101 010100 8

Enumerative encoding: performance

The number of bits per n source outputs for pn ones

)p(hpn

Asymptotically: Efficiency h(p) bits per source output

Note added: for words of length n,

- encode first the number of ones in a block with log2(n+1) bits,

- then do the enumerative encoding with h(p) bits per source output

The contribution (log2(n+1))/n dissappears for large n!

David A. HuffmanIn 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman's professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer's memory.

Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. "It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization."

The inventors

Abraham Lempel

Jacob ZivLZW (Lempel-Ziv-Welch) is an implementation of a lossless data compression algorithm created by Lempel and Ziv. It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Abraham Lempel and Jacob Ziv.

Intuitive Lempel Ziv (be careful !)

• A source generates independent symbols 0 and 1: – p(1) = 1-p(0) = p

• Then:– There are roughly 2nh(p) typical sequences, – every typical sequence has p(t) 2-nh(p)

• We expect that in a binary sequence of lenth N = 2nh(p) , every typical sequence occurs once

(with very high probability)

Idea for the Algorithm:

Start with an initial sequence of length N

a. Generate: a string of length n ( which is typical with high probability)

b. Transmit its starting position in the string of length N with log2N bitsif not present, transmit the n bits as they occur

c. Delete the first n bits of the initial sequence and append the newly generated n bits. Go back to a, unless end of the source sequence

EFFICIENCY: the new n bits are typical with probability 1 - , where 0

- if non typical, transmit 0, followed by the n bitsif typical, transmit 1, followed by log2N bits for the position in the block

hence: average #bits/source output:

(1-)(log2N)/n + +1/n h(p) bits/source output for large n and 0!

NOTE: - if p changes, we can adapt N and n, or choose some worst case value in advance

- the typical words can also be stored in a memory. The algorithm then outputs the location of the new word. Every time a new word is entered into the memory and one word is deleted.Why is this not a good solution?

Another approach (1)

- Suppose that a source generates N independent M-ary symbols

- The frequency of a symbol i is fi and thus fi N symbols i occur

- We call F = (f1, f2, •••, fN ) the composition of x

- Then, the number of different vectors x for a given F is

source x = (x1, x2, •••, xN ), xi Є {1,2, •••,M}

)(entropy! flogfNlogNflogf|x|logN1

x represent to needed bits of number the and

N)!(fN)!(fN)!(fN!

NfNfNfN

1ii2i2

1ii2iF2

en- and decoding

sourcex

letter /output bits NlogNM

need we F, of value the transmit To

entropy! Shannon the to equal is flogf thus and pfN, large forge

1ii2iii

encoder

F (composition)

Lexicographical index for x

encoder

F (composition)

Lexicographical index for x

decoderx

Lexicographical en- and decoding is a solved problem in computer science

Source Coding Data Compression May 7, 2012 A.J. Han Vinck

Documents

(J.P.M.) Piet Schalkwijk 80th birthday presentation: Han Vinck · 2017-01-27 · Han Vinck lecture at Piet Schalkwijk's 80th birthday, 2016 5 AtKailath’s 80th birthday Kailath’s

A.J. Mills

A.J. Arroyo Libre

A.J Wells Cladding brochure

A.J Norval

H1N1 Copy (A.J)

A.J. Cronin - Sub Stele

Second Work Plan of the European Coordinator Karel Vinck · 2017. 8. 25. · Karel Vinck, European ERTMS Coordinator, has been following up closely these developments and proposing

Information theory Multi-user information theory A.J. Han Vinck Essen, 2004

A.J Wells Rail Brochure

20172 -a.j._ayer_-_el_positivismo_logico_y_guia_de_trabajos_practicos_no_3

DIGITAL COMMUNICATION Packet error detection (CRC) November 2011 A.J. Han Vinck

Chassis - A.J. Glew

Second Work Plan of the European Coordinator Karel Vinck · 2017-04-25 · Karel Vinck, European ERTMS Coordinator, has been following up closely these developments and proposing

architecten de vylder vinck taillieu - CV - en

A.J. INSTITUTE OF MEDICAL SCIENCES & RESEARCH …ajims.edu.in/AttendenceFile/FORENSIC March 2017.pdf · a.j. institute of medical sciences & research centre, mangaluru ... a.j. institute

Introduction to Information theory A.J. Han Vinck University of Duisburg-Essen April 2012

PINGOS Jennifer Vinck Geology 495 University of Regina, 2006

McNamara, A.J

ISPLC-2006 Orlando Han Vinck University Duisburg-Essen Germany