Upload
mariah-leonard
View
230
Download
6
Embed Size (px)
Citation preview
Source Coding
Data CompressionMay 7, 2012
A.J. Han Vinck
2
DATA COMPRESSION
NO LOSS of information and exact reproduction (low compression ratio 1:4)
• general problem statement:
“find a means for spending as little time as possible on packing as much of data as possible into as little space as possible, and with no loss of information”
3
GENERAL IDEA:• represent likely symbols with short length binary words
where likely is derived from
- prediction of next symbol in source output
q-ue q-ua q-ui q-uoq ?
q-00 q-01 q-10 q-11
- context between the source symbolswords sounds context in pictures
4
Why compress?1. - Lossless compression often reduces file size by 40% to 80%.
1. - More economical to transport and store
2. - Most Internet content is compressed for transmission
3. - Compression before encryption can make code-breaking difficult
4. - Conserve battery power and storage space on mobile devices
5. - Compression and decompression can be hardwired
5
Some history
• 1948 – Shannon-Fano coding
• 1952 – Huffman coding– reduced redundancy in symbol coding– demonstrably optimal fixed-length coding
• 1977 – Lempel-Ziv coding– first major “dictionary method”– maps repeated word patterns to code words
6
MODEL KNOWLEDGE
best performance: exact prediction!
exact prediction: no new information!
no new information: no message to transmit
7
Example No prediction
source: C
message 0 1 2 3 4 5 6 7
code 000 001 010 011 100 101 110 111
representation length: = 3
8
Example with prediction ENCODE DIFFERENCE
probability .25 .5 .25 difference -1 0 1
code 00 1 01 +
source C code -
P
L = .25 * 2 + .5 * 1 + .25 * 2 = 1.5 bit/difference symbol
•
9
binary tree codes the relation between source symbols and codewords
A:= 11
B:= 10
C:= 0
1
0
1
0
General Properties:
- every node has two successors: leaves or/and nodes
- the way to reach a leave gives the connected codeword
- source letters are only assigned to leaves
i.e. no codeword is prefix of another code word
code
10
tree codes
Tree codes are prefix codes and uniquely decodable
i.e. a string of codewords can be uniquely decomposed
into the individual codewords
Non-prefix codes may be uniquely decodable
example:
A:= 1
B:= 10
C:= 100
Proper trees
11
We look at trees that make sense: occupy all possible leaves!
12
binary tree codes
The average codeword length
Property: an optimal code has minimum L
Homework: show that L = sum (node probabilities)
1
M
i ii
L Pn
13
Tree encoding (1)for data / text the compression should be: lossless no errors
– STEP 1: assign messages to nodes codeword niP(i)
a 0.5 1 1 1 1 1.5
1b 0.25 0 0 1 1 0.75
1 c 0.125 0 0 1 0.25
d 0.0625 1 1 0 0.125 0e 0.0625 0 0 0 0.125
• AVERAGE CODEWORD LENGTH: = 2.75 bit/source symbol
14
Tree encoding (2)• STEP 2 OPTIMIZE ASSIGNMENT (MINIMIZE average length )
codeword niP(i)
e 0.0625 1 1111 0.251
d 0.0625 0 0111 0.25 1
c 0.125 0 011 0.375 1
b 0.25 0 01 0.5 a 0.5 0 0 0.5
• AVERAGE CODEWORD LENGTH: = 1.875 bit/source symbol !
15
Kraft inequality• Prefix codes with M code words satisfy the Kraft inequality:
where nk is the code word length for message k
• Proof: let nM be the longest codeword length
then, in a code tree of depth nM, the terminal nodes eliminate
from the total number of available nodes
1
2 1M
nk
k
1
2 2M
nn nkM M
k
16
example
Depth = 4
eliminates 8
eliminates 4eliminates 2
Homework: can we replace ≤ into = in the Kraft inequality?
17
Kraft inequality• Suppose that the length specification of M code words satisfies the
Kraft inequality,
then
where Ni is the number of code words of length i.
Then, we can construct a prefix code with the specified lengths.
Note that:
1 1
2 2 1n MM
ni ki
i k
N
1 1
: 2 2 1nj M
i iM i i
i i
j n N N
18
Kraft inequality
• From this,
• Interpretation: at every level less nodes used than available!• E.g. for level 3, we have 8 nodes minus the nodes cancelled
byLevel 1 and 2.
11
2 12 1
3 2 13 2 1
2 1
2 2 1
2 2 2 1
N
N N
N N N
1 1
: 2 2 1nj M
i iM i i
i i
j n N N
212
33
12
2
1
2N2N2N
2N2N
2N
19
performance
• Suppose that we select the code word lengths as
• Then, a prefix code exists, since
with average length
1
2 1k
Mn
k
1
1 1 1
2 log 1
log ( ) 1
knk k k
M M M
k k k k kk k k
p n p
p n p p p H U
12 2n nk kkp
20
Lower bound for prefix codes
• We show that
• We write
• Equality can be established for
1
( )M
i ii
H U L Pn
2 21 1 1
2 21 1 1
2( ) log log
2log ( 1) log ( 2 ) 0
nM M M i
i i i i iii i i
nM M Mini
i iii i i
H U L P P Pn PP
e P e PP
2logi in P for all i
Summary so far
21
1. Prefix codes satisfy the Kraft inequality
2. If the length assignment of a code satisfies the Kraft inequality
, then a prefix code exists
3. Shannon gives a length assignment
with efficiency
1
2 1M
nk
k
1
1 1 1
2 log 1
log ( ) 1
knk k k
M M M
k k k k kk k k
p n p
p n p p p H U
22
Huffman coding (1)
The average codeword length
Property: an optimal code has minimum L
Property: for an optimal code the two least probable codewords
have the same length, are the longest
by manipulating the assignment differ only in the last code digit
1
M
i ii
L Pn
2
1 1 11
M
i i M M M Mi
L Pn P n P n
Homework: proof
23
Huffman Coding: optimality (2)Given code C with average length L and M symbols construct C‘:
( For C:the codewords for least probable symbols differ in last digit )
1. replace the 2 least probable symbols CM and CM-1 in C
by symbol CM-1‘ with probability P(M-1)‘ = P(M) + P(M-1)
2. to minimize L, we have to minimize L‘.
2
1 1 11
2
1 1 11
1
( )( 1)
' '
'
M
i i M M M M Mi
M
i i M M M Mi
M M
L Pn P P n P P
L Pn P n P P
L L P P
24
Huffman Coding: (JPEG, MPEG, MP3)
• 1 take together smallest probabilites: P(i) + P(j) • 2 replace symbol i and j by new symbol • 3 go to 1 - until end
•
Example: code
0.3 0.3 0.3 11 0.55
0.25 0.25 0.25 011.00
0.25 0.25 100.45 0.45
0.1 100 0.2
0.1 000 •
25
Properties ADVANTAGES: – uniquely decodable code– smallest average codeword length
DISADVANTAGES:– LARGE tables give complexity – variable word length– sensitive to channel errors
26
Conclusion Huffman
Tree coding (Huffman) is not universal! it is only valid for one particular type of source!
For COMPUTER DATA data reduction is
lossless no errors at reproductionuniversal effective for different types of data
27
Some comments
• The Huffman code is not unique, but efficiency is the same!
• For alphabets larger than 2 small modifications are necessary (where?)
28
Performance Huffman
• Using the probability distribution for the source U, a prefix code exists with average length
L < H(U) + 1
Since Huffman is optimum, this bound is also true for Huffman codes.Problem if H(U) 0
• Improvements can be made when we take J symbols together, then
JH(U) ≤ L < J H(U) + 1 and H(U) ≤ L’ = L/J < H(U) + 1/J
example
• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05
• Step 1 0.35 0.25 0.2 0.1 0.1
• Step 2 0.35 0.25 0.2 0.2
• Step 3 0.35 0.25 0.4
• Step 4 0.6 0.4
29
Assign bits
• Probabilities: 0.35 0.25 0.2 0.1 0.05 0.05 1 0
• Step 1 0.35 0.25 0.2 0.1 0.1• 1 0• Step 2 0.35 0.25 0.2 0.2
1 0• Step 3 0.35 0.25 0.4
1 0• Step 4 0.6 0.4
1 0 average length = 2.3 entropy = 2.25 b/s
30
31
Encoding idea Lempel Ziv Welch-LZW
Assume we have just read a segment w from the text.a is the next symbol.
If wa is not in the dictionary,●Write the index of w in the output file.●Add wa to the dictionary, and set w a.
●If wa is in the dictionary,●Process the next symbol with segment wa.
a
w
a
32
Encoding example• address 0: a address 1: b address 2: c• Stringa a b a a c a b c a b c b output updatea a aa not in dictionry, output 0 add aa to dictionary 0 aa 3a a b continue with a, store ab in dictionary 0 ab 4a a b a continue with b, store ba in dictionary 1 ba 5a a b a a c aa in dictionary, aac not, 3 aac 6a a b a a c a 2 ca 7a a b a a c a b c 4 abc 8a a b a a c a b c a b 7 cab 9
33
UNIVERSAL (LZW) (decoder)
1. Start with basic symbol set
2. Read a code c from the compressed file.- The address c in the dictionary determines the segment w.- write w in the output file.
3. Add wa to the dictionary: a is the first letter of the next segment
34
Decoding example• address 0: a address 1: b address 2: c
String input updatea ? output a 0a a ! output a determines ? = a, update aa 0 aa 3a a b . output 1 determines !=b, update ab 1 ab 4 a a b a a . 3 ba 5a a b a a c . 2 aac 6a a b a a c a b . 4 ca 7a a b a a c a b c a . 7 abc 8
35
Conclusion (LZW)
IDEA: TRY to copy long parts of source output
– if overflow• throw least-recently used entry away in en- and
decoder– universal– lossless
Homework: encode/decode the sequence 1001010110011...
Try to solve the problem that occurs!
36
Some history
• GIF, TIFF, V.42bis modem compression standard, PostScript Level 2
– 1977 published by Abraham Lempel and Jakob Ziv
– 1984 LZ-Welch algorithm published in IEEE Computer
– Sperry patent transferred to Unisys (1986)– GIF file format Required use of LZW algorithm
37
references
J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, May 1977.
Terry Welch, A Technique for High-Performance Data Compression, Computer, June 1984.
38
Summary of operations
• ENCODING output update location
W1 A loc( W1 ) W1A N
W2 F loc( W2 ) W2 F N+1
W3 X loc( W3 ) W3 X N+2
• DECODE: INPUT update location– loc( W1 ) W1 ?
– loc( W2 ) W2 ? W1A N
– loc( W3) W3 ? W2 F N+1
39
Problem and solution• ENCODING output update location
– W1 A loc( W1 ) W1A N
W2 = W1 A F loc( W2 ) W2 F N+1
• DECODE: INPUT update location
– loc( W1 ) W1 ?
– loc( W2 = W1 A) W2 # W1A N
Since W2 = W1 A, the ? can be solved W2 updated at location N as W1A
40
Shannon-Fano coding
Suppose that we have a source with M symbols. Every symbol ui occurs with probability P(ui).
We try to encode symbol ui with bits
Then the average representation length is
1)u(Plogn i2i
1)U(H)u(P)u(Plog)u(Pn)u(PM
1iii2
M
1iii
M
1ii
41
code realization
Define
.1)U(Hn:Then
11ij)u(P)u(P
1Mi0)u(Plogn
0)u(Q
Mk1)u(P)u(Q
ij
i2i
0
1k
0jjk
42
continued
Define:
The codeword for ui is the binary expansion for Q(ui) of length ni
Property: The code is a prefix code with the promised length
Proof: Let i k+1
1i
kj
knkjki
1k
0jjk
1i
0jji
.2)u(P)u(P)u(Q)u(Q
:thus
)u(P)u(Q)u(P)u(Q
43
continued
1. The binary radix-2 representation for Q(ui) and Q(uk) differ at least in position nk.
2. The codewords for Q(ui) and Q(uk) have length
3. The truncated representation for Q(uk) can never be a prefix for the codeword ni.
kk2 n)u(Plog
44
example
32/31)u(Q
32/29)u(Q
32/27)u(Q
4/3)u(Q
8/5)u(Q
2/1)u(Q
16/5)u(Q
0)u(Q
7
6
5
4
3
2
1
0
32/31)u(Q
32/29)u(Q
32/27)u(Q
4/3)u(Q
8/5)u(Q
2/1)u(Q
16/5)u(Q
0)u(Q
7
6
5
4
3
2
1
0
5)32/1(logn
4)16/1(logn
4)32/2(logn
4)32/3(logn
3)8/1(logn
3)16/2(logn
3)16/3(logn
2)16/5(logn
27
26
25
24
23
22
21
20
11111.
1110.
1101.
1100.
101.
100.
010.
00.
truncate
P(u0 u1 u2 u3 u4 u5 u6 u7)=(5/16, 3/16,1/8, 1/8, 3/32, 1/16, 1/16, 1/32)
45
Enumerative coding
suppose pn ones in long sequence of length n.
According to Shannon:
we need ~ nh(p) bits to represent every sequence
How do we realize the encoding and decoding?
46
Enumerative coding
Solution: do lexicographical ordering
Example: 2 ones in sequence of length 6
14 1 1 0 0 0 0
•••
9 0 1 1 0 0 0
8 0 1 0 1 0 0
7 0 1 0 0 1 0
6 0 1 0 0 0 1
5 0 0 1 1 0 0
6 0 0 1 0 1 0
3 0 0 1 0 0 1
2 0 0 0 1 1 0
1 0 0 0 1 0 1
0 0 0 0 0 1 1
Encode:
Sequence # = # of sequences with lower
lexicographical order
Decode: reconstruct sequence with sequence #
47
Enumerative encoding
Example: index for sequence 0 1 0 1 0 0 = 8
0 1 0 1 0 0
There are 2 sequences
with prefix 0
length 2 and with 1 one
There are 6 sequences
with prefix 0
length 4 and with 2 ones
48
Enumerative decodingGiven: sequence of length 6 with 2 ones
What is the sequence for index 8 ?
There are 10 sequences with prefix 0, length 5 and 2 ones
Hence, sequence starts with 0
There are 6 sequences with prefix 00, length 4 and 2 ones
Hence, sequence starts with 01 01 6
There are 3 sequences with prefix 010, length 3 and 1 one
Hence, sequence starts with 010 and not 011 010 6
There are 2 sequences with prefix 0100, length 2 and 1 one
Hence, sequence starts with 0101 010100 8
49
Enumerative encoding: performance
The number of bits per n source outputs for pn ones
)p(hpn
nlog
n
Asymptotically: Efficiency h(p) bits per source output
Note added: for words of length n,
- encode first the number of ones in a block with log2(n+1) bits,
- then do the enumerative encoding with h(p) bits per source output
The contribution (log2(n+1))/n dissappears for large n!
50
David A. HuffmanIn 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman's professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. Besides being a nimble intellectual exercise, finding such a code would enable information to be compressed for transmission over a computer network or for storage in a computer's memory.
Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. "It was the most singular moment of my life," Huffman says. "There was the absolute lightning of sudden realization."
51
The inventors
Abraham Lempel
Jacob ZivLZW (Lempel-Ziv-Welch) is an implementation of a lossless data compression algorithm created by Lempel and Ziv. It was published by Terry Welch in 1984 as an improved version of the LZ78 dictionary coding algorithm developed by Abraham Lempel and Jacob Ziv.
52
Intuitive Lempel Ziv (be careful !)
• A source generates independent symbols 0 and 1: – p(1) = 1-p(0) = p
• Then:– There are roughly 2nh(p) typical sequences, – every typical sequence has p(t) 2-nh(p)
• We expect that in a binary sequence of lenth N = 2nh(p) , every typical sequence occurs once
(with very high probability)
53
Intuitive Lempel Ziv (be careful !)
Idea for the Algorithm:
Start with an initial sequence of length N
a. Generate: a string of length n ( which is typical with high probability)
b. Transmit its starting position in the string of length N with log2N bitsif not present, transmit the n bits as they occur
c. Delete the first n bits of the initial sequence and append the newly generated n bits. Go back to a, unless end of the source sequence
54
EFFICIENCY: the new n bits are typical with probability 1 - , where 0
- if non typical, transmit 0, followed by the n bits- if typical, transmit 1, followed by log2N bits for the position in the block
hence: average #bits/source output:
(1-)(log2N)/n + +1/n h(p) bits/source output for large n and 0!
NOTE: - if p changes, we can adapt N and n, or choose some worst case value in advance
- the typical words can also be stored in a memory. The algorithm then outputs the location of the new word. Every time a new word is entered into the memory and one word is deleted.Why is this not a good solution?
Intuitive Lempel Ziv (be careful !)
55
Another approach (1)
- Suppose that a source generates N independent M-ary symbols
- The frequency of a symbol i is fi and thus fi N symbols i occur
- We call F = (f1, f2, •••, fN ) the composition of x
- Then, the number of different vectors x for a given F is
source x = (x1, x2, •••, xN ), xi Є {1,2, •••,M}
)(entropy! flogfNlogNflogf|x|logN1
x represent to needed bits of number the and
N)!(fN)!(fN)!(fN!
Nf
NfNfNfN
Nf
NfN
Nf
N|x|
M
1ii2i2
M
1ii2iF2
M21M
1M21
2
1
1F
56
en- and decoding
sourcex
letter /output bits NlogNM
need we F, of value the transmit To
entropy! Shannon the to equal is flogf thus and pfN, large forge
2
M
1ii2iii
encoder
F (composition)
Lexicographical index for x
encoder
encoder
F (composition)
Lexicographical index for x
decoderx
Lexicographical en- and decoding is a solved problem in computer science