Upload
dustin-kelly
View
230
Download
1
Embed Size (px)
Citation preview
SIMS-201
Compressing Information
2
Overview
Chapter 7: Compression Introduction Entropy Huffman coding Universal coding
3
Introduction Compression techniques can significantly reduce the bandwidth and
memory required for sending, receiving, and storing data. Most computers are equipped with modems that compress or
decompress all information leaving or entering via the line. With a mutually recognized system (e.g. WinZip) the amount of data
can be significantly diminished. Examples of compression techniques:
Compressing BINARY DATA STREAMS Variable length coding (e.g. Huffman coding) Universal Coding (e.g. WinZip)
IMAGE-SPECIFIC COMPRESSION (will will see that images are well suited for compression)
GIF and JPEG VIDEO COMPRESSION
MPEG
World Wide Web not World Wide Wait
4
Why can we compress information?
Compression is possible because information usually contains redundancies, or information that is often repeated.
For example, two still images from a video sequence of images are often similar. This fact can be exploited by transmitting only the changes from one image to the next.
For example, a line of data often contains redundancies:
File compression programs remove this redundancy.
“Ask not what your country can do for you - ask what you can do for your
country.”
5
Some characters occur more frequently than others. It’s possible to represent frequently occurring
characters with a smaller number of bits during transmission.
This may be accomplished by a variable length code, as opposed to a fixed length code like ASCII.
An example of a simple variable length code is Morse Code.
“E” occurs more frequently than “Z” so we represent “E” with a shorter length code:
. = E - = T - - . . = Z - - . - = Q . = E - = T - - . . = Z - - . - = Q
6
Information Theory
Variable length coding exploits the fact that some information occurs more frequently than others.
The mathematical theory behind this concept is known as: INFORMATION THEORY Claude E. Shannon developed modern Information
Theory at Bell Labs in 1948. He saw the relationship between the probability of
appearance of a transmitted signal and its information content.
This realization enabled the development of compression techniques.
7
A Little Probability
Shannon (and others) found that information can be related to probability. An event has a probability of 1 (or 100%) if we believe this event will
occur. An event has a probability of 0 (or 0%) if we believe this event will not
occur. The probability that an event will occur takes on values anywhere from 0
to 1. Consider a coin toss: heads or tails each has a probability of .50
In two tosses, the probability of tossing two heads is: 1/2 x 1/2 = 1/4 or .25
In three tosses, the probability of tossing all tails is: 1/2 x 1/2 x 1/2 = 1/8 or .125
We compute probability this way because the result of each toss is independent of the results of other tosses.
8
Entropy If the probability of a binary event is .5 (like a coin), then, on
average, you need one bit to represent the result of this event. As the probability of a binary event increases or decreases, the
number of bits you need, on average, to represent the result decreases
The figure is expressing that unless an event is totally random, you can convey the information of the event in fewer bits, on average, than it might first appear
Let’s do an example...
As part of information theory,
Shannon developed the concept of ENTROPY
Probability of an event
Bits
9
Example from text..
The probability of male patrons is .8 The probability of female patrons is .2
Assume for this example, groups of two enter the store. Calculate the probabilities of different pairings:
Event A, Male-Male. P(MM) = .8 x .8 = .64 Event B, Male-Female. P(MF) = .8 x .2 = .16 Event C, Female-Male. P(FM) = .2 x .8 = .16 Event D, Female-Female. P(FF) = .2 x .2 = .04
We could assign the longest codes to the most infrequent events while maintaining unique decodability.
A MEN’S SPECIALTY STORE
10
Let’s assign a unique string of bits to each event based on the probability of that event occurring.
Event Name Code AMale-Male 0 B
Male-Female 10 C Female-Male 110 D Female-Female 111
Given a received code of: 01010110100, determine the events:
The above example has used a variable length code.
Example (cont..)
A
MM
B
MF
B
MF
C
FM
B
MF
A
MM
11
Variable Length Coding
Unlike fixed length codes like ASCII, variable length codes:
Assign the longest codes to the most infrequent events.
Assign the shortest codes to the most frequent events.
Each code word must be uniquely identifiable regardless of length.
Examples of Variable Length Coding Morse Code Huffman Coding
Takes advantage of the probabilistic nature of information.
If we have total uncertainty about the information we are conveying, fixed length codes are preferred.
12
Morse Code
Characters represented by patterns of dots and dashes. More frequently used letters use short code symbols. Short pauses are used to separate the letters. Represent “Hello” using Morse Code:
H . . . . E . L . - . . L . - . . O - - -
Hello . . . . . . - . . . - . . - - -
13
Huffman Coding
Creates a Binary Code Tree Nodes connected
by branches with leaves
Top node – root Two branches
from each node
D
B
C
A
Start
Root Branches
Node
Leaves
0
0
0
1
1
1
The Huffman coding procedure finds the optimum, uniquely decodable, variable length code associated with a set of events, given their probabilities of occurrence.
14
A 0B 10C 110D 111
Given the adjacent Huffman code tree, decode the following sequence: 11010001110
Huffman Coding
D
B
C
A
Start
Root Branches
Node
Leaves
0
0
0
1
1
1110C
10B
0A
0A
111D
0A
15
Huffman Code Construction First list all events in descending order of probability.
Pair the two events with lowest probabilities and add their probabilities.
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
0.15
16
Repeat for the pair with the next lowest probabilities.
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
0.150.25
Huffman Code Construction
17
Huffman Code Construction
Repeat for the pair with the next lowest probabilities.
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
0.150.25
0.4
18
Repeat for the pair with the next lowest probabilities.
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
0.150.25
0.40.6
Huffman Code Construction
19
Repeat for the last pair and add 0s to the left branches and 1s to the right branches.
.3Event A
.3Event B
.13Event C
.12Event D
.1Event E
.05Event F
0.150.25
0.40.6
0
0
0
0 0
1
1
111
00 01 100 101 110 111
Huffman Code Construction
20
Exercise Given the code we just constructed:
Event A: 00 Event B: 01 Event C: 100 Event D: 101 Event E: 110 Event F: 111
How can you decode the string: 0000111010110001000000111?
Starting from the leftmost bit, find the shortest bit pattern that matches one of the codes in the list. The first bit is 0, but we don’t have an event represented by 0. We do have one represented by 00, which is event A. Continue applying this procedure:
00A
00A
111F
01B
01B
100C
01B
00A
00A
00A
111F
21
Universal Coding Huffman has its limits
We must know a priori the probability of the characters or symbols we are encoding.
What if a document is “one of a kind?” Universal Coding schemes do not require a knowledge of the
statistics of the events to be coded. Universal Coding is based on the realization that any stream of
data consists of some repetition. Lempel-Ziv coding is one form of Universal Coding presented in
the text. Compression results from reusing frequently occurring strings. Works better for long data streams. Inefficient for short
strings. Used by WinZip to compress information.
22
Lempel-Ziv Coding
The basis for Lempel-Ziv coding is the idea that we can achieve compression of a string by always coding a series of zeroes and ones as some previous string (prefix string) plus one new bit. Compression results from reusing frequently occurring strings
We will not go through Lempel-Ziv coding in detail..