Lempel Ziv

The LZ family LZ77LZRLZSSLZBLZH used by zip and unzipLZ78 LZW Unix compressLZC Unix compressLZTLZMWLZJLZFG

Overview of LZ familyTo demonstrate:simple alphabet containing only two letters, a and b, and create a sample stream of text

LZ family overviewRule: Separate this stream of characters into pieces of text so that the shortest piece of data is the string of characters that we have not seen so far.

Sender : The CompressorBefore compression, the pieces of text from the breaking-down process are indexed from 1 to n:

LZindices are used to number the pieces of data. The empty string (start of text) has index 0. The piece indexed by 1 is a. Thus a, together with the initial string, must be numbered Oa. String 2, aa, will be numbered 1a, because it contains a, whose index is 1, and the new character a.

LZthe process of renaming pieces of text starts to pay off.Small integers replace what were once long strings of characters. can now throw away our old stream of text and send the encoded information to the receiver

Bit Representation of Coded InformationNow, want to calculate num bits neededeach chunk is an int and a letternum bits depends on size of table permitted in the dictionary every character will occupy 8 bits because it will be represented in US ASCII format

Compression good? in a long string of text, the number of bits needed to transmit the coded information is small compared to the actual length of the text. example: 12 bits to transmit the code 2b instead of 24 bits (8 + 8 + 8) needed for the actual text aab.

Receiver: The Decompressor (Implementationreceiver knows exactly where boundaries are, so no problem in reconstructing the stream of text. Preferable to decompress the file in one pass; otherwise, we will encounter a problem with temporary storage..

Lempel-Ziv appletSeehttp://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic23/#JavaApplet

Lempel Ziv Welsch (LZW)previous methods worked only on charactersLZW works by encoding stringssome strings are replaced by a single codewordfor now assume codeword is fixed (12 bits)for 8 bit characters, first 256 (or less) entries in table are reserved for the charactersrest of table (257-4096) represent strings

LZW compressiontrick is that string-to-codeword mapping is created dynamically by the encoderalso recreated dynamically by the decoderneed not pass the code table between the twois a lossless compression algorithmdegree of compression hard to predictdepends on data, but gets better as codeword table contains more strings

LZW encoderInitialize table with single character stringsSTRING = first input characterWHILE not end of input streamCHARACTER = next input characterIF STRING + CHARACTER is in the string tableSTRING = STRING + CHARACTERELSEOutput the code for STRINGAdd STRING + CHARACTER to the string tableSTRING = CHARACTEREND WHILEOutput code for string

DemonstrationsAnother animated LZ algorithm http://www.data-compression.com/lempelziv.html

LZW encoder examplecompress the string BABAABAAA

LZW decoder

Lempel-Ziv compressiona lossless compression algorithmAll encodings have the same lengthBut may represent more than one characterUses a dictionary approach keeps track of characters and character strings already encountered

LZW decoder example decompress the string

LZW Issuescompression better as the code table growswhat happens when all 4096 locations in string table are used?A number of options, but encoder and decoder must agree to do the same thingdo not add any more entries to table (as is)clear codeword table and start againclear codeword table and start again with larger table/longer codewords (GIF format)

LZW advantages/disadvantagesadvantagessimple, fast and good compressioncan do compression in one passdynamic codeword table built for each filedecompression recreates the codeword table so it does not need to be passeddisadvantagesnot the optimum compression ratioactual compression hard to predict

Entropy methodsall previous methods are lossless and entropy based lossless methods are essential for computer data (zip, gnuzip, etc.)combination of run length encoding/huffman is a standard toolare often used as a subroutine by other lossy methods (Jpeg, Mpeg)

Lempel-Ziv compressiona lossless compression algorithmAll encodings have the same lengthBut may represent more than one characterUses a dictionary approach keeps track of characters and character strings already encountered

Documents

Lempel Ziv