13
University of Ljubljana Institute of Mathematics, Physics and Mechanics Department of Mathematics Jadranska 19, 1000 Ljubljana, Slovenia Preprint series, Vol. 36 (1998), 600 SUB-LINEAR DECODING OF HUFFMAN CODES ALMOST IN-PLACE Andrej Brodnik Svante Carlsson ISSN 1318-4865 May 7, 1998 Ljubljana, May 7, 1998

Sub-linear Decoding of Huffman Codes Almost In-place

  • Upload
    upr-si

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

University of Ljubljana

Institute of Mathematics, Physics and MechanicsDepartment of Mathematics

Jadranska 19, 1000 Ljubljana, Slovenia

Preprint series, Vol. 36 (1998), 600

SUB-LINEAR DECODING OFHUFFMAN CODES ALMOST

IN-PLACE

Andrej Brodnik Svante Carlsson

ISSN 1318-4865

May 7, 1998

Ljubljana, May 7, 1998

Sub�linear Decoding of Hu�man Codes Almost In�Place

Andrej Brodnik� Svante Carlssony

May �� ����

Abstract

We present a succinct data structure storing the Hu�man encoding that permits sub�linear decoding in the number of transmitted bits� The size of the extra storage except forthe storage of the symbols in the alphabet for the new data structure is O�l logN� bits� wherel is the longest Hu�man code and N is the number of symbols in the alphabet� We presenta solution that typically decodes texts of sizes ranging from a few hundreds up to � with only one third to one �fth of the number of memory accesses of that of regular Hu�manimplementations� In our solution� the overhead structure where we do all but one memoryaccess to� is never more than � � bytes� This will with a very high probability reside in cache�which means that the actual decoding time compares even better�

� Introduction

If you have an alphabet of N symbols that you would like to encode the typical solution would beto use dlogne bits to encode N di�erent numbers� each number corresponding to a symbol� Thisis done� for example� in ASCII�coding� If all symbols in a text that should be coded are equallyfrequent this will also give a minimal size encoding in the number of bits used� The encoding of asymbol is done by a standard dictionary look�up and the decoding is done by a constant�time tablelook�up�

If the frequency between the symbols varies one can construct a smaller encoding of a giventext� Let us arrange N symbols with normalized frequencies �i �� � i � N� at leaves of a binarytree� where li is the depth of the leaf� The binary tree with the minimal weighted external pathlength

l minN��Xi��

�ili ��

is called the Hu�man tree �cf� ��� ���A path from the root to the leaf in a binary tree can be encoded as a binary string where

descending to the left �right� is represented with � �� in the string� Using such a technique we canuniquely encode each symbol in the text� Moreover� these encodings are pre�x�free� The encodingconstructed using a Hu�man tree� is called Hu�man encoding�

�Institute of Mathematics� Physics and Mechanics� Ljubljana� Slovenia and Department of Computer Science�

Lule�a University� Sweden�yDepartment of Computer Science� Lule�a University� Sweden

It is easy to see that the Hu�man encoding� because of the property from eq� ��� gives thebest possible compression of a �nite text if one has to use one static code for each symbol of thealphabet� Therefore the encoding is extensively used in various �elds of computer science� such aspicture compression ����� HDTV ������ data transmission ���� ���� etc� We gain this decrease of thesize of the encoded text at the cost of increased decoding time� which is linear in the number ofbits of the text�

In this paper we are considering the size of the data structure to store the Hu�man tree andthe time necessary to decode a given Hu�man code of a symbol� The best solution known so farrequires D��N� space for the data structure �D is the size of dictionary storing the symbols� andO�l� time to decode the symbol ����� Moreover� the data structure� and hence� the accompanyingalgorithm� is very involved� In this paper we improve the space bound to D � O�l logN� bits forthe data structure with a hidden constant at most � �l is the longest Hu�man code�� and the timebound to ��log l�� worst�case �l� is the number of di�erent lengths of Hu�man codes�� The averagedecoding time for any Hu�man tree is bounded by ��log logn�� In fact� if the symbols have equalprobabilities we have an in�place� constant time decoding solution�

The paper has a usual structure� �rst we do the homework by browsing through the literaturesearching for similar and related solutions� then we present general ideas how to change the Hu�mantree not changing the cost of encoding described by the optimization function in eq� ��� In the coreof the paper we present the basic structure and the decoding algorithm� This solution is furtherimproved to the best possible on the average time�wise and almost optimal space�wise� Finally� wepresent some practical results of our algorithms as they are used in the Phone Book of Slovenia�showing a remarkable speed�up and space reduction�Notation� In this paper we consider symbols with normalized frequencies �i� � � i � N andPN��

i�� �i � The length of the Hu�man code of the ith symbol is li bits� the longest code is l bitslong� the number of di�erent code lengths is l� and the weighted length of all codes is l

PN��i�� �ili�

To write down all the symbols we need D space�

� Homework

Construction of the Hu�man codes minimizes eq� �� over all possible trees �cf� ����� There existsa simple greedy algorithm which� from the list of symbols with frequencies� constructs Hu�mancodes in O�N logN� time �cf� ��� pp� �� � ��� The algorithm �rst sorts symbols in descendingorder of frequencies ���N logN� time�� and then constructs the Hu�man tree ���N� time�� Thewhole construction takes O�N� registers of space� but can be brought down to N �O�� registers������

There are numerous variations to the basic problem� In the basic version we have as an input thelist of symbols with their frequencies� However� we do not always have a possibility to constructsuch a list� In this case we use dynamic Hu�man coding �cf� � � ���� Even more generalizedversion of the problem is when we do not have the list of symbols either �cf� ����� Yet anotherversion of the problem is to construct Hu�man codes that must be shorter than some prede�nedconstant �cf� ����� The solution presented in this paper can be applied to all these di�erent versionsof the problem� However� our solution only substantially speeds up their decoding process� while�in general� it does not decrease the size of the data structure�

When the Hu�man encoding is constructed appears the problem of its decoding� The decoding

algorithm has to be very fast and simple� The main reason is that it usually runs on a cheaphardware without too much memory �e�g� when used for HDTV etc��� Therefore a number ofdi�erent decoding algorithms were constructed which try to minimize the size of the data structureand simultaneously decrease the decoding time �cf� �� �� � ���� All of the mentioned algorithmstry to traverse the Hu�man tree constructed by the common �merging algorithm� �cf� ��� pp� �� ����� They apply at least one of the following two techniques� they �wrap� the Hu�man tree insuch way that the leaves are replaced by the root of the tree� and�or they layer the tree �graph�into a small number of layers �preferably constant number� that are inspected quickly� Technically�the �rst technique gives a directed graph which needs to be traversed and when the root is reached�the code is decoded �cf� �nite automaton�� while the second technique gives an r�ary trie �graph��Applying only the �rst technique can give us O�l� time andD�O�N� space solution� while applyingonly the second technique can give us O�� time and O�D � �l� space solution� Obviously� thereis a wide spectrum of solutions between both techniques which� however� employ involved pointerpassing data structures�

� Mutation of the Hu�man Tree and the Basic Data Struc�

ture

Vitter �� �� states the following important so�called characterization of Hu�man trees �cf� Figure ��

Lemma � �Sibling Property� A binary tree with N leaves is a Hu�man tree i��

�� the N leaves have nonnegative weights �i �� � i � N and the weight of each internal nodeis the sum of the weights of its children and

�� the nodes can be numbered by weight� so that nodes �j � and �j are siblings �� � j � Nand their common parent node is higher in the numbering�

��

� �

� � �

� �� ��

��

Figure �� An example of Hu�man tree with numbered nodes�

The important consequence of the Lemma is�

Corollary � The nodes of a Hu�man tree are sorted by their weights level by level from top tobottom� This is true for all nodes and separately for internal �black in Figure � and external� gray in Figure � nodes�

In the rest of the text we will refer to the numbers in Figure as labels� In particular� the labelsat leaves can be� w�l�o�g�� considered symbols we are encoding�

Because of ��N logN� lower bound on sorting under the comparison based model and since wecan traverse nodes of a binary tree level by level in O�N� time Corollary gives us a trivial lowerbound�

Lemma � Given a list of N symbols with frequencies �i �� � i � N� it takes ��N logN� toconstruct the Hu�man tree�

A straight forward implementation of the Hu�man tree uses at least �N�� pointers to representthe tree� It is easy to see that it is necessary to use �N � o�N� bits to represent the shape of thetree� since the Hu�man tree can take any shape with appropriate assignment of probabilities to thesymbols�

To decode a symbol a bit string into the corresponding symbol takes time proportional to thenumber of bits in the string �l�� This is optimal in the pointer machine model�

��� Mutation of the Hu�man Tree

Since the lower bound on the size of the structure depended on the number of di�erent shapesof Hu�man trees and the decoding lower bound depended on the pointer machine model� we willrestrict the number of possible shapes and allow table look�ups to beat these lower bounds�

We will refer to the internal nodes of a Hu�man tree as black nodes� to the external nodes�leaves� as gray nodes and to the missing nodes �nodes that would be descendants of gray nodes�as white nodes�

The crucial step in the construction of the minimal size Hu�man tree is the following rearrange�ment producing the mutated Hu�man tree �cf� Figure ���

De�nition � We rearrange �permute the nodes on the level i �� � i � l of the Hu�man tree so�

�� that �rst �on the left are bi black nodes� then gi gray nodes and on the far end are wi whitenodes� and

�� that the relative order of the same coloured nodes does not change�

A special case of the mutated Hu�man tree is a left�aligned Hu�man tree�During the process of node rearrangement the depth of a leaf does not change and thus the

length of the code does not change either� Therefore� the code generated from the mutated treeremains optimal� Moreover� Corollary is valid for the mutated tree as well� In the rest of thepaper we will refer to the Hu�man code as the code produced by using the mutated Hu�man tree�

��� Basic Data Structure

Before giving the representation of the basic data structure and the decoding algorithm we needthe following lemmata�

�Obviously� bi � gi � wi � �i�

�� ��

� �

�� ��

� �

Figure �� The mutated Hu�man tree of Figure �

Lemma � The symbol stored at the gth gray node �� � g � gi on the level li of the mutatedHu�man tree has the Hu�man code corresponding to the binary representation using li bits of thenumber bi � g�

Proof� Cutting the mutated Hu�man tree at the level li gives a complete binary tree with biblack leaves� gi gray leaves and wi white leaves� in particular the gth gray leaf is the bi � g�th leafof the cut tree� Obviously� the binary representation� using li bits� of the number bi � g representsexactly the path from the root to the leaf� QED

Lemma Let c be a li bits long bit�string� If

c � bi� c is a pre�x of some Hu�man code

bi � c � bi � gi� c is a Hu�man code and

bi � gi � c� c contains as a proper pre�x some Hu�man code�

Proof� The second case follows trivially from Lemma � while the �rst and the last from thede�nition of the mutated Hu�man tree �see De�nition �� QED

Now we are ready to show how we can get below the previous space lower bound by using themutated Hu�man tree�

Theorem � There exists a data structure that uses only O�l logN� bits� except what is necessaryto store the alphabet� that permits O�l� worst case and O�l� average case decoding time�

Proof� By Corollary the leaves of the mutated Hu�man tree are sorted by non�decreasing weightfrom the bottom to the top� This brings us to the following two�part data structure�

� sorted list of symbols table �size D�� and

� an index of l pairs �bi� indxi�� where indxi is the index of the �rst leaf on level li in the table�size O�l logN���

The decoding algorithm reads �shifts� in bit at a time until the number code represented by theli�bit shifted in bit�string falls in the range bi � code � bi � gi� The worst case running time ofthe algorithm is obviously O�l�� while its correctness follows directly from Lemma �� The average

case running time is proportional to the sum of �i � li over all symbols � hence by eq� �� it is O�l��QED

Note� that sortedness of symbols is not the precondition� It is su�cient that leaves from the givenlevel of the tree are stored in consecutive locations of the table�

An important observation for this� and all our new algorithms� is that it makes only one referenceto a symbol stored in the table when it is decoded� The rest of the time is spent in a very smalltable with indices that will almost always be in the cache�

Most Hu�man trees do not have leaves on all levels� but only on a few� An immediate improve�ment of the algorithm in the previous theorem is not to search linearly through all levels� but onlythrough those l� levels that have leaves� By keeping track of the number of levels to skip �bits toshift in� in the small index table we get solution presented in Algorithm � Hence�

Corollary � Algorithm � uses data structure of size D�O�l� logN� bits and decodes Hu�man codein O�l�� worst�case time�

The average time for decoding using Algorithm depends very much on the probability distribution�but it is at least never bigger than l�

typedef struct funsigned short skipBits�

tNumNodes blacks�

tNumNodes short grays�

g tIndx�

static const tIndx indx��� ����

tNumNodes Decode �� ftNumNodes indxOfSymbol� �

tNumNodes code� �

tNumLevels i� �

while �indx�i��blacks code� findxOfSymbol�� indx�i��grays�

ShiftIn�indx�i��skipBits� code��

i���

g�return indxOfSymbol � �code indx�i��blacks��

g �� Decode ��

Algorithm �� Linear decoding algorithm�

� Faster Decoding

In Algorithm we search for the level of the symbol linearly from the top and in this section weimprove the solution by using binary search on the levels of the mutated Hu�man tree �cf� � ��� for a toy example see Figure � The search� when probing on some level� shifts in or out theappropriate number of bits from the input stream and checks if the computed code corresponds tothe black� gray or white node� If the corresponding node is�

down

�� ��

� �

�� ��

� �

� �

� �

� �up

Figure �� An example of a binary search tree on the levels of the mutated Hu�man tree�

� black� we are too high in the tree and we have to go down�

� gray� we found the symbol� and

� white� we are too low in the tree and we have to go up�

Hence�

Theorem � There exists a data structure of size D �O�l� logN� bits that permits decoding of theHu�man code in O�log l�� time�

By blindly using the logarithmic search on the levels we can actually get a higher average searchcost than in the linear search� This will be the case if� for instance� the frequencies are exponentiallydecreasing� Instead� we should build an optimal binary search tree on those levels of the Hu�mantree that contain leaves� A level is weighted by the sum of frequencies of leaves on the given level�

We want to show that the average decoding time for any Hu�man tree is bounded by O�log logN�if we use the optimal search tree for searching the levels� To do this we �rst have to prove someproperties of the Hu�man tree�

Lemma The total weight of a non�leaf subtree on level i of a Hu�man tree is at most ��� �i�The weight of a leaf on level i is at most ��� �i�� � �����

Proof� Consider an internal node that is a child of the root of the Hu�man tree� If the totalweight of that subtree is more than �� then its sibling must have a weight that is less than � and one of its children has a weight that is more than � � This contradicts Corollary since anode on a higher level has a lower weight than a node on a lower level�

A grand�child of a node can never be a weight higher than �� of the total weight� since it thenwould have to be the child of that node�

By using the same technique in an induction we get the stated results� QED

Lemma � A Hu�man tree with N symbols can only have O�logN� levels where the total weightsof the leaves on each of these levels exceeds �N��

Proof� From Lemma � we know that the topmost � log���N levels can each have only one leafand still have a total weight bigger than �N��

On the following levels we know that on level �� log���N� � i there has to be at least � ���i

leaves to make the total weights of the leaves at least �N�� This follows almost directly from theprevious Lemma�

Since there are only N leaves in total these can only help make log���N levels below level� log���N having a total weight of at least �N�� In total� there could only be a logarithmicnumber of levels with that high weight� QEDUsing this observation� we can show that the following theorem�

Theorem � There exists a data structure of size D �O�l� logN� bits that permits decoding of theHu�man code in O�log logN� average time for any probability distribution�

Proof� Consider the following binary search of levels� Construct a perfectly balanced binarysearch tree of the levels with a weight that is at least �N�� The rest of the levels are now insertedone by one in the balanced search tree by a standard insertion method�

From Lemma � we know that there could only be a logarithmic number of levels in the top partof the search tree� and each of them can be found in O�log logN� time�

There could be a linear number of levels in the small part� and each of them could take atmost linear time� but since the probability for each of these levels is less than �N� they will onlycontribute with a constant additive term to the expected search time�

We have now showed that there always exists one binary search tree with a search time that isO�log logN�� We know that the optimal binary search tree has the lowest expected search time itis at least as good as the one we have described� This concludes the proof� QED

It is worth mentioning that this algorithm improves the worst average decoding time fromO�logN� for the standard Hu�man decodings� We are able to break this lower bound because weare able to use table look�ups instead of only following pointers�

The Algorithm � presents a reasonably practical decoding algorithm� As shown in the algorithmthe constant in the size of the data structure is at most �� Also� for the practical purposes thealgorithm should be coded with goto statements which eliminates one conditional branching andeven further speeds up the decoding�

� Tree Construction

In all Hu�man tree construction algorithms� the �rst step is to sort the symbols� This takesO�N logN� time�

The standard algorithm proceeds as a simple greedy algorithm ���� pp� �� ����� We useinitially similar algorithm though augmented with an auxiliary structure recording the �rst nodeon each level and connecting all nodes in a linked list instead of building a binary tree� In thesecond phase we walk through all levels� count black and gray nodes� and rearrange them accordingto De�nition � This takes O�N� time�

If we choose to use Algorithm � instead of Algorithm in a previous step we also compute thetotal weight of leaves on each level not changing the run time� However� we need to add one morestep � construction of the optimal binary tree� For this we use dynamic programming� which takesO�l�� log l�� time� In the worst� and very pathological� case� l can be as large as N �

The described procedure is used for constructing and mutating of a static Hu�man tree� Whenconsidering a dynamic Hu�man tree we use the structure presented by Vitter in � � �� and mutate

typedef struct ftCodeLength shiftBits�

tNumNodes blacks�

tNumNodes grays�

tCodeLength upper�

tCodeLength lower�

tNumNodes firstOnLevel�

g tIndx�

static const tIndx indx��� ����

static const tCodeLength startAt� ����

tNumNodes Decode �� ftCodeLength code� �

bool goUp� false�

tCodeLength i� startAt�

int gray�

for ���� fif �goUp� ShiftOutBits�indx�i��shiftBits� code��

else ShiftInBits�indx�i��shiftBits� code��

gray� code indx�i��blacks�

if �gray � f i� indx�i��lower� goUp� false� gelse if �gray indx�i��grays� return indx�i��firstOnLevel � gray�

else f i� indx�i��upper� goUp� true� g�g�

g �� Decode ��

Algorithm �� Logarithmic decoding algorithm

the tree when the change occurs� The changes are always only a�ecting neighbouring levels �� �Lemma ����� and since we do not need keep the nodes on the level sorted� the changes in themutated tree are easier to make and much more seldom� A similar approach is also used with aHu�man encoding of the in�nite symbol alphabet �cf� �����

� Practical Results

We compare three implementations of Hu�man encoding� classical pointer based� linear with skipsfrom the Algorithm and the logarithmic one from the Algorithm �� In order to put all consideredimplementations on the common denominator we assume that the Hu�man tree is stored in twopieces� the dictionary containing all �variable length� symbols� and the shape data structure storingthe Hu�man tree shape �cf� Theorem �� For each implementation we compute the size of the secondpiece � the shape �the dictionary is the same for all of them�� and the number of comparisonsperformed by the decoding algorithm� Because of the two�piece data structure� all algorithms haveto make an extra access to get the symbol from the dictionary�

In the pointer based implementation we assume usual representation of a binary tree �cf� ���pp� ��������� the tree node contains a ag� de�ning if the node is a leaf or not� and one or twopointers� The decoding is essentially the same as Tree�Search algorithm on ��� p� ���� � i�e� it

performs two comparisons per level�The data used to compare the implementations are taken from the Phone Book of Slovenia

������ We took � di�erent sets of data �texts� containing di�erent numbers of symbols� from afew to more than sixty thousand� and di�erent frequencies of these symbols� For each of the setswe computed the size of shapes for all compared implementations and the number of comparisonsnecessary to decode the complete text�

The Table contains results of tests� the �rst column represents the number of symbols N � thefollowing four columns the normalized sizes of the data structure and the last four columns thenormalized number of comparisons for the decoding algorithms� The logarithmic implementationcolumns contain two values� the normalized� with value ��� and the actual size �number of com�parisons�� The actual size is measured in the number of bytes� All normalized values are rounded�

N size num� of comparisonspointers linear logarithmic pointers linear logarithmic

� �� �� � � � �� �� �������� ��� �� �� �� �� � �� � �� ���� �� �� � �� �� �� �� �� �� �� �� �� ��� �� �� ��������� ��� �� �� ��� ��� �� �� ������ ��� �� �� �� �� �� �� ������ �� � �� �� �� �� � �� �������� ���� �� �� �� �� � �� ������ ��� �� �� �� � � �� �� � ����� ���� �� �� ��� ��� �� �� ���������� ��� �� �� �� ��� �� �� ���� �� ��� �� �� ��� � � �� ���������� ����� �� �� ��� ��� �� �� ��������� ���� �� �� �� �� � �� ���� ���� � �� �� �� � � ��� � �� �� ��� ���� �� �� ��� ��� �� �� ���� �� �� ����� �� �� ��� �� �� �� ��������� ���� �� �� �� �� � �� � � � � ���� �� �� ��� ��� �� �� ������� � � � �� �� �� ��� ��� �� �� ����������� ������ �� �� �� ��� �� �� �� � ��

Table �� Comparison of di�erent implementations of the Hu�man encoding�

Conclusion

We have presented a way to mutate a Hu�man tree to have a more succinct representation ofthe tree and also to get a considerable speed up of the decoding compared to previous versionsof Hu�man trees� We typically use only a few hundred bytes extra storage for a �� thousand

symbols alphabet as an overhead structure� and we use only one third to one �fth of the numberof comparisons compared to other versions� As we hope to be able to show in the �nal version ofthis paper the actual time for decoding compares even more favorably to our algorithms�

Also the theoretical results of this paper is worth noticing� The Hu�man code has been aroundfor more than �� years� and now we decrease the decoding time from O�logN� to O�log logN� byusing a much smaller representation than before�

We have not found any drawbacks of our algorithms� and we would strongly recommend theusage of them� or variations thereof� in all places where Hu�man coding is used�

There remains an open problem if we can further �theoretically� decrease the size of the datastructure� In our current approach we essentially group N keys into l subsets� In general we havel�

Nl

�msuch groupings and thus we need lg

l�Nl

�m O�l logN� bits for the description� In our case

not all groupings are valid and hence this bound is too high� But how much! Note� that the sizeof our data structure matches the O�l logN� bound�

Addendum

When we �nished the research we learned that more or less the same approach was taken by Mo�atand Turpin in ���� They give also empirical data to support their approach but do not performtheoretical analysis of the technique�

In meantime we also implemented and tested our data structure� The results of tests and theiranalysis from the algorithmic engineering point of view will bi subject of the full version of thiscontribution�

References

��� K��L� Chung� E�cient Hu�man decoding� Information Processing Letters� ������������ �����

��� T�H� Cormen� C�E� Leiserson� and R�L� Rivest� Introduction to Algorithms� MIT Press� Cambridge�Massachusetts� ����

��� P� van Emde Boas� R� Kaas� and E� Zijlstra� Design and implementation of an e�cient priority queue�Mathematical Systems Theory� ������������ �����

� � R� Hashemian� Memory e�cient and high�speed search Hu�man coding� IEEE Transactions on

Communications� �������������� October �����

��� D�A� Hu�man� A method for the construction of minimum redundancy codes� Proceedings of the

Institute of Radio Engineers� ����������� September �����

��� V� Iyengar and K� Chakrabarty� An e�cient �nite�state machine implementation of Hu�man decoders�Information Processing Letters� � ������������ �����

��� A� Kato� T�S� Han� and H� Nagaoka� Hu�man coding with an in�nite alphabet� IEEE Transactions

on Information Theory� ���������� � May �����

�� L�L� Larmore and D�S� Hirschberg� A fast algorithm for optimal length�limited Hu�man codes� Journalof the ACM� ������ � � ��� July ����

��� A� Mo�at and J� Katajainen� In�place calculation of minimum�redundancy codes� In Proceedings �th

Workshop on Algorithms and Data Structures� volume ��� of Lecture Notes in Computer Science�pages ���� �� Springer�Verlag� �����

��� A� Mo�at and A� Turpin� On implementation of minimum redundancy pre�x codes� IEEE Transac�

tions on Communications� ������������ �����

���� H� Tanaka� Data structure of Hu�man codes and its application to e�cient encoding and decoding�IEEE Transactions on Information Theory� �������� ����� January ����

���� Telekom Slovenije � ADACTA� Telefonski imenik Slovenije� URL� http���tis�telekom�si�

���� J�S� Vitter� Design and analysis of dynamic hu�man codes� Journal of the ACM� � � ����� ��October ����

�� � J�S� Vitter� Algorithm ���� Dynamic Hu�man coding� ACM Transactions on Mathematical Soft�

ware� ������������� June ����