Wyner and Ziv

Embed Size (px)

Citation preview

  • 8/10/2019 Wyner and Ziv

    1/6

    The Sliding-Window Lempel-Ziv Algorithmis Asymptotically Optimal

    AARON D. WYNER, FELLOW, IEEE, AND JACOB ZIV, FELLOW, IEEE

    Invited Paper

    The sliding-window version of the Lempel-Ziv data-compressionalgorithm (sometim es called LZ 77) has been thrust into promi-nence recently. A version o this algorithm is used in the highlysuccessful Stacker program for personal computers. I t is also

    and takes values in the alphabet A with cardinality AI =A < 0 ~ . or --30 5 i < j 5 00, let X i denote the

    xi,1

    H , = - W ( X - ; )n

    1A - -n

    ~ . . . ,xj). btincorporated into Microsofts new MS-DOS-6. Although otherversions of the Lempel-Ziv algorithm are known to be optimal inthe sense that they compress a d ata source t o its en tropy, optimalityin this sense has never been demonstrated for this version.

    I n this self-contained paper, we will describe rhe algorithm,and show that s the window size, a quantity which is related

    ~r {X? Xjl og (Pr {x; = X IX d n

    to the memory and complexity of the procedure, goes to infinity,the compression rate approaches the source entropy. The proofis surprisingly general, applying to all finite-alpha bet stationaryergodic sources.

    I. INTRODUCTION

    The sliding-window version of the Lempel-Ziv (LZ) datacompression algorithm (first proposed in [6] in 1977 andsometimes called LZ 77) has been thrust into prominencerecently. A version of this algorithm is used in the highlysuccessful Stacker program for personal computers [31.It is also incorporated into Microsofts new MS-DOS-6.Although other versions of the LZ algorithm are known to

    be optimal in the sense that they compress a data source toits entropy, such optimality has never been demonstratedfor the sliding-window version.

    This self-contained paper begins with a description of thesliding-window LZ algorithm. The main result is a theoremwhich asserts that as the window size (a quantity directlyrelated to the memory and complexity requirements of theprocedure) becomes large, the compression rate approachesthe source entropy. This theorem is surprisingly general,applying to all stationary, ergodic, finite-alphabet sources.

    We will be concerned with a data source which is arandom sequence {Xk}T=-,, that is stationary, ergodic,

    be the nth order (normalized) entropy of { X k } ,and letH = lima--rooH , be the source entropy. (All logarithmsin this paper are taken to the base 2 . ) It is well knownthat this limit always exists, and that the data source can belosslessly encoded using ( H + E ) its per source symbol [ 11,[ 2 ] (for arbitrarily E > 0). The LZ algorithm is a universalprocedure (which does not depend on the source statistics)for encoding the source.

    In Section I1 we give a precise description of the sliding-window LZ algorithm and state the main result. In Section111 we establish the mathematical facts that we need in theproof of this theorem which we give in Section IV.

    Let us remark at this point that the sliding-windowLZ algorithm discussed here is not the same as the LZ78 algorithm, used for example in the UNIX compresscommand and in a CCITT standard for data compressionfor modems. In LZ 78, the dictionary is allowed togrow until it reaches a certain specified size. In the sliding-window version, the dictionary is of fixed size, and consistsof the data symbols immediately preceding the data beingcompressed.

    Finally, we remark that the treatment in [6], wherethe sliding-window LZ algorithm is first described, isnonprobabilistic. In that reference, the algorithm is shownto be optimal for a certain (deterministic) family of sources.Furthermore, for a given source in this family, the rateof convergence is similar to that of the best ~~nuniversal

    compression scheme. In the present paper, we replace the

    more conventional (and realistic) assumption that the sourceis stationary and ergodic.

    Manuscript received November 1, 1993; revised January 15, 1994.A. D. Wyner is with AT&T Bell Laboratories, Murray Hill, NJ 07974,

    USA.J Ziv was with AT&T Bell Laboratories, Mumy Hill, NJ 07974, USA.He is now with the Faculty of Electrical Engineering, Technion-IsraelInstitute of Technology, Haifa, Israel.

    assumption that the belongs to this family, by the

    IEEE Log Number 9401404.

    00 18-92 19/94$O4.00 0 1994 IEEE

    a12 PROCEEDINGS OF THE IEEE, VOL. 82, NO. 6, JUNE 1994

  • 8/10/2019 Wyner and Ziv

    2/6

    11. DE~CRIFTION F THE ALGORITHMLet { X k } E - m ,be a stationary ergodic random process

    with entropy H , as discussed in Section I. We will nowdescribe the sliding-window Lempel-Ziv algorithm forencoding the sequence {X,}?=,, where N is a largeinteger. Later we shall let N 00

    Let n , > 0 be an integer parameter, called the windowsize Assume that n , is a power of two. The first n , sym-bols of X y are encoded with no attempt at compression.We call X;lw he window. The number of bits required toencode the window X; l w s [n , log A l . These bits are to beconsidered as overhead that will be ammortized over a verylong time. We next define the first phrase Y1 A X : ; + ff l ,where L1 is the largest integer such that

    n -m+L1-lx:::; = , for some m E [ 0 , n , - ] .

    (1)

    Thus L1 is the largest integer such that a copy of X : ; + ff lbegins in the window. Let ml be the (say) smallest ofthose ms which satisfy 1). If 1) is not satisfied forany m E [O,n,-l], that is, Xnw+, X,,, for allm E [ l . ,], then we take L1 = 1. For example if

    n, = 5, and ( X l , X Z , . ; . ) = ( a b c d e . d e d u ...), thenL1 = 3 since the string ( d e d ) begins at position 4

  • 8/10/2019 Wyner and Ziv

    3/6

  • 8/10/2019 Wyner and Ziv

    4/6

    Proof: Write, for fixed E > 0,s = ~ / 2 ,nd Te 6)defined by (6)

    +P '{X;- ' e Te(6)). (14)The inequality in (14) follows from (13) with n = 2 e ( H + e ) .NOW for z E Te S), P ' { X ; - ~= z } 2 2 - e ( H + 6 ) . n u sfrom 14), with E - 6 = ~ / 2

    ~ r { ~ e2 e ( H + E ) } 2 - e ( E - 6 )P ~ { x ; - ~Te 6))+ p r { X : - l zTe S)} + 0,

    as. --+ 00 (15)

    by Theorem 3.1 (AEP). This is Theorem 3.3.We will use the following form of Theorem 3.3:Corollary 3.4 Let n l ) , ? = 1,2,. . , satisfy

    1L l imp+-- logn(L) > H. 16a)e

    l imp- , P r {Wt > n e)} = 0.Then

    16b)

    Proof: Let t = L - H > 0. For . 2 t o (sufficientlylarge), ( I l l )ogn(l) 2 L - ~ / 2 ) H + ( ~ / 2 ) . hus

    . e) 2 2e[H+(E/2)1.

    Pr {we n ~ > ) ~r {We > 2 [ ~ + ( / 2 ) 1 }+ oThus for l >_ . ,

    by Theorem 3.3.

    Iv. PROOF OF THEOREM.1That R N )2 H for all N follows from the converse

    to the Coding Theorem for lossless source coding. Sincethe algorithm defined in Section I1 yields a prefix-free("instantaneous") code for X I , . ,X N

    E V ( X ? ) 2 H ( X 1 : . . X N ) .

    (See Theorem 5.3.1 in 111.) Since H ( X ~ , . . . , X N )N H , E ( N ) = E v ( X ? ) / N 2 H

    We now show that lim R N ) 5 H . Suppose that wehave implemented the algorithm to encode X I , . ,X N .Subdivide the interval [n, + 1 , NI into N'/Co subintervalsof length t o , where

    (174' = N - n,

    and

    log nu,e, =S E

    and E > 0 is arbitrary. (Assume that E is adjusted so that O

    is an integer, and thatO

    dividesN I ) .

    Recall that X y wis the(uncoded) initial window. Denote the N I / subintervalsby { I j } , where

    We say that a subinterval I j is bad if a copy of { X k } k E 1 ,does not begin in the string of n, - o ) symbols precedingit, i.e.

    Now referring to the discussion in Section 111, we havethat

    Pr{Ij is bad} = pr{W e0) > n, -eo). 19)

    n, - 10 = 2 e 0 ( H + E )CO satisfies (16a), withince .(lo)L = H + E , Corollary 3.4 implies that

    Pr{Ij is bad } -+ 0, (20)

    as n, and l o ) 00.Now let {Y } , 1 5 i 5 c be the phrases generated by the

    algorithm when encoding X r For 1 5 i 5 c, let Yi be thephrase Yi augmented by the next symbol from X y Thus ify; = x ; + L - 1 , then Yi = X t + L * .We say that the phrase

    Y ; s in internal phrase, if the corresponding augmentedphrase Y: egins and ends in the same subinterval Ij .We claim that if Y ; s an intemal phrase supported onI j , then I j is a bad subintemal. We show this with theaid of the schematic diagram in Fig. 1. The figure showsthe augmented intemal phrase Y: upported on the intervalIj . The window corresponding to phrase Y; s supportedon the interval A D . By definition, a copy of Yi does notbegin in this window (since Yi , s the longest sequence,beginning at D, a copy of which begins in window). SinceYi s supported on I j , a copy of { X k } I , cannot begin inthe window AC, and therefore cannot begin in BC (sinceB is to the right of A ) .Thus Ij is a bad subinterval.

    Let SI = {i : Yi s intemal}, and let ?1 be the fractionof the intervals { I j } which are bad. It follows that

    WYNER AND ZIV: SLIDING-WINDOW LEMPEL-ZIV ALGORITHM 875

  • 8/10/2019 Wyner and Ziv

    5/6

    We are now ready to comple te the proof to Theorem 2.1.Rewrite inequality 5) as

    Now the first expectation is from (21)

    Finally, observe that if Yi is not an intemal phrase, then Yimust occupy the last position of some subinterval Ij. Thus

    (24)Ne0

    c 4 ISFl I number of subintervals} =

    The second term in (22) (without the expectation) is

    (410gnw y1I- + og e, +1)

    eo C O

    E ) ( H+E ) +( / l o ) log eo + 1).Step (a) in (25) follows from the concavity of log(.);step (b) from Li ; step (c) from the fact that

    ( ~ / I c )og (IC+ l ) , C 2 0, is decreasing in IC, nd from (24)which impliesc N 1- < - < -N - Ne0 - o

    and step (d) from (17b).Substituting (23) and (25) into (22) we have

    and

    lim R ( N ) I Y ~

    +N m

    Finally, letting n , m,

    Pr{I j is bad} + H +Tl/ lO) log (eo + 1).

    and using (20) we have

    lim lim R ( N ) 5 H +n W + m N + m

    which is Theorem 2.1, on letting E 0.

    APPENDIXCOMMA-FREE CODING OF AN UNBOUNDED INTEGER

    In this Appendix we define a scheme for unambiguouslyencoding an integer L into a binary string. Let {0,1}* bethe set of binary sequences of arbitrary length. We will givean encoding or mapping e : [ l , O) (0, l}*, which mapsthe nonnegative integers into binary sequences, such that forany distinct L1, La, e ( L 1 ) s not a prefix of e ( ) . Thus thecode is uniquely decipherable, see for example [ 1, ch. 51.

    For k E [ l , O) let b ( k ) be the binary expansion of theinteger k. Thus b(1) = 1, b(2) = 10, etc. The length of thisexpansion is

    (AI)

    , let u k) 2 O1 the concatena-Ib k)l = [log k + 111.

    Also, for k = 1,2,tion of k - 1) 0s followed by a 1. First observe that

    d k )= . lb k)l> *b ( k ) (42)where *denotes concatenation, is a prefix-free encoding ofk E [ l , m . For example, if G k1) = 00011001 , he prefix0001 tells us that the next 4 bits, 1001, are the binaryencoding of kl = 9.

    The mapping which we will use is, for L = 1,2, . . .e ( L )= d lb L)I) b L). (A3)

    The prefix d(lb(L)I)encodes the length of b ( L ) ,and thenext Ib(L)Ibits is the binary expansion of L. For example,when L = 7 , b ( L ) = 111, and Ib L)I = 3 so thatd b L)I) 0111. Thus e ( L ) = e ( 7 ) = 0111111.

    We need a bound on le(L)I.From (A2) Id k)l = 21b(k)l.Thus from (A3),

    le (L)I = 2lb(lb(L)I)I + b(L)l.Using (Al) we see that for large L ,

    le(L)I - logL+2loglogL.

    We can account for small values of L by writing, for allL = 1 , 2 , . . . ,

    le(L)I I o g ( L + l ) + y l o g l o g ( L + 2 ) ( A 4 4

    for a sui tably large y For our purposes the very weak bound

    (A4b)e(L) 1% ( L+ 1)will suffice, where is appropriately large, say 7 = 10.

    REFERENCES[ I ] T. Cover and J Thomas, Elements of Information Theory.

    New York: Wiley, 1991.[2] R. Gallager, Information Theory of Reliable Communication.

    New York: Wiley, 1968.[3] D. W hiting, G. George, and G. h e y , Data compression ap-

    paratus and method, US Patent 5016009, 1991 (assigned toStac, Inc.).

    [4] A. D. Wyner and J Ziv, Some asymptotic properties of theentropy of a stationary ergodic data source with applications todata com pression, IEEE Trans. Informat. T heory , vol. 36, pp.1250-1258, Nov. 1989.

    876 PROCEEDINGS OF THE IEEE, VOL. 82. NO. 6, JUNE 1994

  • 8/10/2019 Wyner and Ziv

    6/6

    [5] -, Fixed data base version of the Lempel-Ziv data com-pression algorithm, IEEE Trans. Informat. Theory, vol. 37, pp.878-880, May 1991.

    [6] Ziv and A. Lempel, A universal algorithm for sequentialdata compression, IEEE Trans. Informar. Theory, vol. IT-24,pp. 337-343, May 1977.

    Aaron D. Wyner (Fellow, IEEE) received theB.S. degree in 1960 from Queens College, CityUniversity of New York, and the B.S.E.E.,M.S., and Ph.D. degrees from ColumbiaUniversity, New York, in 1960, 1961, and 1963,respectively.

    Since 1963 he has been a member of theMathematical Sciences Research Center atAT&T Bell Laboratories, Murray Hill, NJ, doingresearch in information and communicationtheory and related mathematical areas. From

    1974 to 1993 he was Head of the Communications Analysis ResearchDepartment at Bell Labs. He spent the year 1969-1970 visi ting theDepartment of Applied Mathematics at the Wietzmann Institute ofTechnology, Haifa, Israel, and the Faculty of Electr ical Engineering at theTechnion-Israel Institute of Technology, also in Haifa, on a GuggenheimFoundation Fellowship. He has also been a full- and part-time facultymember at Columbia University, Princeton University, and the PolytechnicInstitute of Brooklyn. In 1977 he received the IEEE Information TheorySociety Prize Paper award and in 1984 the IEEE Centennial Medal.

    Dr. Wyner was Chairman of the IEEE Information Theory Soci-etys Metropolitan New York Chapter, Editor-in-Chief of the IEEETRANSACTIONSN INFORMATIONTHEORY, nd Cochairman of two intema-tional symposia. In 1976 he served as President of the IEEE InformationTheory Society. He is a member of URSI, and served as Vice-Chair ofthe International Commission C, and also is a member of the NationalAcademy of Engineering.

    Jacob Ziv (Fellow, IEEE) was born in Tiberias, Israel, on November 27,1931. He received the B.Sc., Dipl. Eng., and M.Sc. degrees, in electricalengineering, from Technion-Israel Insti tute of Technology, Haifa, in 1954and 1957, respectively, and the D.Sc. degree from the MassachusettsInsti tute of Technology, Cambridge, in 1962.

    From 1955 to 1959 he was a Senior Research Engineer in the ScientificDepartment of the Israel Ministry of Defense, and was assigned to theresearch and development of communication systems. From 1961 to 1962,while studying for the Ph.D. degree at MIT, he joined the Applied ScienceDivision of Melpar, Inc., Watertown, MA, where he was a Senior ResearchEngineer doing research in communication theory. In 1962 he retumedto the Scientific Department, Israel Ministry of Defense, as Head of the

    Communications Division, and was also an Adjunct at the Faculty ofElectrical Engineering at the Technion. From 1968 to 1970 he was aMember of the Technical Staff at Bell Laboratories, Murray Hill, NJ. Hejoined the Technion in 1970 and is Herman Gross Professor of ElectricalEngineering there. He was Dean of the Faculty of Electrical Engineeringfrom 1974 to 1976 and Vice President for Academic Affairs from 1978to 1982. From 1977 to 1978, 1983 to 1984, and 1991 to 1992 he wason sabbatical leave at Bell Laboratories. In 1982 he was elected Memberof the Israeli Academy of Science, and was appointed as a TechnionDistinguished Professor. In 1993 he was awarded the Israel Prize in ExactSciences (Engineering and Technology). He has twice been the recipientof the IEEE-Information Theory Best Paper Award (for 1976 and 1979).He was the Chairman of the Israeli Universities Planning and GrantsCommittee from 1985 to 1991. His research interests include general topicsin data compression, information theory, and statistical communication.

    In 1988, Dr. Ziv was elected as a Foreign Associate to the U.S. NationalAcademy of Engineering.

    WYNER AND ZIV: SLIDING-WINDOW LEMPEL-ZIV ALGORITHM 877