10
Differential Compression of Executable Code Giovanni Motta 1 , James Gustafson 2 , and Samson Chen 3 Abstract A platform-independent algorithm to compress file differences is presented here. Since most file updates consist of software updates and security patches, particular attention is dedicated to making this algorithm suitable to efficient compression of differences between executable files. This algorithm is designed so that its low-complexity decoder can be used in mobile and embedded devices. Compression is compared with several existing methods on a common test suite. 1. Introduction Software distribution and update, security patching, revision control, backup and archival of multiple versions of data are all applications where compressing file differences can be used to achieve compression that is orders of magnitude greater than what achievable by ordinary methods. Instead of compressing a file ex-novo, differential file compression takes advantage of the fact that an “older” version of the same file is available to the decompressor. In the following, we will refer to the “old” copy of the file, accessible by the decompressor, with the name of reference file. The version that we aim at compressing, will be called target or update file. When target and reference files are sufficiently similar, a compact representation of the differences can often be obtained. File differencing finds one of its most interesting applications in the distribution of software updates and security patches to mobile and embedded devices. Mobile devices (wearable computers, personal digital assistants, cellular phones, etc.) have access to a communication link of limited capacity, so achieving the highest possible compression is extremely important. Furthermore, mobile devices are battery-powered and characterized by limited computing power and small memory, so compression used in this application should be asymmetrical in the sense that the complexity of the compressor can be much higher than that of the decompressor. Finding and compressing the differences between two versions of the same file is not an easy task, and this problem has been addressed in literature by a number of authors. An early approach to file differencing is implemented by the UNIX command diff [1]. This command encodes differences between two text files by using a minimal set of operations that APPEND, DELETE or CHANGE entire text lines. Applied in sequence, these operations transform the first file into the second. Like other differencing algorithms, diff uses a variation of the Longest Common Substring algorithm (or LCS) [2] to find the sections common to both reference and target file. 1 [email protected] , Bitfone Corp., Laguna Niguel, CA 92677. 2 [email protected] , Work done while at Bitfone Corp., Laguna Niguel, CA 92677. 3 [email protected] , Work done while at Bitfone Corp., Laguna Niguel, CA 92677. 2007 Data Compression Conference (DCC'07) 0-7695-2791-4/07 $20.00 © 2007

[IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

  • Upload
    samson

  • View
    213

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

Differential Compression of Executable Code Giovanni Motta1, James Gustafson2, and Samson Chen3

Abstract A platform-independent algorithm to compress file differences is presented here. Since most file updates consist of software updates and security patches, particular attention is dedicated to making this algorithm suitable to efficient compression of differences between executable files. This algorithm is designed so that its low-complexity decoder can be used in mobile and embedded devices. Compression is compared with several existing methods on a common test suite.

1. Introduction Software distribution and update, security patching, revision control, backup and archival of multiple versions of data are all applications where compressing file differences can be used to achieve compression that is orders of magnitude greater than what achievable by ordinary methods. Instead of compressing a file ex-novo, differential file compression takes advantage of the fact that an “older” version of the same file is available to the decompressor. In the following, we will refer to the “old” copy of the file, accessible by the decompressor, with the name of reference file. The version that we aim at compressing, will be called target or update file. When target and reference files are sufficiently similar, a compact representation of the differences can often be obtained.

File differencing finds one of its most interesting applications in the distribution of software updates and security patches to mobile and embedded devices. Mobile devices (wearable computers, personal digital assistants, cellular phones, etc.) have access to a communication link of limited capacity, so achieving the highest possible compression is extremely important. Furthermore, mobile devices are battery-powered and characterized by limited computing power and small memory, so compression used in this application should be asymmetrical in the sense that the complexity of the compressor can be much higher than that of the decompressor.

Finding and compressing the differences between two versions of the same file is not an easy task, and this problem has been addressed in literature by a number of authors. An early approach to file differencing is implemented by the UNIX command diff [1]. This command encodes differences between two text files by using a minimal set of operations that APPEND, DELETE or CHANGE entire text lines. Applied in sequence, these operations transform the first file into the second. Like other differencing algorithms, diff uses a variation of the Longest Common Substring algorithm (or LCS) [2] to find the sections common to both reference and target file.

1 [email protected], Bitfone Corp., Laguna Niguel, CA 92677. 2 [email protected], Work done while at Bitfone Corp., Laguna Niguel, CA 92677. 3 [email protected], Work done while at Bitfone Corp., Laguna Niguel, CA 92677.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 2: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

Relative Ref.

Absolute Ref.

CodeSection

Added Code

Reference File Target File

Relative Ref.

Absolute Ref.Data

Section

CodeSection

DataSection

Figure 1: Relative and absolute references before and after the addition of new code.

Specific optimizations contain memory usage and achieve good performance on typical text files. Variants and improvements of the diff algorithm are described in [3].

Other file differencing algorithms, like VCDIFF [4] and Zdelta [5], are also based on similar operations, like COPY and ADD (VCDIFF also uses RUN, a command that encodes sequences of identical bytes).

These algorithms perform well on text files, where changes are typically localized and material is identically preserved across revisions. However, executable files are highly structured and characterized by the presence of relative and absolute references to functions, objects and data. A small change in the source code, such as the addition of a new function, can propagate and affect references present in sections of the file that have not been modified explicitly. Figure 1 depicts an example where a relative and an absolute reference (a branch and a pointer, for example) are modified by the addition of new code.

When discussing differential compression of executable files it is helpful to distinguish between primary changes, i.e., changes explicitly introduced by the modification of the source code, and secondary changes, referring to absolute and relative references modified as a consequence of the primary changes. When a basic COPY and ADD differencing algorithm is applied to executable files, the secondary changes will fragment the copies and the compression performance will be compromised.

It is commonly assumed that the compressor has no access to the source programs that have been compiled into the executable files, so there is no direct way to infer the secondary changes from the primary. Clearly, it is technically possible to use a brute-force approach and disassemble both reference and target files. After that, corresponding objects in the two files could be matched and the references changed between the two versions adjusted. Adjusting the value of pointers and branches in the reference file constitutes a preprocessing, following which, a standard COPY and ADD differencing algorithm could be applied. Such brute-force approach would be very effective compression-wise, but it will have the disadvantage of being complex and platform (or even compiler and linker) dependent.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 3: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

In the following, we address the problem of augmenting a COPY and ADD algorithm so that compression of executable files is performed efficiently. The algorithm described will be platform-independent and will have a low-complexity decompressor suitable to a mobile or embedded platform.

2. Past Work File differencing has been studied by several authors and under a number of assumptions. In this section we present the most relevant features of some of the compressors used in our comparison: VCDIFF, Zdelta, BSDiff, eDelta and Exediff.

VCDIFF [4] and Zdelta [5] are both variation of LZ77, where the reference file is part of the dictionary. Zdelta even reuses many functions of the library zlib and adapts the library to compressing file differences. VCDIFF and Zdelta represent the target file by combining copies from both the reference and the already compressed target file. However, Zdelta differs from VCDIFF in the use of a Huffman encoder to further compress the COPY and ADD commands and their parameters. Another important difference is in the use of multiple pointers to specify the location of the copy. Zdelta maintains and updates independently two pointers in the reference file and an implicit pointer in the target file (the implicit pointer, always pointing to the section about to be compressed). While offsets can be specified from any of these three pointers, a more compact encoding is achieved by always preferring the one that generates the smaller offset. The position of each pointer represents a prediction of the location of the next match. The smallest of the three offsets is the prediction error that is sent to the decoder (together with the used pointer). The pointers in the reference buffer are updated after each match with a predetermined strategy known to the decompressor. A detailed description of VCDIFF and Zdelta can be found in the original papers as well as in [6].

BSDiff [7-9] has been developed by Colin Percival to distribute binary security update for FreeBSD UNIX. Unlike VCDIFF and Zdelta, BSDiff specifically addresses the differential file compression of executable code and uses a platform-independent approach. Its most recent publicly released version, BSDiff 4, is used by FreeBSD and OS X to distribute binary security updates and, with some modifications, by the Mozilla project to accelerate the download of FireFox updates.

BSDiff 4 compresses file differences by first finding a set of exactly matching regions, and then extending these regions forward and backward by allowing mismatches. Extended regions will roughly correspond to secondary changes and to unmodified sections of code. Regions in the target files for which no approximate match can be found correspond to primary changes. BSDiff 4 encodes these regions into three sections: A control section, containing ADD and INSERT instructions and their parameters. A difference section, containing byte-wise differences between the approximate

matches. A section containing the bytes being inserted in the update file by the INSERT

instructions. The concatenation of these three sections is slightly bigger than the update file. However, the control section and the byte-wise differences are highly compressible. BSDiff 4 entropy codes each section, independently, with bzip2.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 4: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

BSDiff 6 [9], not yet available to the public, relies upon the same basic principles but uses a different and more sophisticated matching algorithm. On the same set of reference files, BSDiff 6 improves the compression of BSDiff 4 by 10% to 20% .

EDelta, developed by Jacob Gorm Hansen [10], is a linear time, constant space compressor, also tailored at the differential compression of executable files. EDelta is based on a variation of the algorithm described by Burns and Long [11]. EDelta uses a variation of the LCS algorithm to find the matches between reference and target. In this variation, short, constant-size sequences of unmatched bytes (or holes) are allowed into the matching sequences of bytes. The holes are efficiently patched to a later stage.

Exediff, developed by Baker et al. [12], uses a lossy transform to reduce the effect of the secondary changes in the executable code. Exediff iterates two operations called pre-matching and value recovery until the size of the patch converges to a minimum and cannot be further reduced. Pre-matching is based on a solution of the Longest Common Subsequence. Exediff is not platform-independent since the lossy transform relies on the detailed knowledge of the architecture, and involves decoding the binary file to locate the references. Once absolute and relative references are individuated, their value is replaced by tags or coarsely quantized values (for example, only the sign of a relative branch is retained). After the matching, values that cannot be recovered are sent explicitly. While Exediff has been developed and tested on a 64-bit UNIX Alpha executables, the lossy transform can be adapted to other architectures.

An important variation to this problem, relevant to mobile and embedded applications but not explicitly addressed here, is in-place reconstruction. Under this assumption, the compressed version may be reconstructed without requiring any additional memory or storage space. In-place reconstruction has been studied by Shapira and Storer [13] and Burns and Long [14].

3. Improving the Encoding of the Secondary Changes While secondary changes are generated by a simple and predictable process, they compromise the matching paradigm on which a basic COPY and SET differential compressor is based.

Figure 2, from [10], shows an example of this phenomenon. Two versions of a “hello world” C program are shown side by side; both versions have been compiled for the PowerPC architecture. The target version differs from the reference because a local variable and a conditional statement have been added to the code. After the compilation, the added code results into 20 new bytes. Unfortunately, the addition of these extra 20 bytes also introduce secondary changes and alters the offsets of all the original instructions by 20 . A basic COPY and SET differential compressor will not be able to copy long sequences of bytes even though very little code has changed from one version to another.

In our algorithm, the matching process is modified so that a number of short mismatches are tolerated. These mismatches aim at modeling secondary changes in the target file and effectively improve the length of the copies. Sections with mismatches are copied from the reference to the target file with a regular COPY command, then, patched with the use of an efficient predictive coding.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 5: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

Reference Target Decimal

Hexadecimal Mnemonic Hexadecimal Mnemonic Offset 48 00 04 b1 bl 0x1000071c 48 00 04 c5 b 0x10000730 20 85 a8 07 a8 lwzu r13,1960(r8) 85 a8 07 bc lwzu r13,1980(r8) 20 48 01 06 78 b 0x10010918 48 01 06 8c b 0x1001092c 20 48 01 06 09 bl 0x100108bc 48 01 06 1d bl 0x100108d0 20 88 1e 09 30 lbz r0,2352(r30) 88 1e 09 44 lbz r0,2372(r30) 20 81 7f 07 c8 lwz r11,1992(r31) 81 7f 07 dc lwz r11,2012(r31) 20 90 1f 07 c8 stw r0,1992(r31) 90 1f 07 dc stw r0,2012(r31) 20 81 7f 07 c8 lwz r11,1992(r31) 81 7f 07 dc lwz r11,2012(r31) 20 98 1e 09 30 stb r0,2352(r30) 98 1e 09 44 stb r0,2372(r30) 20 80 0b 08 b8 lwz r0,2232(r11) 80 0b 08 cc lwz r0,2252(r11) 20 38 6b 08 b8 addi r3,r11,2232 38 6b 08 cc addi r3,r11,2252 20

38 00 00 02 li r0,2 90 1f 00 08 stw r0,8(r31) 80 1f 00 08 lwz r0,8(r31) 2c 00 00 01 cmpwi r0,1 40 81 00 14 ble- 0x10000410

38 69 07 bc addi r3,r9,1980 38 69 07 d0 addi r3,r9,2000 20 4b ff fe 11 bl 0x10000258 4b ff fd fd bl 0x10000258 -20 39 29 08 ac addi r9,r9,2220 39 29 08 c0 addi r9,r9,2240 20 4b ff fb 51 bl 0x100002e4 4b ff fb 3d bl 0x100002e4 -20

Figure 2: Secondary changes introduced by the addition of new code (From [10]).

SET commands are used to encode bytes in the target file for which no COPY of sufficient length can be found in the reference, or in the cases in which such a COPY is found but the encoding is deemed to be inefficient.

The encoding starts by building a dictionary by hashing the entire reference file, C bytes at the time. The value C determines the length of the shortest possible copy. Besides a hash table, other data structures are certainly possible, but the size of the dictionary is a critical factor that must be taken into account when choosing the structure. The dictionary grows during the encoding of the target file, and after each COPY or SET, the newly encoded section of the target file is hashed and added to the dictionary. In the implementation described here, dictionary entries are never removed.

The target file is encoded sequentially, in a greedy fashion, from beginning to end. The encoder maintains a pointer tgtP to the section of the file yet to be encoded. The search for a match starts by hashing C bytes following tgtP . Then the dictionary is searched and the pointers to all the (reference and already encoded target) sections that have identical hash are collected into a list. After discarding any possible collision, the list 1 2, , , mL P P P contains pointers to the locations of 0m matching candidates.

Matching candidates are evaluated and ranked in order to select the best one. If the list L is empty, or if at the end of the evaluation no matching candidate is deemed useful, the pointer tgtP is increased by one byte and the match continues from the next position. Any unmatched bytes are encoded with the use of a SET command immediately before the encoding of the successive match. In a SET command, the command code is followed by the number of bytes to set and then, by the sequence of bytes itself.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 6: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

Ref Tgt TimeDelta

Ref Tgt TimeDelta

Ref Tgt TimeDelta

T1T2

TMMlen

...

Figure 3: Tables used by the encoder to cache the mismatches.

The encoder then determines, for each candidate in the list, the length of the longest match starting from that location. Since collisions have been removed from the list L , for all iP L , we know that the first C bytes are already matching the bytes starting at tgtP , and the comparison proceeds matching the bytes at , 1, 2,i i iP C P C P C to the bytes , 1, 2,tgt tgt tgtP C P C P C until 1lenMM consecutive mismatching bytes are found.

During this process, mismatching sequences of bytes with length not greater than lenMM are ignored and considered as being part of the match. Long sequences of bytes

are matched while disregarding the effects of the secondary changes, but in doing so, spurious matches are likely to be introduced. Spurious matches are addressed by ranking the candidates in the list according to a cost function. The cost function estimates the benefits of using each match, and determines the match that has to be finally selected and encoded.

Encoding a match is done with a COPY command, followed by the number of bytes being copied and a pointer to the location of the match. The match can originate from the reference or from the encoded section of the target file. Since mismatches can be present in the material being copied, this information is not sufficient to reconstruct the target in a lossless manner. The COPY has to be followed by an encoding of the position of each mismatch and by information useful to reconstruct the original bytes.

3.1. Encoding the Mismatches Secondary changes introduced by code relocation, growth and shrinkage complicate the differential compression of executable files, however: If a pointer P points to a location A in the reference file and to a location B in the

target file, it is likely that other pointers pointing to A in the reference will change to B in the target.

Code is relocated in blocks. If a pointer changes from A to A k , it is likely that other pointers referencing addresses close to A will be altered by the same offset k .

When a code section shrinks or grows by k bytes, it is likely that relative references crossing that section will change from A to A k .

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 7: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

The previous considerations suggest that mismatches due to secondary changes are somehow regular and can be predicted well. The algorithm described here takes advantage of a rudimentary caching mechanism to exploit local patterns in the changes.

The encoding of the mismatches is based on the use of a set of tables 1 2, , ,lenMMT T T ,

each having Tsize entries. The table iT stores the Tsize most recent mismatches having length i , 1 leni MM . The j -th entry of the i -th table iT (1 j Tsize ), is

, , , , ,, , ,i j i j i j i j i jT ref tgt time where ,i jref records a i -bytes pattern in the reference file,

,i jtgt the corresponding mismatched i -bytes in the target file, , , ,i j i j i jtgt ref the numerical difference between these two values, and ,i jtime keeps track of the most recent access or contains a usage counter, helpful to implement a cache replacement strategy. Depending on Tsize , an implementation using arrays, hashing or another structure may be the most convenient.

After a match is encoded with a COPY command, all mismatches (if any) are encoded by specifying their position, the number of mismatching bytes and their values. However, with the use of the tables described earlier, it is possible to encode the pattern of an i -bytes mismatch ,ref tgt as follows:

1. The table iT is accessed to verify whether there is an entry j for which ,i jref ref and ,i jtgt tgt . If such an entry exists, and ,i href ref for all h j , no further information needs to be encoded. The mismatching pattern ref can be used to locate the j -the entry and retrieve the replacement pattern ,i jtgt tgt .

2. Failing that, iT is searched for an entry j such that ,i j tgt ref . If such an entry is found, encoding the index j is sufficient to reconstruct at the decoder

,i jtgt ref . A new entry , , ,ref tgt tgt ref time is added to the table and, if necessary, an old entry deleted by a cache replacement strategy.

3. Failing that, the replacement pattern tgt has to be encoded explicitly. The new information , , ,ref tgt tgt ref time is added to the table and, if necessary, an old entry deleted by a cache replacement strategy.

It is important to notice that the operation tgt ref must be performed byte-by-byte in order to maintain platform-independency.

3.2. Cost Function Matching with mismatches is likely to introduce spurious matches in the list

1 2, , , mL P P P of matching candidates. Comparing the length of these matches is not sufficient to select the best candidate.

Our algorithm compares the candidates with a cost function aimed at predicting the coding efficiency of each candidate. This cost is simply defined as:

( )( )i

i

Bytes to Encode Copy and MismatchesCost Plength P

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 8: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

Programs Tgt. Size Bzip2 VCDiff Zdelta eDelta .RTPatch BSDiff 4 Proposed Exediff BSDiff 6 alto: identical binaries 466,944 148,024 n/a n/a 67 n/a 142 167 155 n/a alto: gcc -O2 gcc -O3 466,944 148,024 n/a n/a 35,663 34,755 33,633 30,486 20,793 n/a alto: changed reg. alloc. 450,560 148,024 n/a n/a 32,284 34,571 23,246 31,601 15,845 n/a alto: extra printf 466,944 148,024 n/a n/a 6,491 7,524 6,299 7,612 6,237 n/a agrep: 4.0 4.1 262,144 114,388 10,886 7,162 6,469 5,910 6,066 5,611 3,531 4,265 glimpse: 4.0 4.1 524,288 222,548 93,935 64,608 36,054 37,951 31,720 31,254 23,200 24,642 glimpseindex: 4.0 4.1 442,368 193,883 80,325 51,723 21,233 25,764 21,559 22,669 18,473 16,240 wgconvert: 4.0 4.1 368,640 157,536 60,658 38,544 16,476 20,712 15,806 18,043 15,688 12,432 agrep: 3.6 4.0 262,144 114,502 79,962 63,282 62,701 58,124 53,490 47,731 41,554 44,327 glimpse: 3.6 4.0 524,288 222,178 189,926 147,594 150,091 140,549 130,210 111,298 104,350 109,680 glimpseindex: 3.6 4.0 442,368 193,892 144,746 115,980 115,266 105,510 97,782 86,281 79,085 80,447 netscape: 3.01 3.04 6,250,496 2,396,661 1,013,581 2,519,221 344,720 351,759 302,431 274,380 284,608 212,032 gimp: 0.99.19 1.00.00 1,646,592 642,725 462,588 345,385 353,625 301,879 284,278 237,510 185,962 219,684 iconx: 9.1 9.3 548,864 233,056 119,510 80,017 50,199 51,195 44,961 42,893 38,121 31,632 gcc: 2.8.0 2.8.1 2,899,968 708,301 422,288 274,652 126,729 140,284 121,371 119,864 76,072 88,022 rcc (lcc): 4.0 4.1 811,008 221,826 667 373 183 265 289 305 303 187 apache: 1.3.0 1.3.1 679,936 180,708 103,611 69,895 59,998 48,033 38,278 41,037 40,460 25,927 apache: 1.2.4 1.3.0 671,744 179,369 242,292 200,511 198,936 216,867 180,981 163,025 227,233 163,249 rcc (lcc): 3.2 3.6 434,176 155,090 76,227 52,324 33,658 34,098 33,136 30,139 22,019 22,691 Total Size (*) 16,769,024 5,936,663 3,101,202 4,031,271 1,576,338 1,538,900 1,362,358 1,232,040 1,160,659 1,055,457 Compression Ratio 1.00 : 1 2.97 : 1 5.04 : 1 6.96 : 1 8.54 : 1 8.86 : 1 9.92 : 1 10.98 : 1 12.01 : 1 12.47 : 1 Compression vs. Bzip2 1.00 : 1 1:70 : 1 2.34 : 1 2.87 : 1 2.98 : 1 3.34 : 1 3.70 : 1 4.04 : 1 4.20 : 1

Table I: Comparison with other differential file compression algorithms. Total size and compression are computed by excluding the first four rows for which not all results were available. Results for VCDIFF, Zdelta, .RTPatch, BSDiff and Exediff are from [8] and [9].

Since commands and their parameters are entropy coded, the numerator cannot be easily determined. An estimation of this term, with fixed costs, attributed to each command and parameter, proved to be sufficient to achieve competitive performance while containing the complexity of the encoder.

In case of a tie, the longest match is always to be preferred and after that, the match with a position closer to the position of the previous match.

3.2. Entropy Coding As anticipated, COPY and SET commands, their parameters and the information necessary to recover all mismatches are entropy coded. The results described here were obtained by using LZMA (acronym of Lempel-Ziv-Markov chain-Algorithm), an improved LZ77 compressor developed by Igor Pavlov [15]. LZMA is used in the 7z format of the 7-zip archival software [16]. The use of LZMA is motivated by the small memory required by its decompressor that makes this program suitable to mobile applications.

Slightly worse compression is obtained by a first-order entropy coder. The small gap shows that the differential encoder effectively exploits high-order dependencies.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 9: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

4. Experimental Results Table I shows compares the compression of our method with several existing methods. Tests are performed on a set of executable files, used for the first time in [12], to benchmark differential file compression of executable code. These files are DEC UNIX Alpha executables and are all in eCOFF format.

Four pairs have been artificially created in order to introduce small, predictable changes between versions (identical files, different compiler optimization parameters, different register allocation and an extra printf instruction in the target version). The remaining 15 pairs are all UNIX applications of common use. Total size and compression do not include the first four pairs since not all results were available for these test cases. The table also reports the size of the target version before and after bzip2 compression. Results for VCDIFF, Zdelta, .RTPatch, BSDiff and Exediff are from [8] and [9]. Exediff patches have been recompressed with bzip2 instead of the original gzip to ensure a fair comparison.

The order of the columns roughly reflects the performance of the various algorithms, with VCDIFF achieving the lowest, and BSDiff 6 the highest average compression.

The first observation is that differential compression has a definite advantage over regular compression. On the average, a differential file compressor generates updates 2.5 to 4 times smaller than bzip2. Two notable exceptions are present in the table. First, the poor compression obtained by Zdelta on the pair “netscape: 3.01 3.04”. This behavior may be caused by the file size that exceeds the capability of the program and forces it to use regular compression. The second, is the anomalous size of the updates for the pair “apache: 1.2.4 1.3.0” where only our algorithm and BSDiff 6 achieve compression higher than bzip2. As explained in [9], this is caused by the two versions having less than 50% of the source code in common.

Results for our algorithm were obtained by using a minimum copy of length 8C and 4lenMM tables of 64Tsize entries each, to encode the mismatches.

Despite its simplicity, our algorithm performs as well as BSDiff 4 and Exediff. This result is important since, unlike Exediff, our proposal is not platform-dependent. Also, since our algorithm has been developed to be used in mobile and embedded applications, the results have been obtained with limited memory and computing power. The performance gap with BSDiff 6 is justified by the lower complexity and by the simpler structure of our algorithm (BSDiff 6 uses FFT to find correlation between reference and target files), and suggests that improvements are indeed possible.

5. Acknowledgements The authors wish to thank the members of the mProve team for their contributions: LaShawn McGhee, Brian O’Neill, Jason Penkethman, and Marko Slyz. The authors also thank Jennifer Jones for the careful review of the manuscript.

The understanding of eDelta and BSDiff was greatly facilitated by the patience of their inventors, Jacob Gorm Hansen and Colin Percival, with whom the first author exchanged numerous emails. Jacob Gorm Hansen made also available the most recent results for eDelta and provided the example described in Figure 2.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007

Page 10: [IEEE 2007 Data Compression Conference (DCC'07) - Snowbird, UT, USA (2007.03.27-2007.03.29)] 2007 Data Compression Conference (DCC'07) - Differential Compression of Executable Code

References [1] J. Hunt and M. Douglas McIlroy, “An Algorithm for Differential File Comparison,”

Computing Science Technical Report No. 41, Bell Labs, N.J., June 1976. [2] T. Cormen, C. Leiserson, R. Rivest, and C. Stein, “Introduction to Algorithms,”

Second Edition, MIT Press and McGraw-Hill, 2001. [3] E. Myers, “An O(ND) Difference Algorithm and its Variations,” Algorithmica,

1(2):251–266, 1986. [4] D. Korn et al., “The VCDIFF Generic Differencing and Compression Data Format,”

RFC 3284, June 2002. [5] D. Trendafilov, N. Memon, and T. Suel, “Zdelta: An Efficient Delta Compression

Tool,” Technical Report TR-CIS-2002-02, Polytechnic University, 2002. [6] D. Salomon, “Data Compression: the Complete Reference,” 4th edition, Springer,

Sep. 2006. [7] C. Percival, “An Automated Binary Security Update System for FreeBSD,”

Proceedings of BSDCon ’03, 29–34, 2003. [8] C. Percival, “Naive Differences of Executable Code,” Computing Lab, Oxford

University, 2003. Url: http://www.daemonology.net/bsdiff/bsdiff.pdf. [9] C. Percival, “Matching with Mismatches and Assorted Applications,” Ph.D. Thesis,

2006. Url: http://www.daemonology.net/papers/thesis.pdf. [10] J. Gorm Hansen, “The EDelta Algorithm for Linear Time, Constant Space

Executable Differencing,” Private Communications, Nov. 2006. [11] R. C. Burns and D. D. E. Long, “A Linear Time, Constant Space Differencing

Algorithm,” 1997 International Performance, Computing and Communications Conference, Feb. 1997.

[12] B. Baker, U. Manber, and R. Muth, “Compressing Differences of Executable Code,” ACM SIGPLAN Workshop on Compiler Support for System Software, 1999.

[13] D. Shapira and J. Storer, “In Place Differential File Compression,” The Computer Journal, 48: 677-691, 2005.

[14] R. C. Burns and D. D. E. Long, “In-Place Reconstruction of Delta Compressed Files,” Symposium on Principles of Distributed Computing, 267-275, 1998.

[15] Wikipedia entry on LZMA. Url: http://en.wikipedia.org/wiki/LZMA. [16] 7-Zip web page. Url: http://www.7-zip.org.

2007 Data Compression Conference (DCC'07)0-7695-2791-4/07 $20.00 © 2007