Cryptographic Hashing for Virus Localization

Embed Size (px)

Citation preview

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    1/8

    Cryptographic Hashing for Virus Localization

    Giovanni Di CrescenzoTelcordia Technologies

    Piscataway, NJ

    [email protected]

    Faramak VakilTelcordia Technologies

    Piscataway, NJ

    [email protected]

    ABSTRACT

    Virus detection is an important problem in the area of com-puter security. Modern techniques attempting to solve thisproblem fall into the general paradigms of signature detec-tion and integrity checking. In this paper we focus on thelatter principle, which proposes to label an executable orsource file with a tag computed using a cryptographic hashfunction, which later allows to detect if any changes havebeen performed to the file. We suggest to extend this prin-ciple so that not only changes to the file are detected, butalso these changes are localized within the file; this is espe-cially useful in the virus diagnostics which can then focuson the localized area in the file rather than the entire file.This implicitly defines an apparently new problem, which wecall virus localization. We design techniques to solve thevirus localization problem based on repeated efficient appli-cations of cryptographic hashing to carefully chosen subsetsof the set of file blocks, for each of the most important andknown virus infection techniques, such as rewriting tech-niques, appending and prepending techniques, and insertiontechniques.

    Categories and Subject Descriptors

    D.2.4 [Software Engineering]: Software/Program Veri-ficationReliability, Validation; F.2 [Analysis of Algo-rithms and Problem Complexity]: Miscellaneous; G.4[Mathematical Software]: Algorithm Design and Analy-sisEfficiency

    General Terms

    Security, Algorithms, Theory

    Keywords

    Virus Detection, Virus Tolerance, Virus Localization, In-tegrity Verification, Cryptographic Hashing

    Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WORM06, November 3, 2006, Alexandria, Virginia, USA.Copyright 2006 ACM 1-59593-551-7/06/0011 ...$5.00.

    1. INTRODUCTIONVirus detection is a problem of surprisingly wide impact

    to any PC user, as she or he is routinely asked to take pre-venting measures against viruses, such as buying and run-ning antivirus software. In the light of early impossibilityresults [1,2] for general virus detection, having theoreticalvalue but small practical impact, research in this area hasmoved towards various heuristic approaches targeting, some-times with success, specific classes of viruses. Some of themost successful modern techniques attempting to solve thisproblem fall into the general paradigms of signature detec-tion and integrity checking (see, e.g. [3,4]). The formerrequires discovering pieces of infected code (called signa-tures) for known viruses, storing them and developing soft-ware that scans the computer memory to search for suchsignatures. The latter, which we focus on in this paper, re-quires using cryptographic hash functions that detect unau-thorized changes to a file, possibly detecting the presenceof unknown viruses. An important example of the successof the latter technique is Tripwire [5], a widely availableand recommended integrity checking program for the UNIXenvironment. It might also be instructive to look at b othtechniques with an analogy with the two main intrusion de-

    tection principles of signature and anomaly detection (see,e.g. [6]): the signature virus detection paradigm is similarto the signature detection principle in the intrusion detec-tion area; the integrity checking paradigm, instead, is moresimilar to the anomaly detection principle in the intrusiondetection area.

    Our results. We suggest to extend the integrity checkingprinciple so that not only changes to the file are detected,but also these changes are localized within the file; this isespecially useful in the virus diagnostics which can then fo-cus on the localized area in the file rather than the entirefile (we stress that this phase is usually both very resource-expensive and failure-prone in the absence of informationabout the virus). This implicitly defines an apparently new

    problem in the area of software security, which we call viruslocalization. We start the study of this problem by char-acterizing virus infection techniques, such as rewriting tech-niques, appending and prepending techniques, and insertiontechniques, and by posing the problem of designing tech-niques that achieve virus localization for all of them. Ourtechniques are based on repeated efficient applications ofcryptographic hashing to various subsets of the set of fileblocks, chosen according to codes that are specific to theinfection technique. In all cases, our techniques achieve sat-isfactory localization (an area at most twice as large as the

    41

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    2/8

    virus is identified in the corrupted file), and efficiency (thehashes returned by our localizing hash functions are longerthat those returned by a conventional cryptographic hashfunction by at most a logarithmic factor). We stress thatthis output efficiency is a crucial property of a localizinghash function, as there is a trivial construction (which welater discuss in detail) that returns much longer hashes (i.e.,by a factor linear in the number of atomic blocks of the inputdocument).

    Previous results. To the best of our knowledge, the prob-lem of virus localization has never been rigorously inves-tigated or posed before. Cryptographic hashing is a well-known paradigm for integrity verification, and is fundamen-tal for programs that verify the integrity of file systems, likeTripwire [5]. Cryptographic hashing of all atomic blocks of afile is also a known paradigm, and is important for programsthat remotely update files over high latency, low bandwidthlink, like rsync [7], or address write-once archival data stor-age purposes, like the Venti system [8]. We note that noneof these programs gets any closer than the mentioned trivialconstruction to solve our virus localization problem.

    Organization of the paper. In what follows, we first re-view in Section 2 some well-known examples of PC viruses

    and describe a classification of mechanisms followed by knownviruses (this will be useful before designing our localizationtechniques). Then, in Section 3 we review known techniquesfor virus detection and describe our novel techniques forvirus localization through cryptographic hashing. Finally, inSection 4 we discuss properties and compare performancesof our techniques.

    2. DEFINITIONSWe recall the known notion and formal definition of (cryp-

    tographic) collision-resistant hashing (see, e.g. [9]), and de-scribe a classification of virus behaviors that is useful in therest of the paper.

    2.1 Collision-resistant hashingMathematical tools often used in Cryptography, called

    collision-intractable hash functions (a.k.a. collision-free,collision-resistant, and sometimes just called cryptographic),are very often used for several applications, including thoserequiring integrity verification of files, messages, etc. Thesefunctions are defined as follows: they take as input a bi-nary string of arbitrary length (representing the file to becompressed), and return a fixed-size (e.g., 128- or 160-bit)binary string as output (representing the hash or fingerprintof the original file). Needless to say, the input to the hashfunction being typically much longer than 160 bit impliesthat the number of preimages of any single output of thehash function is very large. Yet, the amazing property ex-

    pected by these functions is that it seems computationallyintractable for any efficient algorithm to find even just twopreimages that map to the same output, in the sense thatany algorithm would have to take infeasible time (e.g., a fewcenturies or more) to succeed in doing that.

    We now recall their formal definition.

    Definition 2.1. Let k be a security parameter and letH = {Hw}kN be a family of functions Hw : {0, 1}

    k {0, 1}a {0, 1}b, where a > b and w is a function indexsatisfying |w| = k. We say that H is a collision-resistant

    hash function family if: (1) there exists a probabilistic poly-nomial time (in k) algorithm computing Hw for each w;(2) for any probabilistic polynomial-time algorithm A, thereexists a negligible function such that, when w is uni-formly chosen, the probability that A(w) = (x1, x2) suchthat Hw(x1) = Hw(x2) is at most (k).

    Several constructions of collision-intractable hash functionshave been given in the literature under the hardness of spe-

    cific number-theoretic problems; in practice, for efficiencypurposes, heuristic constructions, such as SHA1 or more re-cent proposals [10], can be deployed.

    In designing our techniques, we will use the followingproperty of collision-resistant hash functions (that directlyfollows from their definition): except with negligible prob-ability, Hw(x1) = Hw(x2) when x1 = x2, for |x1| k and|x2| k. We will then define constructions of (composed)collision-resistant hash functions that make repeated appli-cations of an (atomic) collision-resistant hash function overtheir input, so that the resulting output of the composedfunction on any two inputs reveals information about theinputs similarity/difference.

    2.2 VirusesWe will consider viruses that perform certain modifica-

    tions to a target file, resulting in a corrupted file, where thetype of modifications may vary according to specific infec-tion techniques.

    We first discuss well-known examples of viruses and thenpresent a characterization of PC viruses behaviors, definingthe infection techniques that we consider in the rest of thepaper.

    Examples of PC Viruses. Let us briefly examine a fewexamples of well-known virus attacks, e.g., Michelangelo[11], Melissa [12], and Nimda [13], on the Internet to under-stand how they infected the PCs and disrupted the Internet,its users, and resources.

    The well-known Michelangelo virus infected the boot sec-tor of PCs running MS-DOS versions 2.xx and up. It movedthe contents of the original master boot record (MBR) fileto another location on the disk and placed itself into theMBR. Upon next startups, the basic input-output systemwould execute the Michelangelos code that would load thevirus into memory, and then would pass the control back tothe copy of the original MBR to continue the boot processunless it was March 6. On that day Michelangelo wouldcompletely wipe out the hard drive of the PC.

    The Melissa virus/worm infected the MS Office applica-tions and spread via Outlook e-mail. For instance, whenMelissa virus infected in a MSWord document, its code waslocated in the DocumentOpen() subroutine that automati-cally executed when a user tried to open a document.

    The Nimda worm would exploit multiple vulnerabilitiesof the Internet to infect the PCs and servers and spreadacross the Internet rapidly. This worm would modify webdocuments (e.g., .htm, .html, and .asp files) and certain exe-cutable files found on the systems it infected, and would cre-ate numerous copies of itself (using file names with .eml and.nws extensions) in all writable directories. It would propa-gate through email arriving as a MIME multipart/alternativemessage consisting of two parts. The first part was de-fined as MIME type text/html, though it contained notext, so the email appears to have no content. The second

    42

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    3/8

    part was a MIME type audio/x-wav containing a base64-encoded attachment that was a binary executable namedreadme.exe.

    Characterization of known PC viruses behaviors.

    As illustrated in the above virus examples and discussed insome computer virus literature (see, e.g., [3,4]), most In-ternet viruses can be characterized to follow one of theseinfection techniques: Overwriting, Prepending, Appendingand Embedding infection techniques, which we now describein detail.

    Overwriting infection technique. The virus infects an exe-cutable file by replacing portion of the hosts code. One waythat this can be achieved is for the virus to simply open thetarget file for writing as it would open a regular data file, andthen save a copy of itself to the file. When the executablefile is later launched by its owner, the operating system willexecute the code instead of the expected program. This in-fection technique typically damages the host file to makingit not usable.

    Prepending infection technique. The virus inserts its code atthe beginning of the executable file that it desires to infect.This technique is generally more elaborated than the over-writing technique and was used, for instance, by the virus

    Nimda [13]. When the executable file is later launched byits owner, the operating system will execute first the viruscode since it is located at the beginning of the executable,and then the intended program. This infection techniquenot only may not damage the host file but may have effectsthat are not easily detectable from the codes owner. In fact,the presence itself of such a virus may not be easy to detect.

    Appending infection technique. The virus inserts its code atthe end of the executable file that it desires to infect, and,typically, a jump to the beginning of the virus code at thebeginning of the infected file. When the executable file islater launched by its owner, the operating system will jumpto execute the virus code first and then return control to theintended program. Similarly as with the prepending tech-

    nique, the presence of a virus using this infection techniquemay be not easy to detect. This technique was used, forinstance, by the virus Michelangelo [11].

    Embedding infection technique. The virus inserts a pieceof code or a command pointing to code in a separate fileanywhere into a shell script, a Perl script, a Java class fileor source code of files that will eventually be compiled intoregular executables. Although less examples have been seenof such techniques, they are very useful in reminding thatevery program containing executable instructions is a po-tential target for a virus.

    Detection and Analysis of PC viruses. Available an-tivirus software typically uses three main techniques for de-tecting viruses: signatures, heuristics, and integrity verifi-

    cation, which we briefly explain. The signature techniqueis similar to the signature detection approach in intrusiondetection systems (or, perhaps, a variant of it): first, knownviruses are studied and signatures of them are stored; thenoccurrences of these signatures are looked for into candidateexecutable files. Although this is the most popular approachfor virus detection, it relies on quick update of the signaturedatabase by vendors and of their signature files by users,and it is easily defeated by polymorphic and metamorphicvirus techniques. The remaining two techniques (heuristicsand integrity verification), are more similar to the anomaly

    detection approach in intrusion detection systems. Heuris-tic techniques may be somewhat sophisticated in that theyattempt to identify viruses based on some behaviors thatthey are likely to exhibit, such as attempts to write on ex-ecutable files, to access boot sectors, to delete hard drivecontents, etc. Integrity verification techniques try to detectunexpected modifications to files and after the infection hashappened, but potentially before the execution of the in-fected file happens, thus still making the infection harmless.

    While both heuristics and integrity verification techniqueshave the potential of catching more intelligent viruses suchas those equipped with polymorphism and metamorphismcapabilities, they are at most able to raise an alert on aparticular file, which later has to be carefully emulated andanalyzed in the virus diagnosis phase under a controlled en-vironment where a conclusion about the location, natureand consequences of the potential virus need to be derived.Due to the difficulty of the realization of an accurate con-trolled environment for emulation, the accuracy of the de-rived consequences may be not entirely or not at all trustedand it is naturally desirable to strengthen the mentionedtechniques. Moreover, in many cases the modification car-ried by the virus to the original file is very minimal (e.g.,

    a subroutine call to a program located somewhere else inmemory), and therefore it would be very helpful to have ad-ditional information about the virus itself. In what follows,we describe integrity verification techniques that addition-ally provide useful virus localization properties.

    3. OUR DETECTION APPROACHAs mentioned, the integrity verification techniques attempt

    to detect unexpected modification to a file, before executingit. In our approach, we would like to extend these techniquesso that, in case an unexpected modification is detected, someadditional information is obtained about the location of themodification, within the file. We will refer to each of thepreviously discussed virus infection techniques (character-ized as: rewriting, prepending, appending, and ebbeddingtechniques) and we will design virus localization techniquefor each of them. To that purpose, it is useful to discussin slightly greater detail the original integrity verificationtechnique.

    3.1 Integrityverification based on cryptographichashing

    Given any collision-resistant hash function family H ={Hw}kN (for simplicity of notation, we will drop the indexw from Hw), and a file f, the 160-bit value hash = H(f)gives evidence that any efficient algorithm (possibly creatingf itself) would not be able to obtain in feasible computationtime or to already have a g different from f meeting the in-tegrity verification test H(g) = H(f). Integrity verification

    of executable or source files is then done as follows. In a pre-liminary stage, when the file f is believed to be not infected,or as soon as the file is downloaded from its original source,a tag hash = H(f) is attached to the file. Later, at anytime a legal modification on the file is performed, resultingin file f, a new value hash = H(f) replaces the previoushash value. Later, at any time the file needs to be executed,or compiled for execution, the integrity verification test isperformed on the current version f of the file; that is, itis checked whether hash = H(f). If the match is sat-isfied, then we have evidence that only legal modifications

    43

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    4/8

    have been performed to the file; otherwise, some unexpectedmodification has b een performed. As this modification canbe caused by an infection, the file is then returned to ananalyzer that tries various diagnosis techniques, includingemulating the execution of the file in a safe environment, toderive useful information about the virus and possibly re-store the file f to its previous, uninfected, state. In whatfollows, we show techniques that help the emulation processto focus on a certain area of the file that more accurately

    localizes the potential virus.We note that the original integrity verification technique

    does not specifically deal with viruses changing the storedhash value. Presumably, this technique may be modified todeal with this case using any of the following ways:

    (a) It assumes the existence of a secure read-only memory,and stores all hash values in this memory (this can actuallybe implemented more efficiently, by associating all the hashvalues to leaves of a Merkle tree whose root is the only valuestored in the read-only memory);

    (b) It assumes the existence of a memory that cannot be ac-cessed by an attacker, and stores a symmetric key k into thismemory; then, a keyed hash function Hk is used analogouslyas H was used in the above description, and a public-key en-

    cryption of key k is sent to the user downloading the file sothat the user can run Hk. Here, a different key k shouldbe used for each file, and, as before, instead of storing allkeys in the secure memory, one can define a Merkle tree overthem and only store the root value.

    (c) It assumes the existence of a memory that cannot beaccessed by an attacker, and stores a signature secret keysk into this memory; then, the computation of a hash valuehash using H as in the above description is followed by asignature sig of hash, and only the value sig is returnedfor verification.

    In the rest of the paper we also do not deal with viruseschanging any stored hash values, and carry on any of theseassumptions to our techniques.

    3.2 Virus localization based on cryptographichashing

    Towards our goal, the basic property of the above integrityverification technique can be rephrased as follows. Givenfiles M1 and M2, and their hashes hash(1), and hash(2),there exists an efficient algorithm DIFF that decides thefunction p(M1, M2), defined as equal to 1 if M1 = M2 and0 otherwise. Algorithm DIFF, on input p, hash(1), andhash(2), simply checks whether hash(1) = hash(2) and re-turns 1 if yes or 0 otherwise. We would like to generalizethis approach to more general functions, in particular, tofunctions detecting information about differences betweenM1 and M2, that are caused by each one among the previ-

    ously discussed infection techniques (rewriting, appending,prepending and embedding infection techniques). In gen-eral, we will talk about p-revealing collision-resistant hashfunctions, that are informally defined as collision-resistanthash functions such that for any efficient adversary return-ing two files M1, M2, there exists an efficient algorithm Detthat, on input the two hashes of M1 and M2, is able withvery high probability to compute the function p(M1, M2),where typically p is a function returning some informationabout differences between the two files. (The formal def-inition requires a somewhat elaborated probability experi-

    ment and is deferred to the full paper.) While an arbitrarycollision-intractable hash function does not seem to help inrevealing useful information about the nature of the differ-ences between M1 and M2 we will consider more involvedcollision-intractable hash functions (starting from conven-tional ones) that do achieve that.

    An immediate (but inefficient) solution to construct a p-revealing collision-resistant hash function is as follows. Firstconsider that in all our techniques we will view each file M as

    composed of n atomic blocks M(1), M(2), . . . , M (n), wherea block size can be appropriately chosen; for instance, itcould be determined as the smallest semantic unit that afile can be decomposed into (e.g., a line in a source file).Then a very simple hash function is obtained by returningan atomic hash of each block of the input file and then anatomic hash of all the hashes previously computed, wherean atomic hash is computed using any collision-intractablehash function. Note that with this technique one can use theabove algorithm DIFF on each block of the file and there-fore derive any interesting information that is required byfunction p, for anyp. The obvious drawback however is thatthe returned hash is too long, as it is equal to n times thesize of a conventional hash, which can be already imprac-

    tical for files of quite small sizes, such as a few kilobytes.(Still, it is worth noting that this technique is used as a cru-cial component of the rsync algorithm [7], which addressesthe task of remotely updating files over a high latency, lowbandwidth link). This motivates our definition of two mainmetrics associated with any construction of a p-revealingollision-resistant hash functions.

    Two metrics of interest. As a consequence of the previ-ous discussion, the main metric we pay special attention to isthe size of the returned hash, which we will refer to as the ex-pansion factorof a construction, defined as follows: let H bea collision-intractable hash function, and cH be a construc-tion of a hash function which makes calls to H; then the ex-pansion factor ofcH is the size of the output returned by cHon an n-bit input divided by the size of the output returnedby H on an n-bit input. Naturally, we aim at designing con-structions with the shortest possible expansion factor (e.g.,logarithmic in n). This implies very small computationaland storage overhead resulting from the added verificationsand generation of hashes, since all our constructions can bedescribed as a number of applications of hash function equalto the expansion factor. Our general approach to achievethis performance, as instantiated on each of the infectiontechniques, is that of computing multiple applications of abasic collision-intractable hash function to carefully chosensubsets of the message bits; here, the choice of the subsets isperformed according to an appropriate encoding schemesthat later allows to decode the desired information aboutthe differences between the two messages from the multiple

    hashes alone, where both the encoding/decoding schemesand the desired information depend on the infection tech-nique.

    Another metric of interest will be that of minimizing thelocalized area in which the virus is detected to belong to.In particular, we aim at returning an interval of blocks con-taining a number of blocks that is at most max( b(V), 1),for some localization factor , where b(V) is the number ofblocks that virus V is made of.

    We note that all our constructions achieve localizationfactor = 2, expansion factor logarithmic in n, and only

    44

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    5/8

    require unfrequent, off-line computations of the hashes (thecomputation overhead is thus not a significant factor to-wards the efficiency of the overall system).

    3.3 Localization of rewriting infectionsRecall that in rewriting infection techniques, a virus is a

    portion of, say, text that rewrites some blocks of the file.We will assume, for simplicity and since this seems to bethe case in most virus instances, that the file keeps the same

    size even after rewriting and that the rewritten blocks areadjacently located. We will design a p-revealing collision-resistant hash function, where p is defined as follows. Oninput n-block files M1 and M2, function p(M1, M2) is equalto

    0 if M1 = M2;

    (a, b) if b a < n/4 and M1(a), M1(a + 1), . . . , M 1(b)and M2(a), M2(a + 1), . . . , M 2(b) are the only blockswhere M1 and M2 differ;

    otherwise.

    In the above definition we restrict the analysis to virusesrewriting at most n/4 blocks. We consider this to be satis-

    factory enough as when a larger number of blocks is rewrit-ten, the localization problem loses relevance, in that a lo-calizing hash function is not significantly more useful thana (conventional) hash function.

    Our construction. Informally, our construction works asfollows. At each step, it uses the atomic collision-resistanthash function H to compute hashes of some blocks of theinput file. Given any two files M1, M2 that differ due toa rewriting virus that modified M1 into M2, the hashes ineach step should help in finding the location of the virus byeliminating half of the candidate blocks in M2. The blockseliminated are one among these four: the first half or thesecond half of the yet not eliminated blocks in the file, or,the first half or the second half of an appropriate cyclical

    shift of yet not eliminated blocks in the file. This is enoughto achieve localization factor 2, and, since at each step 4hashes are computed, output expansion at most 4 log n.

    Formal description. Let H denote a collision-resistant hashfunction. Our p-revealing collision-resistant hash function,which we denote as pH, takes as input an n-block messageM = M(0), . . . , M (n1), and applies 4 log n times the origi-nal function H to subsets of the blocks ofM, as follows (weassume that n is a power of 2 for simplicity and logarithmsare in base 2):

    Step 1: In this first step, the blocks in M are grouped into4 segments, each obtained by concatenating n/2 blocks, de-fined as

    S1,1 = M(0)| |M(n/2 1),

    S1,2 = M(n/2)| |M(n 1),

    S1,3 = M(n/4)| |M(3n/4 1), and

    S1,4 = M(0)| |M(n/4 1)|M(3n/4)| |M(n 1)).

    (Note that the indices of the blocks in S1,3 and S1,4 areobtained by cyclically shifting the indices of the blocks inS1,1 and S1,2, respectively, by n/4.) The 4 hashes in thisstep are computed as h1,i = H(S1,i) for i = 1, 2, 3, 4.

    Step j, for j = 2, . . . , log n 1: Extending step 1, theblocks in M are grouped in 2 2j segments, each obtained

    by concatenating n/2j blocks, defined as Sj,i = M((i 1) n/2j)| |M(i n/2j 1) and Sj,i+2j = M(((i 1) n/2

    j +

    n/2j+1) mod n)| |M((i n/2j + n/2j+1 1) mod n), fori = 1, . . . , 2j, (note that the indices of the blocks in Sj,i+2j

    are obtained by cyclically shifting those in Sj,i by n/2j+1).

    The 4 hashes in this step are computed as follows: wheni = 1, 2,

    hj,i = H(Sj,i|Sj,i+2|Sj,i+4| |Sj,i+2j2);

    and, when i = 3, 4,

    hj,i = H(Sj,i2+2j |Sj,i+2j |Sj,i+2j+2| |Sj,i+2j+2j4).

    Output: h = ( (h1,i, h2,i, . . . , hlogn1,i) )i=1,2,3,4.

    Properties of our construction. The overall number ofatomic hashes of pH is only 4log n 4, which is essentiallyoptimal (up to a small constant) in a model where each hashreveals one bit of information about the location of the virusinterval.

    To get convinced that pH is collision-resistant, observethat an algorithm finding collisions in the output ofpH findstwo different inputs that are mapped by H to the same h(j|i)for at least one of the values j = 0, 1, . . . , log n, and i =

    1, 2, 3, 4, which implies an algorithm for finding a collisionfor H.We now prove that pH is p-revealing. First of all, we note

    that by definition ofp, it is enough to only consider the caseb a < n/4. Then the claim follows by the following lemma:There exists an efficient algorithm that, for any virus interval(a, b) of size at most n/(22j), on input M (a version ofMsubject to rewriting infection with the virus interval (a, b))and the hashes obtained until step j of the constructionpH on input the original file M, returns an interval (a, b)such that a a < b b and, if n/2j (b a + 1),(b a + 1) n/2j . To see that the lemma implies the factthat pH is p-revealing and has localization factor 2, considerany virus interval (a, b) of size b a < n/4, and take jsuch that n/2j+1 (b a + 1) < n/2j ; by applying the

    lemma for this value of j, we obtain that a a

    < b

    b,which implies that pH is p-revealing, and (b a + 1) n/2j 2(b a + 1), which implies that it has localizationfactor 2. We now prove the lemma by induction over j.We show an efficient algorithm Det that first computes theoutput h = ( (h1,i, h

    2,i, . . . , h

    logn1,i) )i=1,2,3,4, of pH on

    input M, and then computes tj,i = 1 ifhj,i = hj,i or tj,i = 0

    if hj,i = hj,i, for all indices (j, i), j {1, . . . , log n 1},

    i {1, 2, 3, 4}, and uses values tj,i to return an interval(a, b).

    Base case (j = 1). We note that any virus interval (a, b) ofsize at most n/4 is strictly contained in exactly one amongS1,i, for i = 1, 2, 3, 4, and therefore exactly one of valuest1,i will be equal to 1. Given values t1,i, for i = 1, 2, 3, 4,

    such that t1,i

    = 1, algorithm Det can return (a, b) equal to(0, n/2 1), (n/2, n 1), (n/4, 3n/4 1) or (3n/4, (5n/4 1)mod n), if i = 1, 2, 3 or 4, respectively.

    Inductive step. We consider any virus interval (a, b) ofsize at most n/(2 2j); since n/(2 2j) n/(2 2j1), bythe induction hypothesis, we obtain that there exists an al-gorithm that by only using the hashes obtained until stepj 1 of the construction pH, returns an interval (a, b)that contains (a, b) and satisfies (b a) < n/2j1. Wewould like to extend the algorithm so that it returns an in-terval (a, b) that contains (a, b) and satisfies (ba) < n/2j .

    45

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    6/8

    We assume (b a) n/2j (or otherwise there is nothingelse to prove), that (a, b) = Sj1,i for some i (note that|Sj1,i| = n/2

    j1), and consider the hashes obtained at stepj of the construction pH. These hashes contain hashes of 4intervals Sj,i1, Sj,i2, Sj,i3, Sj,i4 such that the following holds:

    (a) the interval (a, b) can be partitioned into the 2 equal-size intervals T1 = (a

    , b) Sj1,i1 and T2 = (a, b)

    Sj1,i2 and into the 2 equal-size intervals T3 = (a, b)

    Sj1,i3 and T4 = (a, b) Sj1,i4;

    (b) |T1| = |T2| = |T3| = |T4|; and

    (c) the indices of the blocks in T3 and T4 are obtained bycyclically shifting those in T1 and T2, respectively, by n/2

    j .

    Due to (a), (b), and (c), we can apply the same reasoningdone in the base case to intervals T1, T2, T3, T4, and thusobtain that any virus interval (a, b) of size at most n/(2 2j) is strictly contained in exactly one among Ti, for i =1, 2, 3, 4, and therefore exactly one of the values tj,i obtainedfrom the hashes at step j will be equal to 1. Given valuestj,i, for i = 1, 2, 3, 4, such that tj,i = 1, the algorithm canreturn (a, b) equal to T1, T2, T3 or T4, if i

    = 1, 2, 3 or 4,respectively.

    3.4 Localization of appending infectionsWe will only deal with appending infection techniques asprepending infection techniques can be dealt with in a sym-metric way. Recall that in appending infection techniques, avirus is a portion of, say, text that is appended at the end ofthe target file, and, as a consequence, the file size increasesand after the virus is appended the previous file size maybe lost. We will design a p-revealing collision-resistant hashfunction, where p is defined as follows. On input M1 andM2 (of not necessarily equal length), function p(M1, M2) isequal to

    0 if M1 = M2;

    x if|M2| 2|M1| and M1(i) = M2(i) for i = 1, . . . , x

    1 and M1(i) = M2(i) for i = x. otherwise.

    Similarly as in the case of overwriting infections, here we canrestrict the analysis to viruses appending at most n blocks(that is, we assume that the number of blocks in M2 is atmost 2n, where n is the number of blocks in M1).

    Our construction. Informally, our construction works asfollows. At each step, it uses the atomic collision-resistanthash function H to computes 2 hashes of some blocks of theinput file. Given any two files M1, M2 that differ due to anappending virus that modified M1 into M2, the 2 hashes ineach step should help in finding the location of the virus byeliminating about half of the candidate blocks in M2. One

    main difficulty with appending viruses is that the lengthof file M2 may be different from the length of target fileM1, and, therefore, we cannot use length-dependent tech-niques such as those for the rewriting viruses. Instead, westart eliminating candidate blocks from the beginning of thefile, by carefully computing hashes on subsets independentfrom the length of the target file.Formal description. Let H denote a collision-resistant hashfunction. Our p-revealing collision-resistant hash function,which we denote as pH, takes as input an -block messageM = M(1), . . . , M (), and applies at most 2 log times the

    original function H to subsets of the n blocks of M, as fol-lows:

    1. set j 1 and Mj = M;

    2. compute the largest positive integer kj such that 2kj

    /2

    obtain, from Mj , the following two segments:

    Sj,1 = Mj(1), . . . , M j(2kj ),

    Sj,2 = Mj(2kj + 1), . . . , M j(2kj+1);

    compute hash h1,i = H(Sj,i) for i = 1, 2;

    3. repeat step 2 recursively over Mj+1 = Mj(2kj+1 +

    1)| |Mj() if 2kj+1 + 1 1 (that is, unless the

    new input Mj+1 has 0 or 1 blocks, when the recursionends), and set j = j + 1;

    4. return all computed hashes (h1,i, h2,i, . . . , hlog ,i)i=1,2.

    Properties of our construction. We denote by n thenumber of blocks of the target file M and by n the numberof blocks of the corrupted file M. Then we note that thereare at most log recursion steps and therefore the overallnumber of hashes returned by pH is 2log n 2log n + 1,since we assume that n 2n. This expansion factor isessentially optimal (up to a small constant) in a model whereeach hash reveals one bit of information about the locationof the virus interval.

    To get convinced that pH is collision-resistant, observethat an algorithm finding collisions in the output ofpH findstwo different inputs that are mapped by H to the same h(j|i)for at least one of the values j = 1, . . . , log n, and i = 1, 2,which implies an algorithm for finding a collision for H.

    We now sketch the proof that pH is p-revealing. We startby noting that the computation of pH is divided into recur-sive steps, where the j-th step takes as input Mj(sj)| |M(),and (s1, . . . , slog ) in an increasing sequence of integers in{1, . . . , } such that sj = 2

    kj1+1 + 1, for j 2, and sj = 1

    when j = 1. We also note that by definition of p, it isenough to only consider the case s /2. Then the claimfollows directly by the following lemma: there exists an ef-ficient algorithm that, for any virus interval (s, ) with atleast sj + 1 blocks, on input an -block corrupted fileM, equal to the target file M with interval (s, ) appendedto it, and the hashes obtained in the first j 1 steps ofthe recursion in the construction pH, returns a value x suchthat x s (that is, the algorithm localizes the virus) andx 2s n 1 (that is, the localization factor of pH is atmost 2).

    We now prove the lemma by induction over j. A first ob-servation is that there exists an efficient algorithm Det that,as a first step, runs a slightly modified version of functionpH on input M, the modification being as follows.

    (a) Step 2 computes values kj and segments Sj,i using thenumber of blocks n of M while Step 2 computes valueskj and segments Sj,i using the number of blocks n of M(this implies that, in general, kj , Sj,i may be different fromkj , S

    j,i, respectively);

    (b) in step 2, 3 segments (rather than 2 as in step 2) areobtained from Mj: Sj,1 and Sj,2, as defined in step 2, and

    also Sj,0, defined as = Mj(1), . . . , M j(2kj1); at the end of

    step 2, the 3 hashes h1,i = H(Sj(i)), for i = 0, 1, 2, arecomputed.

    46

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    7/8

    Thus, algorithm Det can compute an augmented hash out-put h = ( (h1,i, h

    2,i, . . . , h

    logn,i) )i=0,1,2.

    Base case (j = 1). Recall that k1 is defined as the largestpositive integer such that 2k1 n/2, and k1 is defined as the

    largest positive integer such that 2k

    1 n/2. Since n = n,it may hold that k1 = k

    1; however, since n

    2n, it holdsthat k1 {k1, k1 + 1}. We assume wlog that the algorithmDet can check whether k1 = k1 or k

    1 = k1 + 1 (this can be

    achieved in many ways; for instance, by augmenting eachof the 2log atomic hashes in the output returned by pHwith the number of blocks in its input). First, considerthe case k1 = k1; then, algorithm Det, after computingh, returns x = 2k1 + 1 if h1,1 = h1,1 and h

    1,2 = h1,2; or

    returns x = 1 if h1,i = h1,i for i = 1, 2. Now, consider thecase k1 = k1 + 1; then, algorithm Det, after computing h

    ,returns x = 2k1 +1 ifh1,0 = h1,1 and h

    1,1 = h1,2; or returns

    x = 1 if h1,i1 = h1,i for i = 1, 2. To see that this outputreturns a valid localization, it is enough to observe that thefollowing holds: either |Sj,1| = |Sj,1|, in correspondence tok1 = k1, or |S

    j,0| = |Sj,1|, in correspondence to k

    1 = k1 + 1.

    Inductive step. The proof of the inductive step is an adap-tation of the reasoning in the base case.

    3.5 Localization of embedding infectionsRecall that in embedding infection techniques, a virus is a

    portion of, say, text that is inserted somewhere in the middleof the target file (and not at the beginning or the end), thusagain increasing the size of the file. We will design a p-revealing collision-resistant hash function, where p is definedas follows.

    On input M1 and M2 (of not necessarily equal length),function p(M1, M2) is equal to

    0 if M1 = M2;

    (a, b) if b a < n /2 and M1(i) = M2(i) for i =1, . . . , a 1, and M1(i1) = M2(i2) for i1 = b + 1, . . . , n,and i2 = i1 + b 1 + 1.

    otherwise.

    Here we restrict the analysis to viruses embedding at mostn blocks (that is, we assume that the number of blocks inM2 is at most 2n, where n is the number of blocks in thetarget file M1).

    Ideas behind our construction. This construction issomewhat less interesting as it is obtained as a direct com-bination of ideas from the previous two ones.

    First, we use a variant of the construction for appendingviruses to obtain a length-independent choice of subsetsfor the localization of the beginning of the embedded virusinterval, and we apply the same technique to the reversed

    input file (i.e., the same file, reordered from the last blockto the first) for the localization of the end of the embeddedvirus interval.

    This is then combined with a variant of the constructionfor rewriting viruses to eliminate, on each step of the con-struction, half of the candidate blocks in the corrupted file;the combination consists of employing this technique usinglength-independent subsets of file blocks.

    The resulting construction has localization factor 2 andexpansion factor 16 log , where is the size of the input file.

    Details are deferred to the full paper.

    Construction Infection Virus length xname technique restriction

    detection any nonetrivial any nonepH1 rewriting x < n/4pH2 appending x npH3 embedding x n

    Figure 1: Applicability of our techniques.

    Construction Localizing Expansion Timename factor factor blowup

    detection n 1 1trivial 1 n 1pH1 2 4 log n 4log npH2 2 2 log n 2log npH3 2 32 log n 32log n

    Figure 2: Performance of our techniques.

    4. PERFORMANCE ANALYSISWe now summarize and comment the applicability and

    performance of our techniques, when compared with twoknown techniques mentioned in the paper.

    The summary is depicted in Figures 1 and 2. (Here, recallthat by detection construction we denote the constructionthat only computes a single hash of its entire input file;by trivial construction we denote the construction thatcomputes a single hash of each block of the input file; andour constructions in Section 3.3, 3.4, and 3.5, are denotedas pH1, pH2, pH3, respectively.)

    We now discuss the entries in the above tables.While the detection and the trivial constructions are ap-

    plicable to any infection technique in the characterizationgiven in Section 2, each of the three hash function construc-tions pH1, pH2, pH3 was tailored for a specific infection tech-nique. Additionally, we note that one should consider therestrictions on the virus length in these constructions to beessentially without loss of generality, as the virus localiza-tion problem loses significance whenever such restrictionsare not satisfied.

    With respect to the localizing factor, we note that thedetection construction practically does no localization, whilethe trivial construction has the b est possible localization(as it localizes all infected blocks), and the constructionspH1, pH2, pH3 localize an area only at most twice as large

    as the infected area.With respect to the expansion factor, we note that thetrivial construction has a too large expansion, while the de-tection construction has practically no expansion and theconstructions pH1, pH2, pH3 only have a logarithmic (in n)expansion of the hash.

    Finally, with respect to the time performance, we notethat the trivial and detection construction have essentiallyno time blowup (with respect to a single hash computation),and constructions pH1, pH2, pH3 only compute a logarith-mic (in n) number of hashes.

    47

  • 7/30/2019 Cryptographic Hashing for Virus Localization

    8/8

    Acknowledgement.

    We thank the WORM 2006 committee for very helpful com-ments.

    5. REFERENCES[1] F. Cohen. Computer Viruses - Theory and

    Experiments. Computers and Security, vol. 6, 1987.

    [2] F. Cohen. On the Implications of Computer Viruses

    and Methods of Defense. Computers and Security, vol.7, 1988.

    [3] E. Skoudis. MALWARE: Fighting Malicious Code.Prentice Hall, 2004.

    [4] P. Szor. The Art of Computer Virus Research andDefense. Addison Wesley, 2005.

    [5] G. H. Kim and E. H. Spafford. The design andimplementation of tripwire: a file system integritychecker. Proc. of 2nd ACM Conference on Computerand Communications Security (ACM CCS), 1994.

    [6] G. Di Crescenzo, A. Ghosh, and R. Talpade. Thedesign and implementation of tripwire: a file systemintegrity checker. Computer Security - ESORICS2005, Proc. of 10th European Symposium on Research

    in Computer Security, vol. 3679 of Lecture Notes inComputer Science, Springer-Verlag, 2005.

    [7] A. Tridgell. Efficient Algorithms for Sorting andSynchronization.http://samba.org/ tridge/phd thesis.pdf.

    [8] S. Quinlan and S. Dorward. Venti: A new approach toarchival storage. Proc. of USENIX Conference on Fileand Storage Technologies (FAST), 2002.

    [9] B. Schneier. Applied Cryptography: Protocols,Algorithms, and Source Code in C. John Wiley &Sons, 1993.

    [10] 1st NIST Cryptographic Hash Functions Workshop.http://www.csrc.nist.gov/pki/HashWorkshop/2005/program.htm, 2005.

    [11] CERT Advisory CA-1992-02 Michelangelo PC VirusWarning,http://www.cert.org/advisories/CA-1992-02.html,1997.

    [12] CERT Advisory CA-1999-04 Melissa Macro Virus,http://www.cert.org/advisories/CA-1999-04.html,1999.

    [13] CERT Advisory CA-2001-26 Nimda Worm,http://www.cert.org/advisories/CA-2001-26.html,2001.

    48