MD5 collisions and the impact on computer forensics

Digital Investigation (2005) 2, 36e40

www.elsevier.com/locate/diin

MD5 collisions and the impact on computerforensics

Eric Thompson

AccessData Corporation, 384 South 400 West, Lindon, UT 84042, United States

Abstract In August 2004 at the annual cryptography conference in Santa Barbara,California a group of cryptographers, Xianyan Wang, Dengguo Feng, Xuejia Lai,Hongbo Yu, made the announcement that they had successfully generated two fileswith different contents that had the same MD5 hash. This paper reviews theannouncement and discusses the impact this discovery may have on the use of MD5hash functions for evidence authentication in the field of computer forensics.� 2005 Elsevier Ltd. All rights reserved.

Hash functions are one of the basic buildingblocks of modern cryptography. In cryptographyhash functions are used for everything from pass-word verification to digital signatures. A hashfunction has three fundamental properties:

� A hash function must be able to easily convertdigital information (i.e. a message) into a fixedlength hash value.

� It must be computationally infeasible to deriveany information about the input message fromjust the hash.

� It must be computationally infeasible to findtwo files that have the same hash. Hash(Mes-sage 1)Z Hash(Message 2).

E-mail address: [email protected]

1742-2876/$ - see front matter � 2005 Elsevier Ltd. All rights resedoi:10.1016/j.diin.2005.01.004

In computer forensics hash functions are impor-tantbecause theyprovideameansof identifyingandclassifying electronic evidence. Because hash func-tionsplayacritical role inevidenceauthentication itis critical a judge or jury can trust the hash valuesthat uniquely identify electronic evidence.

The third property of a hash function states thatit must be computationally infeasible to find twofiles to have the same hash. The research pub-lished by Wang, Feng, Lai and Yu demonstratedthat MD5 fails this third requirement since twodifferent messages have been generated that havethe same hash. This situation is called a collision.

Birthday paradox

One method of demonstrating that a hash functionis insecure is to find ANY two messages that have

rved.

mailto:[email protected]

http://www.elsevier.com/locate/diin

MD5 collisions and the impact on computer forensics 37

the same hash. The easiest method of accomplish-ing this is through what is frequently referred to asa birthday attack or birthday paradox. Whena person enters a room, how many people needto be in the room before there is greater than 50%probability that one of those people in the roomwill share the first person’s birthday (same day notthe same year)? The answer is 183 (365/2). This isbecause you are attempting to find someone whomatches a specific date. However, how manypeople must be in a room before there is a probablygreater than 50% that there exist ANY two pairs inthe room that have the same birthday. The numberis surprisingly low: 23. When 23 people are in theroom there are a total of 253 different pairs ofdates. This is called the birthday paradox. Fora more detailed description of the birthday para-dox see Patterson (1987).

The birthday paradox is a standard statisticalproblem. For each additional person n who entersa room the number of pairs of birthdays increasesby n� 1. As more people enter the room thenumber of birthday pairs increases rapidly untila matching pair is found.

In the first example, the attempt was to finda person in the room that matched one specificbirthday. When matching a specific day, eachperson has only a 1/365 chance of being born onthe specific day in question. In cryptography, thefirst example is analogous to a brute force orexhaustive key space attack. This is the processused in most password recovery/password guessingattacks. In the second example ANY birthday pairwill suffice. This second type of attack is theprocess used by cryptographers to attack hashfunctions.

If a hash function has a key space of 64 bits,then an exhaustive key space attack would requirea computer test up to 264 combinations. If a singlecomputer could process one million hashes persecond and an advisory could use a distributednetwork attack to harness the CPU power of 10,000computers it would still take up to 58 years toexhaust the key space. However, if the goal wassimply to find ANY hash match a single computercould find that match in slightly more than an hour.

Fortunately MD5 and other common hash func-tions have substantially larger key length than 64bits. For MD5 the key length is 128 bits, for SHA-1the key length is 160 bits, SHA-256 the key length is256 bits. However, if a cryptographic weakness isdiscovered in the design of the hash algorithm thisweakness can reduce the effective key length ofthe hash function to be less than the intendeddesign length. In this case the weakness makespossible the potential for a birthday attack to

successfully find a hash collision. This weaknessinduced collision is what occurred with the MD5algorithm.

MD5

The MD5 hash function was developed in 1994 bythe cryptographer Ron Rivest as a stronger alter-native to the MD4 algorithm developed in 1992.The algorithm breaks a file into 512 bit inputblocks. Each block is run through a series offunctions to produce a unique 128 bit hash valuefor the file. Changing just one bit in any of theinput block should have a cascade effect thatcompletely alters the hash results. Furthermore,since the key size of 128 bits has 3.4! 1038

possible combinations the chance of randomlyfinding two files that produce the same hash valueshould be computationally infeasible (Schneier,1996).

Cryptanalysis of MD5

MD5 has been intensely scrutinized by the crypto-graphic community since its original release. Priorto 2004, most of the research attacks against MD5demonstrated only minor weaknesses in the MD5design. However, there are two particularly nota-ble exceptions that discovered more serious designproblems.

The first indication that MD5 might have a designflaw was in a paper published by Den Boer andBooselaer in which it was demonstrated that givencertain different input conditions it was possiblefor there to exist identical internal states for someof the MD5 computations. However, Boer andBooselaer were not able to expand upon theseinternal anomalies to produce duplicate hashes fordifferent input values (Den Boer and Bosselaers,1994).

The second significant research advancementoccurred in 1996 when Dobbertin was able todemonstrate that the MD5 algorithm could produceidentical hashes for two different messages if theinitialization vector could be chosen (Dobbertin,1996). The initialization vector is the value towhich the MD5 internal variables are initially setbefore beginning the hashing process. BecauseMD5, when used in real life, is always set to thesame initialization state (IV0) Dobbertin’s resultdid not present an immediate security concern.However, his work did demonstrate that an even-tual MD5 collision would probably be discovered.

38 E. Thompson

Table 1 MD5 collsion

In the summer of 2004 the cryptographers Wanget al. demonstrated their ability to generate MD5collisions using the standard initialization vectorIV0. This research showed that it is possible tocreate two related 512 bit input blocks and modifyspecific bits within these blocks, creating twoslightly different messages, that have the samehash value. The amount of time to create an MD5message pairwas on average 1 h (Wang et al., 2004).

Example of an MD5 collision

In their paper Wang et al. provided an example oftwo MD5 collisions. One of the collisions is as givenin Table 1.

Response of the cryptographiccommunity to MD5 collisions

The response of the cryptographic community hasbeen what should be expected. While these resultsare mathematically significant they do not presentan immediate cause for alarm. Creating twomessages that have identical MD5 hashes requiresvery specific circumstances that would have anextremely rare chance of actually existing in theregular world. Additionally this research does notprovide a hacker with any new technique to breakthrough a firewall, attack a public key encryptionsystem or fabricate a false digitally signed mes-sage. Nevertheless, this research does point outa design weakness in the MD5 algorithm and as

a result the cryptographic community needs toincrease the diligence in which it searches fora new hash standard.

Bruce Schneier summarized the feelings ofmany in the cryptographic community with hisstatement:

‘‘The magnitude of the results depends on who youare. If you’re a cryptographer, this is a huge deal.Whilenot revolutionary, these results are substantialadvances in the field. The techniques described bythe researchers are likely to haveother applications,and we’ll be better able to design secure systems asa result.. As a user of cryptographic systemse as Iassume most readers are e this news is important,but not particularly worrisome. MD5 and SHA aren’tsuddenly insecure. No one is going to be breakingdigital signatures or reading encrypted messagesanytime soon with these techniques. The electronicworld is no less secure after these announcementsthan it was before.’’ (Schneier, 2004)

The impact of MD5 collision on theuse of MD5 in computer forensics

The recent research on MD5 collision should havelittle impact on the use of MD5 for evidenceauthentication in computer forensics. Three rea-sons for this are:

(1) MD5 is still secure against a brute forceattack e It is computationally infeasible to

MD5 collisions and the impact on computer forensics 39

modify the contents of a message such that thehash of the new message matches some pre-determined hash value. No one in the crypto-graphic research community has yet to be ableto generate a new file or modify an existing fileso that the new file will convey intelligibleinformation and still match a pre-determinedMD5 hash from a different file.

(2) Changing one bit in the evidence will still causea cascade effect that dramatically changes theMD5 hash result e A collision similar to that onedemonstrated by Wang et al. can only beproduced using very specific input blocks. Thereis no reason for these types of input blocks tooccur in the real world. Therefore, there is noreason to believe the internal state of the MD5engine that allowed for the collision wouldnaturally occur. The MD5 engine does a remark-ably good job of generating a cascade effect onall the bits in the hash value even when justa single bit in the input file is changed. MD5 canstill be relied upon by the forensics communityto do an excellent job at identifying even thesmallest change in electronic data.

(3) The chance of a birthday collision from filesthat are part of the NIST data set or hash keeperproject are very remote e The birthdaycollision that was produced by these cryptogra-phers required a very special set of circum-stances to occur within the internal variables ofthe MD5 engine. It is unrealistic to believe thatthis kind of state would occur naturally whenanalyzing files that would normally be found ona computer, PDA or similar electronic device. Inthe real world the number of files required forthere to be a 50% probability for an MD5 collisionto exist is still 264 or 1.8! 1019. The chance ofan MD5 hash collision to exist in a computer casewith 10 million files is still astronomically low.

For those who wish to be overly cautious, it isalways possible to hash electronic evidence usingboth MD5 and another hash function such as SHA-1or SHA-256. Since these hash functions are line-arly independent of each other, the resultinguniqueness of having both these hash valueswould be the sum of the bits from each individualhash. For example, a file that has been hashedwith both MD5 (128 bits) and SHA-1 (160 bits)would have an effective uniqueness of 288 bits or1:1086. Even if a weakness could be found thatreduces the effective key size of one of thesehash functions it is still computationally unrealis-tic that in our life time, there will be twodifferent data streams that would have the sameMD5 and SHA-1 hash.

Conclusion

The struggle to make a perfect cryptosystem haslong eluded cryptographers. New cryptographiccodes are created and broken every day. It isthrough this challenge that the cryptographictechnology advances forward. Cryptographershave new information about how to design hashfunctions that Ron Rivest did not know back in1994 when he published his work on the MD5algorithm. This new announcement does not pres-ent a current security threat nor does it make theuse of MD5 for evidence authentication any lesstrustworthy. Instead, this research gives mathe-maticians information about how to design hashfunctions so the next generation’s codes can bebetter and stronger.

As a result of these developments, in the nextseveral years a new set of hash algorithms willmost certainly emerge. These new algorithms willbe resistant to the weakness discovered by Wanget al. One of these new algorithms will rise to thetop and for a period of time, serve as the worldsnext hash function standard. Several years after-wards, a brilliant mathematician will discovera weakness in this new algorithm, publish theirresults, and the process of finding another hashstandard will start all over again.

The computer forensics community will want toembrace the new hash technology once it hasbeen thoroughly tested by the cryptographic com-munity. Until then, computer forensics examinersshould feel comfortable in their continued, all beit short term use of MD5. When possible, hashingelectronic evidence with both MD5 and a secondhash function such as SHA-1 or SHA-256 is alwaysa good idea, however, the forensics softwareneeds to support multiple hash functions in orderfor this to be possible. Unless new informationemerges showing a further weakness in the MD5hash algorithm, there should not be an immediaterequirement to discontinue the use of MD5.Rather, forensics examiners should work with themanufacturers of forensics software so that newreleases of the forensics software, when possible,will start implementing stronger hash functionssuch as SHA-1 or SHA-256 into the forensicsprocess.

References

Den Boer B, Bosselaers A. Collisions for the compressionfunction of MD5, Advances in Cryptology e EUROCRYPT’93.LNCS 765; 1994. p. 293e304.

Dobbertin Hans. Cryptanalysis of MD5 compress. GermanInformation Security Agency; May 1996.

40 E. Thompson

Patterson Wayne. Mathematical cryptology for computerscientists and mathematicians. Rowman & Littlefield, Pub-lishers; 1987. p. 156e8.

Schneier Bruce. Applied cryptography, second edition proto-cols, algorithms and source code in C. John Wiley & Sons,Inc.; 1996. p. 436e41.

Schneier Bruce. Opinion: cryptanalysis of MD5 and SHA:time for a new standard. Computerworld; April 19,2004.

Wang Xianyan, Feng Dengguo, Lai Xuejia, Yu Hongbo. Collisionsfor hash functions MD4, MD5 Haval-128 and RIPEMD.CRYPTO’04; Revised August 17, 2004.

Documents

MD5 collisions and the impact on computer forensics