Bytewise Approximate Matching: The Good, The Bad, and The

Journal of Digital Forensics, Journal of Digital Forensics,

Security and Law Security and Law

Volume 11 Number 2 Article 4

2016

Bytewise Approximate Matching: The Good, The Bad, and The Bytewise Approximate Matching: The Good, The Bad, and The

Unknown Unknown

Vikram S. Harichandran University of New Haven

Frank Breitinger University of New Haven

Ibrahim Baggili University of New Haven

Follow this and additional works at: https://commons.erau.edu/jdfsl

Part of the Computer Engineering Commons, Computer Law Commons, Electrical and Computer

Engineering Commons, Forensic Science and Technology Commons, and the Information Security

Commons

Recommended Citation Recommended Citation Harichandran, Vikram S.; Breitinger, Frank; and Baggili, Ibrahim (2016) "Bytewise Approximate Matching: The Good, The Bad, and The Unknown," Journal of Digital Forensics, Security and Law: Vol. 11 : No. 2 , Article 4. DOI: https://doi.org/10.15394/jdfsl.2016.1379 Available at: https://commons.erau.edu/jdfsl/vol11/iss2/4

This Article is brought to you for free and open access by the Journals at Scholarly Commons. It has been accepted for inclusion in Journal of Digital Forensics, Security and Law by an authorized administrator of Scholarly Commons. For more information, please contact [email protected].

(c)ADFSL

http://commons.erau.edu/jdfsl

http://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl

https://commons.erau.edu/jdfsl/vol11

https://commons.erau.edu/jdfsl/vol11/iss2

https://commons.erau.edu/jdfsl/vol11/iss2/4

https://commons.erau.edu/jdfsl?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages

http://network.bepress.com/hgg/discipline/258?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages







https://doi.org/10.15394/jdfsl.2016.1379

https://commons.erau.edu/jdfsl/vol11/iss2/4?utm_source=commons.erau.edu%2Fjdfsl%2Fvol11%2Fiss2%2F4&utm_medium=PDF&utm_campaign=PDFCoverPages

mailto:[email protected]

http://commons.erau.edu/

http://commons.erau.edu/

/creativecommons.org/licenses/by-nc-nd/4.0/

/creativecommons.org/licenses/by-nc-nd/4.0/

Bytewise approximate matching: the good, the bad ... JDFSL V11N2

BYTEWISE APPROXIMATE MATCHING:THE GOOD, THE BAD, AND THE

UNKNOWNVikram S. Harichandran, Frank Breitinger and Ibrahim Baggili

Cyber Forensics Research and Education Group (UNHcFREG)Tagliatela College of Engineering

University of New Haven, West Haven CT, 06516, United Statese-Mail: {vhari2, FBreitinger, IBaggili}@newhaven.edu

ABSTRACT

Hash functions are established and well-known in digital forensics, where they are commonlyused for proving integrity and file identification (i.e., hash all files on a seized device andcompare the fingerprints against a reference database). However, with respect to the latteroperation, an active adversary can easily overcome this approach because traditional hashesare designed to be sensitive to altering an input; output will significantly change if a singlebit is flipped. Therefore, researchers developed approximate matching, which is a rather new,less prominent area but was conceived as a more robust counterpart to traditional hashing.Since the conception of approximate matching, the community has constructed numerousalgorithms, extensions, and additional applications for this technology, and are still workingon novel concepts to improve the status quo. In this survey article, we conduct a high-levelreview of the existing literature from a non-technical perspective and summarize the existingbody of knowledge in approximate matching, with special focus on bytewise algorithms. Ourcontribution allows researchers and practitioners to receive an overview of the state of theart of approximate matching so that they may understand the capabilities and challenges ofthe field. Simply, we present the terminology, use cases, classification, requirements, testingmethods, algorithms, applications, and a list of primary and secondary literature.

Keywords: Approximate matching, Fuzzy hashing, Similarity hashing, Bytewise, Survey,Review, ssdeep, sdhash, mrsh-v2.

1. INTRODUCTION

It is no secret that the number of networkeddevices in the world continues to increasealongside the complexity of cyber crimes,size of storage devices, and amount of data.We are beyond the point where investiga-tors can manually analyze all cases. Theseadvances have also been complemented with

an increase in processing power. Speaking innumbers, while 80-200 GB HDDs, 2-4 GB ofRAM memory, and dual core were the quasistandards for machines in 2011, nowadaysthey are 512 GB SSDs, 8-16 GB RAM, andmulticore architectures; (external) storagedevices may have several terabytes of stor-age. Furthermore, if privately owned com-puting resources are insufficient for a task,

© 2016 ADFSL Page 59

JDFSL V11N2 Bytewise approximate matching: the good, the bad ...

one may shift it to the cloud. In short,practitioners need tools and techniques thatare capable of automatically handling largeamounts of data since time in investigationsis of the essence.

A common forensic process to supportpractitioners is known file filtering, whichaims at reducing the amount of data an in-vestigator has to manually examine. Theprocess is quite simple: (1) compute thehashes for all files on a target device (2)compare the hashes to a reference database.Based on the signatures in the database,files are whitelisted (filtered out / known-good files, e.g., files of the operating system)or blacklisted (filtered in / known-bad files,e.g., known illicit content). This straightfor-ward procedure is commonly implementedusing cryptographic hash functions like MD5(Rivest, 1992) or an algorithm from theSHA family (FIPS, 1995; Bertoni, Daemen,Peeters, & Assche, 2008).

While cryptographic hashes are well-established and tested, they have one down-side – they can only identify bitwise identi-cal objects. This means changing a singlebit of the input will result in a totally dif-ferent hash value. Subsequently, the com-munity worked on a counterpart for (crypto-graphic) hashing algorithms that allows sim-ilarity identification – approximate match-ing. Although this is a practically usefulconcept, a recent survey by Harichandran,Breitinger, Baggili, and Marrington (2016)with 99 participants showed that only 12%of the forensic experts polled use this tech-nology on regular basis. Detailed results areprovided in Table 1.

Contribution. In this paper we aim to ad-dress the almost 15 % that have never heardof approximate matching by providing themwith a comprehensive literature survey, andthe 31 % (unnecessary for my purposes) byillustrating a multitude of applications for

Table 1. Answers to the survey question:Have you ever used approximate match-ing/similarity hashing algorithms?

Answer in %

Yes, I use them on a regular basis. 12.50Yes, a few times. 34.38No, they are too slow for practicaluse.

7.29

No, they are unnecessary for mypurposes.

31.25

No, I am unaware of what it is. 14.58

approximate matching. Accordingly, we ad-dress the following key points:

� Terminology, use cases, classification,requirements, and testing.

� High-level description of existing algo-rithms including strengths and weak-nesses.

� Secondary literature that enhances / as-sesses existing approaches.

� New applications that employ approx-imate matching, e.g., file carving anddata leakage prevention.

� Current limitations and challenges, andpossible future trends.

Since it is low-level (is directly concernedwith the structure of everything digital), andmay be the most impacting / implementedtype due to its usage for automation, we fo-cus on bytewise approximate matching.

Differentiation from previous work.When writing this article, there were threearticles similar to this survey. The firstwas the SP 800-168 from the National Insti-tute for Standards and Technology (NIST,Breitinger, Guttman, McCarrin, Roussev,and White (2014)). While this article pro-vides an overview of the terminology, uses

Page 60 © 2016 ADFSL


cases, and testing, it does not include anyalgorithm concepts, applications, or criti-cal discussion. Moreover, a reader is notprovided with a long list of references.Martınez, Alvarez, and Encinas (2014) is apurely technical paper and focused on thefull details of the algorithms and their imple-mentations. Thirdly, the dissertation fromBreitinger (2014) contained almost all ofthese topics but is extremely lengthy. Ergo,our intention during writing was to makethis publication the primary source for re-searchers / practitioners to grasp a cur-sory bird’s-eye view of bytewise approximatematching.

We summarized the most important ele-ments of these works in a condensed anddirect manner to increase the awareness ofapproximate matching. Extra texts are alsoshared for each algorithm in Sec. 4.

Structure. The remainder of this paper isorganized as follows: Sec. 2 provides the his-torical background of approximate match-ing. Concepts are outlined, including usecases, types, requirements (this subsectiondescribes the core principles of algorithm de-sign), and testing. Then, after traversing 8of the most popular algorithms in Sec. 4 wemention newly explored prospects in Sec. 5.Limitations and challenges precedes a brieflisting of future areas of research in Sec. 7.

2. HISTORYMany of the approximate matching algo-rithms designed to solve modern-day prob-lems in digital forensics rely fundamentallyon the ability to represent objects as setsof features, thereby reducing the similarityproblem to the well-defined domain of set op-erations (Leskovec, Rajaraman, & Ullman,2014). This approach has roots in the workof the early Swiss 20th century biologist PaulJaccard, who suggested expressing the simi-larity between two finite sets as the ratio of

the size of their intersection over the size oftheir union (Jaccard, 1901, 1912): if A andB are sets, then the Jaccard index J is de-fined as J(A,B) = |A

⋂B|

|A⋃

B| . It has been widelyadopted as a method for quantifying similar-ity and is still used mainly within computerlinguistics for plagiarism detection.

Nearly a century later, Broder (1997) pro-posed using the Jaccard index as part ofhis algorithm for identifying similar docu-ments. Broder suggested a distinction be-tween two commonly used types of similar-ity: ‘roughly the same’ (resemblance) and‘roughly contained inside’ (containment).While he recommended using the Jaccard in-dex for resemblance, he introduced a varia-tion to approximate containment which “in-dicates that A is roughly contained withinB”: c(A,B) = |A

⋂B|

|A| . Additionally, Broderdescribed the MinHash algorithm, an effi-cient method for estimating these similarities(Broder, Charikar, Frieze, & Mitzenmacher,1998).

On the other hand, Manber (1994) pre-sented sif, an implementation used to cor-relate text files. “Files are considered simi-lar if they have a significant number of com-mon pieces, even if they are very differentotherwise.” Due to the complexity of com-paring strings directly, he utilized Rabin fin-gerprinting to hash and compare substrings(Rabin, 1981).

A first step towards approximate match-ing as we use it today was dcfldd1 by Har-bour in 2002, which was an extension forthe well-known disk dump tool dd. Histool divided the input into chunks of fixedlength and hashed each chunk. Therefore,Harbour’s approach is also called block-basedhashing. While this approach works per-fectly for flipped bits, it has a limited ca-pacity to detect similarity in strings where

1http://dcfldd.sourceforge.net (last ac-cessed Feb 4th, 2016).


http://dcfldd.sourceforge.net


the deletion or insertion of bits creates ashift that changes the hashes of all blocksthat follow. Theoretically, the shift of evena single bit at the beginning of a file couldcause nearly identical objects to appear tohave nothing in common, much like the naive(traditional) file hashing approach.

Although this weakness can present aproblem for file-to-file comparison, it maybe acceptable in some scenarios. For exam-ple, if the goal is to determine which parts ofa disk image might have been changed dur-ing a cyber attack, Harbour’s technique re-mains useful. Likewise, in the case of an an-alyst scanning for blacklisted material acrossa drive or a collection of drives, the loss ofa few block matches may be a worthwhiletrade-off for gains in speed and simplicity,particularly because a single block is oftensufficient evidence to demonstrate the pres-ence of an artifact, or at least to warrantcloser inspection.

Several efforts have been made to furtherleverage this technique for detecting similarmaterial by matching fixed length file frag-ments. Collange, Dandass, Daumas, andDefour (2009) coined the term “hash-basedcarving” to describe this method of scanningfor blacklisted material, since it can be usedto extract content without aid from the filesystem, provided the targets are known be-forehand.

Key (2013)’s File Block Hash MapAnalysis (FBHMA) EnScript and SimsonGarfinkel’s tool frag find (S. L. Garfinkel,2009) provided practical implementationsthat automated the process for forensic ex-aminers, though searches were limited to afew files at a time. S. Garfinkel, Nelson,White, and Roussev (2010) described theimplementation and evaluation of frag findin detail, noting a particular difficulty instoring and searching billions of hashes atpractical speeds. S. Garfinkel and McCar-rin (2014) later succeeded in scanning a

drive image for matches of 4096-byte blocksacross a set of nearly one million blacklistedfiles stored in the custom-built database,hashdb.

3. CONCEPTS

While approximate matching (a.k.a. fuzzyhashing or similarity hashing) started to gainpopularity in the field of digital forensics in2006, it was not until 2014 that the Na-tional Institute for Standards and Technolo-gies (NIST) developed standard definitions,publishing Approximate Matching: Defini-tion and Terminology (NIST SP 800-168,Breitinger, Guttman, et al. (2014)). Sub-sections below briefly summarize this work’sprinciples.

The ‘Purpose and Scope’ section of theNIST document defines approximate match-ing as follows: “Approximate matching isa promising technology designed to identifysimilarities between two digital artifacts. Itis used to find objects that resemble eachother or to find objects that are containedin another object.”

3.1 Use cases

In investigative cases, approximate match-ing is used to filter known-good or known-bad files while using a reference approximatematching hashed data set, either on staticdata or data in transit over a network. Theprimary use cases for approximate matchingare presented below:

� Similarity detection correlates relateddocuments, e.g., different versions of aWord document.

� Cross correlation correlates documentsthat share a common object, e.g., aDOC and a PPT document includingthe same image.



� Embedded object detection identifies anobject inside a document, e.g., an imageinside a memory dump.

� Fragment detection identifies the pres-ence of traces/fragments of a known ar-tifact, e.g., identify the presence of a filein a network stream based on individualpackets.

A lecture at DFRWS USA 2015 decidedto break down uses into 6 categories instead,from the perspective of the bytestreamsmatching: R is identical to T, R contains T,R & T share identical substrings, R is simi-lar to T, R approximately contains T, and R& T share similar substrings (where R andT are two sequences) (Ren & Cheng, 2015).

Notwithstanding, approximate matchingmay not be appropriate when used towhitelist artifacts since such content can bequite similar to benign content, e.g., an SSHserver with a backdoor would look analogousto a regular entry point (Baier, 2015).

3.2 Types

Regardless of the use cases, approximatematching can be implemented at differentabstractions. Usually we distinguish be-tween the following three abstraction cate-gories:

Bytewise: matching operates on the bytelevel and uses only the byte sequences asinput only (also known as fuzzy hashingand similarity hashing).

Syntactic: matching also works on the bytelevel but may use internal structure in-formation, e.g., one may ignore the TCPheader information of a packet that isparsed.

Semantic: matching works on the content-visual layer and therefore closely resem-bles human behavior (also called percep-tual hashing and robust hashing), e.g.,

the similarity of the content of a JPGand a PNG image where the image filetypes / byte streams are different, butthe picture is the same.

Furthermore, there are 4 cardinal cate-gories of algorithms (see Sec. 4 for the innerworkings) (Martınez et al., 2014):

� Context-Triggered Piecewise Hashing(CTPH).

� Block-Based hashing (BBH).

� Statistically-Improbable Features(SIF).

� Block-Based Rebuilding (BBR).

3.3 Requirements

There are multiple ways to interpret sub-string matching. For example, “ababa” and“cdcdc” might be considered similar becausethey both have five characters ranging overtwo alternating values, or they might betreated as dissimilar because they have nocommon characters. Thus, algorithms mustdefine the lowest common denominator ofits interpretation - a feature - including howthey are derived from an input. When twofeatures are compared the outcome is binary,match or no match.

A feature set refers to a set of distinctfeatures found in the entire bytestream ofa file or file fragment. An algorithm maychoose to only include some features in thisset and must outline the method/criteria forinclusion. Feature sets are then used to pro-duce a match/similarity score, representingthe amount of similarity between sets of tar-get files rationally (an increasing monotonicfunction).

Bytewise algorithms have two main func-tions. A feature extraction function iden-tifies and extracts features from objects toconvert them to a compressed version for



comparison. Then, a similarity function per-forms the comparison between these com-pressed versions to output a normalizedmatch score. This comparison usually in-volves string formulas such as Hamming dis-tance and Levenshtein distance; Martınez,Alvarez, Encinas, and Avila (2015) and Li etal. (2015) have proposed new algorithms forspecific uses. Normalized scores may be cal-culated by weighing the number of matchingfeatures against the total number of featuresfor both objects (for resemblance), or by ig-noring unmatched features in the containerobject (if concerned with containment).

In addition to the above, these traits mustbe satisfied to be considered a valid ap-proximate matching algorithm, according toNIST:

Compression: actual storage of features isusually implemented as a one-way hashknown as a similarity digest, signature,or fingerprint); length is shorter thanthe original feature/input itself.

Similarity preservation: similar inputsshould result in similar digests.

Self-evaluation: authors should state theconfidence level for the circumstances/-parameters used to produce the matchscore and what the scale is (e.g., 0 = nofeatures matched, 1 = all exact match).

Time complexity/runtime efficiency:speed should be stated via theoreticalcomplexity in O-notation as well as theruntime speed; for bytewise algorithmsit is preferable to know the isolatedspeeds of the feature extraction andsimilarity functions.

3.4 Testing bytewiseapproximate matching

Testing algorithms is an important task, soresearchers set out to create a test envi-

ronment for bytewise approximate matching.The first step was taken by Breitinger, Sti-vaktakis, and Baier (2013), called FRame-work to test Algorithms of Similarity Hash-ing (FRASH).

It tested efficiency, sensitivity and robust-ness, and precision and recall. This last cat-egory can be divided further into syntheticdata vs. real world data. While syntheticdata provides the perfect ground truth (fur-ther described below), it does not coincidewith the real world, and vise versa.

Synthetic data test results were published(Breitinger, Stivaktakis, & Roussev, 2013)in addition to real world data (Breitinger& Roussev, 2014). The complete resultsare too complex to be presented in thisarticle but can be found in chapter 6 inBreitinger (2014). In the following subsec-tions we briefly summarize how approximatematching algorithms can be evaluated andFRASH’s results. The main findings were:

� sdhash and mrsh-v2 outperform otheralgorithms.

� mrsh-v2 is faster and shows better com-pression than sdhash.

� sdhash obtains slightly better precisionand recall rates than mrsh-v2.

Therefore, the final decision for selecting analgorithm depends on the use case.

Efficiency. As with cryptographic hashfunctions, compression and runtime ef-ficiency are important, but approximatematching algorithms involve additional con-cerns; several do not output fixed length di-gests. Thus, researchers usually report com-pression ratio, cr = digest length

input length.

The community distinguishes between thefollowing for runtime:

Generation efficiency: time needed toprocess an input and output the simi-larity digest.



Comparison efficiency: summarizes thetheoretical complexity (in O-notation)to compare digests against an existingdata set / database; again, often statedin units of time for implementations forbytewise approximate matching.

Space efficiency: calculated by dividingdigest length by input length.

Sensitivity and robustness. Sensitivityrefers to the granularity at which an algo-rithm can detect similarity, i.e., how minutethe feature is. At some threshold making afeature too fine causes almost all objects toappear common, however, and therefore thealgorithm designer must strike a balancedsensitivity to optimize utility and time ef-ficiency.

Robustness is a metric of how effective analgorithm can be in the midst of noise andplain transformations such as fragmentationand insertion/deletion into the target bytesequence.

As outlined by FRASH, these attributesare tested by creating manual mutations ofthe target fragments/files:

Alignment robustness: inserts blocks ofvarious sizes at the beginning of an in-put; this should simulate scenarios likegrowing log files or emails.

Fragment detection: identifies the small-est fragment of a byte sequence thatstill matches by cutting it; this featureis important for network traffic analysis(see Sec. 5.2) and hash-based file carv-ing (see Sec. 5.1).

Single-common block correlation:analyzes the minimum amount ofcorrelation between two files, e.g., twoword documents that share an commonparagraph.

White noise resistance: is a probability-driven test that introduces (uniform)random changes into a byte sequence(via insertion, deletion, & substitution);a viable scenario is source code where adeveloper renamed a variable.

Precision and recall on synthetic data.Precision can be thought of as a measureof false positives (possibility of counting ob-jects as similar that in actuality are not)while recall refers to the false negatives(omitting objects that should be tallied assimilar). These attributes are an indicationof an algorithm’s reliability.

In order to quantify them, the initial stepis to analyze synthetic data. First, ran-dom byte sequences (the FRASH paper usedLinux /dev/urandom) are generated. Next,mutations are created through methods likethose mentioned in the previous subsection.Finally, the comparison is executed and theresults are analyzed.

Precision and recall on real world data.Testing on real world data is a bit more com-plex because there is no definition for simi-larity and no ground truth (publicly avail-able data sets for testing that involves ex-planations of what the expected similarityis between different files). To define theground truth, the community developed ap-proximate longest common substring (aLCS)which estimates the longest common sub-strings of two files. According to this, two in-puts are declared as similar, if their aLCS issufficient (e.g., 1 % of the total input length,or at least 2 KiB).

3.5 Security

Most approximate matching algorithms cur-rently implement few-to-zero security fea-tures to guard against active, real-time at-tacks. One staple is inherently built into ap-proximate matching algorithms: hash func-tions are one-way, even for similarity, and



therefore prevent reverse engineering theoriginal input sequence of a fragment / file.The few other tolerances exhibited by the al-gorithms are stated in their individual sub-sections under Sec. 4.

However, we posit that for most uses ofapproximate matching, security features arenot essential. As pointed out by Baier (2015)these algorithms are most likely to be usedfor blacklisting. Why would an active ad-versary want to create files that match ablacklist of static (not in transit) data? Re-searchers must find an answer to how easyit is to avoid matching files. Maybe in thefuture we should classify security for approx-imate matching algorithms by the minimumamount of changes that are necessary be-tween two files in order to produce a non-match. A question that needs fresh explo-ration, though, is what practices criminalscan use to bypass certain (types of) algo-rithms, use cases, and applications; a rigor-ous analysis of this has not been performedpartially due to missing standards / groundtruth.

3.6 Extending existingconcepts

One of the major challenges that comes withapproximate matching is related to the near-est neighbor problem, i.e., how to identifythe similarity digests that are similar to agiven one. More precisely, let’s assume adatabase containing n entries. Most algo-rithms require an ‘against-all’ comparisonwhich equals a complexity of O(n).

Winter, Schneider, and Yannikos (2013)presented an approach to diminish this com-plexity for ssdeep named F2S2. Gener-ally speaking, instead of storing the com-plete Base64 encoded similarity digest in thedatabase, they stored n-grams using hash-tables. In order to lookup single digests theyfirst looked for the n-grams which reduced

the overall amount of comparisons. For thefinal decision, the ssdeep comparison func-tion was used. As a result, they reduced thecomparison time of 195,186 files against adatabase containing 8,334,077 records from442 h to 13 min (boosted by a factor of about2000), a ‘practical speed’.

However, this approach works only forBase64 and hence for none of the other ap-proaches like sdhash or mrsh-v2. There-fore, Breitinger, Baier, and White (2014)presented a concept that could speed up theprocess via Bloom filter-based approaches.They suggested using one single huge Bloomfilter to store all feature hashes, which re-sults in a complexity of ∼ O(1). Their ap-proach overcomes the drawback of compar-ing digests against digests but loses preci-sion. That is, it allows for only yes or nodecisions: yes means there is a similar file inthe set; no equates to none of the files beingsimilar above the chosen threshold. It doesnot allow for the returning of the matchedfile(s).

Consequently, the authors presented anenhancement which simply uses multiplelarge Bloom filters to generate a tree struc-ture that results in a complexity of O(log(n))(Breitinger, Rathgeb, & Baier, 2014). Butthese are only assumptions – while there isa working prototype for the first approach,the latter concept only exists in theory.

3.7 Distinction fromlocality-sensitive hashing

(LSH)

It’s critical to note that sometimes peopleconfuse Locality-Sensitive Hashing (LSH)(e.g., Rajaraman and Ullman (2012)) withapproximate matching. Therefore, we in-cluded this section. LSH is a general mecha-nism for nearest neighbor search and dataclustering where the performance stronglyrelies on the used hashing method. Two pop-



ular algorithms are MinHash (Broder, 1997)and SimHash2 (Charikar, 2002).

This does not necessarily coincide with theidea of approximate matching. Specifically,while LSH aims at mapping similar objectsinto the same bucket, approximate matchingoutputs a similarity digest that is compara-ble.

We would like to note here that the follow-ing section mainly focuses on bytewise ap-proximate matching.

4. INTRODUCTION

TO ALGORITHMSAs previously mentioned, approximatematching started to gain attention in 2006with the concept of context triggered piece-wise hashing and its first implementation,ssdeep (Kornblum, 2006). In the followingyears, new algorithms were proposed andpublished.

We will introduce the eight known approx-imate matching algorithms. While the firstthree algorithms are still extended and rele-vant, the last four algorithms are less promis-ing from a digital forensics perspective forvarious reasons, e.g., precision and recallrates, runtime efficiency and detection capa-bilities. The last algorithm (TLSH) is morerelated to LSH than approximate matchingand is included for completeness.

This section is a high-level summary of thecurrent algorithms. Throughout each sub-section references are cited for deeper read-ing.

4.1 ssdeep

CTPH is the technique used by ssdeep andwas presented by Kornblum (2006). Roughly

2Note, SimHash is a common term and is usedseveral times literature. Accordingly, it is also usedtwice in this article. Besides this section it is alsoused in Sec. 4.6 where it describes an approach fromSadowski and Levin (2007).

speaking, it is a modified version of the spamdetection algorithm from Tridgell (2002–2009) generalized to cope with any digitalobject.

In CTPH the approach is to identify trig-ger points to divide a given input intochunks/blocks. This breakup is performedusing a rolling hash that slides through theinput, adds bytes to the current context(think of it as a buffer), creates a pseudo-random value, and removes them from thecontext after a set number of bytes are com-pleted. The context is then used as a trigger– whenever a specified sequence is createdthe current context is hashed by the non-cryptographic FNV-hash function (Fowler,Noll, & Vo, 1994–2012). To create the sim-ilarity digest, the FNV-chunk-hashes are re-duced to 6 bits, converted into a Base64 char-acter and concatenated; this is done contin-uously as the trigger outputs FNV hashes.

At the time of this article, ssdeep wasstill an active project with version 2.13 andis freely available online3. Over the years,several extensions and performance improve-ments have been published that mostlyfocus on the efficiency of the implemen-tation (Chen & Wang, 2008; Seo, Lim,Choi, Chang, & Lee, 2009; Breitinger &Baier, 2012b). However, a security analy-sis conducted by Baier and Breitinger (2011)showed that CTPH cannot withstand an ac-tive attack.

4.2 sdhash

Similarity digest hashing was publishedfour years later by Roussev, Richard, andMarziale (2008); Roussev (2010) and is alsostill active. The SIF algorithm extracts sta-tistically improbable features that are de-termined by Shannon entropy (not the oneswith the highest / lowest entropy but the

3http://ssdeep.sourceforge.net (last ac-cessed Feb 4th, 2016).


http://ssdeep.sourceforge.net


ones that seem unique, Roussev (2009)). Insdhash, a feature is a byte sequence of 64bytes that is then compressed by hashing itwith SHA-1. Finally, the author developed away to insert the hashes into a Bloom filter4

(Bloom, 1970).The original version was extended several

times, now supporting GPU usage for cal-culation and a block-based hashing mode(Roussev, 2012). The current version (3.4)is available online5.

A comparison between ssdeep andsdhash showed that the latter algorithm out-performs its predecessor (Roussev, 2011). Inaddition, a security analysis showed thatsdhash is much more robust and difficult toovercome (Breitinger & Baier, 2012c).

4.3 mrsh-v2

This algorithm was published by Breitingerand Baier (2013) and is a combinationof ssdeep and sdhash6. Like the afore-mentioned implementations, mrsh-v2 is stillsupported7. The algorithm uses the fea-ture identification procedure from ssdeep,then hashes the feature using the non-cryptographic FNV (Fowler et al., 1994–2012) and proceeds like sdhash, con-sequently overcoming the weaknesses ofssdeep and becoming faster than sdhash.The precision and recall rates are slightlyworse than sdhash.

4.4 bbHash

Building block hashing is a completely dif-ferent approach and is based on the conceptof eigenfaces (biometrics) and de-duplication

4A Bloom filter is a space efficient data structureto represent a set. Bloom filters will not be discussedin this article but more details can be found online.

5http://sdhash.org (last accessed Feb 4th,2016).

6It was also inspired by multi-resolution similar-ity hashing (Roussev, III, & Marziale, 2007).

7http://www.fbreitinger.de (last accessedFeb 4th, 2016).

(data compression). Contrary to expecta-tion, its type (see Sec. 3.2) is not epony-mous, but rather BBR. The main differenceis that this approach utilizes an external ref-erence point – the building blocks.

A set of 16 building blocks (random bytesequences) is used to optimize representationof a given file. In order to find this represen-tation the algorithm calculates the Hammingdistance, which is time consuming and slowfor practical usage (e.g., it takes about twominutes to process a 10 MB file) (Breitinger& Baier, 2012a).

4.5 mvHash-B

Majority vote hashing, another BBR type,was published by Astebøl (2012); Breitinger,Astebøl, Baier, and Busch (2013). It trans-forms any byte-sequence into long runs of0x00s and 0xFFs by considering the neigh-boring bytes of a specific byte. If the neigh-borhood consists of mainly 1s, the byte isset to 0xFF, otherwise to 0x00. Next, theseruns are encoded by Run Length Encod-ing (RLE). Although this proceeding is veryfast, it requires a specific configuration foreach file type.

4.6 SimHash

SimHash was presented by Sadowski andLevin (2007) and embodies the notion ofcounting the occurrences of certain prede-fined binary strings called “Tags” within aninput. In their BBR implementation, the au-thors used 16 8-bit Tags, i.e., a possible Tagcould have been 00110101. Subsequently,the tool parses an input bit by bit, searchingfor each Tag. The total number of matchesis stored in a sum table. A hash key is com-puted as a function of the sum table entriesthat form linear combinations. Lastly, allinformation (including file name, path, andsize) is stored in a database.

To identify similarities, a second toolnamed SimFash is used to query the


http://sdhash.org

http://www.fbreitinger.de


database. The hash keys are used as a firstfilter to identify all possible matches. Next,the sum tables are compared and a matchis found if the distance is within a specifiedtolerance.

The authors clearly state that “two filesare similar if only a small percentage of theirraw bit patterns are different. ... [Thus,] thefocus of SimHash has been on resemblancedetection” (Sadowski & Levin, 2007).

4.7 saHash

Another SIF type, saHash uses Levenshteindistance to derive similarity between twobyte sequences. The output is a lower boundfor the Levenshtein distance between two in-puts. Akin to SimHash (Sec. 4.6), saHash

allows for the detection of only near dupli-cates (up to several hundred Levensthein op-erations).

A unique characteristic of this approach isits definition of similarity. While all otherapproaches output a number between 0 and1 (not a percentage value), saHash actu-ally returns the lower bound of Levenshteinoperations (Ziroff, 2012; Breitinger, Ziroff,Lange, & Baier, 2014) to convert one file intoanother.

4.8 TLSH

TLSH belongs to the category of locality-sensitive hashes, published by Oliver, Cheng,and Chen (2013), and is open source8. Itprocesses an input byte sequence using asliding window to populate an array ofbucket counts, and determines the quartilepoints of the bucket counts. A fixed lengthdigest is constructed which consists of twoparts: (i) a header based on the quartilepoints, the length of the input, and a check-sum; (ii) a body consisting of a sequence ofbit pairs, which depends on each bucket’s

8https://github.com/trendmicro/tlsh (lastaccessed Feb 4th, 2016).

value in relation to the quartile points. Thedistance between two digest headers is deter-mined by the difference in file lengths andquartile ratios. Meanwhile, the bodies arecontrasted via their approximate Hammingdistance. Summing these together producesthe TLSH similarity score.

According to the authors, the precisionand recall rates are robust across a range offile types. Additional experiments (Oliver,Forman, and Cheng (2014)) showed thatTLSH can detect strings which have beenmanipulated with adversarial intentions9.TLSH is also effective in detecting embed-ded objects depending on the level of objectmanipulation. Despite these advantages, itis less powerful than sdhash and mrsh-v2 forcross correlation.

5. APPLICATIONSOriginally, approximate matching was de-signed to support the digital investigationprocess via the use cases stated in Sec.3.1; search for target file(s)/fragments or re-duce the volume of data needing investiga-tion. Recently, tools such as EnCase, X-Ways Forensics, and Forensic Toolkit (FTK)have incorporated similar object detectiontechnologies (Breitinger, 2014). Researchershave now identified additional working ar-eas where these techniques or tools canhave practical impact, e.g., for file carving(see Sec. 5.1), data leakage prevention (seeSec. 5.2 and 5.4) and Iris recognition (seeSec. 5.5).

5.1 Automatic data reductionand hash-based file

carving

As sifting through data has become cum-bersome, pre-processing schemes have risen.

9Tolerance of manipulation was one of the designconsiderations for TLSH.


https://github.com/trendmicro/tlsh


Extracting data in bulk is arguably themost sought after application of approximatematching. One perspective that should befruitful is hash-based file carving.

This alienated area of work was presentedby S. Garfinkel and McCarrin (2014). Intheir paper, the authors combined tech-niques from file carving and approximatematching to search on “media for completefiles and file fragments with sector hash-ing and hashdb.” Instead of focusing onthe complete file and comparing it againsta database, the authors use individual datablocks. They utilized a special databasenamed hashdb (Allen, 2015) to obtain highthroughput.

The evaluation proved their strategyworks, although they had to solve the prob-lem of non-probative blocks that emerged“from common data structures in office doc-uments and multimedia files.” To filter outsuch artifacts, the authors presented several‘tests’ that alleviated the problem.

5.2 Network traffic analysis

Gupta (2013); Breitinger and Baggili (2014)demonstrated preliminary results when us-ing approximate matching on network traf-fic for data leakage prevention. The ques-tion was (since approximate matching canbe used for fragment detection) whether net-work packets could be matched back to theiroriginal files.

Design was similar to its traditional coun-terpart: create a database of known-objectsignatures (most likely files) and identifythese objects, but instead of analyzing ahard drive the researchers used a networkstream (single packets). This work illus-trated approximate matching’s utility indata leakage prevention, a formerly un-touched application.

Beginning with modifying the originalmrsh-v2 algorithm to handle the small sizeof 1460 bytes per packet, the authors showed

that this method works robustly on randomdata (true positive rate 99.6 %, true nega-tive rate 100.0 %) having a throughput of 650Mbit/s on a regular workstation.

Regardless, they faced several unsolvedproblems for real world data. One obstaclewas that many files share the same structuralinformation (e.g., file header information;this is equivalent to the non-probative blocksproblem from the previous subsection) whichled to false positive rates of around 10−5 –too high for network traffic analysis.

5.3 Malware

Innately, similarity hashing is ideal forgrouping things together, but it was not un-til 2015 that it was rigorously tested whenapplied to malware clustering (Li et al.,2015). Faruki, Laxmi, Bharmal, Gaur, andGanmoor (2015) developed AndroSimilar, asyntactical detection algorithm for AndroidMalware that falls into the SIF category.While Zhou, Zhou, Jiang, and Ning (2012)’sDroidMoSS, a CTPH algorithm, was devel-oped to also detect mobile malware, a com-parison between the two could not be per-formed due to unavailable code.

Polymorphic malware families are on therise, often hosted on servers that automati-cally alter inconsequential segments of a file(before sending across a network) to bypasscryptographic detection tactics (Security,2013). An intriguing paper by Payer et al.(2014) expands on ways criminals can cir-cumvent the use of similarity-based match-ing for spotting malicious binary code, whichoften diversifies itself during recompilation.Thus, it is imperative that malware receivesmore attention.

5.4 Data leakage prevention

With respect to data leakage prevention, ap-proximate matching may also be utilized for



printer data inspection, e.g., MyDLP10 andSymantec (2010) Data Leakage Prevention.If a document is protected, the software candiscard the print (implemented by MyDLP).A similar experiment was also run by our re-search group11 which created a virtual secureprinter that analyzed a sent document beforeforwarding it to an actual printer.

Note, according to Comodo Group Inc.(2013), these software solutions often calltheir technology partial document match-ing, unstructured data matching, intelli-gent content matching, or statistical docu-ment matching, all synonyms for approxi-mate matching.

5.5 Biometrics

Biometrics is another independent do-main employing approximate matching withpromising results (Rathgeb, Breitinger, &Busch, 2013; Rathgeb, Breitinger, Busch, &Baier, 2013). In their work, the authorsdemonstrated the feasibility of using tech-niques from approximate matching for bio-metric template protection, data compres-sion and efficient identification. Accordingto Breitinger (2014), there are three im-provements:

� “Template protection: the successivemapping of parts of a binary biomet-ric template to Bloom filters representsan irreversible transformation achievingalignment-free protected biometric tem-plates.”

� “Biometric data compression: the pro-posed Bloom filter-based transforma-tion can be parameterized to obtain adesired template size, operating a trade-

10https://www.mydlp.com (last accessed Feb 4th,2016).

11The project was done by Kyle Anthony, a mem-ber of the UNH Cyber Forensics Research & Educa-tion Group

off between compression and biometricperformance.”

� “Efficient identification: a compactalignment-free representation of iris-codes enables a computationally effi-cient biometric identification reducingthe overall response time of the system.”

6. LIMITATIONS AND

CHALLENGES

Bytewise approximate matching has someintrinsic limitations. First and foremost, itcannot pick up similarity at a higher levelof abstraction, such as semantically. For in-stance, it cannot meaningfully match twoimage files that have the same semantic pic-ture but are different file types / formatsdue to different binary encoding. However,placing it in tandem with other approacheswill still help; Neuner, Mulazzani, Schrit-twieser, and Weippl (2015) include it as acritical component of the digital forensic pro-cess. Doubly crucial when it comes to appli-cations like malware that employ databasesis lookup time. Winter et al. (2013) out-lined a faster approach to conduct similaritysearching using a database but this methodwill not work effectively for all approximatematching algorithms. We will not belaborthe point since this was discussed in Sec. 3.6.

Evidently, the first challenge to confrontis awareness and adoption of approximatematching, a possible indication that more re-search needs to be conducted to understandthe needs for the community. As we outlinedin the introduction, 15 % of professionals areunaware of approximate matching – an un-acceptable number. Conversely, 7 % criticizethat algorithms are too slow for practicaluse. On the other hand, inquiry needs tobe made into why 35 % have used approxi-mate matching only a few times. Our hope


https://www.mydlp.com


is that having the NIST SP 800-168 defini-tion, a technical classification of algorithms(Martınez et al., 2014), and this more univer-sal outline will improve awareness and adop-tion.

Yet another major obstacle is the lack ofa standard definition of similarity (Baier,2015). As addressed by S. Garfinkel and Mc-Carrin (2014), not all kinds of byte level sim-ilarity are equally valuable as there are someartifacts (e.g., structural information, head-ers, footers, etc.) that are less importantor lead to false positives. Hence, we needa filtering mechanism to prioritize matches.One possibility could be to extract the mainelements (like text or images) and comparethose, meaning including a pre-processingstep before the comparison.

Aside from initial efforts to test approx-imate matching algorithms, there are cur-rently no accepted standards and reliabletesting frameworks. FRASH is not easyto implement, a deterrent to practitioners.This is one reason this paper avoids givingabsolute comparisons between all the (typesof) algorithms. More bothersome is the lackof an accepted ground truth for real worlddata that would support implementation as-sessments like whether algorithms scale ef-fectively. The ground truth should embodythe four use cases (see Sec. 3.1) and algo-rithm types, even if different data sets arerequired for each one. Once this is done,the algorithms will be directly comparable(e.g., embedded object detection: A betterthan B better than C; speed: B faster thanA faster than C). We posit that it is criti-cal for practitioner efficiency to know whichalgorithms solve which potential problems.At the moment the National Software Ref-erence Library (NSRL12) has built the mostprominent software database and is piecing

12http://www.nsrl.nist.gov (last accessed Feb4th, 2016).

together a test corpus (its usefulness wasdemonstrated in Rowe (2012)).

7. FUTURE FOCUSFuture research should, in accompanimentto prior comments, pursue areas that branchout from digital forensics, even if completelydetached. Below we name some possible ap-plications that could use enhancement, andareas that approximate matching may beable to be enhanced by (this is not an ex-haustive list):

� Bioinformatics: This field already usesexact matching methods, albeit us-ing bytewise or semantic approximatematching alongside today’s methodscould conceivably increase efficiency.

� Text mining: Identifying patterns instructured data to gain high-quality, se-mantic value.

� Templates and layouts: Semanticallyidentify document layout and separatecontent from template automatically.

� Deep / machine learning: Automatedforms of learning need to be able toprocess information fast, store it effi-ciently space-wise, and be able to differ-entiate similarities and differences; se-mantic hashing is a known aspect ofdeep learning and ergo might be ableto strengthen approximate matching.

� Source code governance: Manageshared code better, especially for opensource software.

� Spam filtering and anti-plagiarism:These have already been looked at butmight be behooved by deeper scrutiny.

Ultimately, approximate matching is analienated domain and its increased adoption


http://www.nsrl.nist.gov


will strongly support other domains. Re-searchers looking to improve on the slow-ness of indexing and searching may also ben-efit to look into other domains such as com-pression and programming. Once more peo-ple are aware of approximate matching wemight identify more fields where the tech-nology would be relevant.

8.

ACKNOWLEDGMENTSWe would like to thank Michael McCarrinfor reviewing and editing some paragraphs,especially those that are related to his re-search. Additionally, we would like to thankJonathan Olivier for providing us with thesummary for TLSH.

REFERENCES

Allen, B. (2015). Hashdb.https://github.com/simsong/hashdb.

Astebøl, K. P. (2012). mvhash - a newapproach for fuzzy hashing(Unpublished master’s thesis). GjøvikUniversity College.

Baier, H. (2015). Towards automatedpreprocessing of bulk data in digitalforensic investigations using hashfunctions. it-Information Technology ,57 (6), 347–356.

Baier, H., & Breitinger, F. (2011, May).Security Aspects of Piecewise Hashingin Computer Forensics. IT SecurityIncident Management & IT Forensics(IMF), 21–36. doi:10.1109/IMF.2011.16

Bertoni, G., Daemen, J., Peeters, M., &Assche, G. V. (2008, October).Keccak specifications (Tech. Rep.).STMicroelectronics and NXPSemiconductors.

Bloom, B. H. (1970). Space/time trade-offsin hash coding with allowable errors.

Communications of the ACM , 13 ,422–426.

Breitinger, F. (2014). On the utility ofbytewise approximate matching incomputer science with a special focuson digital forensics investigations(Doctoral dissertation, TechnicalUniversity Darmstadt). Retrievedfrom http://tuprints.ulb.tu

-darmstadt.de/4055/

Breitinger, F., Astebøl, K. P., Baier, H., &Busch, C. (2013, March). mvhash-b -a new approach for similaritypreserving hashing. In It securityincident management and it forensics(imf), 2013 seventh internationalconference on (p. 33-44). doi:10.1109/IMF.2013.18

Breitinger, F., & Baggili, I. (2014,September). File detection on networktraffic using approximate matching.Journal of Digital Forensics, Securityand Law (JDFSL), 9 (2), 23–36.Retrieved fromhttp://ojs.jdfsl.org/index.php/

jdfsl/article/view/261

Breitinger, F., & Baier, H. (2012a, May). AFuzzy Hashing Approach based onRandom Sequences and HammingDistance. In 7th annual Conferenceon Digital Forensics, Security andLaw (ADFSL) (pp. 89–100).

Breitinger, F., & Baier, H. (2012b).Performance issues aboutcontext-triggered piecewise hashing.In P. Gladyshev & M. Rogers (Eds.),Digital forensics and cyber crime(Vol. 88, p. 141-155). Springer BerlinHeidelberg. Retrieved fromhttp://dx.doi.org/10.1007/

978-3-642-35515-8 12 doi:10.1007/978-3-642-35515-8 12

Breitinger, F., & Baier, H. (2012c).Properties of a similarity preservinghash function and their realization in


http://tuprints.ulb.tu-darmstadt.de/4055/

http://tuprints.ulb.tu-darmstadt.de/4055/

http://ojs.jdfsl.org/index.php/jdfsl/article/view/261


http://dx.doi.org/10.1007/978-3-642-35515-8_12

http://dx.doi.org/10.1007/978-3-642-35515-8_12


sdhash. In Information security forsouth africa (issa), 2012 (p. 1-8). doi:10.1109/ISSA.2012.6320445

Breitinger, F., & Baier, H. (2013).Similarity preserving hashing:Eligible properties and a newalgorithm mrsh-v2. In M. Rogers &K. Seigfried-Spellar (Eds.), Digitalforensics and cyber crime (Vol. 114,pp. 167–182). Springer BerlinHeidelberg. Retrieved fromhttp://dx.doi.org/10.1007/

978-3-642-39891-9 11 doi:10.1007/978-3-642-39891-9 11

Breitinger, F., Baier, H., & White, D.(2014, May). On the database lookupproblem of approximate matching.Digital Investigation, 11, Supplement1 (0), S1 - S9. Retrieved fromhttp://www.sciencedirect.com/

science/article/pii/

S1742287614000061 (Proceedings ofthe First Annual DFRWS Europe)doi: http://dx.doi.org/10.1016/j.diin.2014.03.001

Breitinger, F., Guttman, B., McCarrin, M.,Roussev, V., & White, D. (2014,May). Approximate matching:Definition and terminology (SpecialPublication 800-168). NationalInstitute of Standards andTechnologies. Retrieved fromhttp://dx.doi.org/10.6028/

NIST.SP.800-168

Breitinger, F., Rathgeb, C., & Baier, H.(2014, September). An efficientsimilarity digests database lookup - Alogarithmic divide & conquerapproach. Journal of DigitalForensics, Security and Law(JDFSL), 9 (2), 155–166. Retrievedfromhttp://ojs.jdfsl.org/index.php/

jdfsl/article/view/276

Breitinger, F., & Roussev, V. (2014, May).

Automated evaluation of approximatematching algorithms on real data.Digital Investigation, 11, Supplement1 (0), S10 - S17. Retrieved fromhttp://www.sciencedirect.com/

science/article/pii/

S1742287614000073 (Proceedings ofthe First Annual DFRWS Europe)doi: http://dx.doi.org/10.1016/j.diin.2014.03.002

Breitinger, F., Stivaktakis, G., & Baier, H.(2013, August). FRASH: Aframework to test algorithms ofsimilarity hashing. In 13th DigitalForensics Research Conference(DFRWS’13). Monterey.

Breitinger, F., Stivaktakis, G., & Roussev,V. (2013, Sept). Evaluating detectionerror trade-offs for bytewiseapproximate matching algorithms.5th ICST Conference on DigitalForensics & Cyber Crime (ICDF2C).

Breitinger, F., Ziroff, G., Lange, S., &Baier, H. (2014, January). sahash:Similarity hashing based onlevenshtein distance. In Tenth annualifip wg 11.9 international conferenceon digital forensics (ifip wg11.9).

Broder, A. Z. (1997). On the resemblanceand containment of documents. InCompression and complexity ofsequences (sequences’97) (pp. 21–29).IEEE Computer Society.

Broder, A. Z., Charikar, M., Frieze, A. M.,& Mitzenmacher, M. (1998).Min-wise Independent Permutations.Journal of Computer and SystemSciences , 60 , 327-336.

Charikar, M. S. (2002). Similarityestimation techniques from roundingalgorithms. In Proceedings of thethiry-fourth annual acm symposiumon theory of computing (pp. 380–388).

Chen, L., & Wang, G. (2008). An efficient


http://dx.doi.org/10.1007/978-3-642-39891-9_11

http://dx.doi.org/10.1007/978-3-642-39891-9_11

http://www.sciencedirect.com/science/article/pii/S1742287614000061



http://dx.doi.org/10.6028/NIST.SP.800-168

http://dx.doi.org/10.6028/NIST.SP.800-168







piecewise hashing method forcomputer forensics. In Knowledgediscovery and data mining, 2008.wkdd 2008. first internationalworkshop on (pp. 635–638). doi:10.1109/WKDD.2008.80

Collange, S., Dandass, Y. S., Daumas, M.,& Defour, D. (2009). Using graphicsprocessors for parallelizing hash-baseddata carving. In System sciences,2009. hicss’09. 42nd hawaiiinternational conference on (pp.1–10).

Comodo Group Inc. (2013). PDM (PartialDocument Matching). https://

www.mydlp.com/partial-document

-matching-how-it-works/.Faruki, P., Laxmi, V., Bharmal, A., Gaur,

M., & Ganmoor, V. (2015).Androsimilar: Robust signature fordetecting variants of androidmalware. Journal of InformationSecurity and Applications , 22 , 66–80.

FIPS, P. (1995). 180-1. secure hashstandard. National Institute ofStandards and Technology , 17 , 45.

Fowler, G., Noll, L. C., & Vo, P.(1994–2012). Fnv hash.http://www.isthe.com/chongo/

tech/comp/fnv/index.html.Garfinkel, S., & McCarrin, M. (2014).

Hash-based carving: Searching mediafor complete files and file fragmentswith sector hashing and hashdb.digital investigation, 12 .

Garfinkel, S., Nelson, A., White, D., &Roussev, V. (2010). Usingpurpose-built functions and blockhashes to enable small block andsub-file forensics. digital investigation,7 , S13–S23.

Garfinkel, S. L. (2009, March). Announcingfrag find: finding file fragments indisk images using sector hashing.Retrieved from http://

tech.groups.yahoo.com/group/

linux forensics/message/3063

Gupta, V. (2013). File fragment detectionon network traffic using similarityhashing (Unpublished master’sthesis). Denmark TechnicalUniversity.

Harichandran, V. S., Breitinger, F., Baggili,I., & Marrington, A. (2016). A cyberforensics needs analysis survey:Revisiting the domain’s needs adecade later. Computers & Security ,57 , 1–13.

Jaccard, P. (1901). Distribution de la florealpine dans le bassin des drouces etdans quelques regions voisines.Bulletin de la Societe Vaudoise desSciences Naturelles , 37 , 241–272.

Jaccard, P. (1912, February). Thedistribution of the flora in the alpinezone. New Phytologist , 11 , 37–50.

Key, S. (2013). File block hash mapanalysis. Retrieved fromhttps://www.guidancesoftware

.com/appcentral/

Kornblum, J. (2006, September).Identifying almost identical files usingcontext triggered piecewise hashing.Digital Investigation, 3 , 91–97.Retrieved from http://dx.doi.org/

10.1016/j.diin.2006.06.015 doi:10.1016/j.diin.2006.06.015

Leskovec, J., Rajaraman, A., & Ullman,J. D. (2014). Mining of massivedatasets. Cambridge University Press.

Li, Y., Sundaramurthy, S. C., Bardas,A. G., Ou, X., Caragea, D., Hu, X., &Jang, J. (2015). Experimental studyof fuzzy hashing in malware clusteringanalysis. In 8th workshop on cybersecurity experimentation and test(cset 15).

Manber, U. (1994). Finding similar files ina large file system. In Proceedings of


https://www.mydlp.com/partial-document-matching-how-it-works/



http://www.isthe.com/chongo/tech/comp/fnv/index.html

http://www.isthe.com/chongo/tech/comp/fnv/index.html

http://tech.groups.yahoo.com/group/linux_forensics/message/3063



https://www.guidancesoftware.com/appcentral/

https://www.guidancesoftware.com/appcentral/

http://dx.doi.org/10.1016/j.diin.2006.06.015



the usenix winter 1994 technicalconference on usenix winter 1994technical conference (pp. 1–10).

Martınez, V. G., Alvarez, F. H., & Encinas,L. H. (2014). State of the art insimilarity preserving hashingfunctions.worldcomp-proceedings.com.

Martınez, V. G., Alvarez, F. H., Encinas,L. H., & Avila, C. S. (2015). A newedit distance for fuzzy hashingapplications. In Proceedings of theinternational conference on securityand management (sam) (p. 326).

Neuner, S., Mulazzani, M., Schrittwieser,S., & Weippl, E. (2015). Graduallyimproving the forensic process. InAvailability, reliability and security(ares), 2015 10th internationalconference on (pp. 404–410).

Oliver, J., Cheng, C., & Chen, Y. (2013).Tlsh–a locality sensitive hash. In 4thcybercrime and trustworthy computingworkshop (ctc) (pp. 7–13).

Oliver, J., Forman, S., & Cheng, C. (2014).Using randomization to attacksimilarity digests. In Applications andtechniques in information security(pp. 199–210). Springer.

Payer, M., Crane, S., Larsen, P.,Brunthaler, S., Wartell, R., & Franz,M. (2014). Similarity-based matchingmeets malware diversity. arXivpreprint arXiv:1409.7760 .

Rabin, M. O. (1981). Fingerprinting byrandom polynomials (Tech. Rep. No.TR1581). Cambridge, Massachusetts:Center for Research in ComputingTechnology, Harvard University.

Rajaraman, A., & Ullman, J. D. (2012).Mining of massive datasets.Cambridge: Cambridge UniversityPress.

Rathgeb, C., Breitinger, F., & Busch, C.(2013, June). Alignment-free

cancelable iris biometric templatesbased on adaptive bloom filters. InBiometrics (icb), internationalconference on (p. 1-8). doi:10.1109/ICB.2013.6612976

Rathgeb, C., Breitinger, F., Busch, C., &Baier, H. (2013). On the applicationof bloom filters to iris biometrics.IET Biometrics .

Ren, L., & Cheng, R. (2015, August).Bytewise approximate matching,searching and clustering. DFRWSUSA.

Rivest, R. (1992). The MD5Message-Digest Algorithm.

Roussev, V. (2009). Building a bettersimilarity trap with statisticallyimprobable features. In Systemsciences, 2009. hicss ’09. 42nd hawaiiinternational conference on (pp.1–10). doi: 10.1109/HICSS.2009.97

Roussev, V. (2010). Data fingerprintingwith similarity digests. In K.-P. Chow& S. Shenoi (Eds.), Advances indigital forensics vi (Vol. 337, pp.207–226). Springer Berlin Heidelberg.Retrieved from http://dx.doi.org/

10.1007/978-3-642-15506-2 15

doi: 10.1007/978-3-642-15506-2\ 15Roussev, V. (2011, August). An evaluation

of forensic similarity hashes. DigitalInvestigation, 8 , 34–41. Retrievedfrom http://dx.doi.org/10.1016/

j.diin.2011.05.005 doi:10.1016/j.diin.2011.05.005

Roussev, V. (2012). Managingterabyte-scale investigations withsimilarity digests. In G. Peterson &S. Shenoi (Eds.), Advances in digitalforensics viii (Vol. 383, pp. 19–34).Springer Berlin Heidelberg. Retrievedfrom http://dx.doi.org/10.1007/

978-3-642-33962-2 2 doi:10.1007/978-3-642-33962-2 2

Roussev, V., III, G. G. R., & Marziale, L.


http://dx.doi.org/10.1007/978-3-642-15506-2_15

http://dx.doi.org/10.1007/978-3-642-15506-2_15



http://dx.doi.org/10.1007/978-3-642-33962-2_2

http://dx.doi.org/10.1007/978-3-642-33962-2_2


(2007, September). Multi-resolutionsimilarity hashing. DigitalInvestigation, 4 , 105–113. doi:10.1016/j.diin.2007.06.011

Roussev, V., Richard, I., Golden, &Marziale, L. (2008). Class-awaresimilarity hashing for dataclassification. In I. Ray & S. Shenoi(Eds.), Advances in digital forensics iv(Vol. 285, pp. 101–113). Springer US.Retrieved from http://dx.doi.org/

10.1007/978-0-387-84927-0 9 doi:10.1007/978-0-387-84927-0 9

Rowe, N. C. (2012). Testing the nationalsoftware reference library. DigitalInvestigation, 9 , S131–S138.

Sadowski, C., & Levin, G. (2007,December). Simhash: Hash-basedsimilarity detection.http://simhash.googlecode.com/svn/trunk/paper/SimHashWithBib.pdf.

Security, H. N. (2013). The rise ofsophisticated malware. Retrieved fromhttp://www.net-security.org/

malware news.php?id=2543

Seo, K., Lim, K., Choi, J., Chang, K., &Lee, S. (2009). Detecting similar filesbased on hash and statistical analysisfor digital forensic investigation. InComputer science and itsapplications, 2009. csa ’09. 2ndinternational conference on (p. 1-6).doi: 10.1109/CSA.2009.5404198

Symantec. (2010). Machine learning setsnew standard for data loss prevention:Describe, fingerprint, learn (Tech.Rep.). Symantec Corporation.

Tridgell, A. (2002–2009). spamsum.http://www.samba.org/ftp/

unpacked/junkcode/spamsum/.Retrieved fromhttp://samba.org/ftp/unpacked/

junkcode/spamsum/README (lastaccessed Feb 4th, 2016)

Winter, C., Schneider, M., & Yannikos, Y.

(2013). F2s2: Fast forensic similaritysearch through indexing piecewisehashsignatures. Digital Investigation,10 (4), 361–371. doi: http://dx.doi.org/10.1016/j.diin.2013.08.003

Zhou, W., Zhou, Y., Jiang, X., & Ning, P.(2012). Detecting repackagedsmartphone applications inthird-party android marketplaces. InProceedings of the second acmconference on data and applicationsecurity and privacy (pp. 317–326).

Ziroff, G. (2012, February). Approaches tosimilarity-preserving hashing,Bachelor’s thesis, HochschuleDarmstadt.


http://dx.doi.org/10.1007/978-0-387-84927-0_9

http://dx.doi.org/10.1007/978-0-387-84927-0_9

http://www.net-security.org/malware_news.php?id=2543

http://www.net-security.org/malware_news.php?id=2543

http://www.samba.org/ftp/unpacked/junkcode/spamsum/

http://www.samba.org/ftp/unpacked/junkcode/spamsum/

http://samba.org/ftp/unpacked/junkcode/spamsum/README

http://samba.org/ftp/unpacked/junkcode/spamsum/README



Documents

Bytewise Approximate Matching: The Good, The Bad, and The