Bytewise approximate matching, searching and clustering

Copyright 2011 Trend Micro Inc. 1

Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D

Ray Cheng, Ph.D

Trend Micro Inc.

DFRWS USA 2015, August , 2015, Philadelphia, PA

Copyright 2011 Trend Micro Inc.

Agenda

• Background

• Six Matching Problems and Bytewise Relevance

• Current Work: A Framework of Theory, Algorithms, and Technologies

• Future Work

Classification 8/17/2015 2


Background

• Similarity digesting schemes: – Problem: Given two binary strings s1 and s2, measure their similarity.

• Do a hash that preserves similarity property of strings.

• Measure similarity by comparing two hash values.

– Example: TLSH, ssdeep, sdhash



Background

• NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching :

– NIST document lists four cases to describe this concept:

• Object similarity detection: identify related artifacts, e.g. different versions of a document.

• Cross Correlation: identify artifacts sharing a common object.

• Embedded Object Detection: identify a given object inside an artifact.

• Fragment Detection: identify the presence of traces/fragments of a known artifact.

• Dr . Liwei Ren’s talk at DFRWS EU 2015: – A Theoretic Framework for Evaluating Similarity Digesting Tools

– Using a mathematical model to describe binary similarity.

4


Six Matching Problems and Bytewise Relevance

• The NIST document does not cover all bytewise approximate matching cases.

• We generalized NIST cases to six cases:




• Continued:

6


Classification of NIST approximate matching cases

• Similarity Detection: identify related artifacts. – AM1 (approximate match)

• Cross Correlation: identify artifacts sharing a common object.

– EM3 (exact match cross-sharing)

• Embedded Object Detection: identify a given object inside an artifact.

– EM2 (exact match containment)

• Fragment Detection: identify the presence of traces/fragments of a known artifact.

– EM2 (one or more exact match containment)




• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of six cases is true, we say R and T are bytewise relevant. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.

8


A Framework of Theory, Algorithms and Technologies

• Define three fundamental problems using Bytewise Relevance: – Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.

– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that BR (o, b )=1.

– Clustering: Given a bag B of objects, partition B into groups { G1, G2,…,Gm} based on BR.

• S = An object space S,

• O = An object in object space S,

•BR = Bytewise Relevance relationship for objects in S.



A Framework of Theory, Algorithms and Technologies

• Our bytewise relevance framework :



Matching

• The Six Matching Problems EM1 – AM3 – Identicalness EM1 : the solution is trivial.

– Containment EM2 : the solution is Rabin-Karp algorithm.

– Cross-sharing EM3 :

• We established a theory on this interesting problem : how to measure cross-sharing.

• We developed an algorithmic solution with theoretic analysis.

– Similarity AM1 :

• TLSH, ssdeep and sdhash

• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem.

– We designed a novel similarity digesting scheme TSFP.

– Approximate containment AM2: Two heuristic algorithms

– Approximate cross-sharing AM3: One heuristic algorithm



Searching

• For the relationship BR, the searching problem: – B is a bag of strings. Given a string T , find s ∊ B such that BR(T,

s)=1.



Searching

• How to solve searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can

we scale to millions or billions?

– Candidate selection approach: two-step approach

• STEP 1: select a few candidates { s1, s2,…,sm} quickly

• STEP 2: evaluate each BR(T, sk).

– How to select good candidates?

• String fingerprinting: generate fingerprints from each string from B.

• Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB.

• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB.

– NOTE:

• This is similar to a keyword based search engine where the keywords are the fingerprints.

• The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13


Future Work: Clustering Problem

• For the relationship BR, one has a clustering problem : – B is a bag of strings, partition B into groups of strings based on BR.



Future Work: Library and tools

• Analyze algorithms and measure performance. – Verify they can scale.

• For bytewise approximate matching, searching and clustering, – Library of functions

– API

– Tools



Application examples of Approximate Matching, Searching, Clustering • E-Discovery

– Comparing near duplicate documents

– Grouping near duplicate documents

• Digital forensic analysis

– Identifying similar objects or files

• Malware analysis

– Identifying similar malware or mutated malware

• Anti-plagiarism

– Detection of copyright violations

• Source code governance

• Spam filtering

• Data Loss Prevention



Q&A

• Thank you.

• Any questions?

• Email: – [email protected]

– [email protected]

17

mailto:[email protected]





Application Example

• A search problem in DLP (Data Loss Prevension) system: – Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.

Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ.

• RLV is a function to measure the relevance of two documents.

• Challenges: how to construct RLV and δ? How to make search scalable?



Application Example

• A clustering problem in e-Discovery: – Data are identified as potentially relevant by attorneys

– De-duplication technology. – Problem: partition S into groups based on the textual relevance.



Background

• Similarity digesting schemes: – A family of similarity preserving hashing techniques & tools

– Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)).

• H is a hash function that preserves string similarity.

• SIM is another function to measure similarity of two hash values

– Example: TLSH, ssdeep, sdhash

– Challenge: how to evaluate pros & cons between them?




• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y.

• We have following relationship:


EM1 EM2 EM3

AM1 AM2 AM3

↪ ↪

↪ ↪

↪ ↪ ↪

Technology

Bytewise approximate matching, searching and clustering