of 21/21
Copyright 2011 Trend Micro Inc. 1 Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D Ray Cheng, Ph.D Trend Micro Inc. DFRWS USA 2015, August , 2015, Philadelphia, PA

Bytewise approximate matching, searching and clustering

  • View
    257

  • Download
    2

Embed Size (px)

Text of Bytewise approximate matching, searching and clustering

  • Copyright 2011 Trend Micro Inc. 1

    Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D

    Ray Cheng, Ph.D

    Trend Micro Inc.

    DFRWS USA 2015, August , 2015, Philadelphia, PA

  • Copyright 2011 Trend Micro Inc.

    Agenda

    Background

    Six Matching Problems and Bytewise Relevance

    Current Work: A Framework of Theory, Algorithms, and Technologies

    Future Work

    Classification 8/17/2015 2

  • Copyright 2011 Trend Micro Inc.

    Background

    Similarity digesting schemes: Problem: Given two binary strings s1 and s2, measure their similarity.

    Do a hash that preserves similarity property of strings.

    Measure similarity by comparing two hash values.

    Example: TLSH, ssdeep, sdhash

    Classification 8/17/2015 3

  • Copyright 2011 Trend Micro Inc.

    Background

    NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching :

    NIST document lists four cases to describe this concept:

    Object similarity detection: identify related artifacts, e.g. different versions of a document.

    Cross Correlation: identify artifacts sharing a common object.

    Embedded Object Detection: identify a given object inside an artifact.

    Fragment Detection: identify the presence of traces/fragments of a known artifact.

    Dr . Liwei Rens talk at DFRWS EU 2015: A Theoretic Framework for Evaluating Similarity Digesting Tools

    Using a mathematical model to describe binary similarity.

    4

  • Copyright 2011 Trend Micro Inc.

    Six Matching Problems and Bytewise Relevance

    The NIST document does not cover all bytewise approximate matching cases.

    We generalized NIST cases to six cases:

    Classification 8/17/2015 5

  • Copyright 2011 Trend Micro Inc.

    Six Matching Problems and Bytewise Relevance

    Continued:

    6

  • Copyright 2011 Trend Micro Inc.

    Classification of NIST approximate matching cases

    Similarity Detection: identify related artifacts. AM1 (approximate match)

    Cross Correlation: identify artifacts sharing a common object.

    EM3 (exact match cross-sharing)

    Embedded Object Detection: identify a given object inside an artifact.

    EM2 (exact match containment)

    Fragment Detection: identify the presence of traces/fragments of a known artifact.

    EM2 (one or more exact match containment)

    Classification 8/17/2015 7

  • Copyright 2011 Trend Micro Inc.

    Six Matching Problems and Bytewise Relevance

    Definition 1 : Given two strings R[1,..,n] and T[1,,m], if one of six cases is true, we say R and T are bytewise relevant. We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.

    8

  • Copyright 2011 Trend Micro Inc.

    A Framework of Theory, Algorithms and Technologies

    Define three fundamental problems using Bytewise Relevance: Matching: Given O1 , O2 S, determine whether BR (O1,O2) =1. Searching : B S is a bag of objects . Given o S , find b B

    such that BR (o, b )=1. Clustering: Given a bag B of objects, partition B into groups { G1,

    G2,,Gm} based on BR. S = An object space S,

    O = An object in object space S,

    BR = Bytewise Relevance relationship for objects in S.

    Classification 8/17/2015 9

  • Copyright 2011 Trend Micro Inc.

    A Framework of Theory, Algorithms and Technologies

    Our bytewise relevance framework :

    Classification 8/17/2015 10

  • Copyright 2011 Trend Micro Inc.

    Matching

    The Six Matching Problems EM1 AM3 Identicalness EM1 : the solution is trivial.

    Containment EM2 : the solution is Rabin-Karp algorithm.

    Cross-sharing EM3 :

    We established a theory on this interesting problem : how to measure cross-sharing.

    We developed an algorithmic solution with theoretic analysis.

    Similarity AM1 :

    TLSH, ssdeep and sdhash

    Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem.

    We designed a novel similarity digesting scheme TSFP.

    Approximate containment AM2: Two heuristic algorithms

    Approximate cross-sharing AM3: One heuristic algorithm

    Classification 8/17/2015 11

  • Copyright 2011 Trend Micro Inc.

    Searching

    For the relationship BR, the searching problem: B is a bag of strings. Given a string T , find s B such that BR(T,

    s)=1.

    Classification 8/17/2015 12

  • Copyright 2011 Trend Micro Inc.

    Searching

    How to solve searching problem? Brute force approach : for every s B, we evaluate BR(T, s). Can

    we scale to millions or billions?

    Candidate selection approach: two-step approach

    STEP 1: select a few candidates { s1, s2,,sm} quickly

    STEP 2: evaluate each BR(T, sk).

    How to select good candidates?

    String fingerprinting: generate fingerprints from each string from B.

    Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB.

    Searching Process: given T, generate fingerprints {FP1, FP2,,FPq} , we use them to search possible candidates from FP-DB.

    NOTE:

    This is similar to a keyword based search engine where the keywords are the fingerprints.

    The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13

  • Copyright 2011 Trend Micro Inc.

    Future Work: Clustering Problem

    For the relationship BR, one has a clustering problem : B is a bag of strings, partition B into groups of strings based on BR.

    Classification 8/17/2015 14

  • Copyright 2011 Trend Micro Inc.

    Future Work: Library and tools

    Analyze algorithms and measure performance. Verify they can scale.

    For bytewise approximate matching, searching and clustering, Library of functions

    API

    Tools

    Classification 8/17/2015 15

  • Copyright 2011 Trend Micro Inc.

    Application examples of Approximate Matching, Searching, Clustering E-Discovery

    Comparing near duplicate documents

    Grouping near duplicate documents

    Digital forensic analysis

    Identifying similar objects or files

    Malware analysis

    Identifying similar malware or mutated malware

    Anti-plagiarism

    Detection of copyright violations

    Source code governance

    Spam filtering

    Data Loss Prevention

    Classification 8/17/2015 16

  • Copyright 2011 Trend Micro Inc.

    Q&A

    Thank you.

    Any questions?

    Email: [email protected]

    [email protected]

    17

    mailto:[email protected]:[email protected]:[email protected]:[email protected]

  • Copyright 2011 Trend Micro Inc.

    Application Example

    A search problem in DLP (Data Loss Prevension) system: Problem: S = {d1, d2,, dn} is a collection of confidential documents,.

    Given any document T and 0

  • Copyright 2011 Trend Micro Inc.

    Application Example

    A clustering problem in e-Discovery: Data are identified as potentially relevant by attorneys

    De-duplication technology. Problem: partition S into groups based on the textual relevance.

    Classification 8/17/2015 19

  • Copyright 2011 Trend Micro Inc.

    Background

    Similarity digesting schemes: A family of similarity preserving hashing techniques & tools

    Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)).

    H is a hash function that preserves string similarity.

    SIM is another function to measure similarity of two hash values

    Example: TLSH, ssdeep, sdhash

    Challenge: how to evaluate pros & cons between them?

    Classification 8/17/2015 20

  • Copyright 2011 Trend Micro Inc.

    Six Matching Problems and Bytewise Relevance

    Definition 2: Let X , Y { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X Y.

    We have following relationship:

    Classification 8/17/2015 21

    EM1 EM2 EM3

    AM1 AM2 AM3