From DNA sequence variation to .NET bits and bobs Andrei Saygo Eoin Ward Mathieu Létourneau Microsoft Product Release & Security Services FireEye - Mandiant
Mathieu Letourneau, Andrei Saygo, Eoin Ward, Microsoft This talk will present our research project on .Net file clustering based on their respective basic blocks and the parallel that can be made with DNA sequence variation analysis. We implemented a system that extracts the basic blocks on each file and creates clusters based on them. We also developed an IDA plugin to make use of that data and speed up our analysis of .Net files. Andrei Saygo, Eoin Ward and Mathieu Letourneau all work as Anti-Malware Security Engineers in the AM Scan team of Microsoft’s Product Release & Security Services group in Dublin, Ireland.
Citation preview
1. Talk outline
2. About us We analyse files on a daily basis to determine if
they are malicious and that includes Windows 8 Apps and Windows
Phone apps. For the past few years we have been involved in fields
like bioinformatics, molecular biology and genetics allowing us to
extrapolate some of the ideas/algorithms used in the bio field and
apply them to malware classification and detection purposes. About
us About DNA .NET disassembler Clustering IDA plugin
3. About DNA - DNA is made of four chemical building blocks
called nucleotides: adenine (A), thymine (T), cytosine (C) and
guanine (G). - A three-nucleotide series (called codon) in a DNA
sequence specifies a single amino acid. - The DNA sequences are
translated to amino acids that produce proteins. - Each DNA
sequence that contains instructions to make a protein is known as a
gene. About us About DNA .NET disassembler Clustering IDA plugin
moleculesoflife2010.wikispaces.com/Protein+Structure
4. About DNA sequence variation The human genome comprises
about 3 billion base pairs of DNA. Due to various factors,
mutations occur so the DNA sequence may change. Single nucleotide
polymorphisms, frequently called SNPs (pronounced snips), are the
most common type of genetic variation among people. Each SNP
represents a difference in a single DNA building block. They can
act as biological markers, helping scientists locate genes that are
associated with disease.. About us About DNA .NET disassembler
Clustering IDA plugin
5. About GWAS A genome-wide association study (GWAS) is an
approach used in genetics research to associate specific genetic
variations with particular diseases. The method involves scanning
the genomes (1 million SNPs) from many different people (healthy
and carriers) and looking for genetic markers that can be used to
predict the presence of a disease. The results of a GWAS are often
displayed in a scatter plot (called a Manhattan plot), in which the
peaks indicate regions of the genome associated with that disease.
About us About DNA .NET disassembler Clustering IDA plugin
Manhattan plot showing the log10 P values of 606,164 SNPs in the
GWAS for 1,472 Japanese atopic dermatitis (also known as atopic
eczema, is a non-contagious itchy skin disorder) cases and 7,971
controls plotted against their respective positions on autosomes
and the X chromosome
www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html
6. The DNA code is read three letters at a time (these DNA
triplets are called codons) Most of the codons correspond to a
specific amino acid. However some of the 64 codons code for the
same amino acid. Also three of the codons are used as 'stop'
signals (STOP codon) and another is the 'start' signal (START
codon). This resembles the way a disassembler works. Here the
binary machine code is the DNA sequence and the assembly code are
the amino acids. About us About DNA .NET disassembler Clustering
IDA plugin CCCTGTGGAGCCACACCCTAG CCC TGT GGA GCC ACA CCC TAG Amino
acids CIL(MSIL) instructions CCC - Proline 288B00000A call TGT -
Cysteine 03 ldarg.1 GGA Glycine 7D52000004 stfld GCC - Alanine 02
ldarg.0 ACA Threonine 04 ldarg.2 CCC - Proline 288B00000A call TAG
-STOP 2A ret
7. The CLR header can be reached from the IMAGE_DATA_DIRECTORY
structure. Then we have access to the offset to the MetaData header
that holds the number of streams. Immediately after, we have the
headers for each stream contained inside the file. About us About
DNA .NET disassembler Clustering IDA plugin typedef struct
CLR_HEADER { DWORD SizeOfStructure; WORD MajorRuntimeVersion; WORD
MinorRuntimeVersion; IMAGE_DATA_DIRECTORY MetaData; .. typedef
struct METADATA_HEADER { IMAGE_DATA_DIRECTORY NoOfStreams; ..
typedef struct STREAM_HEADERSR { DWORD Offset; DWORD Size; unsigned
char * Name; ..
8. We are interested in #~ (the metadata stream) because it
contains the information about the methods. - The #~ table header
contains a bitmask-QWORD that tells us the tables present in this
stream. (For example we can have the TypeRef, TypeDef, MethodDef,
Field, etc. tables). Out of all, we are interested in the MethodDef
table because it contains the RVAs of the method bodies. -
Following the #~ header we have a set of DWORDs specifying the
number of rows for each table that is present. - After them we have
the actual Metadata tables. - The RVA within the MethodDef table
tells us where the body of the method can be found. About us About
DNA .NET disassembler Clustering IDA plugin typedef struct
TABLE_HEADER { DWORD Reserved; WORD MajorVersion; WORD
MinorVersion; QWORD ValidMask; .. typedef struct TABLE_METHODDEF {
DWORD RVA; WORD ImplFlags; WORD Flags; WORD NameIndex; ..
9. For each method the RVA is the offset to the first
instruction. The Common Intermediate Language (CIL), formerly MSIL,
instructions are encoded using a variable-length instruction
encoding, where 1 or 2 bytes are used to represent the instruction.
We continue to disassemble from the first instruction until we
reach RET (opcode 0x2A in CIL). All the instructions are split into
basic blocks and we pick only the first operand (FOP). We have a
set of rules that will filter out garbage instructions. We then do
a CRC on the list of FOPs and add it in the database. About us
About DNA .NET disassembler Clustering IDA plugin CIL(MSIL) FOPs
288B00000A call 03 ldarg.1 7D52000004 stfld 02 ldarg.0 04 ldarg.2
288B00000A call 2A ret
10. Clustering
11. Clustering - basics Feature set: - CRCIDs representing the
hashes of each FOPS present in a given file - Double[ ] file1 = [1,
32, 5673, 5674, 5675, 18001, , 18607]; Distance measure: - Jaccard
index: size of intersection divided by the size of the union of two
sets. - Derivate we use: size of smallest of the two sets divided
by the size of the union. - Gives a similarity value between 0 and
1, subtracting that to 1 gives us a distance measure. About us
About DNA .NET disassembler Clustering IDA plugin
12. Assume 0.01s on average per distance computation A
simplistic implementation would give a complexity of O(n2) -
Computing the distance for every possible pair of files - For
example, imagine having to cluster 1500 files: (1500) 2 * 0.01 =
22500s (6.25 hours) Clearly doesnt scale well About us About DNA
.NET disassembler Clustering IDA plugin
13. Our mitigation techniques to improve speed: Loading all the
files in memory and ordering them by amount of FOPs they contain.
Only compute distance when size ratio is within the threshold
value, possible due to properties of our distance computation
function. Use of prototypes for agglomerative clustering - In each
cluster, the smallest file is elected as prototype to represent
that cluster. - When doing agglomerative clustering, new files to
the prototypes of each clusters until we find a distance within the
threshold, or alternatively put the file in a new cluster. About us
About DNA .NET disassembler Clustering IDA plugin
14. About us About DNA .NET disassembler Clustering IDA
plugin
15. Clustering animation Threshold = 30% 90 35 88 87 40 92
About us About DNA .NET disassembler Clustering IDA plugin
16. Clustering animation Threshold = 30% 9035 888740 92 About
us About DNA .NET disassembler Clustering IDA plugin
17. Clustering animation Threshold = 30% 9035 888740 92 About
us About DNA .NET disassembler Clustering IDA plugin
18. Clustering animation Threshold = 30% 90 35 888740 92 About
us About DNA .NET disassembler Clustering IDA plugin
19. Clustering animation Threshold = 30% 90 35 8887 92 About us
About DNA .NET disassembler Clustering IDA plugin
20. Clustering animation Threshold = 30% 908887 92 35 above
threshold! About us About DNA .NET disassembler Clustering IDA
plugin
21. Clustering animation Threshold = 30% 9088 87 92 35 About us
About DNA .NET disassembler Clustering IDA plugin
22. Clustering animation Threshold = 30% 9088 87 92 35 About us
About DNA .NET disassembler Clustering IDA plugin
23. Clustering animation Threshold = 30% 9088 87 92 35 About us
About DNA .NET disassembler Clustering IDA plugin
24. Clustering animation Threshold = 30% 90 8887 92 35 About us
About DNA .NET disassembler Clustering IDA plugin
25. Clustering animation Threshold = 30% 90 88 87 92 35 About
us About DNA .NET disassembler Clustering IDA plugin
26. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
27. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
28. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
29. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
30. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
31. Clustering animation Threshold = 30% 88 92 35 87 About us
About DNA .NET disassembler Clustering IDA plugin
32. Clustering animation Threshold = 30% 35 87 88 About us
About DNA .NET disassembler Clustering IDA plugin
33. Clustering animation Threshold = 30% 35 87 88 About us
About DNA .NET disassembler Clustering IDA plugin
34. About us About DNA .NET disassembler Clustering IDA plugin
312 1000 1500 4604 7380 6.655 81.644 759.799 945.557 1941.852
Clustering speed (Threshold of 80%) Number of files to cluster Time
taken to complete (seconds) 840 1500 7380 3.058 14 35.475
Clustering speed (Threshold of 20%) Number of files to cluster Time
taken to complete (seconds)
35. Time taken to cluster the same 1500 files from the previous
example is now drastically improved and follow the threshold value:
- With the simplistic approach: 22500s - With mitigation techniques
and threshold of 80%: 760s - With mitigation techniques and
threshold of 20%: 14s About us About DNA .NET disassembler
Clustering IDA plugin
36. Viewing the clustered data
37. We need: - a file from the database that we know is
malicious (weve selected Pameseg/ArchSMS) - a loose cluster that
the file is part of (weve selected a cluster that had 399 files)
Algorithm: - for each CRC present in the target file, we extract
the number of files where that CRC is present - calculate the
median and remove everything thats above based on the assumption
that most prevalent CRCs are clean (they are also found in clean
files). After this step we got 285 files. - use the following
formula to get the CRCs that are most probably malicious. k total
number of CRCs Nfi number of files containing a specific CRC p the
default p-value (0.05) Di distance of the specific CRC About us
About DNA .NET disassembler Clustering IDA plugin
38. - Using the set of data from
gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and-
qq-plots.html,,(200,000 SNPs) and applying the same approach we
get: About us About DNA .NET disassembler Clustering IDA
plugin
39. Applying the formula on our example dataset of 285 files
(that was left after we applied the median) we got a similar result
with the GWAS data. We took the first two CRCs and ran a query for
each one in order to see which files contain them. The result was a
set of 10 files, all of which were found to be malicious and from
the same family (Pameseg/ ArchSMS). About us About DNA .NET
disassembler Clustering IDA plugin
40. IDA Python Plugin
41. About us About DNA .NET disassembler Clustering IDA
plugin
42. About us About DNA .NET disassembler Clustering IDA
plugin
43. About us About DNA .NET disassembler Clustering IDA
plugin
44. About us About DNA .NET disassembler Clustering IDA
plugin
45. About us About DNA .NET disassembler Clustering IDA
plugin
46. Similar to what geneticists are doing in order to analyse
genetic variants and identify their link to various diseases, we
have implemented a similar approach so it can help us to
automatically identify malicious files.
47. The IDA plugin shows the areas of the code that require
more attention. This will reduce the time for manual analysis. We
can extend the clustering algorithm to other features like
instructions, behaviour data, etc. In the future we plan to extend
the approach to other type of files and other platforms.
48. Will this method be effective with packed files ? Weill
this method be effective with obfuscated .NET files ? Does the
plugin improve analysis time ? Can the CRCs be used as part of
generic detections / family classification ? The effect of the
speed mitigation strategies and the used a derivative of the
Jaccard index ? Other questions, thoughts, etc