From DNA Sequence Variation to .NET Bits and Bobs

1. Talk outline

2. About us We analyse files on a daily basis to determine if they are malicious and that includes Windows 8 Apps and Windows Phone apps. For the past few years we have been involved in fields like bioinformatics, molecular biology and genetics allowing us to extrapolate some of the ideas/algorithms used in the bio field and apply them to malware classification and detection purposes. About us About DNA .NET disassembler Clustering IDA plugin

3. About DNA - DNA is made of four chemical building blocks called nucleotides: adenine (A), thymine (T), cytosine (C) and guanine (G). - A three-nucleotide series (called codon) in a DNA sequence specifies a single amino acid. - The DNA sequences are translated to amino acids that produce proteins. - Each DNA sequence that contains instructions to make a protein is known as a gene. About us About DNA .NET disassembler Clustering IDA plugin moleculesoflife2010.wikispaces.com/Protein+Structure

4. About DNA sequence variation The human genome comprises about 3 billion base pairs of DNA. Due to various factors, mutations occur so the DNA sequence may change. Single nucleotide polymorphisms, frequently called SNPs (pronounced snips), are the most common type of genetic variation among people. Each SNP represents a difference in a single DNA building block. They can act as biological markers, helping scientists locate genes that are associated with disease.. About us About DNA .NET disassembler Clustering IDA plugin

5. About GWAS A genome-wide association study (GWAS) is an approach used in genetics research to associate specific genetic variations with particular diseases. The method involves scanning the genomes (1 million SNPs) from many different people (healthy and carriers) and looking for genetic markers that can be used to predict the presence of a disease. The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot), in which the peaks indicate regions of the genome associated with that disease. About us About DNA .NET disassembler Clustering IDA plugin Manhattan plot showing the log10 P values of 606,164 SNPs in the GWAS for 1,472 Japanese atopic dermatitis (also known as atopic eczema, is a non-contagious itchy skin disorder) cases and 7,971 controls plotted against their respective positions on autosomes and the X chromosome www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html

6. The DNA code is read three letters at a time (these DNA triplets are called codons) Most of the codons correspond to a specific amino acid. However some of the 64 codons code for the same amino acid. Also three of the codons are used as 'stop' signals (STOP codon) and another is the 'start' signal (START codon). This resembles the way a disassembler works. Here the binary machine code is the DNA sequence and the assembly code are the amino acids. About us About DNA .NET disassembler Clustering IDA plugin CCCTGTGGAGCCACACCCTAG CCC TGT GGA GCC ACA CCC TAG Amino acids CIL(MSIL) instructions CCC - Proline 288B00000A call TGT - Cysteine 03 ldarg.1 GGA Glycine 7D52000004 stfld GCC - Alanine 02 ldarg.0 ACA Threonine 04 ldarg.2 CCC - Proline 288B00000A call TAG -STOP 2A ret

7. The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure. Then we have access to the offset to the MetaData header that holds the number of streams. Immediately after, we have the headers for each stream contained inside the file. About us About DNA .NET disassembler Clustering IDA plugin typedef struct CLR_HEADER { DWORD SizeOfStructure; WORD MajorRuntimeVersion; WORD MinorRuntimeVersion; IMAGE_DATA_DIRECTORY MetaData; .. typedef struct METADATA_HEADER { IMAGE_DATA_DIRECTORY NoOfStreams; .. typedef struct STREAM_HEADERSR { DWORD Offset; DWORD Size; unsigned char * Name; ..

8. We are interested in #~ (the metadata stream) because it contains the information about the methods. - The #~ table header contains a bitmask-QWORD that tells us the tables present in this stream. (For example we can have the TypeRef, TypeDef, MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef table because it contains the RVAs of the method bodies. - Following the #~ header we have a set of DWORDs specifying the number of rows for each table that is present. - After them we have the actual Metadata tables. - The RVA within the MethodDef table tells us where the body of the method can be found. About us About DNA .NET disassembler Clustering IDA plugin typedef struct TABLE_HEADER { DWORD Reserved; WORD MajorVersion; WORD MinorVersion; QWORD ValidMask; .. typedef struct TABLE_METHODDEF { DWORD RVA; WORD ImplFlags; WORD Flags; WORD NameIndex; ..

9. For each method the RVA is the offset to the first instruction. The Common Intermediate Language (CIL), formerly MSIL, instructions are encoded using a variable-length instruction encoding, where 1 or 2 bytes are used to represent the instruction. We continue to disassemble from the first instruction until we reach RET (opcode 0x2A in CIL). All the instructions are split into basic blocks and we pick only the first operand (FOP). We have a set of rules that will filter out garbage instructions. We then do a CRC on the list of FOPs and add it in the database. About us About DNA .NET disassembler Clustering IDA plugin CIL(MSIL) FOPs 288B00000A call 03 ldarg.1 7D52000004 stfld 02 ldarg.0 04 ldarg.2 288B00000A call 2A ret

10. Clustering

11. Clustering - basics Feature set: - CRCIDs representing the hashes of each FOPS present in a given file - Double[ ] file1 = [1, 32, 5673, 5674, 5675, 18001, , 18607]; Distance measure: - Jaccard index: size of intersection divided by the size of the union of two sets. - Derivate we use: size of smallest of the two sets divided by the size of the union. - Gives a similarity value between 0 and 1, subtracting that to 1 gives us a distance measure. About us About DNA .NET disassembler Clustering IDA plugin

12. Assume 0.01s on average per distance computation A simplistic implementation would give a complexity of O(n2) - Computing the distance for every possible pair of files - For example, imagine having to cluster 1500 files: (1500) 2 * 0.01 = 22500s (6.25 hours) Clearly doesnt scale well About us About DNA .NET disassembler Clustering IDA plugin

13. Our mitigation techniques to improve speed: Loading all the files in memory and ordering them by amount of FOPs they contain. Only compute distance when size ratio is within the threshold value, possible due to properties of our distance computation function. Use of prototypes for agglomerative clustering - In each cluster, the smallest file is elected as prototype to represent that cluster. - When doing agglomerative clustering, new files to the prototypes of each clusters until we find a distance within the threshold, or alternatively put the file in a new cluster. About us About DNA .NET disassembler Clustering IDA plugin

14. About us About DNA .NET disassembler Clustering IDA plugin

15. Clustering animation Threshold = 30% 90 35 88 87 40 92 About us About DNA .NET disassembler Clustering IDA plugin

16. Clustering animation Threshold = 30% 9035 888740 92 About us About DNA .NET disassembler Clustering IDA plugin

18. Clustering animation Threshold = 30% 90 35 888740 92 About us About DNA .NET disassembler Clustering IDA plugin

20. Clustering animation Threshold = 30% 908887 92 35 above threshold! About us About DNA .NET disassembler Clustering IDA plugin

25. Clustering animation Threshold = 30% 90 88 87 92 35 About us About DNA .NET disassembler Clustering IDA plugin

34. About us About DNA .NET disassembler Clustering IDA plugin 312 1000 1500 4604 7380 6.655 81.644 759.799 945.557 1941.852 Clustering speed (Threshold of 80%) Number of files to cluster Time taken to complete (seconds) 840 1500 7380 3.058 14 35.475 Clustering speed (Threshold of 20%) Number of files to cluster Time taken to complete (seconds)

35. Time taken to cluster the same 1500 files from the previous example is now drastically improved and follow the threshold value: - With the simplistic approach: 22500s - With mitigation techniques and threshold of 80%: 760s - With mitigation techniques and threshold of 20%: 14s About us About DNA .NET disassembler Clustering IDA plugin

36. Viewing the clustered data

37. We need: - a file from the database that we know is malicious (weve selected Pameseg/ArchSMS) - a loose cluster that the file is part of (weve selected a cluster that had 399 files) Algorithm: - for each CRC present in the target file, we extract the number of files where that CRC is present - calculate the median and remove everything thats above based on the assumption that most prevalent CRCs are clean (they are also found in clean files). After this step we got 285 files. - use the following formula to get the CRCs that are most probably malicious. k total number of CRCs Nfi number of files containing a specific CRC p the default p-value (0.05) Di distance of the specific CRC About us About DNA .NET disassembler Clustering IDA plugin

38. - Using the set of data from gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and- qq-plots.html,,(200,000 SNPs) and applying the same approach we get: About us About DNA .NET disassembler Clustering IDA plugin

39. Applying the formula on our example dataset of 285 files (that was left after we applied the median) we got a similar result with the GWAS data. We took the first two CRCs and ran a query for each one in order to see which files contain them. The result was a set of 10 files, all of which were found to be malicious and from the same family (Pameseg/ ArchSMS). About us About DNA .NET disassembler Clustering IDA plugin

40. IDA Python Plugin

46. Similar to what geneticists are doing in order to analyse genetic variants and identify their link to various diseases, we have implemented a similar approach so it can help us to automatically identify malicious files.

47. The IDA plugin shows the areas of the code that require more attention. This will reduce the time for manual analysis. We can extend the clustering algorithm to other features like instructions, behaviour data, etc. In the future we plan to extend the approach to other type of files and other platforms.

48. Will this method be effective with packed files ? Weill this method be effective with obfuscated .NET files ? Does the plugin improve analysis time ? Can the CRCs be used as part of generic detections / family classification ? The effect of the speed mitigation strategies and the used a derivative of the Jaccard index ? Other questions, thoughts, etc

Technology

From DNA Sequence Variation to .NET Bits and Bobs