43
THE RISE OF DGA MALWARES ENRICO HUGO, S.KOM. , CEH IDNOG 4TH CONFERENCE | 27 JULY 2017 | JAKARTA, INDONESIA

10 - IDNOG04 - Enrico Hugo (Indonesia Honeynet Project) - The Rise of DGA Malwares

Embed Size (px)

Citation preview

THE RISE OFDGA MALWARES

ENRICO HUGO, S.KOM. , CEH

IDNOG 4TH CONFERENCE | 27 JULY 2017 | JAKARTA, INDONESIA

AGENDA

• Distributed Denial of Service

• Botnet Architectures

• Domain Generation Algorithm

• DGA Detection Techniques

• Reverse Engineering

• Zipf’s Law

• Maximum Consonant Sequence Length

• Hierarchical Clustering

DISTRIBUTED DENIAL OF SERVICE

DISTRIBUTED DENIAL OF SERVICE

• DDoS is the current threat as seen on recent news on cyber attacks

• Mirai, for example, employs millions of infected network devices to perform DDoS

• These devices form a network of zombies or bots, so-called “botnet”

• The botnet(s) is/are controlled by a person or a group of people known as “botmaster(s)”

• Botmasters issue commands to the botnet after the bots have successfully established connections to the Command-and-Control (C&C) server(s)

BOTNET ARCHITECTURES

STAR TOPOLOGY

MULTI SERVER C&C TOPOLOGY

HIERARCHICAL TOPOLOGY

RANDOM OR PEER-TO-PEER TOPOLOGY

BOTNET C&C LOOKUP

• Botnet establishes connection with its C&C server by first looking up the IP address of its C&C server

• Regardless of its architecture / topology, botnets mostly use fluxing

• There are two types of fluxing:

• IP Flux

• Domain Flux

IP FLUX

• A single Fully Qualified Domain Name (FQDN) associated with many constantly-changing IP addresses

• There are two types of IP Fluxing techniques:

• Single Flux

• Double Flux

DOMAIN FLUX

• Many FQDNs resolve to a single IP address

• Most of the time this IP address is the IP address of the proxy, not the actual C&C server

• One of the most popular techniques nowadays is the Domain Generation Algorithm (DGA)

DOMAIN GENERATION ALGORITHM

DEFINITION

Domain generation algorithms (DGA) are algorithms seen in various families of malware that are used to periodically generate a large number of domain names that can be used as rendezvous points with their command and control servers.

CHARACTERISTICS

• NXDOMAIN responses

• Usually random on the 2LD or 3LD domains

• A lot of requests from the same IP address

• Ranges from completely unreadable words (not compliant to Zipf’s Law) to dictionary words (harder to detect).

MALWARES USING DGA

• Kraken

• Conficker

• Gameover Zeus

• Pykspa

• Cryptolocker

• Dyre

• Darkshell

• Locky

• Mad Max

• PandaBanker

• Pushdo

• Ramnit

• Srizbi

• Torpig

• Virut

• etc.

DGA DETECTION TECHNIQUES

• Reverse Engineering (Generating Regular Expressions for DGA Detection)

• Zipf’s Law (Detecting the Existence of DGA within Log Files)

• Maximum Consonant Sequence Length (Detecting the DGA within Log Files)

• Hierarchical Clustering (Clustering Log Files)

REVERSE ENGINEERINGDGA DETECTION TECHNIQUES

DGARCHIVE

• Daniel Plohmann, Khaled Yakdan, Michael Klatt, Johannes Bader, and Elmar Gerhards-Padilla published a paper entitled “A Comprehensive Measurement Study of Domain Generating Malware” in which they discussed the many different categories of malware DGAs.

• In addition, they also managed to create DGArchive, a repository of DGA regexes from 69 malware families obtained by reverse engineering malware samples.

• Using the regexes, it is possible to generate list of AGDs for the current day to be used as a blacklist before the DGA attack even started.

DRAWBACK OF REGEX

• The regex provided by DGArchive is too generic

• For example, the DGA regular expression of Darkshell is [\s\S]{6}\.com and google.comfits into the regex

• Some other detection measures are necessary

ZIPF’S LAWDGA DETECTION TECHNIQUES

ZIPF’S LAW

Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word.

N-GRAM FREQUENCIES

Let’s take facebook.com as an example:

• Unigrams = [‘f’, ‘a’, ‘c’, ‘e’, ‘b’, ‘o’, ‘o’, ‘k’, ‘c’, ‘o’, ‘m’]

• Bigrams = [‘fa’, ‘ac’, ‘ce’, ‘eb’, ‘bo’, ‘oo’, ‘ok’, ‘co’, ‘om’]

• Trigrams = [‘fac’, ‘ace’, ‘ceb’, ‘ebo’, ‘boo’, ‘ook’, ‘com’]

The bigram frequency:

• fa = 1

• ac = 1

• ce = 1

• eb = 1

• bo = 1

• oo = 1

• ok = 1

• co = 1

• om = 1

The unigram frequency:

• f = 1

• a = 1

• c = 2

• e = 1

• b = 1

• o = 3

• k = 1

• m = 1

BIGRAM FREQUENCY OF LOG FILE

Given a DNS Log File containing

a list of domain names as follows:

• google.com

• facebook.co.id

• apple.com

• youtube.com

• klikbca.com

• twitter.com

• detik.com

• co = 7

• om = 6

• ik = 2

• le = 2

• oo = 2

• ac = 1

• ca = 1

• it = 1

• ce =1

The sorted bigram frequencies would be:

• ap = 1

• go = 1

• et = 1

• gl = 1

• er = 1

• pp = 1

• tw = 1

• tt = 1

• tu = 1

• li = 1

• ti = 1

• te = 1

• pl = 1

• be = 1

• de = 1

• yo = 1

• bc = 1

• bo = 1

• wi = 1

• fa = 1

• eb = 1

• kb = 1

• ok = 1

• og = 1

• ut = 1

• kl = 1

• ou = 1

• ub = 1

• id = 1

CONVERTING FREQUENCIES TO FREQUENCY RATIOS

• There are 38 distinct bigrams in the given DNS log file

• The total of all 38 bigram frequencies are 52

• The most frequent bigram frequency is 7, equalling to 7/52 times in the log file

• The least frequent bigram frequency is 1, equalling to 1/52 times in the log file

• Therefore the max and min bigram frequency ratio is 0.1346 and 0.0192 respectively

ALEXA BIGRAM DISTRIBUTION

CONFICKER BIGRAM DISTRIBUTION

PYKSPA BIGRAM DISTRIBUTION

CONFICKER VS PYKSPA BIGRAM DISTRIBUTION

AGD VS HGD BIGRAM DISTRIBUTION

AGD VS HGD

• From the graphs, it is seen that Algorithmically-Generated Domains (AGD) such as the Conficker and Pykspa worm domains, generate a relatively straight line graph while Human-Generated Domains (HGD) like Alexa’s Top 500 sites produce an elbow-shaped graph .

• This observation leads to the creation of a formula for calculating the probability of a given log file containing DGA domains or incurring a DGA attack. The higher the DGA probability rate, the higher the possibility of an ongoing DGA attack within the monitored log.

MAXIMUM CONSONANT SEQUENCE LENGTH

DGA DETECTION TECHNIQUES

DISCOVERING DGA WITHIN LOG FILES

• Further observation on the polluted log file (identified using Zipf’s Law) reveals one of the most prominent DGA characteristics that allow us to distinguish AGDs from HGDs better, i.e. Maximum Consonant Sequence Length. Generally, AGDs has a larger value of MCS Length compared to HGDs.

• Example:

• google.com has a maximum consonant sequence length of 2, since the longest consonant sequence is “gl”

• vofwxlbi.cn, one of the domains generated by Conficker worm, has a Maximum Consonant Sequence Length of 5 and the longest sequence is “fwxlb”

HIERARCHICAL CLUSTERINGDGA DETECTION TECHNIQUES

FEATURES

Level 1

• Query Class

• Query Type

Level 2• Response Code

Level 3

• Query Length

• Numeric Chars

Level 4• Query Label

Level 5• Numeric Chars

TREEMAP

RESULTING CLUSTERS

ACCURACY OF DETECTION

\

• Calculating the Accuracy using the formula below, the number 0.913 or 91% accuracy is obtained

COUNTERMEASURES - SINKHOLING

COUNTERMEASURES – DNS RPZ• Obtain daily DGA log file from http://data.netlab.360.com/feeds/dga/dga.txt

• Parse using dnsanalysis library in Python

• Export to text file and implement into DNS RPZ

REFERENCES

• Botnet Communication Topologies

https://www.damballa.com/downloads/r_pubs/WP_Botnet_Communications_Primer.pdf

• A Comprehensive Measurement Study of Domain Generating Malware

https://www.usenix.org/system/files/conference/usenixsecurity16/sec16_paper_plohmann.pdf

• DGArchive – A deep dive into domain generating malware

https://www.botconf.eu/wp-content/uploads/2015/12/OK-P06-Plohmann-DGArchive.pdf

• Using DNS RPZ to Block Malicious DNS Requests

https://blogs.cisco.com/security/using-dns-rpz-to-block-malicious-dns-requests