Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

Detection of malicious domainsthrough lexical analysis

International Conference On Cyber Security And Protection Of Digital Services

June 11, 2018

Egon Kidmose, Matija Stevanovic and Jens Myrup [email protected], [email protected], [email protected]

Department of Electronic SystemsAalborg University

Denmark

mailto:[email protected]



12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

2 Motivation

Assumption andMethod

Data

Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Domain Names and DNS

DNS: Domain Names SystemI Used everywhere, by everybody on the InternetI ... also criminals!I Technology/service choke-pointI Only small fraction of Internet traffic

Example domainsI www.example.com

I goggle.com

I yf32d9ac7f0a9f463e8da4736b12d7044a.tk

Source: Antonakakis et al. 2011.

12



Egon Kidmose et al.

3 Motivation


Data

Features

Results

Conclusion


Denmark

Abuse and Malicious Domains

I Command and controlI PhishingI SpammingI Typo-squattingI Blending in, tunnellingI Fast-flux, Double-fluxI Domain-flux (DGA)

Source: Haymarket Media, Inc. 1

Malicious domain: Any domain that is used for criminal,malicious or otherwise nefarious activities.

1: https://www.scmagazineuk.com/cyber-criminals-becoming-increasingly-professional/article/

531709/

https://www.scmagazineuk.com/cyber-criminals-becoming-increasingly-professional/article/531709/



12



Egon Kidmose et al.

Motivation

4 Assumption andMethod

Data

Features

Results

Conclusion


Denmark

Assumption and Method

Assumption:

“Malicious domain names can be detected bylexical features.”

Method:

DomainNames

Lexical Analysis(Feature Extraction)

SupervisedMachine Learning Results

Machine Learning: Supervised, 10-fold cross validation, 10 repeats,Random Forrest, Python, Scikit-learn.

12



Egon Kidmose et al.

Motivation


5 Data

Features

Results

Conclusion


Denmark

Data

Table I: DATA SETS OF MALICIOUS DOMAIN NAMES.Data sets Number of domains

Non-DGA DGA

abuse.chZeuS Tracker 437 -

Palevo Tracker 14 -Ransomware Tracker 1248 -

malware-\domains.com

Just domains 13073 -Zeus Gameover - 190033

Conficker - 106101Pushdo - 10951

GOZ - 7348Microsoft Botnet 22036 -

host-file.net

Ad/tracking servers (ATS) 47960 -Malware distribution (EMD) 137237 -

Exploits sites (EXP) 17282 -Fraud sites (FSA) 134501 -

Spamming sites (GRM) 674 -Spamming sites (HFS) 573 -

Hijack sites (HJK) 74 -Misleading marketing (MMT) 5533 -Pharmacy activities (PHA) 23143 -

Phishing sites (PSH) 133913 -Warez distribution (WRZ) 3231 -

datadriven\security.info

Cryptolocker - 34319Goz - 7347

New Goz - 10999malware\domainlist.com

Malware-related domains 1253 -

malc0de.com Malware-related domains 208 -malwarepatrol.net Malicious URLs 35518 -phistank.com Phishing URLs 14807 -AAU-STAR Domains from malw.testing 27778 -Total 620493 367098

Table II: DATA SETS OF BENIGN DOMAIN NAMES.Data sets Number of domains

Non-DGA DGAalexa.org Most popular domains 971424 -datadriven\security.info

Legitimate DGA domains - 133927

12



Egon Kidmose et al.

Motivation


Data

6 Features

Results

Conclusion


Denmark

Features

1. Basic Domain FeaturesI Count, CategoricalI TLD and FQDN

2. Simple Lexical FeaturesI Length, count, ratioI 2LD, character classes

3. Advanced Lexical FeaturesI Approximating language, recognising wordsI Entropy of letter distributionI N-gram analysis (alexa.org, English)

Top-Level Domain (TLD): www.example.com.Second-Level Domain (2LD): www.example.com.Fully Qualified Domain Name: www.example.com

12



Egon Kidmose et al.

Motivation


Data

Features

7 Results

Conclusion


Denmark

Feature Analysis

Basic Domain Features

5 10 15 20 25 30 35NumLDs

0.0

0.2

0.4

0.6

0.8

1.0Number of tokens (n-LD), CDF

MaliciousNon-malicious

0 50 100 150 200 250LengthDomain

0.0

0.2

0.4

0.6

0.8

1.0Length of FQDN, CDF


com ru net

org

info biz

com.br

co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr

co.kr be

co.za

com.ar ch

com.mx hu at xyz cl

xn--p1ai

com.tw

com.tr dk vn su

com.ua no sk

co.id

online ie

co.il

co.nz ae

0

20000

40000

60000MaliciousNon-malicious

com ru net org info02500005000007500001000000

12



Egon Kidmose et al.

Motivation


Data

Features

8 Results

Conclusion


Denmark

Detection: Scenario I - Full Data Set

Basicdomainfeatures

Simplelexicalfeatures

Advancedlexicalfeatures

Allfeatures

0.00.20.40.60.81.0

TPR/RecallFPRPrecisionF1-score

12



Egon Kidmose et al.

Motivation


Data

Features

9 Results

Conclusion


Denmark

Detection: Scenario II - Non-DGA

Basicdomainfeatures



Allfeatures

0.00.20.40.60.81.0


12



Egon Kidmose et al.

Motivation


Data

Features

10 Results

Conclusion


Denmark

Detection: Scenario III - DGA

Basicdomainfeatures



Allfeatures

0.00.20.40.60.81.0


Scenario III includes the benign non-DGA domains.

12



Egon Kidmose et al.

Motivation


Data

Features

Results

11 Conclusion


Denmark

Conclusion

Results

Malicious domains can be detectedby using lexical features only

I Detection of DGA-domains is particular promisingI Precision: 0.984. Recall: 0.984. F1-score: 0.948.

I Detection of Non-DGA-domains requires further workI Precision: 0.937. Recall: 0.795. F1-score: 0.860.

I Using all features outperforms the individual sets

GeneralI Vague distinction between DGA and Non-DGAI Labelled data of high quality is hard to come by

12



Egon Kidmose et al.

Motivation


Data

Features

Results

12 Conclusion


Denmark

Conclusion

Future workI Explore other lexical featuresI Explore effect of combining with non-lexical featuresI Improve performance for non-DGA domainsI More substantial data setsI Evaluate in a practical, online settingI Apply unsupervised machine learning

20



Egon Kidmose et al.

Motivation


Data

Features

Results

13 Conclusion


Denmark

Backup slides

20



Egon Kidmose et al.

Motivation


Data

Features

Results

14 Conclusion


Denmark

Existing Solutions

Existing SolutionsI ReputationI Traffic and activity (DNS, e-mail, ...)I Resilience/anonymity techniques

20



Egon Kidmose et al.

Motivation


Data

Features

Results

15 Conclusion


Denmark

Features

Basic Domain Features1. Number of domain levels (n-LD).2. Top level domain (TLD).3. Length of Fully Qualified Domain Name (FQDN).

Simple Lexical Features

1. Length of 2nd Level Domain (2-LD).2. Ratio of consonants in the 2-LD.3. Number of vowels in 2-LD.4. Number of numeric characters in 2-LD.5. Number of special characters in 2-LD.6. Ratio of special characters in 2-LD.

20



Egon Kidmose et al.

Motivation


Data

Features

Results

16 Conclusion


Denmark

Features

Advanced Lexical Features1. Language indicator (langid.py)2. Number of English words in 2-LD.3. Entropy of 2-LD.4. N-gram analysis of 2-LD (www.alexa.org) 1 .5. N-gram analysis of 2-LD (English dictionary).

1{3,4,5}-grams,

20



Egon Kidmose et al.

Motivation


Data

Features

Results

17 Conclusion


Denmark

Feature Analysis

Basic Domain Features

5 10 15 20 25 30 35NumLDs

0.0

0.2

0.4

0.6

0.8

1.0Number of tokens (n-LD), CDF


0 50 100 150 200 250LengthDomain

0.0

0.2

0.4

0.6

0.8

1.0Length of FQDN, CDF


com ru net

org

info biz

com.br

co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr

co.kr be

co.za

com.ar ch

com.mx hu at xyz cl

xn--p1ai

com.tw

com.tr dk vn su

com.ua no sk

co.id

online ie

co.il

co.nz ae

0

20000

40000


com ru net org info02500005000007500001000000

20



Egon Kidmose et al.

Motivation


Data

Features

Results

18 Conclusion


Denmark

Feature Analysis

Simple Lexical Features

0 10 20 30 40 50 60Length2LD

0.0

0.2

0.4

0.6

0.8

1.0Length of 2-LD, CDF


0.0 0.2 0.4 0.6 0.8 1.0RatioConsonants

0.0

0.2

0.4

0.6

0.8

1.0Ratio of consonants, CDF


0 10 20 30 40NumberVowels

0.0

0.2

0.4

0.6

0.8

1.0Number of vowels, CDF


0 10 20 30 40 50 60NumberNumeric

0.0

0.2

0.4

0.6

0.8

1.0Number of numeric characters, CDF


0 10 20 30 40 50 60NumberSpecial

0.0

0.2

0.4

0.6

0.8

1.0Number of special, CDF


0.0 0.2 0.4 0.6 0.8 1.0RatioSpecial

0.0

0.2

0.4

0.6

0.8

1.0Ratio of special, CDF


20



Egon Kidmose et al.

Motivation


Data

Features

Results

19 Conclusion


Denmark

Feature Analysis

Advanced Lexical Featuresen de pl es fr mt it nl sl da fi pt sv hu et cs eu id lt zh ro sq cy sk no hr lv nn xh br az ca tr ms rw eo mg tl ar sw vi gl qu bs af nb ga zu vo ku wa jv oc ht lb is la se an fo ko ja

0

50000

100000

150000


en de pl es fr0500000

10000001500000

0 5 10 15 20 25 30NumberWords

0.0

0.2

0.4

0.6

0.8

1.0Number of English words, CDF


0 1 2 3 4 5Entropy2LD

0.0

0.2

0.4

0.6

0.8

1.0Entropy of 2-LD, CDF


0 100 200 300 400GramsAlexa2LD

0.0

0.2

0.4

0.6

0.8

1.0Ngram analysis of 2-LD (alexa.com), CDF


0 100 200 300 400 500GramsDict2LD

0.0

0.2

0.4

0.6

0.8

1.0Ngram analysis of 2-LD (English dictionary, CDF


20



Egon Kidmose et al.

Motivation


Data

Features

Results

20 Conclusion


Denmark

Detection: Mean performance

Scenario Features TPR/Recall FPR Precision F1-scoreI (Full) 1 9.09e-01 1.36e-01 9.25e-01 9.17e-01

2 7.93e-01 3.44e-01 8.10e-01 8.01e-013 9.06e-01 1.66e-01 9.10e-01 9.08e-01All 9.27e-01 5.35e-02 9.70e-01 9.48e-01

II (Non-DGA) 1 7.59e-01 1.10e-02 9.78e-01 8.55e-012 3.50e-01 6.69e-02 7.69e-01 4.81e-013 8.05e-01 9.16e-02 8.49e-01 8.26e-01All 7.95e-01 3.41e-02 9.37e-01 8.60e-01

III (DGA) 1 9.60e-01 1.21e-01 9.09e-01 9.34e-012 8.61e-01 2.08e-01 8.38e-01 8.50e-013 9.35e-01 8.40e-02 9.33e-01 9.34e-01All 9.84e-01 1.98e-02 9.84e-01 9.84e-01

Documents

Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of