20
Detection of malicious domains through lexical analysis International Conference On Cyber Security And Protection Of Digital Services June 11, 2018 Egon Kidmose, Matija Stevanovic and Jens Myrup Pedersen , , Department of Electronic Systems Aalborg University Denmark

Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

Detection of malicious domainsthrough lexical analysis

International Conference On Cyber Security And Protection Of Digital Services

June 11, 2018

Egon Kidmose, Matija Stevanovic and Jens Myrup [email protected], [email protected], [email protected]

Department of Electronic SystemsAalborg University

Denmark

Page 2: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

2 Motivation

Assumption andMethod

Data

Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Domain Names and DNS

DNS: Domain Names SystemI Used everywhere, by everybody on the InternetI ... also criminals!I Technology/service choke-pointI Only small fraction of Internet traffic

Example domainsI www.example.com

I goggle.com

I yf32d9ac7f0a9f463e8da4736b12d7044a.tk

Source: Antonakakis et al. 2011.

Page 3: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

3 Motivation

Assumption andMethod

Data

Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Abuse and Malicious Domains

I Command and controlI PhishingI SpammingI Typo-squattingI Blending in, tunnellingI Fast-flux, Double-fluxI Domain-flux (DGA)

Source: Haymarket Media, Inc. 1

Malicious domain: Any domain that is used for criminal,malicious or otherwise nefarious activities.

1: https://www.scmagazineuk.com/cyber-criminals-becoming-increasingly-professional/article/

531709/

Page 4: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

4 Assumption andMethod

Data

Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Assumption and Method

Assumption:

“Malicious domain names can be detected bylexical features.”

Method:

DomainNames

Lexical Analysis(Feature Extraction)

SupervisedMachine Learning Results

Machine Learning: Supervised, 10-fold cross validation, 10 repeats,Random Forrest, Python, Scikit-learn.

Page 5: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

5 Data

Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Data

Table I: DATA SETS OF MALICIOUS DOMAIN NAMES.Data sets Number of domains

Non-DGA DGA

abuse.chZeuS Tracker 437 -

Palevo Tracker 14 -Ransomware Tracker 1248 -

malware-\domains.com

Just domains 13073 -Zeus Gameover - 190033

Conficker - 106101Pushdo - 10951

GOZ - 7348Microsoft Botnet 22036 -

host-file.net

Ad/tracking servers (ATS) 47960 -Malware distribution (EMD) 137237 -

Exploits sites (EXP) 17282 -Fraud sites (FSA) 134501 -

Spamming sites (GRM) 674 -Spamming sites (HFS) 573 -

Hijack sites (HJK) 74 -Misleading marketing (MMT) 5533 -Pharmacy activities (PHA) 23143 -

Phishing sites (PSH) 133913 -Warez distribution (WRZ) 3231 -

datadriven\security.info

Cryptolocker - 34319Goz - 7347

New Goz - 10999malware\domainlist.com

Malware-related domains 1253 -

malc0de.com Malware-related domains 208 -malwarepatrol.net Malicious URLs 35518 -phistank.com Phishing URLs 14807 -AAU-STAR Domains from malw.testing 27778 -Total 620493 367098

Table II: DATA SETS OF BENIGN DOMAIN NAMES.Data sets Number of domains

Non-DGA DGAalexa.org Most popular domains 971424 -datadriven\security.info

Legitimate DGA domains - 133927

Page 6: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

6 Features

Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Features

1. Basic Domain FeaturesI Count, CategoricalI TLD and FQDN

2. Simple Lexical FeaturesI Length, count, ratioI 2LD, character classes

3. Advanced Lexical FeaturesI Approximating language, recognising wordsI Entropy of letter distributionI N-gram analysis (alexa.org, English)

Top-Level Domain (TLD): www.example.com.Second-Level Domain (2LD): www.example.com.Fully Qualified Domain Name: www.example.com

Page 7: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

7 Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Feature Analysis

Basic Domain Features

5 10 15 20 25 30 35NumLDs

0.0

0.2

0.4

0.6

0.8

1.0Number of tokens (n-LD), CDF

MaliciousNon-malicious

0 50 100 150 200 250LengthDomain

0.0

0.2

0.4

0.6

0.8

1.0Length of FQDN, CDF

MaliciousNon-malicious

com ru net

org

info biz

com.br

co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr

co.kr be

co.za

com.ar ch

com.mx hu at xyz cl

xn--p1ai

com.tw

com.tr dk vn su

com.ua no sk

co.id

online ie

co.il

co.nz ae

0

20000

40000

60000MaliciousNon-malicious

com ru net org info02500005000007500001000000

Page 8: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

8 Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Detection: Scenario I - Full Data Set

Basicdomainfeatures

Simplelexicalfeatures

Advancedlexicalfeatures

Allfeatures

0.00.20.40.60.81.0

TPR/RecallFPRPrecisionF1-score

Page 9: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

9 Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Detection: Scenario II - Non-DGA

Basicdomainfeatures

Simplelexicalfeatures

Advancedlexicalfeatures

Allfeatures

0.00.20.40.60.81.0

TPR/RecallFPRPrecisionF1-score

Page 10: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

10 Results

Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Detection: Scenario III - DGA

Basicdomainfeatures

Simplelexicalfeatures

Advancedlexicalfeatures

Allfeatures

0.00.20.40.60.81.0

TPR/RecallFPRPrecisionF1-score

Scenario III includes the benign non-DGA domains.

Page 11: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

11 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Conclusion

Results

Malicious domains can be detectedby using lexical features only

I Detection of DGA-domains is particular promisingI Precision: 0.984. Recall: 0.984. F1-score: 0.948.

I Detection of Non-DGA-domains requires further workI Precision: 0.937. Recall: 0.795. F1-score: 0.860.

I Using all features outperforms the individual sets

GeneralI Vague distinction between DGA and Non-DGAI Labelled data of high quality is hard to come by

Page 12: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

12

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

12 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Conclusion

Future workI Explore other lexical featuresI Explore effect of combining with non-lexical featuresI Improve performance for non-DGA domainsI More substantial data setsI Evaluate in a practical, online settingI Apply unsupervised machine learning

Page 13: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

13 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Backup slides

Page 14: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

14 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Existing Solutions

Existing SolutionsI ReputationI Traffic and activity (DNS, e-mail, ...)I Resilience/anonymity techniques

Page 15: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

15 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Features

Basic Domain Features1. Number of domain levels (n-LD).2. Top level domain (TLD).3. Length of Fully Qualified Domain Name (FQDN).

Simple Lexical Features

1. Length of 2nd Level Domain (2-LD).2. Ratio of consonants in the 2-LD.3. Number of vowels in 2-LD.4. Number of numeric characters in 2-LD.5. Number of special characters in 2-LD.6. Ratio of special characters in 2-LD.

Page 16: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

16 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Features

Advanced Lexical Features1. Language indicator (langid.py)2. Number of English words in 2-LD.3. Entropy of 2-LD.4. N-gram analysis of 2-LD (www.alexa.org) 1 .5. N-gram analysis of 2-LD (English dictionary).

1{3,4,5}-grams,

Page 17: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

17 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Feature Analysis

Basic Domain Features

5 10 15 20 25 30 35NumLDs

0.0

0.2

0.4

0.6

0.8

1.0Number of tokens (n-LD), CDF

MaliciousNon-malicious

0 50 100 150 200 250LengthDomain

0.0

0.2

0.4

0.6

0.8

1.0Length of FQDN, CDF

MaliciousNon-malicious

com ru net

org

info biz

com.br

co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr

co.kr be

co.za

com.ar ch

com.mx hu at xyz cl

xn--p1ai

com.tw

com.tr dk vn su

com.ua no sk

co.id

online ie

co.il

co.nz ae

0

20000

40000

60000MaliciousNon-malicious

com ru net org info02500005000007500001000000

Page 18: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

18 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Feature Analysis

Simple Lexical Features

0 10 20 30 40 50 60Length2LD

0.0

0.2

0.4

0.6

0.8

1.0Length of 2-LD, CDF

MaliciousNon-malicious

0.0 0.2 0.4 0.6 0.8 1.0RatioConsonants

0.0

0.2

0.4

0.6

0.8

1.0Ratio of consonants, CDF

MaliciousNon-malicious

0 10 20 30 40NumberVowels

0.0

0.2

0.4

0.6

0.8

1.0Number of vowels, CDF

MaliciousNon-malicious

0 10 20 30 40 50 60NumberNumeric

0.0

0.2

0.4

0.6

0.8

1.0Number of numeric characters, CDF

MaliciousNon-malicious

0 10 20 30 40 50 60NumberSpecial

0.0

0.2

0.4

0.6

0.8

1.0Number of special, CDF

MaliciousNon-malicious

0.0 0.2 0.4 0.6 0.8 1.0RatioSpecial

0.0

0.2

0.4

0.6

0.8

1.0Ratio of special, CDF

MaliciousNon-malicious

Page 19: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

19 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Feature Analysis

Advanced Lexical Featuresen de pl es fr mt it nl sl da fi pt sv hu et cs eu id lt zh ro sq cy sk no hr lv nn xh br az ca tr ms rw eo mg tl ar sw vi gl qu bs af nb ga zu vo ku wa jv oc ht lb is la se an fo ko ja

0

50000

100000

150000

200000MaliciousNon-malicious

en de pl es fr0500000

10000001500000

0 5 10 15 20 25 30NumberWords

0.0

0.2

0.4

0.6

0.8

1.0Number of English words, CDF

MaliciousNon-malicious

0 1 2 3 4 5Entropy2LD

0.0

0.2

0.4

0.6

0.8

1.0Entropy of 2-LD, CDF

MaliciousNon-malicious

0 100 200 300 400GramsAlexa2LD

0.0

0.2

0.4

0.6

0.8

1.0Ngram analysis of 2-LD (alexa.com), CDF

MaliciousNon-malicious

0 100 200 300 400 500GramsDict2LD

0.0

0.2

0.4

0.6

0.8

1.0Ngram analysis of 2-LD (English dictionary, CDF

MaliciousNon-malicious

Page 20: Detection of malicious domains through lexical analysis · through lexical analysis Egon Kidmose et al. Motivation Assumption and Method 5 Data Features Results Conclusion Dept. of

20

Detection of maliciousdomains

through lexicalanalysis

Egon Kidmose et al.

Motivation

Assumption andMethod

Data

Features

Results

20 Conclusion

Dept. of Electronic SystemsAalborg University

Denmark

Detection: Mean performance

Scenario Features TPR/Recall FPR Precision F1-scoreI (Full) 1 9.09e-01 1.36e-01 9.25e-01 9.17e-01

2 7.93e-01 3.44e-01 8.10e-01 8.01e-013 9.06e-01 1.66e-01 9.10e-01 9.08e-01All 9.27e-01 5.35e-02 9.70e-01 9.48e-01

II (Non-DGA) 1 7.59e-01 1.10e-02 9.78e-01 8.55e-012 3.50e-01 6.69e-02 7.69e-01 4.81e-013 8.05e-01 9.16e-02 8.49e-01 8.26e-01All 7.95e-01 3.41e-02 9.37e-01 8.60e-01

III (DGA) 1 9.60e-01 1.21e-01 9.09e-01 9.34e-012 8.61e-01 2.08e-01 8.38e-01 8.50e-013 9.35e-01 8.40e-02 9.33e-01 9.34e-01All 9.84e-01 1.98e-02 9.84e-01 9.84e-01