Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Detection of malicious domainsthrough lexical analysis
International Conference On Cyber Security And Protection Of Digital Services
June 11, 2018
Egon Kidmose, Matija Stevanovic and Jens Myrup [email protected], [email protected], [email protected]
Department of Electronic SystemsAalborg University
Denmark
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
2 Motivation
Assumption andMethod
Data
Features
Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Domain Names and DNS
DNS: Domain Names SystemI Used everywhere, by everybody on the InternetI ... also criminals!I Technology/service choke-pointI Only small fraction of Internet traffic
Example domainsI www.example.com
I goggle.com
I yf32d9ac7f0a9f463e8da4736b12d7044a.tk
Source: Antonakakis et al. 2011.
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
3 Motivation
Assumption andMethod
Data
Features
Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Abuse and Malicious Domains
I Command and controlI PhishingI SpammingI Typo-squattingI Blending in, tunnellingI Fast-flux, Double-fluxI Domain-flux (DGA)
Source: Haymarket Media, Inc. 1
Malicious domain: Any domain that is used for criminal,malicious or otherwise nefarious activities.
1: https://www.scmagazineuk.com/cyber-criminals-becoming-increasingly-professional/article/
531709/
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
4 Assumption andMethod
Data
Features
Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Assumption and Method
Assumption:
“Malicious domain names can be detected bylexical features.”
Method:
DomainNames
Lexical Analysis(Feature Extraction)
SupervisedMachine Learning Results
Machine Learning: Supervised, 10-fold cross validation, 10 repeats,Random Forrest, Python, Scikit-learn.
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
5 Data
Features
Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Data
Table I: DATA SETS OF MALICIOUS DOMAIN NAMES.Data sets Number of domains
Non-DGA DGA
abuse.chZeuS Tracker 437 -
Palevo Tracker 14 -Ransomware Tracker 1248 -
malware-\domains.com
Just domains 13073 -Zeus Gameover - 190033
Conficker - 106101Pushdo - 10951
GOZ - 7348Microsoft Botnet 22036 -
host-file.net
Ad/tracking servers (ATS) 47960 -Malware distribution (EMD) 137237 -
Exploits sites (EXP) 17282 -Fraud sites (FSA) 134501 -
Spamming sites (GRM) 674 -Spamming sites (HFS) 573 -
Hijack sites (HJK) 74 -Misleading marketing (MMT) 5533 -Pharmacy activities (PHA) 23143 -
Phishing sites (PSH) 133913 -Warez distribution (WRZ) 3231 -
datadriven\security.info
Cryptolocker - 34319Goz - 7347
New Goz - 10999malware\domainlist.com
Malware-related domains 1253 -
malc0de.com Malware-related domains 208 -malwarepatrol.net Malicious URLs 35518 -phistank.com Phishing URLs 14807 -AAU-STAR Domains from malw.testing 27778 -Total 620493 367098
Table II: DATA SETS OF BENIGN DOMAIN NAMES.Data sets Number of domains
Non-DGA DGAalexa.org Most popular domains 971424 -datadriven\security.info
Legitimate DGA domains - 133927
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
6 Features
Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Features
1. Basic Domain FeaturesI Count, CategoricalI TLD and FQDN
2. Simple Lexical FeaturesI Length, count, ratioI 2LD, character classes
3. Advanced Lexical FeaturesI Approximating language, recognising wordsI Entropy of letter distributionI N-gram analysis (alexa.org, English)
Top-Level Domain (TLD): www.example.com.Second-Level Domain (2LD): www.example.com.Fully Qualified Domain Name: www.example.com
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
7 Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Feature Analysis
Basic Domain Features
5 10 15 20 25 30 35NumLDs
0.0
0.2
0.4
0.6
0.8
1.0Number of tokens (n-LD), CDF
MaliciousNon-malicious
0 50 100 150 200 250LengthDomain
0.0
0.2
0.4
0.6
0.8
1.0Length of FQDN, CDF
MaliciousNon-malicious
com ru net
org
info biz
com.br
co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr
co.kr be
co.za
com.ar ch
com.mx hu at xyz cl
xn--p1ai
com.tw
com.tr dk vn su
com.ua no sk
co.id
online ie
co.il
co.nz ae
0
20000
40000
60000MaliciousNon-malicious
com ru net org info02500005000007500001000000
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
8 Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Detection: Scenario I - Full Data Set
Basicdomainfeatures
Simplelexicalfeatures
Advancedlexicalfeatures
Allfeatures
0.00.20.40.60.81.0
TPR/RecallFPRPrecisionF1-score
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
9 Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Detection: Scenario II - Non-DGA
Basicdomainfeatures
Simplelexicalfeatures
Advancedlexicalfeatures
Allfeatures
0.00.20.40.60.81.0
TPR/RecallFPRPrecisionF1-score
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
10 Results
Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Detection: Scenario III - DGA
Basicdomainfeatures
Simplelexicalfeatures
Advancedlexicalfeatures
Allfeatures
0.00.20.40.60.81.0
TPR/RecallFPRPrecisionF1-score
Scenario III includes the benign non-DGA domains.
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
11 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Conclusion
Results
Malicious domains can be detectedby using lexical features only
I Detection of DGA-domains is particular promisingI Precision: 0.984. Recall: 0.984. F1-score: 0.948.
I Detection of Non-DGA-domains requires further workI Precision: 0.937. Recall: 0.795. F1-score: 0.860.
I Using all features outperforms the individual sets
GeneralI Vague distinction between DGA and Non-DGAI Labelled data of high quality is hard to come by
12
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
12 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Conclusion
Future workI Explore other lexical featuresI Explore effect of combining with non-lexical featuresI Improve performance for non-DGA domainsI More substantial data setsI Evaluate in a practical, online settingI Apply unsupervised machine learning
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
13 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Backup slides
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
14 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Existing Solutions
Existing SolutionsI ReputationI Traffic and activity (DNS, e-mail, ...)I Resilience/anonymity techniques
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
15 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Features
Basic Domain Features1. Number of domain levels (n-LD).2. Top level domain (TLD).3. Length of Fully Qualified Domain Name (FQDN).
Simple Lexical Features
1. Length of 2nd Level Domain (2-LD).2. Ratio of consonants in the 2-LD.3. Number of vowels in 2-LD.4. Number of numeric characters in 2-LD.5. Number of special characters in 2-LD.6. Ratio of special characters in 2-LD.
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
16 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Features
Advanced Lexical Features1. Language indicator (langid.py)2. Number of English words in 2-LD.3. Entropy of 2-LD.4. N-gram analysis of 2-LD (www.alexa.org) 1 .5. N-gram analysis of 2-LD (English dictionary).
1{3,4,5}-grams,
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
17 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Feature Analysis
Basic Domain Features
5 10 15 20 25 30 35NumLDs
0.0
0.2
0.4
0.6
0.8
1.0Number of tokens (n-LD), CDF
MaliciousNon-malicious
0 50 100 150 200 250LengthDomain
0.0
0.2
0.4
0.6
0.8
1.0Length of FQDN, CDF
MaliciousNon-malicious
com ru net
org
info biz
com.br
co.uk de in fr pl kz ir cn es nl it us ca jp me ro cz gr
co.kr be
co.za
com.ar ch
com.mx hu at xyz cl
xn--p1ai
com.tw
com.tr dk vn su
com.ua no sk
co.id
online ie
co.il
co.nz ae
0
20000
40000
60000MaliciousNon-malicious
com ru net org info02500005000007500001000000
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
18 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Feature Analysis
Simple Lexical Features
0 10 20 30 40 50 60Length2LD
0.0
0.2
0.4
0.6
0.8
1.0Length of 2-LD, CDF
MaliciousNon-malicious
0.0 0.2 0.4 0.6 0.8 1.0RatioConsonants
0.0
0.2
0.4
0.6
0.8
1.0Ratio of consonants, CDF
MaliciousNon-malicious
0 10 20 30 40NumberVowels
0.0
0.2
0.4
0.6
0.8
1.0Number of vowels, CDF
MaliciousNon-malicious
0 10 20 30 40 50 60NumberNumeric
0.0
0.2
0.4
0.6
0.8
1.0Number of numeric characters, CDF
MaliciousNon-malicious
0 10 20 30 40 50 60NumberSpecial
0.0
0.2
0.4
0.6
0.8
1.0Number of special, CDF
MaliciousNon-malicious
0.0 0.2 0.4 0.6 0.8 1.0RatioSpecial
0.0
0.2
0.4
0.6
0.8
1.0Ratio of special, CDF
MaliciousNon-malicious
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
19 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Feature Analysis
Advanced Lexical Featuresen de pl es fr mt it nl sl da fi pt sv hu et cs eu id lt zh ro sq cy sk no hr lv nn xh br az ca tr ms rw eo mg tl ar sw vi gl qu bs af nb ga zu vo ku wa jv oc ht lb is la se an fo ko ja
0
50000
100000
150000
200000MaliciousNon-malicious
en de pl es fr0500000
10000001500000
0 5 10 15 20 25 30NumberWords
0.0
0.2
0.4
0.6
0.8
1.0Number of English words, CDF
MaliciousNon-malicious
0 1 2 3 4 5Entropy2LD
0.0
0.2
0.4
0.6
0.8
1.0Entropy of 2-LD, CDF
MaliciousNon-malicious
0 100 200 300 400GramsAlexa2LD
0.0
0.2
0.4
0.6
0.8
1.0Ngram analysis of 2-LD (alexa.com), CDF
MaliciousNon-malicious
0 100 200 300 400 500GramsDict2LD
0.0
0.2
0.4
0.6
0.8
1.0Ngram analysis of 2-LD (English dictionary, CDF
MaliciousNon-malicious
20
Detection of maliciousdomains
through lexicalanalysis
Egon Kidmose et al.
Motivation
Assumption andMethod
Data
Features
Results
20 Conclusion
Dept. of Electronic SystemsAalborg University
Denmark
Detection: Mean performance
Scenario Features TPR/Recall FPR Precision F1-scoreI (Full) 1 9.09e-01 1.36e-01 9.25e-01 9.17e-01
2 7.93e-01 3.44e-01 8.10e-01 8.01e-013 9.06e-01 1.66e-01 9.10e-01 9.08e-01All 9.27e-01 5.35e-02 9.70e-01 9.48e-01
II (Non-DGA) 1 7.59e-01 1.10e-02 9.78e-01 8.55e-012 3.50e-01 6.69e-02 7.69e-01 4.81e-013 8.05e-01 9.16e-02 8.49e-01 8.26e-01All 7.95e-01 3.41e-02 9.37e-01 8.60e-01
III (DGA) 1 9.60e-01 1.21e-01 9.09e-01 9.34e-012 8.61e-01 2.08e-01 8.38e-01 8.50e-013 9.35e-01 8.40e-02 9.33e-01 9.34e-01All 9.84e-01 1.98e-02 9.84e-01 9.84e-01