Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Learning to Detect Malicious URLs
Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering
UC San Diego
Presentation for GoogleMay 5, 2010
Malicious Web Sites
Spam-advertised goods
Trojan downloads
Phishing: which one is real?
3
Visiting Malicious Web Sites
Predict what is safe without
committing to risky actions
• Safe URL?
• Malicious download?
• Spam-advertised site?
• Phishing site?
URL = Uniform Resource Locator
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
http://fblight.com
http://mail.ru
http://www.cs.ucsd.edu/~jtma/
4
Problem in a Nutshell● URL features to identify malicious Web sites
● No context, no content
● Different classes of URLs● Benign, spam, phishing, exploits, scams...● For now, distinguish benign vs. malicious
facebook.com fblight.com
5
What we want...
6
How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
Blacklist
Hand-picked features (properties of URL)
Machine learning-based classifier
7
State of the Practice
● Current approaches● Blacklists
[SORBS, URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]
● Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]
● Limitations● Cannot learn from newest examples quickly● Cannot quickly adapt to newest features
● Arms race: fast feedback cycle is critical
More automated approach?
8
Contributions
● URL classification system● Used by large Web mail providers
● Practical large-scale machine learning in computer security● Large feature sets + online learning● Related work: Whittaker, Ryner, Nazif, “Large-Scale
Automatic Classification of Phishing Pages” (NDSS'10)
9
Today's Talk
● Problem● Moving Beyond Blacklists● Large-Scale Online Learning● Conclusion
10
Today's Talk
● Problem● Moving Beyond Blacklists● Large-Scale Online Learning● Conclusion
11
Moving Beyond Blacklists
● Preliminary study: Do our features work?● Batch algorithms for now● 104 examples and features
● Outline● System overview● Features focus of this segment←● Experimental results
[Ma, Saul, Savage, Voelker (KDD 2009)]
12
URL Classification System
Label Example Hypothesis
13
Data Sets
● Malicious URLs● 5,000 from PhishTank (phishing)● 15,000 from Spamscatter (spam, phishing, etc)
● Benign URLs● 15,000 from Yahoo Web directory● 15,000 from DMOZ directory
● Malicious x Benign → 4 Data Sets● 30,000 – 55,000 features per data set
14
Algorithms
● Logistic regression w/ L1-norm regularization
● Other models● Naive Bayes● Support vector machines (linear, RBF kernels)
● Implicit feature selection● Easier to interpret
15
Features
Example
16
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical
17
Features to consider?
1)Blacklists
2)Simple heuristics
3)Domain name registration
4)Host properties
5)Lexical
18
(1) Blacklist Queries● List of known malicious sites● Providers: SORBS, URIBL, SURBL, Spamhaus● Not comprehensive
http://www.bfuduuioo1fp.mobiIn blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................
19
stopgap.cn registered 2 May 2010
(2) Manually-Selected Features
● Considered by previous studies● IP address in hostname?● Number of dots in URL● WHOIS (domain name) registration date
[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
20
(3) WHOIS Features
● Domain name registration● Date of registration, update, expiration● Registrant: Who registered domain?● Registrar: Who manages registration?
http://sleazysalmon.com
http://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.com
Registered on4 May 2010
By SpamMedia
21
(4) Host-Based Features
● Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
● WHOIS: registrar, registrant, dates
● IP address: Which ASes/IP prefixes?
● DNS: TTL? PTR record exists/resolves?
● Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com
Bad Parts of the Internet
22
(5) Lexical Features
● Tokens in URL hostname + path● Length of URL● Number of dots
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
[Kolari et al., 2006]
23
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
4,000
# Features
13,000
4
7
17,000
More features Better accuracy→
Error rate (%)
24
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full 96—99% accuracy
4,000
# Features
13,000
4
7
17,000
30,000
Error rate (%)
25
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
7
17,000
30,000
26,000
Error rate (%)
Blacklists and WHOIS are not comprehensive
26
Beyond Blacklists
Blacklist
Full features
Yah
oo
-Ph
ish
Tan
k
Higher detection rate for given false positive rate
27
Summary
● Detect malicious URLs with high accuracy● Only using URL● Diverse feature set helps: 99% w/ 30,000+ features● Model analysis (more in KDD'09 paper)
● What about scalable and adaptive malicious URL detection system?
28
Today's Talk
● Problem● Moving Beyond Blacklists● Large-Scale Online Learning● Conclusion
29
URL Classification System
Label Example Hypothesis
30
Live URL Classification System
Label Example Hypothesis
31
Large-Scale Online Learning
● How do we scale to live, large-scale data?● Outline
● Live training feed● Challenges: scale and non-stationarity● Need for large, fresh training sets● Online learning
[Ma, Saul, Savage, Voelker (ICML 2009)]
32
Live Training Feed
● Malicious URLs (spamming and phishing)● 6,000—7,500 per day from Web mail provider
● Benign URLs● From Yahoo Web directory
● Total of 20,000 URLs per day● Collected Jan – May, 2009
● 120 days● More than two million examples
33
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical60+ features 1.1 million 1.8 million
GROWING
Day 100
34
New Features Encountered All the Time
● Many binary features● Enumerating tokens, ISPs, registrants, etc...● 2.9 million features after 100 days
35
Live URL Classficiation System
Online learning
36
Practical Challenges of ML in Systems
● Industrial concerns● Scale: millions of examples, features● Non-stationarity: examples change over time (arms
race w/ criminals)
● Pivotal decision: batch or online?
37
Batch vs. Online Learning
● Batch/offline learning● SVM, logistic regression,
decision trees, etc● Multiple passes over data ● No incremental updates● Potentially high memory
and processing overhead
● Online learning● Perceptron-style
algorithms● Single pass over data● Incremental updates● Low memory and
processing overhead
Online learning addresses scale and non-stationarity
38
Evaluations
● Online learning for URL reputation● Need for large, fresh training sets● Comparing online algorithms● Continuous retraining● Growing feature vector
39
Need lots of fresh training data?SVM trained once
SVM retrained daily
● Fresh data helps
40
Need lots of fresh training data?
● Fresh data helps● More data helps
SVM trained once on 2 weeks
SVM w/ 2-week sliding window
41
Which online algorithm?
● Perceptron
● Stochastic Gradient Descent for Logistic Regression
● Confidence-Weighted Learning
42
Perceptron
● Convergence result:
[Rosenblatt, 1958]
+−+ +
++
+−
−−
−
−
● Update on each mistake:
radius
margin
Number of mistakes ≤
43
Logistic Regression with SGD
● Log likelihood:
where
[Bottou, 1998]
● For every example:
Proportional
44
Confidence-Weighted Learning
● Maintain Gaussian distribution over weight vector:
[Dredze et al., 2008] [Crammer et al., 2009]
● Constrained problem:
● Closed-form update:Update features at different rates
Diagonal covariance matrix
for evals
45
Which online algorithms?
Perceptron
46
Which online algorithms?
LR w/ SGD
● Proportional update helps
Perceptron
47
Which online algorithms?
● Proportional update helps● Per-feature confidence really helps
Confidence-Weighted
LR w/ SGD
Perceptron
48
Batch...
● Fresh data helps● More data helps
Batch
49
Batch vs. Online
Confidence-Weighted
Batch
● Fresh data helps● More data helps● Online matches batch
50
Why online does well?
SVM w/ 2-week sliding window
Confidence-Weighted
51
Why online does well?
Confidence-Weightedonce-a-day
● More data eventually helps● Continuous retraining helps
SVM w/ 2-week sliding window
Confidence-Weighted
52
Growing feature vector?
Confidence-Weightedfixed features
53
Growing feature vector?
Confidence-Weightedfixed features
Confidence-Weightedgrowing features
● Growing feature vector helps
54
Fixed vs. Variable Features
Perceptronfixed features
● Growing feature vector helps
55
Fixed vs. Variable Features
Perceptronfixed features
● Growing feature vector helps● Growing + CW really helps
Perceptrongrowing features
56
Proof-of-Concept Plugin
57
Examine URL details...
58
It sure looks like phishing...The Real Site
59
Blacklisted later on
60
Summary
● Detecting malicious URLs● Relevant real-world problem● Successful application of online learning
● What helps?● More, fresher data● Continuous retraining● Growing feature vector
● Confidence-Weighted vs. Batch● As accurate● More adaptive● Fewer resources
61
Today's Talk
● Problem● Moving Beyond Blacklists● Large-Scale Online Learning● Conclusion
62
Impact
● Public data set (UCI ML repo)● Industrial impact
● Mail providers have adopted our approach for classifying URLs in email messages
Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/
63
Final Thoughts: Systems + ML
● Systems: high-impact, large-scale applications● ML: Methodical approaches● Systems: Embrace real-world constraints● ML: More than “plug-and-play” solutions
Systems ↔ Machine Learning