Learning to Detect Malicious URLscseweb.ucsd.edu/~jtma/google_talk/jtma-google10.pdf37 Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc

Learning to Detect Malicious URLs

Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering

UC San Diego

Presentation for GoogleMay 5, 2010

Malicious Web Sites

Spam-advertised goods

Trojan downloads

Phishing: which one is real?

3

Visiting Malicious Web Sites

Predict what is safe without

committing to risky actions

• Safe URL?

• Malicious download?

• Spam-advertised site?

• Phishing site?

URL = Uniform Resource Locator

http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

http://fblight.com

http://mail.ru

http://www.cs.ucsd.edu/~jtma/


http://fblight.com/

4

Problem in a Nutshell● URL features to identify malicious Web sites

● No context, no content

● Different classes of URLs● Benign, spam, phishing, exploits, scams...● For now, distinguish benign vs. malicious

facebook.com fblight.com

5

What we want...

6

How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

Blacklist

Hand-picked features (properties of URL)

Machine learning-based classifier

7

State of the Practice

● Current approaches● Blacklists

[SORBS, URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]

● Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]

● Limitations● Cannot learn from newest examples quickly● Cannot quickly adapt to newest features

● Arms race: fast feedback cycle is critical

More automated approach?

8

Contributions

● URL classification system● Used by large Web mail providers

● Practical large-scale machine learning in computer security● Large feature sets + online learning● Related work: Whittaker, Ryner, Nazif, “Large-Scale

Automatic Classification of Phishing Pages” (NDSS'10)

9

Today's Talk

● Problem● Moving Beyond Blacklists● Large-Scale Online Learning● Conclusion

10

Today's Talk


11

Moving Beyond Blacklists

● Preliminary study: Do our features work?● Batch algorithms for now● 104 examples and features

● Outline● System overview● Features focus of this segment←● Experimental results

[Ma, Saul, Savage, Voelker (KDD 2009)]

12

URL Classification System

Label Example Hypothesis

13

Data Sets

● Malicious URLs● 5,000 from PhishTank (phishing)● 15,000 from Spamscatter (spam, phishing, etc)

● Benign URLs● 15,000 from Yahoo Web directory● 15,000 from DMOZ directory

● Malicious x Benign → 4 Data Sets● 30,000 – 55,000 features per data set

14

Algorithms

● Logistic regression w/ L1-norm regularization

● Other models● Naive Bayes● Support vector machines (linear, RBF kernels)

● Implicit feature selection● Easier to interpret

15

Features

Example

16

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical

17

Features to consider?

1)Blacklists

2)Simple heuristics

3)Domain name registration

4)Host properties

5)Lexical

18

(1) Blacklist Queries● List of known malicious sites● Providers: SORBS, URIBL, SURBL, Spamhaus● Not comprehensive

http://www.bfuduuioo1fp.mobiIn blacklist?

Yes

http://fblight.com

No

In blacklist?

http://www.bfuduuioo1fp.mobi

Blacklist queries as features

........................................

........................................

19

stopgap.cn registered 2 May 2010

(2) Manually-Selected Features

● Considered by previous studies● IP address in hostname?● Number of dots in URL● WHOIS (domain name) registration date

[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]

http://72.23.5.122/www.bankofamerica.com/

http://www.bankofamerica.com.qytrpbcw.stopgap.cn/

20

(3) WHOIS Features

● Domain name registration● Date of registration, update, expiration● Registrant: Who registered domain?● Registrar: Who manages registration?

http://sleazysalmon.com

http://angryalbacore.com

http://mangymackerel.com

http://yammeringyellowtail.com

Registered on4 May 2010

By SpamMedia

21

(4) Host-Based Features

● Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)

● WHOIS: registrar, registrant, dates

● IP address: Which ASes/IP prefixes?

● DNS: TTL? PTR record exists/resolves?

● Geography-related: Locale? Connection speed?

75.102.60.0/2269.63.176.0/20

facebook.com fblight.com

Bad Parts of the Internet

22

(5) Lexical Features

● Tokens in URL hostname + path● Length of URL● Number of dots


[Kolari et al., 2006]

23

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

4,000

# Features

13,000

4

7

17,000

More features Better accuracy→

Error rate (%)

24

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full 96—99% accuracy

4,000

# Features

13,000

4

7

17,000

30,000

Error rate (%)

25

Which feature sets?

Blacklist

Manual

WHOIS

Host-based

Lexical

Full

w/o WHOIS/Blacklist

4,000

# Features

13,000

4

7

17,000

30,000

26,000

Error rate (%)

Blacklists and WHOIS are not comprehensive

26

Beyond Blacklists

Blacklist

Full features

Yah

oo

-Ph

ish

Tan

k

Higher detection rate for given false positive rate

27

Summary

● Detect malicious URLs with high accuracy● Only using URL● Diverse feature set helps: 99% w/ 30,000+ features● Model analysis (more in KDD'09 paper)

● What about scalable and adaptive malicious URL detection system?

28

Today's Talk


29

URL Classification System


30

Live URL Classification System


31

Large-Scale Online Learning

● How do we scale to live, large-scale data?● Outline

● Live training feed● Challenges: scale and non-stationarity● Need for large, fresh training sets● Online learning

[Ma, Saul, Savage, Voelker (ICML 2009)]

32

Live Training Feed

● Malicious URLs (spamming and phishing)● 6,000—7,500 per day from Web mail provider

● Benign URLs● From Yahoo Web directory

● Total of 20,000 URLs per day● Collected Jan – May, 2009

● 120 days● More than two million examples

33

Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll

WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...

[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical60+ features 1.1 million 1.8 million

GROWING

Day 100

34

New Features Encountered All the Time

● Many binary features● Enumerating tokens, ISPs, registrants, etc...● 2.9 million features after 100 days

35

Live URL Classficiation System

Online learning

36

Practical Challenges of ML in Systems

● Industrial concerns● Scale: millions of examples, features● Non-stationarity: examples change over time (arms

race w/ criminals)

● Pivotal decision: batch or online?

37

Batch vs. Online Learning

● Batch/offline learning● SVM, logistic regression,

decision trees, etc● Multiple passes over data ● No incremental updates● Potentially high memory

and processing overhead

● Online learning● Perceptron-style

algorithms● Single pass over data● Incremental updates● Low memory and

processing overhead

Online learning addresses scale and non-stationarity

38

Evaluations

● Online learning for URL reputation● Need for large, fresh training sets● Comparing online algorithms● Continuous retraining● Growing feature vector

39

Need lots of fresh training data?SVM trained once

SVM retrained daily

● Fresh data helps

40

Need lots of fresh training data?

● Fresh data helps● More data helps

SVM trained once on 2 weeks

SVM w/ 2-week sliding window

41

Which online algorithm?

● Perceptron

● Stochastic Gradient Descent for Logistic Regression

● Confidence-Weighted Learning

42

Perceptron

● Convergence result:

[Rosenblatt, 1958]

+−+ +

++

+−

−−

−

−

● Update on each mistake:

radius

margin

Number of mistakes ≤

43

Logistic Regression with SGD

● Log likelihood:

where

[Bottou, 1998]

● For every example:

Proportional

44

Confidence-Weighted Learning

● Maintain Gaussian distribution over weight vector:

[Dredze et al., 2008] [Crammer et al., 2009]

● Constrained problem:

● Closed-form update:Update features at different rates

Diagonal covariance matrix

for evals

45

Which online algorithms?

Perceptron

46


LR w/ SGD

● Proportional update helps

Perceptron

47


● Proportional update helps● Per-feature confidence really helps

Confidence-Weighted

LR w/ SGD

Perceptron

48

Batch...

● Fresh data helps● More data helps

Batch

49

Batch vs. Online

Confidence-Weighted

Batch

● Fresh data helps● More data helps● Online matches batch

50

Why online does well?


Confidence-Weighted

51

Why online does well?

Confidence-Weightedonce-a-day

● More data eventually helps● Continuous retraining helps


Confidence-Weighted

52

Growing feature vector?

Confidence-Weightedfixed features

53

Growing feature vector?

Confidence-Weightedfixed features

Confidence-Weightedgrowing features

● Growing feature vector helps

54

Fixed vs. Variable Features

Perceptronfixed features

● Growing feature vector helps

55

Fixed vs. Variable Features

Perceptronfixed features

● Growing feature vector helps● Growing + CW really helps

Perceptrongrowing features

56

Proof-of-Concept Plugin

57

Examine URL details...

58

It sure looks like phishing...The Real Site

59

Blacklisted later on

60

Summary

● Detecting malicious URLs● Relevant real-world problem● Successful application of online learning

● What helps?● More, fresher data● Continuous retraining● Growing feature vector

● Confidence-Weighted vs. Batch● As accurate● More adaptive● Fewer resources

61

Today's Talk


62

Impact

● Public data set (UCI ML repo)● Industrial impact

● Mail providers have adopted our approach for classifying URLs in email messages

Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/

63

Final Thoughts: Systems + ML

● Systems: high-impact, large-scale applications● ML: Methodical approaches● Systems: Embrace real-world constraints● ML: More than “plug-and-play” solutions

Systems ↔ Machine Learning

Documents

Learning to Detect Malicious URLscseweb.ucsd.edu/~jtma/google_talk/jtma-google10.pdf37 Batch vs. Online Learning Batch/offline learning SVM, logistic regression, decision trees, etc