URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression

Presented by :

Mohammed Nazim Feroz

11/26/2013

Motivation

Web services drive new opportunities for people to interact, they also create new opportunities for criminals

Google detects about 300,000 malicious websites per month, this is a clear indication that these opportunities are being used by criminals

Almost all online threats have something in common, they all require the user to click on a hyperlink or type in a website address

Motivation

The user needs to perform sanity checks and assessing the risk of visiting a URL

Performing such an evaluation might be impossible for a novice user

As a result, users often end up clicking links without paying close attention to the URLs – this further makes them vulnerable to malicious websites on the web which in turn exploit them

Introduction

Openness of the web exposes opportunities for criminals to upload malicious content

Do techniques exist to prevent malicious content from entering the web?

Current Techniques

Security practitioners have developed techniques such as blacklisting in order to protect users from malicious websites

Although this approach has minimal overhead, it does not provide complete protection as about only 55% of the malicious URLs are present in blacklists

Another drawback of this approach is that malicious websites are not a part of the blacklist during the period before their detection

Current Techniques

Security researchers have done extensive research in order to detect accounts on social networks that are used for spreading messages that are malicious

The approach still does not provide thorough protection for users in areas such as social networks where the interaction is in real-time because there is a need to build a profile of malicious activity and the process can take a considerable amount of time

Current Techniques

Researchers from TokDoc have used a method that decides on a per-token basis whether a token requires automatic healing

Their work uses n-grams and length as features for detecting malicious URLs

This research builds on their idea by supplementing a set of their features with host-based features as the latter has exhibited a wealth of information that can be used

Approach

URLDoc classifies URLs automatically based on the lexical (textual) and host-based features

Scalable machine learning algorithms from Mahout are used to develop and test the classifier

Online learning is considered over batch learning The classifier achieves 93-97% accuracy by detecting

a large number of malicious hosts, with a modest false positive rate

Approach

If these predictor variables are correctly identified and the URLs metadata is carefully derived then the machine learning algorithms used can sift through tens of thousands of features

Online algorithms are preferred over batch-learning algorithms

Batch learning algorithms look at every example in the training dataset on every step and then update the weights of the classifier – a costly operation if the number of training examples is large

Approach

Online algorithms update the weights according to the gradient of the error with respect to a single training example

Online algorithms are able to process datasets far more efficiently than batch algorithms

Problem Formulation

URL classification lends itself naturally as a binary classification problem

The target variable y(i) can take one of two possible values-malicious or benign

For k predictor variables over all categories then there will be x1(i),…, xk(i); this will result in a k-dimension feature vector characterizing the URL

The goal is to learn a function h(x)=y that maps the space of input values to the space of output values so that h(x) is a good predictor for the corresponding value of y

Problem Formulation

The two main phases involved in building a classification system

The first phase creates the model (i.e. the function h(x)) produced by the learning algorithm

The second phase makes use of that model to assign new data from the test dataset to its predicted target class

Selection of the training dataset and it’s predictor variables, the target classes, and the learning algorithm through which the classification system will learn are vital in the first phase of building the classification system

Predicted labels are compared with known answers to evaluate the classifer

Overview of Features

Lexical features These features have values of both types-binary and

continuous These features include Length of the URL Number of dots in the URL Tokens present in the hostname, primary domain, and path parts of a URL Features in the hostname are further characterized as bigrams

Bigrams are able to capture a certain pattern on character strings permuted randomly and occurring in certain combinations

Example: www.depts.ttu.edu Bigrams: depts ttu, ttu edu


Host-Based features IP address of the URL – A Record IP address of the Mail Exchanger – MX Record IP address of the Name Server – NS Record PTR Record AS number IP Prefix


Malicious websites have exhibited a pattern of being hosted in a particular “bad” portion of the Internet

Example: McColo provided hosting for major botnets, which in turn were responsible for sending 41% of the world’s spam just before McColo’s takedown in November 2008. McColo’s AS number was 26780

These portions of the internet can be characterized on a regular basis by retraining on the predictor variables

This allows keeping track of concept drift

Online Logistic Regression with SGD

Logistic regression is a very flexible algorithm as it allows the predictor variables to be of both types-continuous and binary

Mahout greatly helps in the learning process by choosing an optimum learning rate and thus allowing the classification system to converge to the global minimum

Online Logistic Regression with SGD

Online learning when compared to batch learning is usually much faster, adapts to changes in a continuous manner and is much better when the size of the training and test datasets are large

Support Vector Machines were considered but not chosen since they take a longer period of time to train when compared to Online Logistic Regression

Online Logistic Regression converges more quickly if malicious and benign URLs from the training dataset are presented in a random order

Feature Vector

Feature hashing is used in order to encode the raw feature data into feature vectors

In this approach, a reasonable size (i.e. dimension) is picked for the feature vector and the data is put into feature vectors of the chosen size

After carefully considering the datasets, the size of the feature vectors in the research is in the 100,000 dimension space

Feature Vector Example

The data is encoded into the feature vector as continuous, categorical, word-like, and text-like features using the Mahout API

Results

90/10 dataset split 80/20 dataset splitTraining/Test dataset split Training/Test dataset split

Results

Training/Test dataset split50/50 dataset split

Benign:Malicious

Other Approaches Attempted

Term Frequency – Inverse Document Frequency A bag of words approach was used and term (lexical features) – document

(URL) matrix was created Online Logistic Regression is not affected by good word weighting

Clustering The URLs are viewed as a set of vectors in vector space Cosine similarity was used as the similarity measure between URLs This research focused on classification over clustering since the target

classes of the URLs was known – Clustering has known to be useful when the target classes are unknown

Future Work

Study the various features extensively and only use those with the highest contributions – Also add new features that would help in better classification

Try to use algorithms that can benefit from parallelization

Summary

A reliable framework for the classification of URLs is built A supervised learning method is used in order to learn the

characteristics of both malicious and benign URLs and classify them in real time

The applicability and usefulness of Mahout for the URL classification task is demonstrated, and the benefits of using an online setting over a batch setting are illustrated-the online setting enabled learning new trends in the characteristics of URLs over time

Questions ?

Documents

URLDoc: Learning to Detect Malicious URLs using Online Logistic Regression