15
Spam Email Detection: A Comparative Study Jincheng Zhang, Yan Liu December 6, 2013 Supervised by Prof. Laiwan Chan.

spam_detection.pdf

  • Upload
    ali-rao

  • View
    11

  • Download
    6

Embed Size (px)

Citation preview

  • Spam Email Detection: A Comparative Study

    Jincheng Zhang, Yan Liu

    December 6, 2013

    Supervised by Prof. Laiwan Chan.

  • Contents

    1 Introduction 3

    2 Data Set 52.1 Data Set Source . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Data Set Description . . . . . . . . . . . . . . . . . . . . . . . 5

    3 Approaches 73.1 Nave Bayes Classification . . . . . . . . . . . . . . . . . . . . 73.2 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 83.3 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . 93.4 Combined Method . . . . . . . . . . . . . . . . . . . . . . . . 10

    4 Performance Evaluation 114.1 Training Set and Testing Set . . . . . . . . . . . . . . . . . . . 114.2 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 114.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 12

    4.3.1 Different Kernel functions of SVM . . . . . . . . . . . . 124.3.2 Different Spam Detection Algorithms . . . . . . . . . . 13

    5 Conclusion 14

    2

  • Chapter 1

    Introduction

    Nowadays, emails have become an important way for people to communi-cate and share information with the rapid development of information tech-nology. The total number of worldwide email accounts is expected to increasefrom 3.9 billion in 2013 to over 4.9 billion by the end of 2017, and the numberof daily emails sent and received is expected from 182.9 billion in 2013 to over200 billion by the end of 2017 according to a recent email statistics report[1]. However, based on a spam trends and statistics report by KASPERSKY,over 70% of global email traffic comes from spam in Q2-2013 [2].

    Spam emails cause much inconvenience, bad feeling and potential dan-ger to users. Spam emails waste the limited mailbox space and networkbandwidth. Also email users have to waste time reading and identifying thespam. Moreover, users may have significant financial loss due to the spam.For example, just in US, spam emails have been reported to result in directfinancial losses in excess of 1 billion dollars per year [3]. If an email serviceprovider cannot detect the spam, they may lose their users. Motivated bythese observations, it is important to do spam detection for both users andemail service providers.

    In this project, we address the spam detection problem (i.e., classify emailas spam or non-spam) using three commonly used machine learning tech-niques, which are Nave Bayes Classification (NB), Support Vector Machine(SVM) and Artificial Neural Network (ANN). For the SVM technique, wetry four types of kernel functions, including linear function, polynomial func-tion, radial basis function and Sigmoid function. Furthermore, we combinethe output results of NB, SVM and ANN to see how the combined methodperform. Last, we compare the performance of NB, SVM, ANN and the com-bined method, and give some guidelines on how to select machine learning

    3

  • algorithms when conducting spam detection.

    The remainder of this report is structured as follows. Chapter 2 describesthe data set we used in this project. Chapter 3 states the principles, set-tings and implementation details for the three machine learning techniques.Chapter 4 depicts the main results and gives the implications on how toselect spam detection algorithms. Chapter 5 concludes this report.

    4

  • Chapter 2

    Data Set

    2.1 Data Set Source

    In this project, we use the UCI Spambase data set [4]. It is a classicaldata set for spam detection problem. For the data set, the collection of spamemails came from the postmaster and individuals who had filed spam. Andthe collection of non-spam emails came from filed work and personal emails.The general information about this data set is shown in Figure 2.1.

    Figure 2.1: UCI Spambase Data Set

    2.2 Data Set Description

    The UCI Spambase data set has 4601 instances. Among them, 1813 in-stances are spam and 2788 instances are non-spam. For each instance, thereare 58 attributes.

    The first 48 attributes are all continuous real number ranging from 0 to100, with the name word freq WORD. The value of each attribute is thepercentage of words in the email that match WORD as defined in equation2.1.

    100 number of times the WORD appears in the e-mailtotal number of words in e-mail

    (2.1)

    5

  • A word in this case is any string of alphanumeric characters bounded bynon-alphanumeric characters or end-of-string.

    Right after the first 48 attributes, there are 6 attributes which are all con-tinuous real number ranging from 0 to 100 with the name char freq CHAR.The value of each attribute is the percentage of characters in the email thatmatch CHAR as shown in equation 2.2.

    100 number of CHAR occurencestotal characters in email

    (2.2)

    Then we have 1 continuous real number attribute with the name capi-tal run length average. The value of this attribute is the average length ofuninterrupted sequences of capital letters. Obviously, this value is alwaysbigger than 1.

    Similar to the attribute capital run length average, we have 1 continuousinteger attribute with the name capital run length longest. The value of thisattribute is the length of longest uninterrupted sequence of capital letters.

    Following is a continuous integer attribute with the name capital run length total.The value of this attribute is the sum of length of uninterrupted sequencesof capital letters, i.e., the total number of capital letters in the email.

    The last attribute is a nominal {0, 1} class attribute. 1 denotes that theemail is considered spam, 0 donotes that the email is non-spam.

    In the training set, the number of spam emails and the number of non-spam emails should not differ too much. For example, if the number of spamemails is much larger than that of the non-spam emails in the training set,then the classification model might learn to always classify the instances inthe testing set as 1, simply because that would be true most of the time forthe bad training set. Notice that in the UCI Spambase data set, the first1813 instances are all in the class of spam and the remaining instances areall in the class of non-spam. Therefore the UCI Spambase data set cannot bedirectly used as the input of the training algorithm, details on how to dividethe training set and testing set will be described in Chapter 4.

    6

  • Chapter 3

    Approaches

    3.1 Nave Bayes Classification

    Nave Bayes classification is one of the frequently used supervised learningalgorithms for text categorization [8],[9]. This approach begins by conductinga statistics analysis for a set of emails which have already been labeled asspam or non-spam (i.e., training set) to obtain the probability distributionfor each attribute. Then when a new email comes in, we use the informationgathered from the training set to compute the probability that the email isspam and the probability that the email is non-spam. We classify the emailinto the class with higer probability.

    Given an input feature vector x = {x1;x2; ...;xn} of an email, where xiis the value of ith attribute, and n is the number of attributes in the dataset. Let Y denote the class to be predicted and Y = yi, where yi is 0 or1 (0 denotes non-spam and 1 denotes spam). According to Bayes rule, theprobability that the input vector x belongs to class yi is as follows:

    P (Y = yi|x) = P (Y = yi) P (x|Y = yi)P (x)

    (3.1)

    where P (x) denotes the probability that a randomly picked email hasvector x as its representation, P (Y = yi) is the probability that a randomlypicked email is from class yi, and P (x|Y = yi) denotes the probability thata randomly picked email that belongs to class yi has x as its representa-tion. Besides, in Nave Bayes Classification, all features are conditionallyindependent given the condition yi. Thus, P (x|Y = yi) can be decomposedinto

    P (x|Y = yi) = ni=1P (xi|Y = yi) (3.2)

    7

  • Therefore, determining the class of an input vector x using NB classifier canbe formulated as follows:

    y = arg maxyi

    P (Y = yi|x) (3.3)

    The input vector x belongs to spam if P (Y = 1) ni=1P (xi|Y = 1) >P (Y = 0) ni=1P (xi|Y = 0), otherwise, x belongs to non-spam.

    In this project, we implement Nave Bayes Classification by modifyingexisting codes from [13].

    3.2 Support Vector Machine

    SVM is extenstively used in spam detection [5], [6]. The basic idea behindSVM is to find the optimal separating hyperplane that gives the maximummargin between two different classes, (e.g., {0,1}). Based on this idea, spamfiltering can be viewed as a simple SVM application, i.e., binary classificationof emails as spam or non-spam.

    Given a set of training samples X = (xi, yi), where xi 0 is the penalty parameter of the error term. More and more researcherspay attention to SVM-based classifiers for spam detection, since their demon-strated robustness and ability to handle large feature spaces makes themparticularly attractive for this work.

    In general, there are four types of kernel functions (linear, polynomial,RBF and sigmoid) frequently used with SVM. The definitions of these fourkernel functions are as follows:

    1. Linear: K(xi, xj) = xTi xj

    2. Polynomial: K(xi, xj) = (xTi xj + r)

    d, > 0

    8

  • 3. Radial Basis Function(RBF): K(xi, xj) = exp(xi xj2), > 04. Sigmoid: K(xi, xj) = tanh(x

    Ti xj + r)

    In this project, we have tried all above four kernels to evaluate theirperformance for the spam detection problem. We use the libsvm library [12]to implement kernel-based SVMs to conduct spam classification.

    3.3 Artificial Neural Network

    The artificial neural network is also widely used to do spam detection [10],[11]. The artificial neural network we adopt in this project is a non-linearfeed-forward neural network with the architecture shown in Figure 3.1. There

    Input Layer Hidden Layer Output Layer

    Figure 3.1: Architecture for Feedforward Artificial Neural Network

    are three layers in the neural network, an input layer, a hidden layer and anoutput layer. For the input layer, there are 57 inputs. For the hidden layer,there are 10 neurons. For the output layer, there is one neuron. All neuronsin hidden layer and output layer are coupled with the sigmoid activationfunction as shown in equation (1). So the output value ranges from 0 to 1.In the data set, 1 represents spam and 0 represents non-spam. The emailis classified as spam if the output value is above 0.5, otherwise the email isclassified as non-spam.

    f(x) =1

    1 + ex(1)

    9

  • Levenberg-Marquardt [7] algorithm is adopted as the learning algorithmbeacuse Levenberg-Marquardt algorithm is the fastest backpropagation al-gorithm in the MATLAB toolbox.

    We use MATLAB as the development platform due to the ample functionsin the MATLAB Neural Network Toolbox.

    3.4 Combined Method

    Beyond above mentioned algorithms, we propose a simple method thatcombines NB, SVM and ANN when conducting spam detection. Assume thepredicted value of NB, SVM and ANN are denoted by yNB, ySVM and yANNrespectively. Obviously, yNB, ySVM , yANN {0, 1}, then the classificationresult using combined method is as follows:

    ycombined =

    {1 if yNB+ySVM+yANN

    3 0.5

    0 if yNB+ySVM+yANN3

    < 0.5(3.5)

    The intuition behind this combined method is that if among the threeabove machine learning algorithms (i.e., NB, SVM and ANN), a majority ofthem classify an email as spam, then the email is very likely to be a spam.In our experiments, we will show that this combined method indeed improvethe performance of spam detection.

    10

  • Chapter 4

    Performance Evaluation

    4.1 Training Set and Testing Set

    As we have discussed in Chapter 2, we cannot naively select the first Kinstances in the original UCI Spambase data set as the training set due tothe extremely heterogeneous distribution of spam instances and non-spaminstances. Thus, we have to re-rank the instances in the original data set toobtain a new data set with relatively homogeneous distribution for spamand non-spam instances. Furthermore, in the data set, 1813 instances arespam and the remaining 2788 instances are non-spam. The ratio betweenthe number of spam and total instances is as follows:

    1813

    (1813 + 2788) 0.4

    In the training set and testing set, we both let the ratio between thenumber of spam and total instances be around 0.4, and also let the ratiobetween the size of training set and testing set be around 1

    2. Based on

    this division method, we have 3000 instances in the training set, in which1154 instances are spam, and 1601 instances in the testing set, in which659 instances are spam. The spam and non-spam instances are relativelyhomogeneously distributed in both the training set and testing set.

    4.2 Performance Metrics

    To evaluate the performance of different spam detection algorithms, weadopt four commonly used metrics defined as follows: (TP is short for TruePositive, FP is short for False Positive, TN is short for True Negative andFN is short for False Negative)

    11

  • 1. Accuracy: TP+TNTP+TN+FP+FN

    , which measures the fraction of emails thatare correctly classified.

    2. Precision: TPTP+FP

    , which gives the ratio between the number of emailsthat are correctly classified as spam and the number of emails classifiedas spam.

    3. Recall: TPTP+FN

    , which measures the ratio between the number of emailsthat are correctly classified as spam and the number of spam emails inthe testing set.

    4. F1-measure: 2 PrecisionRecallPrecision+Recall , which is a measure of tests accuracy.The optimal value of F1-measure is 1 and the worst value is 0.

    4.3 Experimental Results

    4.3.1 Different Kernel functions of SVM

    Table 4.1: Performance Comparison of Different KernelsMethod Accuracy(%) Precision(%) Recall (%) F1-measure

    Linear 91.75 91.10 88.61 0.8984Polynomial(d=1) 91.38 92.77 85.73 0.8911Polynomial(d=2) 69.76 90.32 29.74 0.4474Polynomial(d=3) 39.78 39.94 91.95 0.5569

    RBF 82.94 79.60 78.75 0.7917Sigmoid 65.33 93.33 16.95 0.2875

    In this section, we compare the spam detection performance of differentkernel functions based SVM. As shown in Table 4.1, we can find that linearkernel achieves similar performance with polynomial kernel(for degree=1).By contrast, RBF and Sigmoid kernels work much worse than linear andpolynomial ones (for degree=1). Besides, we also compare the performanceof polynomial kernel with different degree values, we find that increase degreeleads to a lower detection performance.

    Due to the varying performance of different kernels (e.g. SVM with Poly-nomial kernel with degree=3 which achieves 39.68% accuracy versus SVMwith linear kernel which achieves 91.95% accuracy), we believe in our futuredata mining work with SVM tools, we should be careful enough on kernelselection so as to achieve good performance.

    12

  • 4.3.2 Different Spam Detection Algorithms

    Table 4.2: Performance Comparison of Different AlgorithmsMethod Accuracy(%) Precision(%) Recall (%) F1-measure

    ANN 93.44 93.82 89.98 0.9186NB 58.27 49.64 95.59 0.6535

    SVM(Linear) 92.50 91.98 94.08 0.9302Combined 93.94 91.32 94.23 0.9275

    In this section, we compare the spam detection performance of NB, SVMand ANN. For this experiment, we use linear kernel based SVM as the rep-resentative of SVM for comparison.

    As shown in Table 4.2, we can see that ANN achieves a slightly higheraccuracy compared to SVM. And SVM outperforms NB in terms of accuracy,precision and F1-measure. The intuition behind this phenomenon is that forthe Nave Bayes Classification, the attributes are all assumed to be inde-pendent. However, in our data set, some attributes have relevance to someextent. For example, the attribute capital run length average and attributecapital run length total is closely correlated.

    Besides, our proposed combined method has a small performance im-provement compared to the other three algorithms individually. This is be-cause we use the crowd source of NB, SVM and ANN. If a majority of themclassify an email as spam, then the email is more likely to be a spam, andvice versa.

    For the time efficiency issues, kernel-based SVMs and NB only requireabout 10 seconds to finish the training process. However, ANN needs atleast 2 minutes to complete the training. Obviously, SVM and NB achievemuch higher time efficiency than ANN.

    13

  • Chapter 5

    Conclusion

    In this project, we adopt three machine learning algorithms, includingNave Bayes Classification, Support Vector Machine and Artificial NeuralNetwork to tackle the spam detection problem, and conduct extensive ex-periments on a classical benchmark spam filtering corpus UCI SpambaseData Set to evaluate the performance of these three above classification al-gorithms. Experimental results show that for the kernel based SVM, linearkernel achieves similar performance with polynomial kernel, and outperformsRBF and sigmoid kernels. Besides, polynomial kernel with higher degreeleads to lower performance. ANN achieves a slightly higher accuracy thanlinear kernel based SVM. But as a contrast, Nave Bayes Classification hasthe worst accuracy. This is because for the Nave Bayes Classification, theattributes are all assumed to be independent. However, in our data set,some attributes have close correlation. Thus, Nave Bayes results in poorperformance. Furthermore, our proposed combined method can improve theperformance in terms of accuracy, recall and F1-measure because of the crowdsourcing. With regard to the time efficiency, ANN has much more trainingtime compared with SVM and NB. Therefore, SVM is suitable to the appli-cations that require low complexity and time efficiency. ANN is suitable tothe applications that require high accuracy and can allow long training time.And NB is not suitable to spam detection task.

    There are several interesting directions that could be explored. First, wecan try deep learning algorithms to conduct spam detection. Nowadays, deeplearning is a hot topic due to the big performance improvement for many ap-plications in vision, audio, speech and natural language processing. Tryingdeep learning algorithms for spam detection should also be an interestingapplication. Second, in our project, we mainly focus on developing a gener-alized spam filter. As an extension, we can try to implement personalizedspam filter for different users by introducing some personalized features.

    14

  • Bibliography

    [1] Sara Radicati, Email Statistics Report 2013-2017, 2013.

    [2] KASPERSKY, Spam Statistics Report for Q2-2013, 2013.

    [3] Wombat Security Technologies, PhishPatrol White Paper, April, 2012.

    [4] UCI Spambase Data Set, http://archive.ics.uci.edu/ml/datasets/Spambase.

    [5] Youn Seongwook, and Dennis McLeod, A comparative study for email classification, Advances andInnovations in Systems, Computing Sciences and Software Engineering. Springer Netherlands, 2007.387-391.

    [6] Drucker Harris, Donghui Wu, and Vladimir N. Vapnik, Support vector machines for spam catego-rization, IEEE Transactions on Neural Networks, 1999.

    [7] http://en.wikipedia.org/wiki/Levenberg%E2%80%93Marquardt algorithm

    [8] Karl-Michael Schneider, A comparison of event models for Naive Bayes anti-spam e-mail filtering,In Proc. of Association for Computational Linguistics, 2003.

    [9] Alexander K. Seewald, An evaluation of naive Bayes variants in content-based learning for spamfiltering, Intelligent Data Analysis, 2007.

    [10] Chuan Zhan, Xianliang Lu, etc., A LVQ-based neural network anti-spam email approach, ACMSIGOPS Operating Systems Review, 2005.

    [11] Clark James etc., A neural network based approach to automated e-mail classification, In Proc. ofIEEE/WIC International Conference on Web Intelligence, 2003.

    [12] Chih-Chung Chang and Chih-Jen Lin, LIBSVM : a library for support vector machines, ACMTransactions on Intelligent Systems and Technology, 2011.

    [13] https://github.com/pranavgupta21/Spam-Detector

    15