Upload
elaine-key
View
141
Download
6
Embed Size (px)
DESCRIPTION
Spam Detection. Jingrui He 10/08/2007. Spam Types. Email Spam Unsolicited commercial email Blog Spam Unwanted comments in blogs Splogs Fake blogs to boost PageRank. From Learning Point of View. Spam Detection Classification problem (ham vs. spam) Feature Extraction - PowerPoint PPT Presentation
Citation preview
Spam DetectionJingrui He
10/08/2007
Spam Types Email Spam
Unsolicited commercial email Blog Spam
Unwanted comments in blogs Splogs
Fake blogs to boost PageRank
From Learning Point of View Spam Detection
Classification problem (ham vs. spam) Feature Extraction
A Learning Approach to Spam Detection based on Social Networks. H.Y. Lam and D.Y. Yeung
Fast Classifier Relaxed Online SVMs for Spam Filtering. D.
Sculley, G.M. Wachman
A Learning Approach to Spam Detection based on Social Networks
H.Y. Lam and D.Y. Yeung
CEAS 2007
Problem Statement n Email Accounts Sender Set: ; Receiver Set Labeled Sender Set: s.t.
Goal Assign the remaining account with in
System Flow Chart
Social Network from Logs Directed Graph Directed Edge
Email sent from to Edge Weight =
is the number of emails sent from to
System Flow Chart
Features from Email Social Networks In-count / Out-count
The sum of in-coming / out-going edge weights
In-degree / Out-degree The number of email accounts that a node
receives emails from / sends emails to
Features from Email Social Networks Communication Reciprocity (CR)
The percentage of interactive neighbors that a node has
The set of accounts that received emails from
The set of accounts that sent emails to
Communication Interaction Average (CIA) The level of interaction between a sender and
each of the corresponding recipients
Features from Email Social Networks
Clustering Coefficient (CC) Friends-of-friends relationship between email
accounts
Features from Email Social Networks
Number of neighbors of
Number of connections between neighbors of
System Flow Chart
Preprocessing Sender Feature Vector
Weighted Features
Problematic?
System Flow Chart
Assigning Spam Score Similarity Weighted k-NN method
Gaussian similarity
Similarity weighted mean k-NN scores
Score scaling
The set of knearest
neighbors
:x
:x
j
j
ij jji
ijj
w yy
w
Experiments Enron Dataset: 9150 Senders To Get
Legitimate Enron senders: email transactions within the Enron email domain
5000 generated spam accounts 120 senders from each class
Results Averaged over 100 Times
Number of Nearest Neighbors
Feature Weights (CC)
Feature Weights (CIA)
Feature Weights (CR)
Feature Weights In/Out-Count & In/Out-Degree
The smaller the better Final Weights
In/Out-count & In/Out-degree: 1 CR: 1 CIA: 10 CC: 15
Conclusion Legitimacy Score
No content needed Can Be Combined with Content-Based Filters More Sophisticated Classifiers
SVM, boosting, etc Classifiers Using Combined Feature
Relaxed Online SVMs for Spam Filtering
D. Sculley and G.M. Washman
SIGIR 2007
Anti-Spam Controversy Support Vector Machines (SVMs) Academic Researchers
Statistically robust State-of-the-art performance
Practitioners Quadratic in the number of training examples Impractical!
Solution: Relaxed Online SVMs
Background: SVMs Data Set = Class Label : 1 for spam; -1 for ham Classifier: To Find and
Minimize:
Constraints:
Slack variable
Maximizing the marginMinimizing the loss function
Tradeoff parameter
Online SVMs
Tuning the Tradeoff Parameter C Spamassassin data set: 6034 examples
Large C preferred
Email Spam and SVMs TREC05P-1: 92189 Messages TREC06P: 37822 messages
Blog Comment Spam and SVMs Leave One Out Cross Validation 50 Blog Posts; 1024 Comments
Splogs and SVMs Leave One Out Cross Validation 1380 Examples
Computational Cost Online SVMs: Quadratic Training Time
Relaxed Online SVMs (ROSVM) Objective Function of SVMs:
Large C Preferred Minimizing training error more important than
maximizing the margin ROSVM
Full margin maximization not necessary Relax this requirement
The last value found for when
Three Ways to Relax SVMs (1) Only Optimize Over the Recent p Examples
Dual form of SVMs
Constraints
Three Ways to Relax SVMs (2) Only Update on Actual Errors
Original online SVMs Update when
ROSVM Update when m=0: mistake driven online SVMs NO significant degrade in performance Significantly reduce cost
Three Ways to Relax SVMs (3) Reduce the Number of Iterations in Interative
SVMs SMO: repeated pass over the training set to
minimize the objective function Parameter T: the maximum number of iterations T=1: little impact on performance
Testing Reduced Size
Testing Reduced Iterations
Testing Reduced Updates
Online SVMs and ROSVM ROSVM:
Email Spam
Blog Comment Spam
Splog Data Set