View
214
Download
0
Tags:
Embed Size (px)
Citation preview
2
How might we analyze email?
Identify different partsReply blocks, signature blocks
Integrate email with workflow tasksBuild a social network
Who do you know, and what is their contact info?Reputation analysis
– Useful for anti-spam too
4
Recognizing Email Structure
Three tasks:Does this message contain a signature block?If so, which lines are in it?Which lines are reply lines?Three-way classification for each line
RepresentationA sequence of linesEach line has features associated with itWindows of lines important for line classification
Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
5Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
6Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
7Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
8Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
9Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
10Victor R. Carvalho & William W. Cohen, Learning to Extract Signature and Reply Lines from Email, in CEAS 2004.
11
The Cost of SpamMost of the cost of spam is paid for by the recipients:
Typical spam batch is 1,000,000 spams
Spammer averages ~$250 commission per batch
Cost to recipients to delete the load of spam @ 2 seconds/spam, $5.15/hour:
$2,861
12
The Cost of Spam
Theft efficiency ratio of spammer:
profit to thief ------------------------ = ~10 %
cost to victims
10% theft efficiency ratio is typical in many other lines of criminal activity such as fencing stolen goods (jewelery, hubcaps, car stereos).
14Adapted froms slide by Rohan Malkhare
Anti-spam ApproachesLegislationTechnology
White listing of Email addressesBlack Listing of Email addresses/domainsChallenge Response mechanismsContent Filtering
– Learning Techniques– “Bayesian filtering” for spam has got a lot of press, e.g.
“How to spot and stop spam”, BBC News, 26/5/2003http://news.bbc.co.uk/2/hi/technology/3014029.stm
“Sorting the ham from the spam”, Sydney Morning Herald, 24/6/2003http://www.smh.com.au/articles/2003/06/23/1056220528960.html
– The “Bayesian filtering” they are talking about is actually Naïve Bayes Classification
15Adapted froms slide by Rohan Malkhare
Research in Spam Classification
Spam filtering is really a classification problemEach email needs to be classified as either spam or not spam (“ham”)
W. Cohen (1996): RIPPER, Rule Learning SystemRules in a human-comprehensible format
Pantel & Lin (1998): Naïve-Bayes with words as features
Sahami, Dumais, Heckerman, Horvitz (1998): Naïve-Bayes with a mutual information measure to select features with strongest resolving powerWords and domain-specific attributes of spam used as features
16Adapted froms slide by Rohan Malkhare
Research in Spam ClassificationPaul Graham (2002): A Plan for spam
Very popular algorithm credited with starting the craze for Bayesian FiltersUses naïve bayes with words as features
Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm
Very accurate (over 99.7% accuracy)Distinctive because of it’s powerful feature extraction techniqueUses Bayesian chain rule for combining weightsAvailable via sourceforge
Others have used SVMs, etc.
New work: First email and anti-spam conference just heldhttp://www.ceas.cc/papers-2004/
17Adapted froms slide by William Yerazunis
Yerazunis’ CRM114 Algorithm
Other naïve-bayes approaches focused on single-word features
CRM114 creates a huge number of n-grams and represents them efficiently
The goal is to create a LOT of features, many of which will be invariant over a large body of spam (or nonspam).
(The name is a reference to a program in Dr. StrangeLove)
Sparse Binary Polynomial Hashing and the CRM114 Discriminator, William S. Yerazunis, http://crm114.sourceforge.net/CRM114_paper.html
18Adapted froms slide by William Yerazunis
CRM114
1. Slide a window N words long over the incoming text
2. For each window position, generate a set of order-preserving sub-phrases containing combinations of the windowed words
3. Calculate 32-bit hashes of these order-preserved sub-phrases (for efficiency reasons)
19Adapted froms slide by William Yerazunis
Step 1: slide a window N words long over the incoming text. ex:
You can Click here to buy viagra online NOW!!!
Yields:
You can Click here to buy viagra online NOW!!!
You can Click here to buy viagra online NOW!!!
You can Click here to buy viagra online NOW!!!
You can Click here to buy viagra online NOW!!!
... and so on... (on to step 2)
CRM114 Feature Extraction Example
20Adapted froms slide by William Yerazunis
SBPH Example
Click Click hereClick toClick here toClick buyClick here buyClick to buyClick here to buy Click viagraClick here viagraClick to viagraClick here to viagra Click buy viagraClick here buy viagraClick to buy viagraClick here to buy viagra
...yields all these feature sub-phrases
Note the binary counting pattern; this is the ‘binary’ in ‘sparse binary polynomial hashing’
Sliding Window Text : ‘Click here to buy viagra’
Step 2: generate order-preserving sub-phrases from the words in each of the sliding windows
21Adapted froms slide by William Yerazunis
SBPH Example
Click Click hereClick toClick here toClick buyClick here buyClick to buy Click here to buyClick viagraClick here viagraClick to viagraClick here to viagra Click buy viagraClick here buy viagraClick to buy viagraClick here to buy viagra
Step 3: make 32-bit hash value “features” from the sub-phrases
32-bit hash
E06BF8AA12FAD10F7B37C4F9113936CF1821F0E846B99AADB7EE69BF19A78B4D56626838AE1B0B615710DE7333094DBB
..... and so on
22Adapted froms slide by William Yerazunis
How to use the terms
For each phrase you can buildKeep track of how many times you see that phrase in both the spam and nonspam categories.
When you need to classify some text, Build up the phrases
– Each extra word adds 15 features
Count up how many times all of the phrases appear in each of the two different categories. The category with the most phrase matches wins.
– But really it uses the Bayesian chain rule
23Adapted froms slide by William Yerazunis
Learning and Classifying
Learning: each feature is bucketed into one of two bucket files ( spam or nonspam)
Classifying: the comparable bucket counts of the two files generate rough estimates of each feature's ‘spamminess’
P(F|C) =0.5 + ( |Fc| - |F~c| ) / ( 2 * MaxF )
24Adapted froms slide by William Yerazunis
The Bayesian Chain Rule (BCR)
P ( F|C ) P ( C )P (C|F ) = ------------------------------------------ P( F|C ) P( C ) + P ( F|~C) P(~C)
Start with P(C ) = P(~C) = .5For a new msg, compute this for both P(spam) and P(not-spam)Which ever has the higher score wins.
The denominator renormalizes to take into account if most of the email is mainly one class or the other
25Adapted froms slide by William Yerazunis
The feature set created by the SBPH feature hash gives better performance than single-word Bayesian systems.
Phrases in colloquial English are much more standardized than words alone - this makes filter evasion much harder
A bigger corpus of example text is better
With 400Kbytes selected spams, 300Kbytes selected nonspams trained in, no blacklists, whitelists, or other shenanigans
Evaluation
26Adapted froms slide by William Yerazunis
>99.915 % The actual performance of CRM114 Mailfilter from Nov 1 to Dec 1, 2002.
5849 messages, (1935 spam, 3914 nonspam)
4 false accepts, ZERO false rejects, (and 2 messages I couldn't make head nor tail of).
All messages were incoming mail 'fresh from the wild'. No canned spam.
For comparison, a human* is only about 99.84% accurate in classifying spam v. nonspam in a “rapid classification” environment.
Results
27Adapted froms slide by William Yerazunis
Filtering speed: classification: about 20Kbytes per second, learning time: about 10Kbytes per second (on a Transmeta 666 MHz laptop)
Memory required: about 5 megabytes
404K spam features, 322K nonspam features
Results Stats
28Adapted froms slide by William Yerazunis
The bad news: SPAM MUTATES
Even a perfectly trained Bayesian filter will slowly deteriorate.
New spams appear, with new topics, as well as old topics with creative twists to evade antispam filters.
Downsides?
29
Revenge of the Spammers
How do the spammers game these algorithms?Break the tokenizer
– Split up words, use html tags, etc
Throw in randomly ordered words– Throw off the n-gram based statistics
Use few words– Harder for the classifier to work
On Attacking Statistical Spam Filters. Gregory L. Wittel and S. Felix Wu, CEAS ’04.