29
Email Classification using Co- Training Harshit Chopra (02005001) Kartik Desikan (02005103)

Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Email Classification using Co-Training

Harshit Chopra (02005001)Kartik Desikan (02005103)

Page 2: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Motivation

Internet has grown vastlyWidespread usage of emailsSeveral emails received per day(~100)Need to organize emails according to user convenience

Page 3: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Classification

Given training data {(x1,y1),…,(xn,yn)}Produce classifier h : X Yh maps object xi in X to its classification label yi in Y

Page 4: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Features

Measurement of some aspect of given dataGenerally represented as binary functionsEg: Presence of hyperlinks is a feature Presence indicated by 1, and absence by 0.

Page 5: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Email Classification

Email can be classified as:Spam and non-spam, orImportant or non-important, etc

Text represented as bag of words:1. words from headers2. words from bodies

Page 6: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Supervised learning

Main tool for email management: text classificationText classification uses supervised learningExamples belonging to different classes is given as training setEg: Emails classified as interesting and uninterestingLearning systems induce general description of these classes

Page 7: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Supervised Classifiers

Naïve Bayes ClassifierSupport Vector Machines (SVM)

Page 8: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Naïve Bayes

According to Bayes Theorem,

Under naïve conditional independence assumption,

)()|()()|(

DocPClassDocPClassPDocClassP =

∏=

==n

i

in ClassaPClassaaPClassDocP1

1 )|()|,...,()|(

Page 9: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Naïve Bayes

Classify text document according to

E.g. Class = {Interesting, Junk}

∏=

n

i

jijj

ClassaPClassP1

)|()(max

Page 10: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Linear SVM

Creates Hyperplane separating data in 2 classes with max separation

Page 11: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Problems with Supervised Learning

Requires lot of training data for accurate classificationEg: Microsoft Outlook Mobile Manager requires ~600 emails as labeled examples for best performanceVery tedious job for average user

Page 12: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

One solution

Look at user’s behavior to determine important emailsE.g.: Deleting mails without reading might be an indication of junkNot reliableRequires time to gather information

Page 13: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Another solution

Use semi-supervised learningFew labeled data coupled with large unlabeled dataPossible algorithms

Transductive SVMExpectation Maximization (EM)Co-Training

Page 14: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Co-Training Algorithm

Based on idea that some features are redundant – “redundantly sufficient”

1. Split features into two sets, F1 and F2

2. Train two independent classifiers, C1and C2, one for each feature set

3. Produce two initial weak classifiers using minimum labeled data

Page 15: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Co-Training Algorithm

4. Each classifier Ci examines unlabeled examples and labels them

5. Add most confident +ve and –veexamples to set of labeled examples

6. Loop back to Step 2

Page 16: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Intuition behind Co-Training

One classifier confidently predicts class of an unlabeled exampleIn turn, provides one more training example to other classifierE.g.: Two messages with subject

1. “Company meeting at 3 pm”2. “Meeting today at 5 pm?”

Page 17: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Given: 1st message classified as “meeting”Then, 2nd message is also classified as “meeting”Messages likely to have different body contentHence, classifier based on words in body provided with extra training example

Page 18: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Co-Training on Email Classification

Assumption: presence of redundantly sufficient features describing dataTwo sets of features:

1. words from subject lines (header)2. words from email bodies

Page 19: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Experiment Setup

1500 emails and 3 foldersDivision as 250, 500 and 750 emailsDivision into 3 classification problems where ratio of +ve : -ve examples are

1:5 (highly imbalanced problem)1:2 (moderately imbalanced problem)1:1 (balanced problem)

Expected: Larger the imbalance, worse the learning results

Page 20: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Experiment SetupFor each task 25% of examples left as test setRest is training set with labeled and unlabeled dataRandom selection of labeled dataStop words removed and Stemmer usedThese pre-processed words form the feature setFor each feature/word, frequency of the word is the feature value

Page 21: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Experiment Run

50 co-training iterationsAppropriate training examples supplied in each iterationResults represent average of 10 runs of the trainingLearning algorithms used :

Naïve BayesSVM

Page 22: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Results 1: Co-training with Naïve Bayes

Absolute difference between accuracy in 1st and 50th

iteration

Subject-based classifier

Body-based classifier Combined classifier

(1:5) Highly imbalanced problem -17.11% -19.78% -19.64%

(1:2) Moderately imbalanced problem -9.41% -1.20% -1.23%

(1:1) Balanced problem 0.78% -0.53% -0.64%

Co-Training with Naïve Bayes

Page 23: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Inference from Result 1

Naïve Bayes performs badlyPossible reasons:

Violation of conditional independence of feature sets.Subject and bodies not redundant. Need to be used togetherGreat sparseness among feature values

Page 24: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Result 2: Co-training with Naïve Bayes and feature selection

Absolute difference between accuracy in 1st and 50th

iteration

Subject-based classifier

Body-based classifier Combined classifier

(1:5) Highly imbalanced problem -1.09% -1.82% -1.96%

(1:2) Moderately imbalanced problem 1.62% 4.31% 3.89%

(1:1) Balanced problem 1.54% 6.78% 3.81%

Co-Training with Naïve Bayes and feature selection

Page 25: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Inference from Result 2

Naïve Bayes with feature selection works lot betterFeature sparseness likely cause of poor result of co-training Naïve Bayes

Page 26: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Result 3: Co-training with SVM

Absolute difference between accuracy in 1st and 50th

iteration

Subject-based classifier

Body-based classifier Combined classifier

(1:5) Highly imbalanced problem 0.28% 1.80% 1.80%

(1:2) Moderately imbalanced problem 14.89% 22.47% 17.42%

(1:1) Balanced problem 12.36% 15.84% 18.15%

Co-training with SVM

Page 27: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Inference from Result 3

SVM clearly outperforms Naïve BayesWorks well for very large feature sets

Page 28: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

Conclusion

Co-training can be applied to email classificationDepends on learning method usedSVM performs quite well as a learning method for email classification

Page 29: Email Classification using Co- Trainingcs621-2011/cs621-2007/old/seminar/Ema… · Email Classification using Co-Training Harshit Chopra (02005001) Kartik Desikan (02005103) Motivation

ReferencesS Kiritchenko, S Matwin: Email classification with co-training, In Proc. of CASCON, 2001A Blum, TM Mitchell: Combining Labeled and Unlabeled Data with Co-Training, In Proc. of the 11th

Annual Conference on Computational Learning Theory, 1998K Nigam, R Ghani: Analyzing the Effectiveness and Applicability of Co-training, In Proc. of the 9th

International Conference on Information Knowledge Management, 2000http://en.wikipedia.org/wiki