27
Naive Bayes Classifier and application Laxmiprasad Iyer(109275242) Prof. Anita Wasilewska

Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

  • Upload
    donhan

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Naive Bayes Classifier and application

Laxmiprasad Iyer(109275242)

Prof. Anita Wasilewska

Page 2: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques;

Bo Pang; Lillian Lee; Shivakumar Vaithyanathan; EMNLP 2002

Textbook Introduction to Information Retrieval - Christopher Manning et al. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-

classification-1.html http://nlp.stanford.edu/IR-book/html/htmledition/feature-selection-1.html http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/

Page 3: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Overview •  Text Categorization Problem •  Apriori Probabilities •  Posterior Probabilites and Conditional Independence

Assumption •  Comparison of Naive Bayes Classifier to other

classifiers and choosing a classifier •  Conditional Independence Assumption •  Research Paper (application of naive bayes classifier)

Page 4: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Consider a Text Categorization Problem Movie Reviews

Positive Reviews (love, wonderful, best, great, superb, still, beautiful) Negative Reviews (bad, worst, stupid, waste, boring, ?, ! )

Page 5: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Dataset (1000 positive, 1000 negative)

#love #wonderful #best ……. #stupid #wordn Class

1 0 1 …... 0 1 Positive

1 1 1 …... 0 1 Positive

1 0 0 …... 1 1 Negative

1 1 1 …... 1 0 Negative

# is whether the word is present or not

n is the total number of words in the vocabulary(in our case 16165) Vocabulary is the set of words in the all the reviews

Page 6: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Split Dataset into training and testing data

Training Data = 800 Positive records and 800 negative records

Test Data = 200 Positive records and 200 negative records

Page 7: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Training of naive bayes classifier The training process is: Learning(calculating) apriori probabilities

Page 8: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Apriori Probabilities(calculated from training data)

P( #word1 = 0| positive) = count of records from positive class with word1 = 0 total number of records in positive class

P( #word1 = 1| positive) = count of records from positive class with word1 = 0 total number of records in positive class

P(#word1 = 0| negative) = count of records from negative class with word1 = 0 total number of records in negative class

P(#word1 = 1| negative)= count of records from negative class with word1 = 1 total number of records from negative class

……………..

Page 9: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Apriori Probabilites(calculated from training data)

Class Probabilities P(Positive) = number of records from positive class total number of records

P(Negative) = number of records from negative class total number of records

Page 10: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Total number of model parameters(to be learnt)

= k*(n^d) where k is the total number of classes d is the number of attribute values n is the total number of attributes For our example, 522614450, k=2,d=2,n=16165

Page 11: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Testing For each test record(labelled) 1.Calculate posterior probabilities 2. Output label with highest posterior probability 3. If output label is same as original label no error, else error

Error = Fraction of misclassified records

Page 12: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Posterior Probabilities Given unseen data d < d1, d2, . .. …., dn> P(Positive | D) P(Negative | D)

Page 13: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Bayes Rule P(Positive | D) = P(D | Positive) * P(Positive) P(Negative | D) = P(D | Negative) * P(Negative)

Page 14: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Conditional Independence Assumption

P(d1, d2, d3, d4, .. dn | Positive)=P(d1 | Positive)*P(d2 | Positive) *....*P(dn|Positive)

P(d1, d2, d3, d4, .. dn | Negative)=P(d1 | Negative)*P(d2 | Negative) *....*P(dn|Negative)

Page 15: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Posterior Probabilities (using bayes rule)

Given unseen data D <1, 0, 1, 1 .. ….., 0> P(Positive | D) = P(word1 = 1 | Positive)*P(word2=0 | Positive) * ……*P(wordn = 0 | Positive) * P(Positive)

P(Negative | D) = P(word1 = 1 | Negative)*P(word2=0 | Negative) * ……*P(wordn = 0 | Negative) * P(negative)

Page 16: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Final Naive Bayes Formula

Page 17: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Final Classifier is the end product After training and testing

Page 18: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Conditional Independence Assumption

Does the conditional independence assumption hold in practise?

It is very limiting! Yet naive bayes performs effectively.

Page 19: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Simple interpretation of the formula

Page 20: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Comparison of naive bayes classifier with other classifiers

26 different datasets, Naive Bayes performed at par with SVM or Decision Tree Classifier except in 3 or 4 cases.

Page 21: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Research Paper Exciting problem solved using the naive bayes classifier. Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang; Lillian Lee; Shivakumar Vaithyanathan; Journal EMNLP 2002

Page 22: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

3 fold cross validation(rotation estimation)

Divide dataset into 3 folds Use 2 folds for training and 1 fold for testing Find the cross validation accuracy Repeat choosing different combination Find the average cross validation accuracy

Page 23: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Results(average 3 fold cross validation accuracies)

Page 24: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Final Classifier Train using whole training data

Page 25: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Feature selection Two main purposes: 1. Training and applying a classifier more efficient 2. Feature selection often increases classification accuracy by eliminating noise features

Page 26: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Feature Selection Algorithm

Page 27: Naive Bayes Classifier - Stony Brook Universitycse634/bayestalk14.pdf · References Research Paper Thumbs up? Sentiment Classification using Machine Learning Techniques; Bo Pang;

Selecting Top 2633 features by frequency

Does ignoring noisy features which do not actually matter improve the performance?