20
2013 IEEE International Conference on Big Data Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen

Scalable sentiment classification for big data analysis using naive bayes classifier

Embed Size (px)

Citation preview

Page 1: Scalable sentiment classification for big data analysis using naive bayes classifier

2013 IEEE International Conference on Big Data

Scalable Sentiment Classification for Big DataAnalysis Using Naive Bayes Classifier

Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen

Page 2: Scalable sentiment classification for big data analysis using naive bayes classifier

outline

✤ introduction

✤ Naive Bayes Classification

✤ implementation of Naive Bayes in hadoop

✤ experimental study

Page 3: Scalable sentiment classification for big data analysis using naive bayes classifier

introduction

A typical method to obtain valuable information is to extract the sentiment or opinion from a message

In this paper, it aim to evaluate the scalability ofNaive Bayes classifier (NBC) in large datasets

Page 4: Scalable sentiment classification for big data analysis using naive bayes classifier

introduction

NBC is able to scale up to analyze the sentiment of millions movie reviews with increasing throughput

the accuracy of NBC is improved and approaches 82%

Page 5: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

naive Bayes classifiers is simple probabilistic classifiers based on applying Bayes' theorem with

strong (naive) independence assumptions between the features

a popular method for text categorization,( the problem of judging documents as belonging to one

category)

Page 6: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

prior probability :

posterior probability:

P(A)

P(A|B)

Page 7: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible)

P(POS|d1) = P(POS) x P(d1|POS)

P(d1)

Bayes' theorem

Page 8: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

P(POS|excellent,terrible) = P(POS) x P(excellent,terrible|POS)

P(excellent,terrible)

P(excellent,terrible|POS) P(excellent|POS) x P(terrible|POS)

independent

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

P(excellent,terrible)

Page 9: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

classes excellent terrible

d1 POS 5 1

d2 NEG 2 6

P(POS|excellent,terrible) = P(POS) x P(excellent|POS) x P(terrible|POS)

P(excellent,terrible)

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)

56

( )16

( )

12

828

( ) 268

( )x x

12

856

( ) 216

( )x x

Page 10: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

P(POS|excellent,terrible) =

P(NEG|excellent,terrible) =

d3 : (excellent,8),(terrible,2)12

856

( ) 216

( )x x12

828

( ) 268

( )x x

0.00323011165

0.00000429153

d3 is POS

Page 11: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

12

856

( ) 216

( )x x

Page 12: Scalable sentiment classification for big data analysis using naive bayes classifier

Naive Bayes Classification

N is the total number of documents,Nc is the number of documents in class c

Nwi is the frequency of a word wi in class c.

Page 13: Scalable sentiment classification for big data analysis using naive bayes classifier

implementation of Naive Bayes in hadoop

pre-processing raw dataset

Page 14: Scalable sentiment classification for big data analysis using naive bayes classifier

implementation of Naive Bayes in hadoop

1000 positive and 1000 negative review

Page 15: Scalable sentiment classification for big data analysis using naive bayes classifier

implementation of Naive Bayes in hadoop

(word,posSum,negSum)

the words frequency in all positive,negative document

(excellent,1000,10)

Page 16: Scalable sentiment classification for big data analysis using naive bayes classifier

implementation of Naive Bayes in hadoop

(excellent,1000,10) (excellent,20,5)

(word,posSum,negSum) (word,count,docID)

(docID,count,word,posSum,negSum)

(5,20,excellent,1000,10)

Page 17: Scalable sentiment classification for big data analysis using naive bayes classifier

implementation of Naive Bayes in hadoop

(5,10,excellent,20,5)

(5,2,terrible,5,20)

(5,pos,true)

(docID,predict,correct)

(6,neg,false)

(docID,count,word,posSum,negSum)

10xlog(20)+2xlog(5)

10xlog(5)+2xlog(20)

Page 18: Scalable sentiment classification for big data analysis using naive bayes classifier

experimental study

one name node and six data nodes. they allocate each VM two virtual CPU and 4GB of memory

7 nodes

a Dell server with 12 Intel Xeon E5-2630 2.3GHz cores and 32G memory

use Xen CloudPlatform (XCP) 1.6 as the hypervisor

Page 19: Scalable sentiment classification for big data analysis using naive bayes classifier

experimental study

training data

Page 20: Scalable sentiment classification for big data analysis using naive bayes classifier

experimental study