36
Text Classification & Sentiment Analysis Muhammad Atif Qureshi Arjumand Younus

Text classification & sentiment analysis

Embed Size (px)

Citation preview

Page 1: Text classification & sentiment analysis

Text Classification & Sentiment Analysis

Muhammad Atif QureshiArjumand Younus

Page 2: Text classification & sentiment analysis

2

Contents

● An Introduction to Text Classification

– Text Classification Examples

– Text Classification Methods● Naive Bayes

– Formalization

– Learning● Applications of Sentiment Analysis

● Baseline Algorithm for Sentiment Analysis

● Sentiment Lexicons

● Sentiment Analysis for the Political Domain (Personal Research)

Page 3: Text classification & sentiment analysis

3

Text Classification Examples

● News filtering and organization

● Document organization and retrieval

● Sentiment analysis/Opinion mining

● Email classification and spam filtering

● Authorship attribution

Page 4: Text classification & sentiment analysis

4

Spam Classification Example

Slide borrowed from Coursera Lectures on “Natural Language ProcessingBy Prof. Dan Jurafsky

Page 5: Text classification & sentiment analysis

5

Text Classification

● Set of training documents D = {d1,....,dN} such that each record is labeled with a class value 'c' from C = {c1,....,cJ}

● Features in training data are related to labels by means of classification model

● Classification model helps predict label for an unknown (test) record

● With text classification, model uses text-based features

Page 6: Text classification & sentiment analysis

6

Text Classification Methods

● Hand-coded rules

● Supervised machine learning

– Naive bayes

– Logistic regression

– Support vector machines

– K-nearest neighbors

Page 7: Text classification & sentiment analysis

7

Naive Bayes

● Simple (“naive”) classification method based on Bayes rule

● Relies on simple document representation namely bag of words

I love this movie. It's sweet but with satirical humor. The dialogue Is great and the adventure scenes are great fun...It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times as I love it so much, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

Page 8: Text classification & sentiment analysis

8

Bag of Words Representation: Subset of Words

I love this movie. It's sweet but with satirical humor. The dialogue is great and the adventure scenes are great fun...It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times as I love it so much, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

great 2

love 2

recommend 1

laugh 1

happy 1

..... ....

Page 9: Text classification & sentiment analysis

9

Bayes' Rule Applied to Documents and Classes

● For a document d and a class c

P (d /c )P (c )

P(d)P(c /d) =

Page 10: Text classification & sentiment analysis

10

Naive Bayes Classifier (1/3)

CMAP argmax P(c /d)= c∈C

argmaxP (d /c )P (c )

P(d)=

c∈C

argmax P(d /c)P(c)c∈C

=

Page 11: Text classification & sentiment analysis

11

Naive Bayes Classifier (2/3)

CMAP =

=

argmax P(d /c)P(c)c∈C

argmax P(x 1, x2,. . , xn /c)P(c)c∈C

Document represented asfeatures x1....xn

How often does this class occur?We can just count the relativefrequencies in a corpus.

Page 12: Text classification & sentiment analysis

12

Naive Bayes Classifier (3/3)

CMAP =

=

argmax P(d /c)P(c)c∈C

argmax P(x 1, x2,. . , xn /c)P(c)c∈C

O(|Xn|.|C|) parameters

Could only be estimated if a very,very large number of training exampleswas available.

argmax P(x 1, x2,. . , xn /c)P(c)

Page 13: Text classification & sentiment analysis

13

Multinomial Naive Bayes Independence Assumptions

Bag of Words assumption: Assume position doesn't matter

● Conditional Independence: Assume the feature probabilities P(xi/cj) are independent given the class c.

P(x 1,x 2,. . , xn /c)

P(x 1,x 2,. . , xn /c)=P(x 1/c )x .....P (xn/c )

Page 14: Text classification & sentiment analysis

14

Multinomial Naive Bayes Classifier

positions ← all word positions in test document

cNB

= cj∈C

argmax P(cj) ∏i∈positions

P(xi /cj)

Page 15: Text classification & sentiment analysis

15

Multinomial Naive Bayes Classifier

CMAP =

argmax P(x 1, x2,. . , xn /c)P(c)c∈C

argmax P(cj)∏x ∈X

P (x /c )c∈C

cNB

=

Page 16: Text classification & sentiment analysis

16

Learning the Multinomial Naive Bayes Model

● First attempt: maximum likelihood estimates

– simply use frequencies in the data

Page 17: Text classification & sentiment analysis

17

Parameter Estimation

● Create mega-document for topic j by concatenating all docs in this topic

– Use frequency of w in mega-document

Page 18: Text classification & sentiment analysis

18

Problem with Maximum Likelihood

● What if we have seen no training documents with the word fantastic and classified as positive

● Zero probabilities cannot be conditioned away, no matter the other evidence!

Page 19: Text classification & sentiment analysis

19

Laplace (add-1) Smoothing for Naive Bayes

Page 20: Text classification & sentiment analysis

20

Multinomial Naive Bayes: Learning

● From training corpus, extract Vocabulary

Page 21: Text classification & sentiment analysis

21

Multinomial Naive Bayes: A Worked Example

Page 22: Text classification & sentiment analysis

22

Sentiment Analysis Overview

Page 23: Text classification & sentiment analysis

23

Sentiment Analysis Applications (1/4)

● Movie: is this review positive or negative?

● Products: what do people think about the new iPhone?

● Public sentiment: how is consumer confidence? Is despair increasing?

● Politics: what do people think about this candidate or issue?

● Prediction: predict election outcomes or market trends from sentiment

Page 24: Text classification & sentiment analysis

24

Sentiment Analysis Applications (2/4)

Page 25: Text classification & sentiment analysis

25

Sentiment Analysis Applications (3/4)

Page 26: Text classification & sentiment analysis

26

Sentiment Analysis Applications (4/4)

Page 27: Text classification & sentiment analysis

27

Formal Definition of Sentiment Analysis● Sentiment analysis is the detection of attitudes

“enduring, affectively colored beliefs, dispositions towards objects or persons”

1. Holder (source) of attitude

2. Target (aspect) of attitude

3. Type of attitude

➢ From a set of types• like, love, hate, value, desire, etc.

➢ Or (more commonly) simple weighted polarity:• positive, negative, neutral together with strength

4. Text containing the attitude

➢ Sentence or entire document

Page 28: Text classification & sentiment analysis

28

Sentiment Analysis Tasks

● Simplest:

– Is the attitude of this text positive or negative?● More complex:

– Rank the attitude of this text from 1 to 5● Advanced:

– Detect the target, source, or complex attitude types

Page 29: Text classification & sentiment analysis

29

Sentiment Analysis: A Baseline Algorithm

● Polarity detection in movie reviews:

– Is an IMDB movie review positive or negative?● Data: Polarity Data 2.0:

– http://www.cs.cornell.edu/people/pabo/movie-review-data/

Page 30: Text classification & sentiment analysis

30

Baseline Algorithm (adapted from Pang and Lee)

● Tokenization

● Feature Extraction

● Classification using different classifiers

– Naive Bayes

– MaxEnt

– SVM

Page 31: Text classification & sentiment analysis

31

Sentiment Tokenization Issues

● Deal with HTML and XML markup

● Twitter markup (names, hash tags)

● Capitalization (preserve for words in all caps)

● Phone numbers, dates

● Emoticons

Page 32: Text classification & sentiment analysis

32

Extracting Features for Sentiment Classification

● How to handle negation

– I didn't like this movie

vs

– I really like this movie● Which words to use?

– Only adjectives

– All words

Page 33: Text classification & sentiment analysis

33

Negation

● Add NOT_ to every word between negation and following punctuation:

Didn't like this movie, but I

Didn't NOT_like NOT_this NOT_movie but I

Page 34: Text classification & sentiment analysis

34

Reminder: Naive Bayes

Page 35: Text classification & sentiment analysis

35

Sentiment Lexicons

● Dictionary of well-known “sentiment” words

– Abusive terms

– Adjectives like bad, worse, good, better, ugly, pretty● Available for use in research

– LIWC: Linguistic Inquiry and Word Count

– SentiStrength

– Bing Liu's Opinion Lexicon

Page 36: Text classification & sentiment analysis

36

My Research: Election Trolling on Twitter (Pakistan Elections 2013)

Twitterer Tweet

A @B Yeh...#Shame with fake account, this is how PTIians think they will get votes

B @A Stop making a fuss and fuck off.

A @B A dumb leader like IK can produce followers like you.

B @A A corrupt leader like Noora can hire paid trolls like you