Download pptx - Distant Supervision for Emotion Classification in Twitter posts 1/17

SENTIMENT ANALYSISDistant Supervision for Emotion Classificationin Twitter posts

1/17

SENTIMENT ANALYSISWHAT IS IT USED FOR? Natural language and text processing

to identify and extract subjective information

Classifying the polarity of a given text as positive, negative or neutral

In general: to discover how people feelabout a particular topic

2/17

SENTIMENT ANALYSISWHO IS IT USED BY?Customers• To research products before purchasing

Marketers• To research public opinion of their company or products• Analyze customer satisfaction

Organizations• Gather critical feedback in newly released products

3/17

PROBLEM Earlier studies relied on predefined

datasets,typically keyword-based

Determening the emotion is subjective

The words can be ambiguous

4/17

DISTANT SUPERVISION STRATEGY An attempt to exploit the widespread use of

emoticons and other emotional content

They are treated as noisy labelsto obtain very large training sets

Machine learning algorithms(Naïve Bayes, MaxENT and SVM)have accuracy above 80%when trained with emoticon data

5/17

SENTIMENT140 Web application with a purpose

to discover sentiment of a brand, product or topic on Twitter

6/17

APPROACH Machine learning classifiers

Keyword-based Naive Bayes MaxENT SVM

Feature Extractors Unigrams Bigrams Unigrams and bigrams Unigrams with part of speech tags

7/17

KEYWORD-BASED As a baseline, a publicly available list of keywords is used

For each tweet, the number of positive and negative keywords is counted

The classifier return the polarity with the higher count

8/17

NAÏVE BAYES Multinomial Naïve Bayes model is used

Class c is assigned to tweet d, where

In this formula, f represents a feature and ni(d) representsthe count of feature fi found in tweet d. There are a total of m features

9/17

MAXIMUM ENTHROPY Feature-based models

Features like bigrams and phrases can be added

In this formula, c is the class, d is the tweet, and lambda is a weight vector. The weight vectors decide the significance of a feature in classification

10/17

SUPPORT VECTOR MACHINES Input data are two sets of vectors of size m

where each entry in the vector corresponds to the presence of a feature

E.g. Unigram feature extractor – a feature is a word found in a tweet If the feature is present – value 1 If not – value 0

11/17

EXPERIMENTAL 1/2 Analysis is done using Twitter API

In the API, a query for „:)“ returns tweets with positive emotion anda query for „:(„ returns tweets with negative emotion

12/17

EXPERIMENTAL 2/2 The training data is post-processed with filters:

Emoticons are stripped off for training purposesMaxENT and SVM have better accuracies without them

Tweets with both positive and negative emoticonsare removedi.e. „I’m turning 30 today :( but I still get birthday presents! :)“

Retweets are removedThe same tweet shouldn’t be counted twice

Tweets with „:P“ are removedThey usually don’t represent any distinct emotion

Replicated tweets are removed

13/17

RESULTS FOR FEATURE EXTRACTION Unigram feature extractor

The simplest way to retrieve features Results are similar to Pang and Lee’s work on different classifiers on movie reviews

Bigram feature extractor Used for negation phrases like „not good“ or „not bad“ Downside: bigrams are very sparse and accuracy can drop for both MaxENT and SVM

Unigrams and bigrams Accuracy improved for Naive Bayes and MaxENT Decline in accuracy for SVM

Parts of speech The same word may have many different meaning

Over as a verb may have a negative connotationOver can be a noun, without an emotion at all

POS tags aren’t much of a use 14/17

UPGRADES Semantics

Djokovic beats Federer :)The sentiment is positive for Djokovic, negative for Federer

Domain-specific tweets Classifiers could perform better if limited to particular domains

(such as movies)

Handling neutral tweets

Internationalization There are lots of tweet about the same subject

in lost of different languages

Utilizing emoticon data in the set Emoticons are stipped out and classifiers could perform better if they were included

15/17

SEMANTICS On a tweet that says Djokovic beats Federer,

one cannot extract the sentiment of the tweet

To be precise, semantics could be a solution

If (user.isFrom(Serbia)) thensentiment := positive

else if (user.isFrom(Switzerland)) thensentiment := negative

Using semantics, we can gather more information,than just by reading keywords

16/17

THANK YOU FOR YOUR ATTENTION

NIKOLA JOLIC

17/17