SENTIMENT ANALYSISDistant Supervision for Emotion Classificationin Twitter posts
1/17
SENTIMENT ANALYSISWHAT IS IT USED FOR? Natural language and text processing
to identify and extract subjective information
Classifying the polarity of a given text as positive, negative or neutral
In general: to discover how people feelabout a particular topic
2/17
SENTIMENT ANALYSISWHO IS IT USED BY?Customers• To research products before purchasing
Marketers• To research public opinion of their company or products• Analyze customer satisfaction
Organizations• Gather critical feedback in newly released products
3/17
PROBLEM Earlier studies relied on predefined
datasets,typically keyword-based
Determening the emotion is subjective
The words can be ambiguous
4/17
DISTANT SUPERVISION STRATEGY An attempt to exploit the widespread use of
emoticons and other emotional content
They are treated as noisy labelsto obtain very large training sets
Machine learning algorithms(Naïve Bayes, MaxENT and SVM)have accuracy above 80%when trained with emoticon data
5/17
SENTIMENT140 Web application with a purpose
to discover sentiment of a brand, product or topic on Twitter
6/17
APPROACH Machine learning classifiers
Keyword-based Naive Bayes MaxENT SVM
Feature Extractors Unigrams Bigrams Unigrams and bigrams Unigrams with part of speech tags
7/17
KEYWORD-BASED As a baseline, a publicly available list of keywords is used
For each tweet, the number of positive and negative keywords is counted
The classifier return the polarity with the higher count
8/17
NAÏVE BAYES Multinomial Naïve Bayes model is used
Class c is assigned to tweet d, where
In this formula, f represents a feature and ni(d) representsthe count of feature fi found in tweet d. There are a total of m features
9/17
MAXIMUM ENTHROPY Feature-based models
Features like bigrams and phrases can be added
In this formula, c is the class, d is the tweet, and lambda is a weight vector. The weight vectors decide the significance of a feature in classification
10/17
SUPPORT VECTOR MACHINES Input data are two sets of vectors of size m
where each entry in the vector corresponds to the presence of a feature
E.g. Unigram feature extractor – a feature is a word found in a tweet If the feature is present – value 1 If not – value 0
11/17
EXPERIMENTAL 1/2 Analysis is done using Twitter API
In the API, a query for „:)“ returns tweets with positive emotion anda query for „:(„ returns tweets with negative emotion
12/17
EXPERIMENTAL 2/2 The training data is post-processed with filters:
Emoticons are stripped off for training purposesMaxENT and SVM have better accuracies without them
Tweets with both positive and negative emoticonsare removedi.e. „I’m turning 30 today :( but I still get birthday presents! :)“
Retweets are removedThe same tweet shouldn’t be counted twice
Tweets with „:P“ are removedThey usually don’t represent any distinct emotion
Replicated tweets are removed
13/17
RESULTS FOR FEATURE EXTRACTION Unigram feature extractor
The simplest way to retrieve features Results are similar to Pang and Lee’s work on different classifiers on movie reviews
Bigram feature extractor Used for negation phrases like „not good“ or „not bad“ Downside: bigrams are very sparse and accuracy can drop for both MaxENT and SVM
Unigrams and bigrams Accuracy improved for Naive Bayes and MaxENT Decline in accuracy for SVM
Parts of speech The same word may have many different meaning
Over as a verb may have a negative connotationOver can be a noun, without an emotion at all
POS tags aren’t much of a use 14/17
UPGRADES Semantics
Djokovic beats Federer :)The sentiment is positive for Djokovic, negative for Federer
Domain-specific tweets Classifiers could perform better if limited to particular domains
(such as movies)
Handling neutral tweets
Internationalization There are lots of tweet about the same subject
in lost of different languages
Utilizing emoticon data in the set Emoticons are stipped out and classifiers could perform better if they were included
15/17
SEMANTICS On a tweet that says Djokovic beats Federer,
one cannot extract the sentiment of the tweet
To be precise, semantics could be a solution
If (user.isFrom(Serbia)) thensentiment := positive
else if (user.isFrom(Switzerland)) thensentiment := negative
Using semantics, we can gather more information,than just by reading keywords
16/17
THANK YOU FOR YOUR ATTENTION
NIKOLA JOLIC
17/17