Upload
francesco-colace
View
240
Download
2
Tags:
Embed Size (px)
DESCRIPTION
A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference - Colace De Santo Greco
Citation preview
A Probabilistic Approach to Tweets’ Sentiment Classification
Francesco Colace, Massimo De Santo, Luca Greco
DIEM –Università degli Studi di Salerno
{fcolace, desanto, lgreco}@unisa.it
ACII 2013 – Geneva, 2-5 September 2013
Motivation Web 2.0 (or Web X.Y) rules!
Social Networks, Blogs, Microblogs, Reviews’ Collectors Sites: huge and terrific quantity of heterogeneus and opinonated data
ACII 2013 – Geneva, 2-5 September 2013
Motivation Open issues:
o How to manage this information?o How to extract the sentiment inside the data?o How to understand something about the users?o How to evaluate the opinion of people about some topics or
products? Sentiment Analysis
ACII 2013 – Geneva, 2-5 September 2013
Outline Brief introduction to the Sentiment Analysis
o Related Works
Towards a Sentiment Analysis Frameworko The Proposed Approach
• The LDA Approach• The Mixed Graph of Terms• A sentiment mining algorithm
Experimental results
Conclusions and Future WorksACII 2013 – Geneva, 2-5 September 2013
Sentiment Analysis Sentiment:
o a thought, view, or attitude, especially based mainly on emotion instead of reason
Sentiment Analysis (as known as Opinion mining):o use of Natural Language Processing (NLP) and computational
techniques to automate the extraction and classification of sentiment from unstructured texts
ACII 2013 – Geneva, 2-5 September 2013
Sentiment Analysis: Why?
Consumer informationo Product reviews (Amazon, e-Bay, …)
Marketingo Consumer attitudeso Trends
Politicso Politicians want to know voters’ point of viewso Voters want to know policitians’ stances and who else supports them
Socialo Find like-minded individuals or communities
ACII 2013 – Geneva, 2-5 September 2013
Sentiment Analysis: Open Issues
What features adopt?o Wordso Sentences
How to interpret features for sentiment detection?o As a bag of words o By the use of annotated lexiconso According to syntactic patternso Analyzing the paragraph structure
ACII 2013 – Geneva, 2-5 September 2013
Sentiment Analysis: Approaches
Naïve Bayes
Maximum Entropy Classifier
SVM
Markov Blanket Classifier
… … …
Latent Dirichlet Allocation (LDA)ACII 2013 – Geneva, 2-5 September 2013
The Proposed Approach: from the Bag-of-Words …
By the use of the Bag of Words approach, a document can be represented as an ordered set of words
Problems:
o What words express better the sentiment in a text?
o How to compare various «bag of words» derived from texts with the same sentiment?
o By the use of the bag of words is it possible to represent the documents’ domain of interest?
ACII 2013 – Geneva, 2-5 September 2013
… to mixed Graph of Terms (mGT)
The mixed Graph of Terms is a «graph based» representation of documents
In the proposed approach, a mixed Graph of Terms is obtained by an automatic extraction of words based on probabilistic clustering techniques as Latent Dirichlet Allocation (LDA)
In a mixed Graph of Terms the words are linked according to their mutual occurence probability and «aggregating_word» and «aggregated_words» can be recognized
Our proposal: a mixed Graph of Terms can be used as a «sentiment filter»
ACII 2013 – Geneva, 2-5 September 2013
mGT: a different point of view
In the proposed approach, in a mixed Graph of Terms two different layers can be recognized:
The Aggregator Layer: the words with higher degree of interconnection with the words that are in the documents
The “Aggregated Words” Layer: this layer expresses words that have higher degree of interconnection with one or more Aggregator Word
ACII 2013 – Geneva, 2-5 September 2013
Latent Dirichlet Allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a
generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar
For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics
The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words
By the use of the Latent Dirichlet Allocation technique a set of documents can be represented as a mixed Graph of Terms
ACII 2013 – Geneva, 2-5 September 2013
Extraction of a Mixed Graph of Terms
ACII 2013 – Geneva, 2-5 September 2013
mGT: an example
ACII 2013 – Geneva, 2-5 September 2013
Sentiment Classification by the use of mGT
Step_1: Learn a mixed Graph of Terms by the use of labelled documents (i.e. Positive or Negative) obtaining:o mGT positiveo mGT negative
Step_2: Use the mixed Graph of Terms as filter in order to classify the sentiment of textso Comparing concepts that are both in the mGTs
both in the texto Comparing words that are both in the mGTs both in
the text
ACII 2013 – Geneva, 2-5 September 2013
Sentiment Classification by the use of mGT
ACII 2013 – Geneva, 2-5 September 2013
Experimental Results
Dataset: Movie Reviews
Approach Accuracy
Support Vector Machine* 82,90
Naive Bayes* 81,50
Maximum Entropy* 81,00
mGT-LDA 88,50
*[Bo Pang, 2002]
ACII 2013 – Geneva, 2-5 September 2013
Experimental Results
Dataset: Real Tweets related to Politics Training Set: 3980 Tweets Test Set: 32185 Tweets
ACII 2013 – Geneva, 2-5 September 2013
Approach Accuracy
mGT-LDA 87,10
SVM 79,20
Naive Bayes 76,60
Experimental Results
ACII 2013 – Geneva, 2-5 September 2013
http://193.205.190.209/elezioni2013/
Experimental Results
ACII 2013 – Geneva, 2-5 September 2013
days
accuracy
Experimental Results
ACII 2013 – Geneva, 2-5 September 2013
Masterchef - http://193.205.190.209/tvshow/masterchef/
Conclusions
Pro:o Indipendent from Languageo Fast classificationo Continous Upgradeo Little Training Set
Cons:o In general, long Time for mGT building
processo An Annotated Lexicon is needed
ACII 2013 – Geneva, 2-5 September 2013
Future Works
To improve the classification by the continous update of the training set
To Introduce SentiWordnet as Annotated lexicon
To adopt an ontological formalism for a better representation of the mGT
To build a bigger tweets’ dataset
ACII 2013 – Geneva, 2-5 September 2013
Any Questions?
ACII 2013 – Geneva, 2-5 September 2013
Don’t forget to tweet your sentiment!!!