24
A Probabilistic Approach to Tweets’ Sentiment Classification Francesco Colace, Massimo De Santo, Luca Greco DIEM –Università degli Studi di Salerno {fcolace, desanto, lgreco}@unisa.it ACII 2013 – Geneva, 2-5 September 2013

A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Embed Size (px)

DESCRIPTION

A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference - Colace De Santo Greco

Citation preview

Page 1: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

A Probabilistic Approach to Tweets’ Sentiment Classification

Francesco Colace, Massimo De Santo, Luca Greco

DIEM –Università degli Studi di Salerno

{fcolace, desanto, lgreco}@unisa.it

ACII 2013 – Geneva, 2-5 September 2013

Page 2: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Motivation Web 2.0 (or Web X.Y) rules!

Social Networks, Blogs, Microblogs, Reviews’ Collectors Sites: huge and terrific quantity of heterogeneus and opinonated data

ACII 2013 – Geneva, 2-5 September 2013

Page 3: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Motivation Open issues:

o How to manage this information?o How to extract the sentiment inside the data?o How to understand something about the users?o How to evaluate the opinion of people about some topics or

products? Sentiment Analysis

ACII 2013 – Geneva, 2-5 September 2013

Page 4: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Outline Brief introduction to the Sentiment Analysis

o Related Works

Towards a Sentiment Analysis Frameworko The Proposed Approach

• The LDA Approach• The Mixed Graph of Terms• A sentiment mining algorithm

Experimental results

Conclusions and Future WorksACII 2013 – Geneva, 2-5 September 2013

Page 5: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Analysis Sentiment:

o a thought, view, or attitude, especially based mainly on emotion instead of reason

Sentiment Analysis (as known as Opinion mining):o use of Natural Language Processing (NLP) and computational

techniques to automate the extraction and classification of sentiment from unstructured texts

ACII 2013 – Geneva, 2-5 September 2013

Page 6: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Analysis: Why?

Consumer informationo Product reviews (Amazon, e-Bay, …)

Marketingo Consumer attitudeso Trends

Politicso Politicians want to know voters’ point of viewso Voters want to know policitians’ stances and who else supports them

Socialo Find like-minded individuals or communities

ACII 2013 – Geneva, 2-5 September 2013

Page 7: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Analysis: Open Issues

What features adopt?o Wordso Sentences

How to interpret features for sentiment detection?o As a bag of words o By the use of annotated lexiconso According to syntactic patternso Analyzing the paragraph structure

ACII 2013 – Geneva, 2-5 September 2013

Page 8: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Analysis: Approaches

Naïve Bayes

Maximum Entropy Classifier

SVM

Markov Blanket Classifier

… … …

Latent Dirichlet Allocation (LDA)ACII 2013 – Geneva, 2-5 September 2013

Page 9: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

The Proposed Approach: from the Bag-of-Words …

By the use of the Bag of Words approach, a document can be represented as an ordered set of words

Problems:

o What words express better the sentiment in a text?

o How to compare various «bag of words» derived from texts with the same sentiment?

o By the use of the bag of words is it possible to represent the documents’ domain of interest?

ACII 2013 – Geneva, 2-5 September 2013

Page 10: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

… to mixed Graph of Terms (mGT)

The mixed Graph of Terms is a «graph based» representation of documents

In the proposed approach, a mixed Graph of Terms is obtained by an automatic extraction of words based on probabilistic clustering techniques as Latent Dirichlet Allocation (LDA)

In a mixed Graph of Terms the words are linked according to their mutual occurence probability and «aggregating_word» and «aggregated_words» can be recognized

Our proposal: a mixed Graph of Terms can be used as a «sentiment filter»

ACII 2013 – Geneva, 2-5 September 2013

Page 11: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

mGT: a different point of view

In the proposed approach, in a mixed Graph of Terms two different layers can be recognized:

The Aggregator Layer: the words with higher degree of interconnection with the words that are in the documents

The “Aggregated Words” Layer: this layer expresses words that have higher degree of interconnection with one or more Aggregator Word

ACII 2013 – Geneva, 2-5 September 2013

Page 12: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Latent Dirichlet Allocation In natural language processing, Latent Dirichlet Allocation (LDA) is a

generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar

For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics

The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words

By the use of the Latent Dirichlet Allocation technique a set of documents can be represented as a mixed Graph of Terms

ACII 2013 – Geneva, 2-5 September 2013

Page 13: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Extraction of a Mixed Graph of Terms

ACII 2013 – Geneva, 2-5 September 2013

Page 14: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

mGT: an example

ACII 2013 – Geneva, 2-5 September 2013

Page 15: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Classification by the use of mGT

Step_1: Learn a mixed Graph of Terms by the use of labelled documents (i.e. Positive or Negative) obtaining:o mGT positiveo mGT negative

Step_2: Use the mixed Graph of Terms as filter in order to classify the sentiment of textso Comparing concepts that are both in the mGTs

both in the texto Comparing words that are both in the mGTs both in

the text

ACII 2013 – Geneva, 2-5 September 2013

Page 16: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Sentiment Classification by the use of mGT

ACII 2013 – Geneva, 2-5 September 2013

Page 17: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Experimental Results

Dataset: Movie Reviews

Approach Accuracy

Support Vector Machine* 82,90

Naive Bayes* 81,50

Maximum Entropy* 81,00

mGT-LDA 88,50

*[Bo Pang, 2002]

ACII 2013 – Geneva, 2-5 September 2013

Page 18: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Experimental Results

Dataset: Real Tweets related to Politics Training Set: 3980 Tweets Test Set: 32185 Tweets

ACII 2013 – Geneva, 2-5 September 2013

Approach Accuracy

mGT-LDA 87,10

SVM 79,20

Naive Bayes 76,60

Page 19: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

http://193.205.190.209/elezioni2013/

Page 20: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

days

accuracy

Page 21: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Experimental Results

ACII 2013 – Geneva, 2-5 September 2013

Masterchef - http://193.205.190.209/tvshow/masterchef/

Page 22: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Conclusions

Pro:o Indipendent from Languageo Fast classificationo Continous Upgradeo Little Training Set

Cons:o In general, long Time for mGT building

processo An Annotated Lexicon is needed

ACII 2013 – Geneva, 2-5 September 2013

Page 23: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Future Works

To improve the classification by the continous update of the training set

To Introduce SentiWordnet as Annotated lexicon

To adopt an ontological formalism for a better representation of the mGT

To build a bigger tweets’ dataset

ACII 2013 – Geneva, 2-5 September 2013

Page 24: A Probabilistic Approach to Tweets' Sentiment Classification - ACII 2013 Conference

Any Questions?

ACII 2013 – Geneva, 2-5 September 2013

Don’t forget to tweet your sentiment!!!