Upload
amal-mahmoud
View
124
Download
2
Embed Size (px)
Citation preview
Authors
Tarek Elghazaly
Amal Mahmoud
Hesham A. Hefny
Political Sentiment Analysis Using
Twitter Data
Computer Science Department, Institute of Statistical Studies and
Research (ISSR), Cairo University, Egypt
ICC '16, March 22-23, 2016, Cambridge, United Kingdom © 2016 ACM. ISBN978-1-
4503-4063-2/16/$15.00
DOI: http://dx.doi.org/10.1145/2896387.2896396
OVERVIEW
1. INTRODUCTION
3. METHODOLOGY
4. RESULTS AND EVALUATION
5.CONCLUSION AND FUTURE WORK
2. RELATED WORK
There is a remarkable growth in the usage of social networks, such
as Facebook and Twitter. Users from different cultures and
backgrounds post large volumes of textual comments reflecting their
opinion in different aspect of life and make them available to
everyone. In particular we study the case of Twitter and focus on
presidential elections in Egypt 2012.
Sentiment analysis is the automated mining of attitudes,
opinions, and emotions from text, speech, and database
sources. Sentiment analysis involves classifying opinions in text
into categories like “positive” or “negative” or “neutral”.
INTRODUCTION
Sentiment analysis of Arabic social media is challenging task for
many reasons such as :
1. Arabic is not a case-sensitive language.
2. Arabic has some variants in spelling and typographic forms,
Creation of new expressions the usage of which indicates high
subjectivity for example” (funky)”روش is used as a positive
reference. Also“االستبن”(spare tire) is used as negative reference
used for former president Mohammed Morsi during the 2012
Presidential elections.
3. Arabic texts have different sorts of ambiguities (different meanings).
For example, (Ramadan)”رمضان“ in Arabic may be used as a
person name, month.
4. a form of speech act where a person says something positive
while (s)he really means something negative or vice versa.
INTRODUCTION
Mining Arabic Business Reviews (Elhawary, M., Elfeky, M. 2010)
Presents a system to extract business Arabic reviews, and then it
analyzed these collected reviews to id identify their polarity (positive,
negative or neutral). And exhibits the general opinion of the Arab public
about different products and services.
Automatic Arabic Document Categorization Based on the Naïve Bayes
Algorithm(El-Kourdi, M., Bensaid, A., and Rachidi, T . 2004) used Naïve
Bayes algorithm for categorizing Arabic text documents to one of five
pre-defined categories, Cross validation experiments are used to
evaluate the Naïve Bayes categorizer.
Opinion corpus for Arabic Rushdi-Saleh, M., Martín-Valdivia, M., Ureña-
López, L., and Perea-Ortega, J. 2011) uses machine learning classifiers
by using both Arabic and English corpora. They employ two machine
learning classifiers namely, (SVMs) and (NB) classifiers. The results
obtained show that SVMs outperform the NB classifier and also there is
no a big difference between using the term frequency (TF) and the term
frequency-inverse document frequency (TF-IDF) for weighting methods.
RELATED WORK
The methodology using for Building Machine Learning classifies consists of 3 steps
1. Corpus Collection and Preparation
2. Pre-processing
3. Text Classification
Then convert the file to ARFF which deals with Weka program. WEKA provides a
large collection of machine learning algorithms for data pre-processing,
classification, clustering, association rules, and visualization, which can be
invoked through a common Graphical User Interface.
METHODOLOGIES
Corpus Collection and Preparation
The total corpus size is 18278 tweets. We have
annotated 18278 tweets consisting of 11910 positive,
6368 related to pinion expressed in Arabic from different
domains:” علىخالد ” -(Khaled Ali),” موسىعمرو ”-(Amr
Mousa),” شفيقاحمد ”-(Ahmed Shafik),” مرسىمحمد" -
(Mohammed Morsi), "صباحىحمدين" Hamden Sabahy)-
( "الفتوحابو" -(Abu Alftouh)”.We define a sentiment as
positive or negative opinion, each data instance (Tweet)
annotated to positive or negative.
Pre-processing
1. Tokenization
2. Normalization
3. Stop words removal
4. Stemming
5. Term weighting
6. N-Grams
Text Classification
This section covers two existing approaches to text classification: SVM, and NB.
Support Vector Machine (SVM): This classifier can recognized by a separating hyper
plane To put it simply, the output of the algorithm is the optimal hyper plane that
put new examples in categories after receiving labelled predefined training data
Naïve Bayes (NB): is an effective classification algorithm which is widely used for
sentiment analysis and document classification. As a probabilistic model, the
Naïve Bayes classifier makes the use of the joint probabilities of terms and their
categories for the sake of figuring out the probabilities of categories given as a
test data.
METHODOLOGIES
represents the precision, recall and F-measure for each category and the
average values for all categories for the SVM classifier
METHODOLOGIES
precision Recall F-measure Class
0.932 0.953 0.942 أحمد شفيق
0.783 0.885 0.831 محمد مرسى
0.000 0.000 0.000 عمرو موسى
0.970 0.935 0.952 خالد على
0.950 0.743 0.834 حمدين صباحى
0.914 0.917 0.916 ابو الفتوح
0.862 0.884 0.871 Weighted average
represents the precision, recall and F-measure for each category and
the average values for all categories for the NB classifier
precision Recall F-measure Class
0.976 0.920 0.947 أحمد شفيق
0.865 0.951 0.906 محمد مرسى
0.844 0.889 0.866 عمرو موسى
0.912 0.941 0.926 خالد على
0.848 0.871 0.859 حمدين صباحى
0.953 0.886 0.918 ابو الفتوح
0.925 0.921 0.922 Weighted average
The evaluation is based on two popular machine learning algorithms (NB, SVM)
using unigram as feature, and using 10-fold cross validation method for testing.
The evaluation we used precision, recall and F-measure to evaluate these
approaches.
RESULTS AND EVALUATION
F-measure=2∗Recall∗Precision
(Recall+Precision)Precision=
TP
TP+FPRecall=
TP
TP+FNAccuracy=
TP+TN
TP+FP+TN+FN
Evaluating Result
Classification Type
Precision RecallF-
Measure
SVM 0.862 0.884 0.871
NB 0.925 0.921 0.922
Time taken to build the models in
Minuets
Classification Type
Time
SVM 161.03
NB 30.59
This study aims to compare between two classification techniques SVM and NB
using Arabic tweets which categorized the Arabic documents into six domains :”
علىخالد ” -(Khaled Ali), ” موسىعمرو ” -(Amr Mousa), ” شفيقاحمد ”-(Ahmed Shafik), ”
"مرسىمحمد -(Mohammed Morsi), صباحى""حمدين -(Hamden Sabahy), "الفتوحابو" -(Abu
Alftouh).The bases of our comparison of the SVM and NB are the most popular
text evaluation measures (F-measure, Recall, and Precision). TF-IDF is used as
the weighting scheme and cosine measure used to calculate the similarity
of each document to be classified with training documents. The comparison is
based on two main aspects for the selected classifiers, accuracy and time. In
terms of accuracy, results show that the Naïve Bayes is a popular technique
for this application because it is very fast and quite accurate.
Future work:
we aim to compare the results obtained from these classifiers with other
classifiers. In this study we used light stemmer we aim to use Khoja stemmer and
compare the result; also we used unigram we aim to compare the result between
unigram, bigram and trigram.
CONCLUSION AND FUTURE WORK