View
276
Download
5
Category
Preview:
DESCRIPTION
Citation preview
Sentiment Analysis on Arabic Tweets Using RapidMiner
Student Name: Salha al osaimi
Supervised by: Dr. Khan Muhammad Badruddin
Agenda
• Introduction
• Motivation
• Challenges Related to Arabic Language
• Experiment Steps
• Experiment Results
• Conclusion
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
2
Introduction
• In social media, there are a lot of informal messages that are posted
every day. Most of these messages are used to describe the sender’s
feelings and emotions.
• Millions of Twitter’s tweets with modern Arabic content provide a
challenging opportunity to understand the emotions of their
producers.
• Sentiment analysis is needed to help in understanding of the
emotions in this informal communication.
• One of the main objectives of sentiment analysis is to extract
sentiment of a given text by classifying it as positive, negative, or
neutral
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
3
Motivation
Our Focus: how to address the challenges of informal Arabic
sentiment analysis. For this purpose, we used RapidMiner to
manipulate Arabic text.
Why :
1. The work has application in Education, Business, Technology ,
Security and almost every field.
2. Working in this area means doing cutting-edge research.
3. The task is challenging in nature
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
4
Arabic sentiments analysis challengesArabic sentiments analysis has many challenges such as the following:
• complexity of Arabic language in terms of both structure and morphology.
• Arabic grammar is highly complex.
• The variety of different Arabic dialects.
• Arabic language contains many word forms and diacritic
• Arabic language is a derivational language
• Semantic dictionaries or lexicons for Arabic sentiment mining are limited.
• most of Arabic language in the internet is written in informal language
which is unstructured in nature.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
5
Experiment steps
This resarch aims to investigate how to address the challenges of
informal Arabic sentiment analysis. For this purpose, we used
RapidMiner to manipulate Arabic text. We performed the
experiments, evaluated the results of different text processes
and then explored the problems and tried to fix them.
Figure.1 shows the sentiment classification process for Arabic
tweets using RapidMiner.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
6
Experiment steps (cont.)
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
7
•
Figure.1 Sentiment classification process for Arabic tweets using RapidMiner
Experiment steps (cont.)
The experiment steps can be describes as follows:
• Data
After collection of the tweets by using twitter’s API library, 3000
tweets were randomly picked for creation of the text corpus. Then,
we determined the sentiment of each tweet (positive, negative,
neutral) manually. Each collected tweet contained at least one
emotion icon. At the end of this step we have 3000 labelled (1000
positives, 1000 negatives, 1000 neutrals).
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
8
Experiment steps (Cont.)
• Text processing
The text processing is very important for text mining to prepare data
for classification step. Figure.2 illustrates the steps of text prepressing.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
9
Figure.2 Text processing steps in RapidMiner.
Experiment steps (Cont.)
In the text processing, we performed this task in four steps that are
described below:
1- Tokenization
The tokenization step was preformed for each tweet in order to divide the
tweet into multiple tokens based on whitespaces characters
2- Filtering
In this step, we used “filter the token by length” facility and removed the
tokens with lenght of less than 3 characters
3- Light stemming
Light stemming facility from RapidMiner was used to reduce the feature
space. In Arabic, the base or stem is different from the root.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
10
Experiment steps (Cont.)
4- Word vector model
In this step, we converted the text data into matrix to show the
frequency of occurrence of each term for each sentiment polarity.
Figure.3 shows word vector model
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
11
Figure.3 Word vector model
Experiment steps (Cont.)
• Building and Validation of Model
We built model to classify the unlabeled tweets with correct
sentiments. The training data with assignment of sentiments to tweets
was the input of the process. We used naïve bayse (NB) and k-Nearest
Neighbor (k-NN) algorithms to build the classification model. We
validated the model using the cross validation technique that can be
easily implemented in RapidMiner.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
12
Experiment Results.When we analyzed experiment results, we discovered many problems
in text processing steps. The first problem was related to emotion
icons symbols. When we cleaned the text and removed English words
and special characters using filters, all emotion icons were also
removed. In order to preserve the emotion icons, we gave special
meaningful name to each emotion icon. Table1 show examples of the
emotion icons conversion step.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
13
Emotion icons
TweetsTweets after converting the
icons
:'(♥ ♥)':السالمهمع رمزقلبرمزحزينالسالمهمع
XD XDجمالاي رمزضحكجمالاي
Table 1: Examples of the Converting Emotion Icons to meaningful text
Experiment Results (Cont.)
The second problem was variations of word forms and diacritic that occurred during the tokenization process. The token filter treats diacriticsas whitespaces. Table 2 shows the Tokenization problems.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
14
Tweet before Tokenization Tweet after Tokenization
موقوتَاِكتابا َالُمْؤِمنينَََعٓلىَكانتالَصالةََإنكنينممؤالىلعانتكالةالصإن
موقوتتابا
ساعدونيانَتْحـَرراح ساعدونيرحـتانراح
جميعاــالمعليكمالســ جميعاــالمعليكمالســ
Table 2: Tokenization problems
Experiment Results (Cont.)
The diacritic problem and some of word forms like tatweel were solved
by performing normalization process that is manipulation of Arabic
text to produce consistent form, by converting all the various forms of
a word to a common form and the removal of the diacritic. Table 3
shows the normalization case.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
15
Rule ExampleTashkeel المؤمنين<-الُمْؤِمنين
Tatweel هللا<-اللــــه
Hamza ء<-ءorىءorؤ
Alef ا<-إorأorآlamalef ال<-إلorألorآلorال
Yeh ي<-ىorي
Heh <-ةorه
Table 3: Normalization cases
Experiment Results (Cont.)
The last problem is the stemming. In the Arabic there exist different
that have different meanings but have the same root. This makes
detecting the polarities of these words Very difficult task. Moreover,
other problems occur during the stemming process. The stemmer
sometimes, deleted some basic letters of the word. Table 4 shows the
light stemmer problems. We remove the stemmer step from the text
processing
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
16
Tweet before stemming Tweet after stemmer
❤ليَولكَلمنَبعدكَ! تعدلَثلثَالقرانَ تعدل ثلث قر ولك لمن بعدك
انزينَخالصَاالمتحانَصعبَوَعيدكمَ
=((مباركانز خالص امتح صعب عيدكم مبارك
Table 4: Stemmer problems
Experiment Results (Cont.)
• Initially the accuracy of the NB classifier was 58.61% while
that of the k-NN classifier was 52.47%.
• After we solved the entire problems we got comparatively
better results. The accuracy of the NB increased to 63.99%
and that of the k-NN reached 59.04%. We plan to perform
more experiments in different settings to get better results.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
17
Experiment Results (Cont.)
• RapidMiner is a great tool for Arabic text mining, there exist
problems in Arabic language that can compel a researcher to work
partially outside the environment of RapidMiner and then come
back to it after solving the above-mentioned problems.
• It is possible this migration can happen due to lack of knowledge of
a researcher about functionalities of RapidMiner.
• There exist lot of room to build new extensions so that the
RapidMiner becomes one-window solution for every kind of Arabic
text preprocessing.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
18
Conclusion Research in sentiment analysis for the Arabic language has been very limited
as compared to other languages like English. This paper described
1. the issues related to sentiment analysis of Arabic Language
2. showed how the RapidMiner tool helped to generate classification
model to discover the sentiments (positive, negative, and neutral) for
each tweets.
We found that even though RapidMiner facilities are very helpful to
manipulate Arabic text and perform text mining, there exist lot of room for
development of new plug-ins of RapidMiner that can handle Arabic-specific
issues.
Sen
tim
ent
anal
ysis
on
Ara
bic
Tw
eets
usi
ng
Rap
idM
iner
19
Recommended