19
Sentiment Analysis on Arabic Tweets Using RapidMiner Student Name: Salha al osaimi Supervised by: Dr. Khan Muhammad Badruddin

RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Sentiment Analysis on Arabic Tweets Using RapidMiner

Student Name: Salha al osaimi

Supervised by: Dr. Khan Muhammad Badruddin

Page 2: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Agenda

• Introduction

• Motivation

• Challenges Related to Arabic Language

• Experiment Steps

• Experiment Results

• Conclusion

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

2

Page 3: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Introduction

• In social media, there are a lot of informal messages that are posted

every day. Most of these messages are used to describe the sender’s

feelings and emotions.

• Millions of Twitter’s tweets with modern Arabic content provide a

challenging opportunity to understand the emotions of their

producers.

• Sentiment analysis is needed to help in understanding of the

emotions in this informal communication.

• One of the main objectives of sentiment analysis is to extract

sentiment of a given text by classifying it as positive, negative, or

neutral

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

3

Page 4: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Motivation

Our Focus: how to address the challenges of informal Arabic

sentiment analysis. For this purpose, we used RapidMiner to

manipulate Arabic text.

Why :

1. The work has application in Education, Business, Technology ,

Security and almost every field.

2. Working in this area means doing cutting-edge research.

3. The task is challenging in nature

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

4

Page 5: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Arabic sentiments analysis challengesArabic sentiments analysis has many challenges such as the following:

• complexity of Arabic language in terms of both structure and morphology.

• Arabic grammar is highly complex.

• The variety of different Arabic dialects.

• Arabic language contains many word forms and diacritic

• Arabic language is a derivational language

• Semantic dictionaries or lexicons for Arabic sentiment mining are limited.

• most of Arabic language in the internet is written in informal language

which is unstructured in nature.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

5

Page 6: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps

This resarch aims to investigate how to address the challenges of

informal Arabic sentiment analysis. For this purpose, we used

RapidMiner to manipulate Arabic text. We performed the

experiments, evaluated the results of different text processes

and then explored the problems and tried to fix them.

Figure.1 shows the sentiment classification process for Arabic

tweets using RapidMiner.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

6

Page 7: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (cont.)

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

7

Figure.1 Sentiment classification process for Arabic tweets using RapidMiner

Page 8: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (cont.)

The experiment steps can be describes as follows:

• Data

After collection of the tweets by using twitter’s API library, 3000

tweets were randomly picked for creation of the text corpus. Then,

we determined the sentiment of each tweet (positive, negative,

neutral) manually. Each collected tweet contained at least one

emotion icon. At the end of this step we have 3000 labelled (1000

positives, 1000 negatives, 1000 neutrals).

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

8

Page 9: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (Cont.)

• Text processing

The text processing is very important for text mining to prepare data

for classification step. Figure.2 illustrates the steps of text prepressing.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

9

Figure.2 Text processing steps in RapidMiner.

Page 10: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (Cont.)

In the text processing, we performed this task in four steps that are

described below:

1- Tokenization

The tokenization step was preformed for each tweet in order to divide the

tweet into multiple tokens based on whitespaces characters

2- Filtering

In this step, we used “filter the token by length” facility and removed the

tokens with lenght of less than 3 characters

3- Light stemming

Light stemming facility from RapidMiner was used to reduce the feature

space. In Arabic, the base or stem is different from the root.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

10

Page 11: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (Cont.)

4- Word vector model

In this step, we converted the text data into matrix to show the

frequency of occurrence of each term for each sentiment polarity.

Figure.3 shows word vector model

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

11

Figure.3 Word vector model

Page 12: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment steps (Cont.)

• Building and Validation of Model

We built model to classify the unlabeled tweets with correct

sentiments. The training data with assignment of sentiments to tweets

was the input of the process. We used naïve bayse (NB) and k-Nearest

Neighbor (k-NN) algorithms to build the classification model. We

validated the model using the cross validation technique that can be

easily implemented in RapidMiner.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

12

Page 13: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results.When we analyzed experiment results, we discovered many problems

in text processing steps. The first problem was related to emotion

icons symbols. When we cleaned the text and removed English words

and special characters using filters, all emotion icons were also

removed. In order to preserve the emotion icons, we gave special

meaningful name to each emotion icon. Table1 show examples of the

emotion icons conversion step.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

13

Emotion icons

TweetsTweets after converting the

icons

:'(♥ ♥)':السالمهمع رمزقلبرمزحزينالسالمهمع

XD XDجمالاي رمزضحكجمالاي

Table 1: Examples of the Converting Emotion Icons to meaningful text

Page 14: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results (Cont.)

The second problem was variations of word forms and diacritic that occurred during the tokenization process. The token filter treats diacriticsas whitespaces. Table 2 shows the Tokenization problems.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

14

Tweet before Tokenization Tweet after Tokenization

موقوتَاِكتابا َالُمْؤِمنينَََعٓلىَكانتالَصالةََإنكنينممؤالىلعانتكالةالصإن

موقوتتابا

ساعدونيانَتْحـَرراح ساعدونيرحـتانراح

جميعاــالمعليكمالســ جميعاــالمعليكمالســ

Table 2: Tokenization problems

Page 15: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results (Cont.)

The diacritic problem and some of word forms like tatweel were solved

by performing normalization process that is manipulation of Arabic

text to produce consistent form, by converting all the various forms of

a word to a common form and the removal of the diacritic. Table 3

shows the normalization case.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

15

Rule ExampleTashkeel المؤمنين<-الُمْؤِمنين

Tatweel هللا<-اللــــه

Hamza ء<-ءorىءorؤ

Alef ا<-إorأorآlamalef ال<-إلorألorآلorال

Yeh ي<-ىorي

Heh <-ةorه

Table 3: Normalization cases

Page 16: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results (Cont.)

The last problem is the stemming. In the Arabic there exist different

that have different meanings but have the same root. This makes

detecting the polarities of these words Very difficult task. Moreover,

other problems occur during the stemming process. The stemmer

sometimes, deleted some basic letters of the word. Table 4 shows the

light stemmer problems. We remove the stemmer step from the text

processing

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

16

Tweet before stemming Tweet after stemmer

❤ليَولكَلمنَبعدكَ! تعدلَثلثَالقرانَ تعدل ثلث قر ولك لمن بعدك

انزينَخالصَاالمتحانَصعبَوَعيدكمَ

=((مباركانز خالص امتح صعب عيدكم مبارك

Table 4: Stemmer problems

Page 17: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results (Cont.)

• Initially the accuracy of the NB classifier was 58.61% while

that of the k-NN classifier was 52.47%.

• After we solved the entire problems we got comparatively

better results. The accuracy of the NB increased to 63.99%

and that of the k-NN reached 59.04%. We plan to perform

more experiments in different settings to get better results.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

17

Page 18: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Experiment Results (Cont.)

• RapidMiner is a great tool for Arabic text mining, there exist

problems in Arabic language that can compel a researcher to work

partially outside the environment of RapidMiner and then come

back to it after solving the above-mentioned problems.

• It is possible this migration can happen due to lack of knowledge of

a researcher about functionalities of RapidMiner.

• There exist lot of room to build new extensions so that the

RapidMiner becomes one-window solution for every kind of Arabic

text preprocessing.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

18

Page 19: RM World 2014: Sentiment analysis on arabic tweets using RapidMiner

Conclusion Research in sentiment analysis for the Arabic language has been very limited

as compared to other languages like English. This paper described

1. the issues related to sentiment analysis of Arabic Language

2. showed how the RapidMiner tool helped to generate classification

model to discover the sentiments (positive, negative, and neutral) for

each tweets.

We found that even though RapidMiner facilities are very helpful to

manipulate Arabic text and perform text mining, there exist lot of room for

development of new plug-ins of RapidMiner that can handle Arabic-specific

issues.

Sen

tim

ent

anal

ysis

on

Ara

bic

Tw

eets

usi

ng

Rap

idM

iner

19