18
1 Sentiment Analysis Term Paper Report Submitted in partial fulfilment of the requirement For Master of Technology In Computer Science and Engineering Under the guidance of Ritika Vern Research Scholar USICT, GGSIPU New Delhi Varsha Mittal M. Tech - CSE (2nd Semester) Submitted By (00716414816) University School of Information & Communication Technology Guru Gobind Singh Indraprastha University Sector 16-C, Dwarka New Delhi -110078New Delhi -110078

Sentiment Analysis - IRTS · Sentiment Analysis 3 CERTIFICATE This is to certify that this research work is done on a study of Sentiment Analysis & its techniques submitted by Varsha

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

1

Sentiment Analysis

Term Paper Report

Submitted in partial fulfilment of the requirement

For

Master of Technology

In

Computer Science and Engineering

Under the guidance

of

Ritika Vern

Research Scholar

USICT, GGSIPU

New Delhi

Varsha Mittal

M. Tech - CSE (2nd Semester)

Submitted By

(00716414816)

University School of Information & Communication Technology

Guru Gobind Singh Indraprastha University

Sector 16­C, Dwarka

New Delhi ­110078New Delhi ­110078

Sentiment Analysis

2

Contents

1. Certificate……………………………………………..3

2. Acknowledgement………………………………….4

3. Abstract ………………………………………………………… 5

4. Introduction of the Topic……………………….. 6-11

5. Related Work………………………..12-14

6. Proposed Work……………………………………… 15-17

7. References ………………….. 18

Sentiment Analysis

3

CERTIFICATE

This is to certify that this research work is done on a study of Sentiment Analysis & its

techniques submitted by Varsha Mittal, Enrollment no. 00716414816 in partial fulfillment of

the requirement for the award of the degree of Master of Technology in Computer Science, is a

valid work carried out by her under my supervision and guidance. The research work is the

original one and has not submitted anywhere else for any other degree.

Name of Student Name of the Guide

Varsha Mittal Ritika Vern

Signature

Sentiment Analysis

4

ACKNOWLEDGEMENT

I express my sincere thanks and deep sense of gratitude to my research mentor Ritika Vern, for

her vulnerable motivation and guidance, without which this report would not have been possible.

I consider myself fortunate for having the opportunity to learn and work under her able

supervision and guidance over the period of association. I have a deep sense of admiration for

her innate goodness.

Varsha Mittal

(00716414816)

M.tech(CSE)

Sentiment Analysis

5

Abstract

Sentiment Analysis is the process of identifying and categorizing opinions expressed in a piece of text, especially

in order to determine whether the writer's attitude towards a particular topic or product is positive, negative, or

neutral. With the increasing use of micro blogging websites such as twitter, facebook and other social media,

every day a lot of reviews are being made available online. These reviews could be of a product, movie or it can

be an independent statement describing a situation. Sentiment analysis is thus used to classify these statements as

a positive one or a negative one. There are various benefits of Sentiment Analysis. It makes the user aware about

the various positive and negative features of any product. It helps the users in effective decision making.

Furthermore, SA helps companies to seek feedback from these reviews and alleviate their products/services

wherever necessary. For example, when a person plans to buy a mobile phone, he tends to scrutinize multiple

review sites to read the reviews that the other consumers have written. In this manner, the consumer can get an

idea about the features that he may consider as important. Analyzing the reviews available on thousands of sites

is a tedious task. Sentiment Analysis thus comes into play at such situations. It eases the consumer's task of

categorizing the text into positive and negative which further helps them in effective decision making.

Sentiment Analysis

6

Introduction to the Topic

Sentiment Analysis is the process of figuring out the emotions from a piece of writing that whether it is positive,

negative or neutral and is used to tell the speaker's attitude. The trend, today, is to consider the opinions of a

variety of individuals around the globe before purchasing an item using micro-blogging data. Customers tend to

go over a lot of reviews about a particular item before buying it. Sentiment Analysis makes this task easy for the

customers. Sentiment Analysis aims to achieve its function in the simplest manner with the help of an existing

approach and an existing algorithm. A number of frameworks have been designed for prediction of user

sentiment & the topic of that particular discussion.

Sentiment Analysis is done at three different levels namely that are Document Level, Sentence Level and Feature

Level. Document Level sentiment analysis takes the whole document and classifies it into two categories

positive and negative based on the sentiment expressed by the user.

Document Level analysis reduces the whole document into single level score. Analysis is done based on

four emotions that are "Joy: sadness", "Acceptance: Disgust", "Anticipation: surprise" and "fear: anger". The

problem with this analysis is that it hides the best insights, the useful ones, and prevents clients from drilling

down to extract the useful information.

Sentence level Sentiment Analysis takes a sentence and determines whether that sentence is positive,

negative, or neutral opinion. Neutral usually means no opinion. It is further classified into subjectivity

classification and sentiment classification. There are two kinds of information in a particular sentence; objective

and subjective. Subjectivity classification means determining the type of sentence. Sentiment classification

furthers classifies the subjective information into positive and negative. Sentence Level analysis is somehow

related to subjectivity classification which separates sentences that express factual information from sentences

that express subjective views and opinions.

Sentiment Analysis

7

Feature Level Sentiment analysis takes into account the opinion itself. It is based on the idea that an

opinion consists of an emotion which could be either a positive one or a negative one and a target (of opinion)

consists of three main tasks. Extraction of features the web content is the first step. The next step is determining

the opinion's polarity. The last and the final task are to group the feature synonym. This type of classification is

also known as word/phrase classification. Feature level looks at the opinion itself and does not take into account

language constructs (documents, paragraphs, sentences, clauses or phrases). It is based on the idea that an

opinion consists of a sentiment (positive or negative) and a target. Document Level and Sentence Level analysis

does not recognize each and every detail of the opinions and facts and thus feature level analysis is done widely.

A specific model framework is followed throughout the process of Sentiment Analysis.

SENTIMENT ANALYSIS FRAMEWORK

This framework consists of three main steps [1]. The first step being data collection, followed by preprocessing

of the data collected. The last step is the classification which categorizes the data processed into either positive

or negative. Fig. 1 gives the basic overview of sentiment analysis framework.

Fig. I: Sentiment Analysis Framework [1]

Sentiment Analysis

8

A. Data Collection Sentiment Analysis can be done on any data. The data can either be collected from

any data set or can be extracted from any website. Data set is available online with thousands of reviews along

with the label of positive and negative. On the other hand, extracting data from web is a lengthy task but one can

perform sentiment analysis on the data of their own choice.

B. Pre-Processing Data extracted from the web contains several syntactic features that may not be useful

and therefore data cleaning and filtering needs to be done. In order to remove the unprocessed data, this step

needs to be performed. It is imperative to preprocess all the data to carry out further functionalities. The various

pre-processing steps involved are given as below:

1) Removing URLs URLs are of no use while performing sentiment analysis and can sometimes

lead to false analysis. For example "I have logged in to www.happy.com as I am bored... This sentence is

negative but because of there is one positive word in the url, it becomes neutral thus leading to a wrong

prediction. To avoid the chances of false prediction, URLs must be removed.

2) Filtering Repeated letters in words like "thankuuuuu" are often used to show the depth of

expression. However, these words are absent in the dictionary hence the extra letters in the word needs to

be eliminated. This is done on the basis of a rule that a letter cannot repeat itself more than three times

and if there is such letter that will be eliminated.

3) Questions Words like "what", "which", "how" etc., does not contribute to polarity and thus

such words must be removed in order to reduce the complexity.

4) Removing special characters In order to remove discrepancies during the Sentiment Analysis

process, special characters like '[] {} 0/' should be removed. For example "it's good:" If these characters

Sentiment Analysis

9

are not eliminated before performing sentiment analysis, they will get combined with the words and those

words will not be recognized. To avoid the situation, removal of such characters is important.

5) Removing Stop words and emoticons Stop words are words that should be excluded in order

to proceed with the SA process. Stop words don't carry as much meaning, such as determiners and

prepositions (in, to, from, etc.) and thus needs to be filtered. Most of the times, while writing a review,

people tend to use emoticons in order to express their feelings better. Although, these emoticons help in

better understanding of the emotions but while performing Sentiment analysis, this can mislead and

predict wrong.

6) Lemmatization or stemming Lemmatization and stemming aims to reduce inflectional and

related forms of a word to a common base forms. Stemming achieves its goal correctly most of the time

by removing the ends of the words. Whereas, lemmatization does the same process properly with the use

of a vocabulary and morphological analysis of words.

7) Tokenization refers to splitting the sentence into its desired constituent parts. It is an important

step in all NLP tasks.

8) Feature selection it finds a reduced set of attributes that provides a suitable representation of

the database given a certain analysis to be performed. This is necessary because the excessive use of

slangs, ironies and language mixtures makes the classification task easy.

C. Classification is a technique which classifies data into various categories. Classification is also used in

the field of Sentiment Analysis in order to classify data into three classes namely positive, negative and neutral

and based on that the sentiment analysis process is completed. The classification task requires a pre-classified

database sample, called training set, which is used to train and generate a classifier. It also helps in comparing

Sentiment Analysis

10

new unlabeled data to be classified. The classifier accuracy is highly dependent upon such training data. There

are different classifiers available to perform the same and are discussed below, but Naive Bayes classifier is the

one which is most commonly used for classification of data in Sentiment Analysis.

1) Naive Bayes classifier is a supervised machine learning approach. This supervised classifier was given

by Thomas Bayes and hence the name. According to this theorem, suppose there are two events say, p1 and p2

then the conditional probability of occurrence of event p1 when p2 has already occurred is given by the

following mathematical formula:

P(p1|p2)=P(p2|p1)P(p1)/P(p2)

The algorithm of the same calculates the probability of the data to be positive or negative. The formula is as

follows:

P(pA|pB)=P(pB|pA)P(pA)/P(pB)

Where A = Sentiment,

B=Sentence

And, the conditional probability of a word is given by-

P(word|A)=C+1/(D+E)

C=no. of word occurrence in class

D= no of words belonging to a class

E= total no. of words

Sentiment Analysis

11

2) Maximum entropy classifier this is another probabilistic classifier which belongs to the class of

exponential models. It is almost similar to Naive Bayes classifier; however, naive bayes assumes that the

features are conditionally not dependent of each other whereas this algorithm does not take this assumption. This

classifier performs by the Principle of Maximum Entropy. From all the models that fit the training data it tends

to select the one which has the largest entropy. Apart from performing Sentiment analysis the Max Entropy

classifier aims to solve a lot of text classification problems such as detecting languages, classification of topics

and more.

3) Support machine vector classifier the classifier is a supervised learning models with associated

learning algorithms that analyze knowledge used for classification and multivariate analysis. A SVM model

represents examples as points. These examples are mapped so that the new examples are divided by clear gap

which can be as wide as possible. New examples are then mapped into the space taken earlier and predict the

category by analyzing the side of gap they fall on.

Sentiment Analysis

12

Related Work

Sentiment Analysis aims to help the customers in effective decision making. The task of manually analyzing the

reviews seems to be a difficult one. Thus, Sentiment analysis helps the customers in doing so. Many of the

researchers have given their significant contribution in the same. In this section, a review of the existing and

related works on Sentiment Analysis has been presented. Keke Cai et al. have presented a research that focuses

on topic detection techniques that is able to detect the topics. These topics are highly correlated with the positive

and negative opinions all these techniques help the business analysts and helps in understanding the overall

sentiment scope as well as the drivers behind the sentiment. They performed the basic sentiment classification

that categorized the text into positive, negative or neutral. But the problem they felt with this type of

classification was it lacked insight of what drives these sentiments. To solve the problem, they came up with a

new sentiment analysis technique that not only determines the sentiment of a given topic, but also determines the

root cause of the sentiments. Prashant Raina came up with an opinion mining engine that uses common-sense

knowledge extracted from Concept Net and Semantic Net to perform sentiment analysis in news article. He used

a large corpus of sentences form news article to test the opinion mining engine. The classification accuracy was

71%, with 91% precision for neutral sentences. Federico Neri et al. has described a Sentiment study. The study

was done on over than 1000 Facebook posts. There were posts about newscasts, comparing the sentiment for

Rai, the Italian public broadcasting service [2]. Ana CES. Lima et al. proposed an automatic sentiment classifier

for emoticons or sentiment based words containing tweets. They used naive bayes algorithm to classify the

tweets. However, the problem with this approach that it classified the tweets as either positive or negative and

did not ass neutral to the classification [3].Min Wang et al. have emphasized on an approach that helped in

realizing polarity analysis of new words and in addition implemented quantitative computation of sentiment

words and automatic expansion of polarity lexicon [4].Their experimental results showed feasibility and

effectiveness of their approach. ZHU Nanli et al. have presented a study on the recent development in the field of

Sentiment Analysis

13

sentiment analysis. They have conducted a survey in three major research fields: framework, feature extraction

and sentiment analysis. The problem that was encountered during this was there has been no research on the

commercial value of online reviews [5]. Seyed-Ali Bahrainian et al. came up with a novel solution to target

sentiment summarization and SA of short informal texts with emphasis on tweets [6]. They have compared

different algorithms and methods for SA polarity detection and sentiment summarization. They have compared

various PD algorithms. However, detection of sarcasm is yet to be taken into account. Andreas Dengel et al. have

compared state-of-art Sentiment Analysis methods against a novel hybrid method in their paper. Their approach

trains a linear Support Vector Machine (SVM) c1assifier and for that they create a brand new set of features

using Sentiment Lexicon. The problem they faced was the classification did not take sarcasm into account

[7].Sunil Kumar Khatri et al. have presented a research work in which they have performed classification on e-

data collected from multiple sites and then after classification analyzed it with ANN. They reduced the error in

prediction up to least. However, there was a problem with their study. It did not just predict the direction of the

market for a particular day, but they aimed to take their research to a level where they could predict the closing

value for the day [8]. Rui Xia et al. came up with a model called dual sentiment analysis (DSA). Their paper

highlighted the issues with sentiment classification [9]. They created a sentiment reversed review for each

training and test review to perform their novel data expansion technique. They developed a training algorithm

that was dual. The algorithms employed both kind of reviews together for learning a sentiment classifier and

classified the test reviews using this. They then took forward the same from 2-c1ass classification to 3-c1ass

classification. They considered the neutral reviews for the same. Finally, a pseudo-antonym dictionary was

created that helped them to perform a corpus-based method. They conducted a wide range of experiments. The

results demonstrate show effective DSA is. Vee W.LO et al. have discussed the existing works on opinion

mining and sentiment classification performed on customer feedback and online reviews, and has evaluated the

Sentiment Analysis

14

various approaches used for the process [10]. It can be seen from the existing literature that there exists many

algorithms for Sentiment Analysis but with few drawbacks and the room for improvement is still there.

Harsha et al. have done a comparative study on basic techniques used for sentiment analysis [1]. The table shown

below shows the comparison done by them.

TABLE 1: COMPARATIVE CHART OF ALGORITHMS [1]

Sentiment Analysis

15

Proposed Research Work

With the rapid growth in use of social networking sites in the past decade, it has become a notable medium for

people to express their views or opinions. This has fostered & promoted sentiment analysis as a dynamic &

potential area of research where new techniques & models need to be explored for continuous improvement in

result accuracy. Many techniques have been used to analyze sentiments from dataset of various category and

size. The techniques used in the past are: Naïve Bayes algorithm, Support Vector Machine, Neural network and

many others. The experiments conducted using these techniques have shown and proved their efficiency. But

there are many open areas of research. Many researchers have used a hybrid approach for the sentiment analysis.

They have combined various algorithms to achieve better results. Akshi et al. have used Neural Network to

perform sentiment analysis on tweets. Neural Networks give several advantages over other techniques. They

have prominent features like adaptive learning, fault tolerance, parallelism and generalization.

ANNs are capable of learning and they need to be trained. There are several learning strategies −

Supervised Learning − It involves a teacher that is scholar than the ANN itself. For example, the teacher

feeds some example data about which the teacher already knows the answers.

For example, pattern recognizing. The ANN comes up with guesses while recognizing. Then the teacher

provides the ANN with the answers. The network then compares it guesses with the teacher’s “correct” answers

and makes adjustments according to errors.

Unsupervised Learning − It is required when there is no example data set with known answers. For

example, Searching for a hidden pattern. In this case, clustering i.e. dividing a set of elements into groups

according to some unknown pattern is carried out based on the existing data sets present.

Sentiment Analysis

16

Reinforcement Learning − this strategy built on observation. The ANN makes a decision by observing

its environment. If the observation is negative, the network adjusts its weights to be able to make a different

required decision the next time.

In my proposed research work I will work with supervised Learning model of ANN. For my dataset I will take a

set of feedback given by users. The set of feedback can be labelled as: Excellent, Good, Average and Poor.

The ANN needs a learning algorithm to effectively calculate weights for the neurons to fire. For my research I

am planning to use an optimization algorithm for adjusting weights in the neural network. The algorithm that I

will be using in Firefly Algorithm. Firefly algorithm is a metaheuristic proposed by Xin-She Yang and inspired

by the flashing behavior of fireflies [11]. The pseudo code of the algorithm is given below:-

Sentiment Analysis

17

My proposed algorithm will have the following steps:-

1) Collection of dataset

2) Pre-processing and cleaning of data

3) Calculate the relative occurrence of words in the dataset

4) Creation of neural network structure with each word being assigned a node

5) Re-adjusting the weights of the neural network by using the firefly algorithm

6) Calculation of the accuracy of the generated result

Fig. 2: Proposed Sentiment Analysis Framework

Sentiment Analysis

18

References

[1] H. Sinha and A. Kaur, A Detailed Surveyand Comparative Study of Sentiment Analysis Algorithms . IEEE, 2016.

[2] Federico Neri, Carlo Aliprandi, Federico Capeci, Monstserrat Cuadros, Tomas, "Sentiment Analysis on social media",

IEEE/ACM International conference on Advances in Social Networks analysIs and mining,2012,pp. 919-926.

[3] Ana C. E.S.Lima, Leandro N.de Castro., "Automatic Sentiment Analysis of twitter Messages", Publisher IEEE, 2012, pp.

52-57.

[4] Min Wang, Hanxio Shi., "Research on Sentiment Analysis Technology and Polarity Computation of Sentiment words",

Publisher IEEE, 2010, pp.331-334.

[5] ZHU Nanli, ZOU Ping, Ll Weign, CHENG Meng., "Sentiment Analysis: A Literature Review", in proceedings of the 2012

IEEE ISMOT, pp. 572-576.

[6] Seyed-Ali Bahrainian, Andreas Dengel., "Sentiment Analysis and Summarization of Twitter Data", IEEE 161h

International Conference on Computational Science and Engineering, 2013,pp. 227-234.

[7] Seyed-AliBahrainian, Andreas Dengel., "Sentiment Analysis using Sentiment Features", IEEE/WIC/ACM International

Conferences on Web Intelligence(WI) and Intelligent Agent Technology(lAT),20I3,pp. 26-29.

[8] Sunil Kumar Khatri, Himanshu Singhal, Prashant Johri., "Sentiment analysis to predict Bombay Stock Exchange Using

Artificial Neutral Network, Publisher IEEE,2014.

[9] Rui Xia, et al., "Dual Sentiment Analysis: Considering Two Sides of One Review", Transactions on Knowledge and Data

Engineering, Vol. 27,No.8, Publisher IEEE, 2015, pp. 2121-2133.

[10] Vee W.LO, Vidyasagar POTDAR., "A review of opinion mining and Sentiment Classification Framework in Social

Networks", 3rolEEE International Conference on Digital Ecosystems and technologies, 2009, pp. 396-40I.

[11] A. Kumar and R. Khorwal., “Firefly Algorithm for Feature Selection in Sentiment Analysis,” SpringerLink, 2017.