REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA

SENTIMENT ANALYSIS OF TWITTER

DATA

ANARGHA GANGADHARAN

[email protected]

ANJU ANIL

[email protected]

MARY LIS JOSEPH

[email protected]

PARVATHY D

[email protected]

B.Tech Scholars

Department of Computer Science

College of Engineering Cherthala

Abstract—Micro blogging has now become a very

popular communication tool. Millions of people share

their views, opinions on various topics in these sites.

Therefore these sites have become a rich source of

opinion and views of different people among many

micro blogging sites twitter is one of the popular sites.

Today it is a daily practice for many people to read the

news online and therefore In this paper we examined

the sentiment analysis of twitter data and we focused on

news channels and other news sites which post about

current news and the tweets of those news posted daily

is being analysed and the overall sentiment of that news

is being analysed. Here we have presented a system

which gives a score that indicates whether the news is

positive or negative. Each news is being considered and

is being tokenized and sentiment is being calculated

using naive bayes classifier which classify the data into

positive, negative or neutral and the main feature is

that the sentiment calculation is being done on real

time data.

Key words: sentiment analysis, machine learning,

naive bayes classifier.

I. INTRODUCTION

Various microblogging sites have become

a part of our day today life as a source for varIous

kinds of information. This is because people rely

mostly on websites rather than any other media.

This is because people can post real time messages

and their opinions on various topics. Among various

sites we have chosen twitter as a platform for

performing sentiment analysis because of various

facilities and features that twitter provides us such

as it is the only web site media through which each

can communicate with their potential customers.

Twitter audience varies from regular users to

celebrities, company representatives, politicians,

students and even it includes high authority

government officials which even consist of

president. Therefore it is possible to collect text

posts of users from various categories. Major works

on sentiment analysis has been done on subjective

texts types such as blogs, result prediction and

product reviews. Authors of such text types

typically express their individual opinions freely

sometimes it may even restrict the sentiment to a

single group of people or may even leads to a single

person. The situation is different in news articles.

News can be good or bad but it is seldom neutral.

Analysing this news and thereby calculating the

sentiments expressed by the twitter audience can

provide a meaningful sense of how the latest news

impacts important entities. Another difference

between reviews and news is that reviews

frequently are about a relatively concrete object or

which can be said as a target subject. Whereas news

articles covers a larger subject domain which is

even more complex event description And a

whole range of targets. Our paper mainly

concentrates on experimental evaluation on a set of

real time news that has been posted on twitter by

various news channels and newspapers and thereby

evaluating the overall impact of the news on the

people. We look over the news article and obtain the

tweets based on that news; the tweets may either be

a link or an opinion or can even be a query. We

classify the news as positive, negative and neutral

and consider only positive and negative news for

sentiment calculation. This paper is structured

mainly as follows. First module is all about

collecting the data. Second module is text pre-

processing. Third module deals with term

frequencies. Fourth module discusses about rugby

and term co-occurrences and the fifth module deals

with data visualisation basics.

II. LITERATURE SURVEY

Social media plays an important share on

the web. Users have become a part and co-creators

of contents on the web. The users now contribute

major part of social media ranging from articles,

news, reviews etc. This leads to the creation of a

large unstructured text on the web. Among all the

social media Twitter plays an important role in

interacting with the people all around the world.

The task here is to analyse the sentiment of such

data which is pertinent research topic in recent

time.

In previous studies by Namrata Godbole,

Manjunath Srinivasaiah, Steven Skiena has done

sentiment analysis on general news following news

articles and blogs. Kiran Shriniwas Dodd, Dr. Mrs.

Y. V. Haribhakta, Dr. Parag Kulkarni has also

succeeded in finding sentiment analysis on online

news media. However, not many researches in

opinion mining contemplate blogs and even much

less addressed micro blogging .Turney, 2002; Pang

and Lee, 2004 sentiment analysis has been carried

on document level classification. Whereas Hu and

Liu, 2004; Kim and Hovy, 2004 has done the

analysis of data in sentence level. Bermingham and

Smeaton, 2010 has done analysis on data but they

failed to break data into tokens and even they

succeed only in handling unigrams. Go et al. (2009)

has succeeded classifying data into tokens but he

too failed to handle n grams. In Sentiment Analysis

of Twitter Data by Apoorv Agarwal we can see

that the sentiment analysis of Twitter data has been

done on data. They included the POS specific prior

polarity features. They mainly deals with two kinds

of models tree kernel and feature based models and

demonstrated it. In another paper by Alexander

Pak, Patrick Paroubek they have used tree tagger

for POS tagging and they have presented a method

for automatic collection of data that is been used to

train a sentiment classifier in that the author used

syntactic structures to describe emotions or state

facts .In the work done by James Spencer and

Gulden Uchyigit School on sentiment analysis of

twitter data they have only deal with common

process in NLP for finding the sentiment or

meaning of a given phrase or text and it gave

accuracy of only 50%.

In another paper about sentiment analysis

of news by Alexandra Balahur we have seen that

the news is being analysed and sentiment of

particular news is being calculated but they haven't

include any method for evaluating the brunt of

using negation and valence shift. In all the papers

which we have considered as a reference for our

work we have seen that sentiment analysis on

Twitter has been done only on structured data like

product reviews, election prediction, blogs, etc. and

no past works has been done on the news that are

been posted daily on twitter. None of the past work

has been dealt with real time data for sentiment

calculation and they haven't followed any specific

algorithm for calculating sentiment analysis.

III. DATA DESCRIPTION

Twitter is the most famous social

networking site in which users are allowed to post

real time messages called tweets. Tweets are small

in size and comprises of 140 characters .As a result

of these peculiarities of tweets, users use

wordplays, spelling mistakes, emoticons so as to

express their ideas. Following is a jargon associated

with tweets.

Hashtags: A special word or phrase indicated by a

hash symbol so as to identify the topic as specific.

Emoticons: An indication of facial expression so as

to convey user's feelings towards a particular topic.

Targets: Target is expressed by @ symbol so as to

identify a particular user specified.

We collected real time messages from the

Twitter. There were no restrictions regarding the

collection of data. The collection even consists of

all the tweets received. After gathering of them we

arranged them into two types positive and negative.

IV. COLLECTING DATA

The first step in collecting data is the

registration of our application. For this we have to

login our twitter account and after logging into our

account, we have to register a name and description

regarding our application. After entering these

entities a consumer key as well as a consumer

secret is obtained and these should be kept private.

From the configuration page we are secured with

an access token and an access token secret provided

the application accesses thus permitted are read

only. Twitter provides an API so as to interact with

its services. We also use tweepy so as to stream

data from twitter (python). Tweepy provides the

convenient cursor interface to iterate through

different types of object.

V. TEXT ANALYSIS

Text analysis is used to extract meaningful

pattern from unstructured text. Here we use

components and concepts from text analysis to

analyse the sentiments in tweets. The process of

analysing the sentiment consists of multiple steps.

First step is breaking texts into words. This process

is known as tokenization. The purpose of

tokenization is to split the text or a tweet, which is

streamed in Real time, into several smaller units

called tokens .Tokens can be either be words or

phrases. These tokens are the primary building

blocks for our Sentiment Analysis. Tokenization is

very crucial especially for Twitter data, since it

poses many challenges because of the nature of the

language being used. In second phase we extract

meaningful terms and counts from our tweets

called term frequency .This analysis phase contains

three parts counting terms, stopwords removal and

term filter. In counting terms we observe what are

the terms most commonly used in the data set. In

every language, some words are particularly

common, and that doesn’t convey any special

meaning called stopwords. After stopword

removal, counting and sorting we will get the most

frequently used words. Sometimes terms comes

together makes more sense .In term co-occurrence

we apply this concept. Visualization phase

represents the graph of frequently used words.

Finally we calculate the sentiment of real time

tweets using naïve bayes algorithm.

A. Tokenization Table 1: tokenization of tweets

The tokenization is based on regular

expressions. Some specific types of tokens will not

be captured. This problem can be solved by

improving the regular expressions, or even employ

more innovatory techniques like Named Entity

Recognition. The important component of the

tokenizer is the regex_str variable, which is a list of

possible patterns. In particular, we need some

emoticons, HTML tags, Twitter @usernames (@-

mentions), Twitter #hashtags, URLs, numbers,

words with and without dashes and apostrophes.

Punctuation and whitespace may or may not be

included in the resulting list of tokens. All

contiguous strings of alphabetic characters are part

of one token; likewise with numbers. Tokens are

separated by whitespace characters, such as a space

or line break, or by punctuation characters.

After tokenization ‘@-mentions’ , ’ emoticons’,

‘URLs’ and ‘#hash-tags’ are now preserved as

individual tokens using NLTK libraries .Let us see

the example given below:

Table shows how the tokenized tweets or data set

looks like. That is each token separated by white

space are now preserved as individual tokens.

B. Term Frequencies

In term frequency we are extracting frequently

used meaningful tokens and there count. On the

basis of this ,term frequency can partitioned into

three they are:

• Counting terms

• Stopword removal

• Term filter

By performing simple word count we can find

the most commonly used term in the data set.

In order to keep track of the frequencies while

we are processing the tweets, we can use

collections.Counter() which internally is a

Tweets Tokenized tweets

"How I feel when dealing

with Unicode strings in

#python \n #programming

https:\/\/t.co\/xqFmmmyJiJ"

‘ How ’, ‘ I ’, ‘ feel ’ , ‘ when ’ ,

‘ dealing ’ , ‘ with ’ , ‘ Unicode

’,

’ strings ’ , ‘ in ’ , ‘#Python’ , ‘

\ n ’ , ‘ #programming’,

‘http:\/\/t.co\/xqFmmmyJiJ’

A $5 microcontroller with

wi-fi that runs python

#python

‘A’, ‘ microcontroller ’ , ’ with

’, ’ wi-fi ’, ’ that ’ , ’ runs ’ ,

’ # ’ , ’ python ’

A # python coding dojo to

end the day @ Downham

Market Academy #rocks

‘ A ’ , ‘ # ’,’ python ’ , ’ coding

’ , ’ dojo ’ , ’ to ’ , ’ end ’ , ’ the

‘ ,

’ day ’ , ’ @ ’ , ’ Downham ‘ , ’ Market

‘, ‘ Academy ’,’ #rocks ’

https://en.wikipedia.org/wiki/Whitespace_character

dictionary with some useful methods like

most_common()

Terms Count

The 42

It 25

Has 06

On 14

And 23

After processing, the tokens we will get

the frequency of word as in table above. Sometimes

the most frequent words are not exactly

meaningful. This due to the presence of articles,

conjunctions, adverbs, etc. in a language, which are

commonly called stop-words. Stop-word removal is

one important step that should be considered during

the pre-processing stages. Anyone can build a

custom list of stop-words, or use available lists;

NLTK provides a simple list for English stop-word.

The punctuation marks and with terms like RT used

for re-tweets and via, which are not in the default

stop-word list. After counting and sorting, we will

get the most commonly used terms.

Term filter don’t give us a deep explanation of

what the text is about.

C. Term co-occurrence

To place things in context, let’s consider

sequences of two terms. Because the terms come

together give more insight about the meaning of the

text, look at the table given below. The terms

comes together is called bigrams. The bigrams()

function from NLTK will take a list of tokens and

produce a list of tuples using adjacent tokens In

case we decide to analyse longer n-grams that is

sequences of n tokens, it could make sense to keep

the stop-words, just in case we want to capture

phrases given in the table.

The terms that comes together gives us better

information about the meaning of a term,

supporting applications such as word

disambiguation or semantic similarity. We build a

co-occurrence matrix that contains the number of

times the term x has been seen in the same tweet as

the term y. For each term, we then extract the most

frequent co-occurrent terms, creating a list of tuple,

here we are collecting.

D. Visualisation

A good pictorial representation

of our data can help us to make sense of them and

highlight interesting insights.While there are some

options to create plots in Python using libraries like

matplotlib or ggplot Vincent bridges the gap

between a Python back-end and a front-end that

supports D3.js visualisation, allowing us to benefit

from both sides Vincent bridges the gap between a

Python back-end and a front-end that supports

D3.js visualisation, allowing us to benefit from

both sides Using the list of most frequent terms

(without hashtags) from our rugby data set, we

want to plot their frequencies: we can plot many

different types of charts with Vincent.

E. Naive Bayes Classifier Algorithm

Real time sentiment analysis using Naïve

Bayes algorithm. Final step is to calculate the

sentiment of the real time tweet . We used Naive

Bayes (NB) classification because it is simple and

natural method. NB combines efficiency with

reasonable accuracy. The important feature of this

algorithm is that the extracted text can be tokenised

easily; it is evident that they cannot be considered

as independent, since words. It is a classification

technique based on Bayes’ Theorem with an

assumption of independence among predictors. In

simple terms, a Naive Bayes classifier assumes that

the presence of a particular feature in a class is

unrelated to the presence of any other feature.

Naive Bayes model is easy to build and particularly

useful for very large data sets. Along with

bigrams

To be

Not to be

Miss you

I know

Look better

https://github.com/wrobstory/vincent

https://github.com/wrobstory/vincent

https://marcobonzanini.com/2015/03/17/mining-twitter-data-with-python-part-3-term-frequencies/

https://marcobonzanini.com/2015/03/23/mining-twitter-data-with-python-part-4-rugby-and-term-co-occurrences/

simplicity, Naive Bayes is known to outperform

even highly sophisticated classification methods.

Here we are using two types of data set they are

test data and train data. Supervised learning are

used in naïve bayes algorithm where supervised

learning is the machine learning task of inferring a

function from labelled training data. The training

data consist of a set a desired of training examples.

In supervised learning, each example is a pair

consisting of an input object and output value.

Trained data is the historical data.

Two different naive bayes classifiers have been

built, according to two different strategies here we

are using the second classifier.it was trained on a

simplified training corpus and makes use of a

polarity lexicon. The corpus was simplified since

only positive and negative tweets were considered.

Neutral tweets were not taken into account. As a

result, a basic binary (or Boolean) classifier which

only identifies both Positive and Negative tweets

was trained. In order to detect tweets without

polarity (or Neutral), the following basic rule is

used: if the tweet contains at least one word that is

also found in the polarity lexicon, then the tweet

has some degree of polarity. Otherwise, the tweet

has no polarity at all and is classified as Neutral.

The binary classifier is actually suited to specify

the basic polarity between positive and negative,

reaching a precision of more than 80% in a corpus

with just these two categories Bayes theorem

provides a way of calculating posterior probability

P(c|x) from P(c), P(x) and P(x|c). Look at the

equation below:0

Above,

• P(c|x) is the posterior probability of class (c,

target) given predictor (x, attributes).

• P(c) is the prior probability of class.

• P(x|c) is the likelihood which is the probability of

predictor given class.

• P(x) is the prior probability of predictor.

we’re able to get almost 73% accuracy. This is

somewhat near human accuracy, as apparently

people agree on sentiment only around 80% of the

time.

VI. CONCLUSION

We conferred results for sentiment

analysis on Twitter based on daily news. Here we

have used SVM and naive bayes classifier for

finding the sentiment of people based on the

current news. Here we have dealt with the two

possible kinds of sentiments positive and negative.

We have also dealt with uni grams, bi grams and

even n grams and have also considered the

hyphenated words. We have also dealt with tweets

which come in form of query or any links. As our

future work we also look forward on developing an

application which carries our textual analysis on

voice data and even extend our textual analysis

with specifying the overall impact of news on

people either as positive or negative along with the

root cause being specified.

VII. REFERENCES

[1] “Large Scale Sentiment Analysis for News and

Blogs” by Namrata Godbole, Manjunath

Srinivasaiah, Steven Skiena.

[2] “Sentiment Analysis of Twitter Data” by

Apoorv Agarwa, Boyi Xie, Ilia Vovsha, Owen

Rambow, Rebecca Passonneau.

[3] Apoorv Agarwal, Fadi Biadsy, and Kathleen

Mckeown 2009. “Contextual phrase-level polarity

analysis using lexical affect scoring and syntactic

n-grams”. Proceedings of the 12th Conference of

the European Chapter of the ACL.

[4] “Sentimentor: Sentiment Analysis of Twitter

Data “ by James Spencer and Gulden Uchyigit.

[5] Bo Pang, “L.L.: Opinion mining and sentiment

analysis.” Foundations and Trends in Information

Retrieval January Volume 2 Issue 1-2, 1–94 (2008)

[6] Pak, A., and Paroubek, P. 2010. “Twitter as a

corpus for sentiment analysis and opinion mining.”

[7] Pang, B., and Lee, L. 2008. “Opinion mining

and sentiment analysis.” Foundations and Trends

in Information Retrieval.

Data & Analytics

REAL TIME SENTIMENT ANALYSIS OF TWITTER DATA