Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 1

REAL TIME SENTIMENT ANALYSIS

USING TWITTER FEED

Project Report

by

Swapnil Shwetank Jha (11753) Shibendu Saha (11679)

Anshu Kumar Gupta (11125) Shivendu Bhushan (11689)

[Group 4]

Project Supervisor: Dr. Shankar Prawesh

IIT Kanpur



Acknowledgement

We have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals. We would like to extend our sincere thanks to all of them.

We are highly indebted to Dr Shankar Prawesh for his guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in completing the project. Our thanks and appreciations also go to our colleagues in developing the project and people who have willingly helped us out with their abilities.



Table of Contents

Topic Page No.

Objective 4

Dataset 4

Introduction 5

Algorithm 6-7

Results 8-9

Future Prospects 9

Appendix (Python Codes) 10-11

Bibliography 12



Objective

The aim of our project is to collect real time tweets about any trending topic

which we can then classify as ‘positive’ or ‘negative’ using a model that we

have prepared through training using Gaussian Naïve Bayes Classifier.

This information will be useful in gathering information about the general

public response related to the particular object, news, trend etc.

Dataset

For training our judgement model, we have used the dataset from ‘Kaggle’

website: https://inclass.kaggle.com/c/si650winter11/data

The data is in the following format:

Value Statement

1 The Da Vinci Code book is just awesome.

1 I liked the Da Vinci Code a lot.

0 I hate Harry Potter.

0 Harry Potter and Titanic suck.

The data required for actual operation, is obtained in real time using the

Twitter API and processed by a python script. (Code in Appendix)

https://inclass.kaggle.com/c/si650winter11/data



Introduction

Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive or negative.

The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by precision and recall. However, according to research human raters typically agree 79% of the time.

Thus, a 70% accurate program is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about any answer.

http://en.wikipedia.org/wiki/Precision_and_recall



Algorithm

Training Stage

Get several statements from a database with their actual

positive or negative response.

Split the statements into two classes: positive and negative.

For each class, compute the tf-idf values and their mean

and variances to prepare a Gaussian probability distribution

Map the probabilities using a Naïve Bayes Classifier

Use only 80% of the dataset for the training and

test the model on remaining 20%.



Actual Program

Authenticate with Twitter using token

Collect real time tweets about a trending topic.

Classify the tweets into the two different classes

based on computed probabilities.

Transform the tweets into vectors and pass it to our

judgement model that has previously been trained.

Display the results about the type of response

That the keyword is generating.



Results (Training)

Overall Accuracy: 88.7%

Where,

Recall Precision

Positive: 1.00 0.79

Negative: 0.80 1.00



Results (Real time)

Review of Apple Watch

No of tweets v/s date (in April’15)

Future Prospects Our application can be used as a service for businesses to do market analysis of the response that their products receive and track the changes in response with time.

0

2

4

6

8

10

12

15 16 17

Positive

Negative



Appendix (Codes) Training

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

import scipy.sparse as sp

import numpy as np

import re

def read():

with open("training.txt") as f:

contents = f.readlines()

ytrain = []

lines = []

for content in contents:

fracs = re.split('\t',content.strip())

ytrain.append(fracs[0])

lines.append(fracs[1])

return (lines,np.array(ytrain))

def split_test_train(x,y):

posIndex = y == "1"

negIndex = y == "0"

posX = x[posIndex]

negX = x[negIndex]

posY = y[posIndex]

negY = y[negIndex]

xtrain =

sp.vstack((posX[:int(posX.shape[0]*0.8)],negX[:int(negX.shape[0]*0.8)]

),format='csr')

ytrain =

np.concatenate((posY[:int(posX.shape[0]*0.8)],negY[:int(negX.shape[0]*

0.8)]))

xtest =

sp.vstack((posX[int(posX.shape[0]*0.8):],negX[int(negX.shape[0]*0.8):]

),format='csr')

ytest =

np.concatenate((posY[int(posX.shape[0]*0.8):],negY[int(negX.shape[0]*0

.8):]))

return (xtrain,ytrain,xtest,ytest)



def train_test():

(lines,y) = read()

vect = TfidfVectorizer()

vect.fit(lines)

x = vect.transform(lines)

(xtrain,ytrain,xtest,ytest) = split_test_train(x,y)

clf = GaussianNB()

clf.fit(xtrain.toarray(),ytrain)

ypred = clf.predict(xtest.toarray())

t = ["positive","negative"]

print "accuracy:"

print(accuracy_score(ytest, ypred))

print(classification_report(ytest, ypred, target_names=t))

def train():

(lines,y) = read()

vect = TfidfVectorizer()

vect.fit(lines)

x = vect.transform(lines)

clf = GaussianNB()

clf.fit(x.toarray(),y)

return (vect,clf)

Sentiment Analysis

from TwitterAPI import TwitterAPI

import train

import warnings

access_token_key = "1402969771-

8GILKV7XwynFy9X0vrEH5GnfHYJZ4Vu3lHr7Sve"

access_token_secret = "7YxDx7SjyGgKZAiumr7zVIhGI7IwBNmfpQr2g8CoVA"

consumer_key = "g1ZsoKVUlDl0buNGJT9dSw"

consumer_secret = "GLjDx3p1MfQzELWXiwOqRQCMjbwItgLDOubCvPVzA"

def get_score(query):

(vect,clf) = train.train()

api = TwitterAPI(consumer_key, consumer_secret, access_token_key,

access_token_secret)

r = api.request('search/tweets', {'q':query})

tweets = []

for item in r:

tweets.append(item['text'] if 'text' in item else item)

x = vect.transform(tweets)

ypred = clf.predict(x.toarray())

print "total no. of tweets : " + str(len(ypred))

print "no. of positive tweets : "+str(sum(ypred=="1"))

print "no. of negative tweets : "+str(sum(ypred=="0"))



Bibliography

[1] Dataset: https://inclass.kaggle.com/c/si650winter11/data

[2] python libraries: http://scikit-learn.org/stable/

[3] tf-idf: Kranti Ghag and Ketan Shah, “SentiTFIDF-Sentiment classification

using Relative Term Frequency Inverse Document Frequency”, International Journal of Advanced Computer Science and Applications

[4] Naïve-Bayes: http://en.wikipedia.org/wiki/Naive_Bayes_classifier

[5] Real time tweets: https://dev.twitter.com/overview/api

https://inclass.kaggle.com/c/si650winter11/data

http://scikit-learn.org/stable/

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://dev.twitter.com/overview/api

Art & Photos

Real Time Sentiment Analysis using Twitter Feed