12
Project Report (MBA 653A) 2015 Indian Institute of Technology, Kanpur 1 REAL TIME SENTIMENT ANALYSIS USING TWITTER FEED Project Report by Swapnil Shwetank Jha (11753) Shibendu Saha (11679) Anshu Kumar Gupta (11125) Shivendu Bhushan (11689) [Group 4] Project Supervisor: Dr. Shankar Prawesh IIT Kanpur

Real Time Sentiment Analysis using Twitter Feed

Embed Size (px)

Citation preview

Page 1: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 1

REAL TIME SENTIMENT ANALYSIS

USING TWITTER FEED

Project Report

by

Swapnil Shwetank Jha (11753) Shibendu Saha (11679)

Anshu Kumar Gupta (11125) Shivendu Bhushan (11689)

[Group 4]

Project Supervisor: Dr. Shankar Prawesh

IIT Kanpur

Page 2: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 2

Acknowledgement

We have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals. We would like to extend our sincere thanks to all of them.

We are highly indebted to Dr Shankar Prawesh for his guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in completing the project. Our thanks and appreciations also go to our colleagues in developing the project and people who have willingly helped us out with their abilities.

Page 3: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 3

Table of Contents

Topic Page No.

Objective 4

Dataset 4

Introduction 5

Algorithm 6-7

Results 8-9

Future Prospects 9

Appendix (Python Codes) 10-11

Bibliography 12

Page 4: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 4

Objective

The aim of our project is to collect real time tweets about any trending topic

which we can then classify as ‘positive’ or ‘negative’ using a model that we

have prepared through training using Gaussian Naïve Bayes Classifier.

This information will be useful in gathering information about the general

public response related to the particular object, news, trend etc.

Dataset

For training our judgement model, we have used the dataset from ‘Kaggle’

website: https://inclass.kaggle.com/c/si650winter11/data

The data is in the following format:

Value Statement

1 The Da Vinci Code book is just awesome.

1 I liked the Da Vinci Code a lot.

0 I hate Harry Potter.

0 Harry Potter and Titanic suck.

The data required for actual operation, is obtained in real time using the

Twitter API and processed by a python script. (Code in Appendix)

Page 5: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 5

Introduction

Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive or negative.

The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by precision and recall. However, according to research human raters typically agree 79% of the time.

Thus, a 70% accurate program is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about any answer.

Page 6: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 6

Algorithm

Training Stage

Get several statements from a database with their actual

positive or negative response.

Split the statements into two classes: positive and negative.

For each class, compute the tf-idf values and their mean

and variances to prepare a Gaussian probability distribution

Map the probabilities using a Naïve Bayes Classifier

Use only 80% of the dataset for the training and

test the model on remaining 20%.

Page 7: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 7

Actual Program

Authenticate with Twitter using token

Collect real time tweets about a trending topic.

Classify the tweets into the two different classes

based on computed probabilities.

Transform the tweets into vectors and pass it to our

judgement model that has previously been trained.

Display the results about the type of response

That the keyword is generating.

Page 8: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 8

Results (Training)

Overall Accuracy: 88.7%

Where,

Recall Precision

Positive: 1.00 0.79

Negative: 0.80 1.00

Page 9: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 9

Results (Real time)

Review of Apple Watch

No of tweets v/s date (in April’15)

Future Prospects Our application can be used as a service for businesses to do market analysis of the response that their products receive and track the changes in response with time.

0

2

4

6

8

10

12

15 16 17

Positive

Negative

Page 10: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 10

Appendix (Codes) Training

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import classification_report

from sklearn.metrics import accuracy_score

import scipy.sparse as sp

import numpy as np

import re

def read():

with open("training.txt") as f:

contents = f.readlines()

ytrain = []

lines = []

for content in contents:

fracs = re.split('\t',content.strip())

ytrain.append(fracs[0])

lines.append(fracs[1])

return (lines,np.array(ytrain))

def split_test_train(x,y):

posIndex = y == "1"

negIndex = y == "0"

posX = x[posIndex]

negX = x[negIndex]

posY = y[posIndex]

negY = y[negIndex]

xtrain =

sp.vstack((posX[:int(posX.shape[0]*0.8)],negX[:int(negX.shape[0]*0.8)]

),format='csr')

ytrain =

np.concatenate((posY[:int(posX.shape[0]*0.8)],negY[:int(negX.shape[0]*

0.8)]))

xtest =

sp.vstack((posX[int(posX.shape[0]*0.8):],negX[int(negX.shape[0]*0.8):]

),format='csr')

ytest =

np.concatenate((posY[int(posX.shape[0]*0.8):],negY[int(negX.shape[0]*0

.8):]))

return (xtrain,ytrain,xtest,ytest)

Page 11: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 11

def train_test():

(lines,y) = read()

vect = TfidfVectorizer()

vect.fit(lines)

x = vect.transform(lines)

(xtrain,ytrain,xtest,ytest) = split_test_train(x,y)

clf = GaussianNB()

clf.fit(xtrain.toarray(),ytrain)

ypred = clf.predict(xtest.toarray())

t = ["positive","negative"]

print "accuracy:"

print(accuracy_score(ytest, ypred))

print(classification_report(ytest, ypred, target_names=t))

def train():

(lines,y) = read()

vect = TfidfVectorizer()

vect.fit(lines)

x = vect.transform(lines)

clf = GaussianNB()

clf.fit(x.toarray(),y)

return (vect,clf)

Sentiment Analysis

from TwitterAPI import TwitterAPI

import train

import warnings

access_token_key = "1402969771-

8GILKV7XwynFy9X0vrEH5GnfHYJZ4Vu3lHr7Sve"

access_token_secret = "7YxDx7SjyGgKZAiumr7zVIhGI7IwBNmfpQr2g8CoVA"

consumer_key = "g1ZsoKVUlDl0buNGJT9dSw"

consumer_secret = "GLjDx3p1MfQzELWXiwOqRQCMjbwItgLDOubCvPVzA"

def get_score(query):

(vect,clf) = train.train()

api = TwitterAPI(consumer_key, consumer_secret, access_token_key,

access_token_secret)

r = api.request('search/tweets', {'q':query})

tweets = []

for item in r:

tweets.append(item['text'] if 'text' in item else item)

x = vect.transform(tweets)

ypred = clf.predict(x.toarray())

print "total no. of tweets : " + str(len(ypred))

print "no. of positive tweets : "+str(sum(ypred=="1"))

print "no. of negative tweets : "+str(sum(ypred=="0"))

Page 12: Real Time Sentiment Analysis using Twitter Feed

Project Report (MBA 653A) 2015

Indian Institute of Technology, Kanpur 12

Bibliography

[1] Dataset: https://inclass.kaggle.com/c/si650winter11/data

[2] python libraries: http://scikit-learn.org/stable/

[3] tf-idf: Kranti Ghag and Ketan Shah, “SentiTFIDF-Sentiment classification

using Relative Term Frequency Inverse Document Frequency”, International Journal of Advanced Computer Science and Applications

[4] Naïve-Bayes: http://en.wikipedia.org/wiki/Naive_Bayes_classifier

[5] Real time tweets: https://dev.twitter.com/overview/api