Upload
anshu-gupta
View
298
Download
1
Embed Size (px)
Citation preview
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 1
REAL TIME SENTIMENT ANALYSIS
USING TWITTER FEED
Project Report
by
Swapnil Shwetank Jha (11753) Shibendu Saha (11679)
Anshu Kumar Gupta (11125) Shivendu Bhushan (11689)
[Group 4]
Project Supervisor: Dr. Shankar Prawesh
IIT Kanpur
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 2
Acknowledgement
We have taken efforts in this project. However, it would not have been possible without the kind support and help of many individuals. We would like to extend our sincere thanks to all of them.
We are highly indebted to Dr Shankar Prawesh for his guidance and constant supervision as well as for providing necessary information regarding the project & also for their support in completing the project. Our thanks and appreciations also go to our colleagues in developing the project and people who have willingly helped us out with their abilities.
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 3
Table of Contents
Topic Page No.
Objective 4
Dataset 4
Introduction 5
Algorithm 6-7
Results 8-9
Future Prospects 9
Appendix (Python Codes) 10-11
Bibliography 12
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 4
Objective
The aim of our project is to collect real time tweets about any trending topic
which we can then classify as ‘positive’ or ‘negative’ using a model that we
have prepared through training using Gaussian Naïve Bayes Classifier.
This information will be useful in gathering information about the general
public response related to the particular object, news, trend etc.
Dataset
For training our judgement model, we have used the dataset from ‘Kaggle’
website: https://inclass.kaggle.com/c/si650winter11/data
The data is in the following format:
Value Statement
1 The Da Vinci Code book is just awesome.
1 I liked the Da Vinci Code a lot.
0 I hate Harry Potter.
0 Harry Potter and Titanic suck.
The data required for actual operation, is obtained in real time using the
Twitter API and processed by a python script. (Code in Appendix)
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 5
Introduction
Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive or negative.
The accuracy of a sentiment analysis system is, in principle, how well it agrees with human judgments. This is usually measured by precision and recall. However, according to research human raters typically agree 79% of the time.
Thus, a 70% accurate program is doing nearly as well as humans, even though such accuracy may not sound impressive. If a program were "right" 100% of the time, humans would still disagree with it about 20% of the time, since they disagree that much about any answer.
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 6
Algorithm
Training Stage
Get several statements from a database with their actual
positive or negative response.
Split the statements into two classes: positive and negative.
For each class, compute the tf-idf values and their mean
and variances to prepare a Gaussian probability distribution
Map the probabilities using a Naïve Bayes Classifier
Use only 80% of the dataset for the training and
test the model on remaining 20%.
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 7
Actual Program
Authenticate with Twitter using token
Collect real time tweets about a trending topic.
Classify the tweets into the two different classes
based on computed probabilities.
Transform the tweets into vectors and pass it to our
judgement model that has previously been trained.
Display the results about the type of response
That the keyword is generating.
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 8
Results (Training)
Overall Accuracy: 88.7%
Where,
Recall Precision
Positive: 1.00 0.79
Negative: 0.80 1.00
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 9
Results (Real time)
Review of Apple Watch
No of tweets v/s date (in April’15)
Future Prospects Our application can be used as a service for businesses to do market analysis of the response that their products receive and track the changes in response with time.
0
2
4
6
8
10
12
15 16 17
Positive
Negative
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 10
Appendix (Codes) Training
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import scipy.sparse as sp
import numpy as np
import re
def read():
with open("training.txt") as f:
contents = f.readlines()
ytrain = []
lines = []
for content in contents:
fracs = re.split('\t',content.strip())
ytrain.append(fracs[0])
lines.append(fracs[1])
return (lines,np.array(ytrain))
def split_test_train(x,y):
posIndex = y == "1"
negIndex = y == "0"
posX = x[posIndex]
negX = x[negIndex]
posY = y[posIndex]
negY = y[negIndex]
xtrain =
sp.vstack((posX[:int(posX.shape[0]*0.8)],negX[:int(negX.shape[0]*0.8)]
),format='csr')
ytrain =
np.concatenate((posY[:int(posX.shape[0]*0.8)],negY[:int(negX.shape[0]*
0.8)]))
xtest =
sp.vstack((posX[int(posX.shape[0]*0.8):],negX[int(negX.shape[0]*0.8):]
),format='csr')
ytest =
np.concatenate((posY[int(posX.shape[0]*0.8):],negY[int(negX.shape[0]*0
.8):]))
return (xtrain,ytrain,xtest,ytest)
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 11
def train_test():
(lines,y) = read()
vect = TfidfVectorizer()
vect.fit(lines)
x = vect.transform(lines)
(xtrain,ytrain,xtest,ytest) = split_test_train(x,y)
clf = GaussianNB()
clf.fit(xtrain.toarray(),ytrain)
ypred = clf.predict(xtest.toarray())
t = ["positive","negative"]
print "accuracy:"
print(accuracy_score(ytest, ypred))
print(classification_report(ytest, ypred, target_names=t))
def train():
(lines,y) = read()
vect = TfidfVectorizer()
vect.fit(lines)
x = vect.transform(lines)
clf = GaussianNB()
clf.fit(x.toarray(),y)
return (vect,clf)
Sentiment Analysis
from TwitterAPI import TwitterAPI
import train
import warnings
access_token_key = "1402969771-
8GILKV7XwynFy9X0vrEH5GnfHYJZ4Vu3lHr7Sve"
access_token_secret = "7YxDx7SjyGgKZAiumr7zVIhGI7IwBNmfpQr2g8CoVA"
consumer_key = "g1ZsoKVUlDl0buNGJT9dSw"
consumer_secret = "GLjDx3p1MfQzELWXiwOqRQCMjbwItgLDOubCvPVzA"
def get_score(query):
(vect,clf) = train.train()
api = TwitterAPI(consumer_key, consumer_secret, access_token_key,
access_token_secret)
r = api.request('search/tweets', {'q':query})
tweets = []
for item in r:
tweets.append(item['text'] if 'text' in item else item)
x = vect.transform(tweets)
ypred = clf.predict(x.toarray())
print "total no. of tweets : " + str(len(ypred))
print "no. of positive tweets : "+str(sum(ypred=="1"))
print "no. of negative tweets : "+str(sum(ypred=="0"))
Project Report (MBA 653A) 2015
Indian Institute of Technology, Kanpur 12
Bibliography
[1] Dataset: https://inclass.kaggle.com/c/si650winter11/data
[2] python libraries: http://scikit-learn.org/stable/
[3] tf-idf: Kranti Ghag and Ketan Shah, “SentiTFIDF-Sentiment classification
using Relative Term Frequency Inverse Document Frequency”, International Journal of Advanced Computer Science and Applications
[4] Naïve-Bayes: http://en.wikipedia.org/wiki/Naive_Bayes_classifier
[5] Real time tweets: https://dev.twitter.com/overview/api