g19p1

CS771 : Machine Learning Tools and Techniques

CS771: GROUP-19Project ReportSentiment Analysis in Movie Reviews

Submitted in partial fulfillment ofthe requirements for the course of the Machine Learning

Submitted by

Jaimita BansalZahira NasrinRajat KumarSharin KG14111017 14111047 14111028 14111033Dept. of CSE Dept. of CSE Dept. of CSE Dept. of CSE

Under the guidance ofDr. Harish Karnick

Department of Computer Science and EngineeringINDIAN INSTITUTE OF TECHNOLOGY KANPURKanpur, Uttar Pradesh, India 208016

Acknowledgments

We would like to express our sincere appreciation to our supervisor Prof. Harish Karnick. This project would not have been possible without his guidance. We owe our knowledge in Machine Learning to his course CS771: Machine Learning Tools and Techniques.

Abstract

In this project, we aim to tackle the problem of Sentiment Analysis which has been an active area of Research recently. We experiment with different machine learning algorithms ranging from simple Bag of Words to the more complex models like GloVe, with the aim of predicting the sentiment of unseen reviews. We also discuss the performance of different classifiers used and compare their results to obtain the best classifier configuration for the problem for the given dataset. To do this, we make use of IMDB dataset which provides us with a set of 25,000 highly polar movie reviews for training, 25,000 reviews for testing and additional 50,000 unlabeled data.

Table of Contents1. Introduction52. Dataset 63. Pre-processing 74. Approaches 84.1 Bag Of Words84.1.1Results84.2 Word2vec94.2.1 Results124.3 Doc2vec144.3.1 Results164.4 GloVe194.4.1 Results224.5 Perceptron244.5.1 Results255. References 27

1. Introduction

Sentiment analysis and classification is an area of text classification that has recently been receiving lots of attention from researchers. Formally defined as - The process of computationally identifying and categorizing opinions expressed in a piece of text, especially in order to determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.Sentiment analysis involves analyzing textual datasets which may contain opinions (e.g., social media, movie reviews etc.) with the aim of classifying the opinions as positive, negative, or neutral. Classification of textual objects according to sentiment is considered a more difficult task than classification of textual objects according to content because people express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers

Figure 1. Learning Task Diagram

2. Dataset

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

Data fields are

id -Unique ID of each review sentiment -Sentiment of the review; 1 - positive and 0 - negative review - Text of the review

3. Pre-processing

As the reviews in the dataset contained many HTML tags, stop words and punctuation symbols, we were required to clean the data and do some pre-processing on the reviews to make it suitable for analysis. We used BeautifulSoup library to remove HTML Tags and markups. Regular expressions and punctuations were dealt using re package. Finally, we got list of English- language stop words from Python Natural Language Toolkit and filter all of those from our data corpus and utilize stemming package to treat same meaning words as one.Code Snippet 1. Functions for Preprocessing

def review_to_wordlist( review, remove_stopwords=False ): # 1. Remove HTML review_text = BeautifulSoup(review).get_text() # 2. Remove non-letters review_text = re.sub("[^a-zA-Z]"," ", review_text) # 2.1 Remove single letters review_text = re.sub('/(?

Documents

g19p1