Feature Dependent Method for Sentiment Analysis Text Mining 2014

cPGCON 2014, Third Post Graduate and Research Scholar Symposium, University of Pune

Ms. Neha S. JoshiDepartment of Computer Engineering

Progressive Education Societys Modern College ofEngg., Shivajinagar,

Pune, [email protected]

AbstractThese days we highly consider opinions of friends,domain experts for decision making in day todays life. Forexample, which brand is best for certain product, whether thecurrent movie is good, whether product gives better performanceor not, how many ratings are given to travelling site. Opinionmining, also known as Sentiment analysis plays an important rolein this process. It is the study of emotions i.e. Sentiments,Expressions that are stated in natural language. Naturallanguage techniques are applied to extract emotions fromunstructured data. In this paper, a feature level analysis isconsidered which is known as fine grained analysis and takeseach and every entity in review and its corresponding polarity. Inproposed work, Artificial Neural Network approach withJaccard similarity measure is presented. Jaccard similaritymeasure performs well in measuring the similarity of wordswhen comparing with each letter of the word. This approach issimilar to existing Support Vector Machine (SVM) approach. ButSVM approach has certain disadvantages like limited parameterselection. Also when more number of features are selected thenits performance is degraded. Hence new approach with similaritymeasure is proposed.

Index Terms Classification, Machine learning, NaturalLanguage Processing (NLP), Opinion mining, Parts of speechtags, Sentiment analysis, Semi-Supervised learning, SentimentClassification, Support Vector Machine (SVM), Term weighting,Polarity.

I. INTRODUCTION Today people not only comment on the existinginformation, bookmark pages, and provide ratings, but theyalso share their ideas, news and knowledge with thecommunity at large.The main aim of information gathering isto analyze what other people think. The World Wide Webcontains large amount of massive data or unstructured data.With increasing popularity of opinion-base websites and otherresources new challenges has been arrived in opinion mining.It is now becoming evident that the views expressed on theweb can be influential to readers in forming their opinions onsome topic. Similarly, the opinions expressed by users are animportant factor taken into consideration by product vendors,and policy makers. There are a number of differences inmeaning between emotions, sentiments and opinions. Themost notable one is that opinion is transitional concept, which

Prof. Mrs. Suhasini A. ItkarDepartment of Computer Engineering

Progressive Education Societys Modern College of Engg., Shivajinagar,

Pune, [email protected]

reflect our attitude towards something. On the other hand,sentiments are different from opinions in that they reflect ourfeeling or emotion, not always directed towards something.Further still, our emotions may reflect our attitudes. Sentimentanalysis also known as opinion mining plays an important rolein determining the direction of sentiments also known aspolarity. It is currently significant trend in natural languageprocessing. As opinions are expressed in natural language, itinvolves machine learning processing i.e. to give artificialintelligence to computers. Opinion mining extracts emotions,sentiments, opinions from the document corpus and analyzesthem. There are different machine learning approaches used toanalyze opinions whiz. Supervised learning and Unsupervisedlearning. For large scale sentiment analysis unsupervisedlearning method is used. Supervised machine learningtechniques used training the sample data set and later testingits subset. Amongst all other supervised learning approaches,SVM gives maximum accuracy when used with unigrams. Butagain it has few disadvantages. In this paper, design ofproposed approach and implementation details are presented.

II. RELATED WORK Basically there are three main levels of sentimentanalysis namely, Document level analysis, Sentence levelanalysis and Feature level analysis. In Document level[1]analysis and Sentence level analysis one cannot identifyreviewers likes or dislikes on specific feature of that object. Ithas been found that document level and sentence levelclassification are not enough to identify each and every onedetail about sentiments expressed in a document as sentimentsmay be expressed with respect to different features. In Featurelevel method algorithm with parts of speech tags is used toimprove the accuracy on the benchmark dataset. It is fine-grained analysis process which takes every feature of objectinto consideration [2]. Abd. Samad Hasan Basaria, Burairah Hussina, I. etal.[3] proposed a new approach which takes both SVM andSVM with particle swarm optimization (PSO) intoconsideration. Experiments are carried out to compare theperformance and accuracy of both approaches. It has beenfound that SVM-PSO gives better solution in case of accuracy

A Feature Dependent Method for SentimentAnalysis to understand User Context in Web


and precision that SVM in case data without cleansing case.But SVM gives better recall factor than SVM-PSO. Rudy Prabowo1, Mike Thelwall[4], presented a Hybridclassification which combines all classification approacheswhiz. Rule based classification, Statistics Based Classifier andSVM together to give better performance. It applies allclassifiers in sequence. There can be any set of configurationsets. For example Statistics Based Classifier (SBC)-SVM,Induction Rule Based Classifier-SBC- SVM. Generally SVMclassifier is placed last as the SVM classifier was placed lastbecause all the documents classified by the SVM wereclassified into either positive or negative. Hence, it did notgive another classifier the chance to carry out a classificationonce applied. The disadvantage of this approach is that ifnumbers of samples are not sufficient then this makes SVMweak to test next sample subsets. Michelle Annett andGrzegorz Kondrak[5] proposed a novel approach based onSVM. They worked on feature vectors of different size,representation and types. Comparison of this new approachwith exiting approaches is carried out. It is concluded thattypes of feature vector chosen has greater impact on accuracyof classifier. In [6, 7] they pose the problem of grouping featuresynonyms which is a current researched topic and a dauntingone analytically. The idea is to acquire useful and possiblymore explanatory synonyms for features that will makeanalysis more robust and easier given a training data set.However, we are usually interested in summarizing all reviewsof a particular movie, meaning these features and featuresynonyms are often not given. For example, reviews containedin places such as [16] and [17] give users an overall score of amovie that essentially summarize how good reviewers onaverage think a movie is. This does not summarize certainfeatures or attributes of the movie, but simply summarizes themovie as a whole to tell a user whether or not a movie comesrecommended. Hogenboom et al. [8] proposed a method which considersthe negation scope and strength of a word while classifyingwhether a word has positive or negative effect on the sentence.For example, let us consider two sentences I am happy withyour performance and I am not that happy with yourperformance. The first sentence expresses a positive emotion.If we just consider the negative keyword not then the secondsentence would be equivalent to I am not happy with yourperformance which is not correct. If scope and strength of thenegative keywords are considered while deciding its effectthen it would give better results. The proposed approach usestwo algorithms; the first one is used to calculate sentencescore for each word. In the second algorithm, the sentencescore is calculated using the word sense and word score withrespect to each negative keyword. If the calculated sentencescore is less than zero, then it is assigned to a negative class. Kechaou et al. [9] proposed an approach to evaluate ausers opinion on e-learning systems. Three feature selectionmethods MI (Mutual Information), IG (Information Gain), andCHI statistics (CHI) have been examined and advanced alongwith their proper HMM and SVM-based hybrid learning

method. Their results showed that IG (Information Gain)performed the best. Applying data mining techniques on e-learning reviews and studying e-learning blogs are some of thechallenges faced in improving the accuracy of the proposedsystem further.

III. PROPOSED WORKFrom the work done previously and the existing approaches itis clear that SVM approach has many disadvantages though itgives better performance than other approaches viz. NaveBayes, Maximum Entropy etc. The proposed frameworkpresents an approach which combines advantages of similaritymeasure and Artificial Neural Network (ANN) together togive better accuracy and efficiency than SVM approach.Further, in similarity measure there are two methods namelyJaccard & Dice and Cosine similarity. We can use any amongthese. The following diagram describes the flow of proposedmethod.The document corpus is a data collection from twitter, blogsetc. It is input to the data pre-processing step. Since datacontain several syntactic features that may not be useful formachine learning, the data needs to be cleaned such as @ (at)for link to username, url or link website (http, url, www),(hashtag), RT(for retweet). A module that allows option ofdifferent cleaning operations is designed. In CaseNormalization Most English texts (and other Romancelanguages) are published in combined case that is, publishedtext contains both higher and lowercase characters.

Figure 1.Block Diagram of Proposed System

The process is to turn the entire document or sentences intolowercase one. Tokenization is splitting up the systems of textinto personal terms or tokens. This procedure can take manytypes, with regards to the terminology being examined. ForEnglish, an uncomplicated and effective tokenizationtechnique is to use white space and punctuation as tokendelimiters. Stemming is the procedure of decreasing relevanttokens into a single type. Typically the stemming procedurecontains the recognition and elimination of prefixes, suffixes,and unsuitable pluralization. Generate n-grams character ngrams are n nearby figures from a given feedback sequence.For example, a 3-gram of phrase TERM can be {T,-TE, TER,


ERM etc. N-grams of single dimension is known as unigram,2 dimension is known as bigrams and so on. Term frequencyis discovered by basically keeping track of frequent that agiven phrase has took place in a given document, and inversedocument frequency is discovered by splitting the amount ofrecords that given term seems to be in. When these principlesare increased together we get a ranking that is maximum forterms that appear regularly in a few records, and low forconditions that appear regularly in every document, enablingus to discover conditions that are essential in a document.Finally transformed data set is generated which is use fortraining.

Figure 2. Training and Testing Flow

Algorithmic Approach:-The process starts with finding important keywords indocuments and removing irrelevant words. A TF-IDFapproach is used initially. The formal procedure forimplementing TF-IDF has some minor differences over all itsapplications, but the overall approach works as follows. Givena document collection D, a word w, and an individualdocument d D, we calculate

wd = fw, d * log (|D|/fw, D)where fw, d equals the number of times w appears in d, |D| isthe size of the corpus, and fw, D equals the number ofdocuments in which w appears in D. The efficiency is O (n).Once we get important terms in documents then similaritymeasure is applied. Here Jaccard similarity measure is usedwhich is binary distinguisher and distinct two or more objects.The jaccard similarity is defined as follows:

| A B | JS(A,B) = ------------------ | A U B |Which gives information about how A and B are similar.Finally we get a feature set depending on its similarity i.e.positive and negative. Now Artificial Neural Network (ANN)algorithm is used for training purpose. One specific benefitthat these models have over SVMs is that their size is fixed:

they are parametric models, while SVMs are non-parametric.That is, in an ANN there is a bunch of hidden layers with sizesh1 through hn depending on the number of features, plus biasparameters, and those make up the required training model.

Implementation DetailsThis proposed work is implemented by designing followingdifferent modules.1) Collecting dataset.2) Pre-processing and storing domain specific keywords.3) Calculating TF-IDF.4) Similarity measure.5) Feature Extraction.6) Training7) Classification and Analysis.

DatasetsExperiments are carried on movies reviews dataset which aretaken from amazon.com. Each dataset consists of 100 reviewsthat were classified in terms of the overall orientation as beingeither positive or negative (50 positive and 50 negativereviews). The ground truth was obtained according to thecustomer 5-stars rating. Reviews with more than 3 stars weredefined as being positive and reviews with less than 3 starswere labeled as being negative[19].

Performance MeasurementThe classification performance can be evaluated in three termsaccuracy, recall and precision as defined below. A confusionmatrix is used for this.

Machine says yes Machine says noHuman says yes True positive False negativeHuman says no False positive True negative

Table 1. Confusion Matrix Table

True positive samples + True Negative samples Accuracy= ----------------------------------------------------------- Total number of samples

True positive sample Recall= --------------------------------------------------------- True positive samples+ false negative samples

True positive sample Precision= ------------------------------------------ True positive sample+ false positive samples

Expected ResultThe proposed methodology should lead to better accuracyresults as well it should be implemented in less computationalcomplexity, which is major disadvantage of SVM. Also, itshould be stand for maximum number of samples sets.


IV.CONCLUSION

Support Vector Machine (SVM) has been widely andsuccessfully used in sentiment analysis. Artificial neuralnetwork (ANNs) has attracted little attention as an approachfor sentiment learning. Literature has been reported thedisadvantages of SVM approach that it cannot stand for morenumber of features and samples. To the best of my knowledgeANN gives better accuracy as SVM but in the worst case, thenumber of support vectors is exactly the number of trainingsamples (though that mainly occurs with small training sets orin degenerate cases) and in general its model size scaleslinearly. In natural language processing, SVM classifiers withtens of thousands of support vectors, each having hundreds ofthousands of features, is not unheard of. Thus, ANN is chosen.

V.ACKNOWLEDGMENT

Foremost, I would like to express my sincere gratitude to myguide Prof. Mrs.S.A.Itkar for her continuous support, for herpatience, motivation, enthusiasm, and immense knowledge.Her guidance helped me in all the time of research and writingof this paper. Besides my guide, I would like to thank ourM.E.Coordinator Prof. Ms. D. V. Gore for her encouragement,insightful comments, and valuable guidance.

VI.REFERENCES

[1] Sowmya Kamath S, Anusha Bagalkotkar, AsheshKhandelwal, Shivam Pandey, Kumari Poornima, SentimentAnalysis Based Approaches for Understanding User Contextin Web Content, 978-0-7695-4958-3/13, 2013 IEEE.[2] Bing Liu, Sentiment Analysis and Opinion Mining,Morgan & Claypool Publishers, May 2012.[3] Abd. Samad Hasan Basaria, Burairah Hussina, I. GedePramudya Anantaa, Junta Zeniarjab, Opinion Mining ofMovie Review using Hybrid Method of Support VectorMachine and Particle Swarm Optimization, ProcediaEngineering 53 ( 2013 ) 453 462 [4] Rudy Prabowo1, Mike Thelwall, Sentiment Analysis: ACombined Approach [5] Michelle Annett and Grzegorz Kondrak, "A Comparisonof Sentiment Analysis Techniques: Polarizing Movie Blogs".[6]B. Liu. \Web Data Mining: Exploring hyperlinks, contents,and usage data," Opinion Mining. Springer, 2007.[7] B. Pang & L. Lee. Opinion Mining and SentimentAnalysis." Foundations and Trends in Information Retrieval.Vol. 2, Nos. 1-2. pp.1-135, 2008.[8] Hogenboom, A.; van Iterson, P.; Heerschop, B.; Frasincar,F.Kaymak, U. , "Determining negation scope and strength insentiment analysis," Systems, Man, and Cybernetics (SMC),2011 IEEE International Conference on , vol., no., pp.2589-2594, 9-12[9] Kechaou, Z.; Ben Ammar, M.; Alimi, A.M.; , "Improvinge-learning with sentiment analysis of users' opinions," Global

Engineering Education Conference (EDUCON), 2011 IEEE ,vol., no., pp.1032-1038, 4-6 April 2011 [10] Wenying ZHENG, Qiang YE. "Sentiment Classificationof Chinese Traveler Reviews by Support Vector MachineAlgorithm". Third International Symposium on IntelligentInformation Technology Application,2009..[11] YuanbinWu, Qi Zhang, Xuanjing Huang, LideWu,Phrase Dependency Parsing for Opinion Mining. Proceedingsof the 2009 Conference on Empirical Methods in NaturalLanguage Processing, pages 15331541,Singapore, 6-7 August2009. c 2009 ACL and AFNLP.[12] Rudy Prabowo, Mike Thelwall. "Sentiment Analysis: ACombined Approach". White paper.[13] Bo Pang and Lillian Lee,Shivakumar Vaithyanathan."Thumbs up? Sentiment Classification using MachineLearning Techniques". In Proceedings of EMNLP 2002,pp.50-57.[14] G. Salton and C. Buckley, Term-weighting approachesin automatic text retrieval, Information Processing &Management, vol. 24, issue.5: 513523, 1988.[15] Zhang, J. Kawai, Y. Nakajima, S. Matsumoto, Y. Tanaka,K.,"Sentiment Bias Detection in Support of News CredibilityJudgment," System Sciences (HICSS), 2011 44th HawaiiInternational Conference on , vol., no., pp.1-10, 4-7 Jan. 2011[16] Kechaou, Z., Ben Ammar, M. Alimi , "Improving e-learning with sentiment analysis of users' opinions," GlobalEngineering Education Conference (EDUCON), 2011 IEEE ,vol., no., pp.1032-1038, 4-6 April 2011[17] B. J. Jensen, M. Zhang, K. Sobel, and A. Chowdury,Twitter power: Tweets as electronic word of mouth, Journalof the American Society for Information Science andTechnology, vol. 60, no. 11, pp. 21692188, 2009[18] ROTTEN TOMATOES: Movies - New Movie Reviewsand Previews." http://www.rottentomatoes.com.[19] Metacritic - Movie Reviews, TV Reviews, Game reviews,and Music Reviews." http://www.metacritic.com.[20] Alexandra BALAHUR, Andrs MONTOYO. "A FeatureDependent Method for Opinion Mining and Classification".978-1-4244-2780-2/08/ 2008 IEEE[21] Sowmya Kamath S,Anusha Bagalkotkar,AsheshKhandelwal. "Sentiment Analysis Based Approaches forUnderstanding User Context in Web Content". InternationalConference on Communication Systems and NetworkTechnologies,2013[22] Liu Gongshen, Lai Huoyao,Luo Jun, Lin Jiuchuan."Predicting the Semantic Orientation of Movie Reviews".Seventh International Conference on Fuzzy Systems andKnowledge Discovery (FSKD 2010).[23] Gang Li,Fei Liu. "A Clustering-based Approach onSentiment Analysis". 978-1-4244-6793-8/10/2010 IEEE.[24] Mikalai Tsytsarau , Themis Palpanas, Survey on miningsubjective data on the web, Data Min Knowl Disc (2012)24:478514 DOI 10.1007/s10618-011-0238-6, Springer

I. INTRODUCTIONIII. Proposed WORK1) Collecting dataset.

Documents

Feature Dependent Method for Sentiment Analysis Text Mining 2014