20
1 A SENTIMENT ANAYSIS AND CLASSIFICATION ALGORITHM UTILIZING AN INDEPENDENT TERM MATCHING SCHEME SENSITIVE TO WORD COUNT PATERNS Authors: Asoka Korale, Ph.D., C.Eng., MIET Chanuka Perera, Dip., ABE(UK) Eranda Adikari, B.Sc., C.Eng., MIESL Nadeesha Ekanayake, B.Sc.,

Sentiment Analysis for IET ATC 2016

Embed Size (px)

Citation preview

Page 1: Sentiment Analysis for IET ATC 2016

1

A SENTIMENT ANAYSIS AND CLASSIFICATION ALGORITHM UTILIZING AN INDEPENDENT TERM MATCHING SCHEME SENSITIVE TO WORD COUNT PATERNS

Authors:

Asoka Korale, Ph.D., C.Eng., MIETChanuka Perera, Dip., ABE(UK)Eranda Adikari, B.Sc., C.Eng., MIESLNadeesha Ekanayake, B.Sc.,

Page 2: Sentiment Analysis for IET ATC 2016

2

Business Drivers of “Sentiment Analysis” & Classification

Devise a Customer focused Corporate Strategy

Help Determine Areas of Future Investments

Analysis of Customer Feedback for Decision making

Insights on Corporate Image, Service Level and Performance

Business Process Improvement …

Page 3: Sentiment Analysis for IET ATC 2016

3

Objective of the Modeling

Prioritize Comments by Sentiment (Severity of Feedback)

Classify Comments to Pre Defined Categories

Rate Sentiment contained in Feedback

Analyze Feedback Comments, Prioritize and Classify for Timely Action

Direct each Class to Appropriate Authority in Priority Order for Timely action

Page 4: Sentiment Analysis for IET ATC 2016

4

“Sentiment” a Definition

Concise “Comments” give insight to “Emotional” content of message

Emotional Dimensions of Words Valence (Happiness), Activation (Arousal), Dominance

An Opinion, View held or Expressed

Only “Select” words convey “Emotion”

Dictionaries of rated Words across each Emotional Dimension

Account separately for “Negations”

Words rated for “Sentiment” by Human agents via large Surveys

Introduce Local Language Support

Page 5: Sentiment Analysis for IET ATC 2016

5

Feedback Comment Classification Process

Supervised Methods employ “Training Sequences”

Technique uses word Combinations, Patterns, Frequencies

Grouping comments on a “Theme” or Criteria in to “Classes”

Requires Pre Classified Comments

Suitable for classifying large texts

Page 6: Sentiment Analysis for IET ATC 2016

6

Sentiment Analysis via Independent Term Matching

Assumptions -

Twitter, FB & Customer comments

Each term in a comment independent of others

Valence, Activation and Dominance components of each word drawn from a Normal Distribution with specified Mean and Standard Deviation

Combined overall sentiment rating of matched words occurs at maximum of the sum of the individual Normal Densities

Overall Sentiment in a comment represented by the combined effect of the sentiment of individual words in the comment

Suitable for small text data

Ref: http://www.csc.ncsu.edu/faculty/healey/tweet_viz/

Page 7: Sentiment Analysis for IET ATC 2016

7

Algorithm – Sentiment Score for each Comment

I. Comments in Series: Each

Analyzed Separately

II. Select a Comment, Convert words to Lower case and

Remove Punctuation

V. Compute a Normal Density Function with Mean and Standard

Deviation corresponding to each Attribute of each matched word by scaling a Standard Normal Random

Variable

III. Find match in Dictionary for each word in selected comment and get corresponding mean and

standard deviation

IV. Extract Mean and Standard Deviation of “Valence” and

“Activation” attributes of each matched word from Dictionary

Vi. Compute the sum of the Density functions corresponding to each

attribute of all matched words in the comment

Vii. Determine Maximum point “max-GMM” of the sum of the Density functions to arrive at an average score for the effect of that attribute across all words in the comment

µ=

µ1µ2……µ𝑛

𝜎=

𝜎 1

𝜎 2

……𝜎𝑛

Comment Words Valence Rating Activation Rating

Dictionary Value Mean Std Dev Mean Std Dev

'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56

Simple Average 6.01 1.32 4.06 2.46

Word Valence Rating Activation Ratingmax- GMM 7.5 3.7

Page 8: Sentiment Analysis for IET ATC 2016

8

Gaussian Mixtures in Rating “Total Sentiment”

N

kkkk mxgpxf

1);();(

Npk

1

2

21

21),;(

k

kmx

kkk emxg

the mean and stand deviation of the Normal Distribution of the ratings of each matched word

overall sentiment xcomment of a comment in a particular dimension is then determined as

Consider the cumulative effect of all matched sentiment bearing words via the sum of the individual probability densities.

x represents the sentiment score, N the number of matched words in a comment

kkm ,

where and

which is the point at which the probability of the mixture of distribution is a maximum, and so is the most likely value for the overall sentiment of a comment composed of several words.

);(max xfx

xcomment

Page 9: Sentiment Analysis for IET ATC 2016

9

Overall Valance (Happiness) and Activation (Arousal) of a commentComment Words Valence Rating Activation Rating

Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56

Simple Average 6.01 1.32 4.06 2.46

Word Valence Rating Activation Ratingmax- GMM 7.5 3.7

Figure 1: Gaussian Mixtures of matched words in the Valence Dimension

Figure 2: Gaussian Mixtures of matched words in the Activation Dimension

Page 10: Sentiment Analysis for IET ATC 2016

10

IMPACT OF “NEGATIONS” ON TOTAL RATING

Comment Words Valence Rating Activation Rating

Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09

Not 'good' 6.65 1.24 6.38 2.72'late' 3.32 1.17 5.57 2.56

Simple Average 5.6 1.32 4.97 2.46

Word Valence Rating Activation Ratingmax- GMM 6.7 4.5

Comment Words Valence Rating Activation Rating

Dictionary Value Mean Std Dev Mean Std Dev'service' 6.83 1.54 2.95 2.09'good' 7.89 1.24 3.66 2.72'late' 3.32 1.17 5.57 2.56

Simple Average 6.01 1.32 4.06 2.46

“the service was not good and late” “the service was good but was late”

Word Valence Rating Activation Ratingmax- GMM 7.5 3.7

Account for Negations by adjusting the sentiment score of word immediately following the negation in a direction opposite in polarity to its matched directory sentiment value.

The magnitude of the adjustment made corresponds to the standard deviation of the particular rating value being adjusted.

The magnitude of the adjustment can also be user definable

Page 11: Sentiment Analysis for IET ATC 2016

11

Variance in Max GMM and Simple Average Measure

It is seen that 90% of the time the samples are within +/- 0.5 in the case of the Valence Attribute.

The CDF of the difference in the Activation attribute is tightly centered on the origin indicating hardly any variance.

This is also an indication that most comments convey sentiments of a single polarity and only a few comments (less than 10%) have words with conflicting emotional content.

Figure 1: Variance between GMM and Simple Average measures for estimating overall comment sentiment

A measure of the degree of disparate emotions in the comments

Page 12: Sentiment Analysis for IET ATC 2016

12

Sample Comments for Rating and Classification

1.HOTLINE ISSUES - DELAY IN ANSWERING - CX SERVICE ASSISTANCE Today morning CX has called to the 444 H\L for Movie Ticket & he has waited for more than 10 mins in the line, regarding this now CX was very disappointed on our service. So pls be kind enough to chk on ths & give the call back to the CX ASAP. * Note: - Regarding this issue CX need the call back from one of our manager & CX has requested not to charge a single rupee from his no for this issue.2.Yes,man magea prshnaya kiyapu gaman eyaa magea prshnea wisaduwaa he's a good3.Yes kad pin nambar signal4.Wenath ayathana wala mema pahasukam nomati nisa5.very good service6.uparimaya7.Uparima8.think so9.thanks10.Super 11.Solved12.She resolved my problem.13.Service nallam14.Sambanda weemata boho welawak giya nisa15.recharge16.Prashnayata pilithura hodin pahadili kara dima17. Payak athulatha gataluwa nirakaranaya karanwa kiuwa. Thawamath gataluwa nirakaranaya kara natha.18.oba ayathanaya sewawan sadaha ihala mudalak ayakarana nisa19.no mms setting laba dunnada save kala nohaka 20.nam apahu e tika ewanna

21.Mata awashshaya u pilithurau pahadili lesa laba ganemata hakiuna.22.mage parshnata pilithuru dunna.23.lotari SMS stop24.Its professional25.ing tone sewawa ain kirima26.I submitted Xtv reg form on 27th oct at yr crescat arcade. They told to call me on 28th wed to give the AC No27.Hot line eka answer karapu girlge voice eka and care eka good28.Hi kohomada? Mama mea dawas wala plan karagena yanawa mage next music video eka karanna. Song eka "Mata Rawana" :-)29.harima pehediliwa mage getaluwa nirakaranaya kala thanks30.Good service but shortcomings due to some arrogant customer care officers31.good men32Good33.getaluwa hadunagenimata noheki wiya..34.First of all its great to be treated as a privilege customer. Reason is simple. I'm using X mobile connection and XTV, because dialog has the better35.durakathanayata pilithuru denda epai eke hoda naraka kiyanna.36.Cx need to add the CHU CHU TV which is a kids channel to the channel list.Since this channel is available on another TV connection.Cx need this channel to activate for XTV aswell.Please check on this and do the needfull. Thank you37.Customer service personal have to be trained better cause they can't think out of the box.38.bashawa wenaskaranna

Page 13: Sentiment Analysis for IET ATC 2016

13

Sentiment Aggregates on Sample Comments

Fig 1: Heat Map of Sentiment rated sample comments Fig 2: Sentiment Dimensions of sample comments

Page 14: Sentiment Analysis for IET ATC 2016

14

A Novel Association Rule Mining Algorithm

• Initialize (at level L1) by determining set of all Items {I} that meet minimum support criteria• Determine support for all pairs of items {Ii,Ij} (i ~= j) in {I}• Determine rules for all pairs of items of the form Ii->Ij

• At each subsequent level (Lp), p > 1• Determine item combinations that meet minimum support criteria• Items at subsequent stages selected from rules of previous stage that met min support

criteria• Antecedent at subsequent level (Lp+1) is formed by merging the antecedent and

consequent terms of the rules that meet the minimum support criteria at level Lp• Stop when combined terms no longer meet min support criteria

Deriving likely word combinations (Keyword Selection)

• Selection Measures NBANBASupport /)()(

)( BAConfidence )(/)( ASupportBASupport

)(/)&( ABA EPEEP

)/( AB EEP

Page 15: Sentiment Analysis for IET ATC 2016

15

Simplifying Assumptions of the Naïve Bayes Technique

Slide | 15

)(/),,...,,()/,...,( 2121 jjNjN CPCXXXPCXXXP

)(/),,..,,(),,...,/( 3221 jJNjN CPCXXXPCXXXP

)(/)()/()......,,..,/( 21 jjjnjN CPCPCXPCXXXP

)/(),,.../( 2 jijNi CXPCXXXP

)/)...(/()/()/,...,,( 2121 jNjjjN CXCXPCXPCXXXP

Under the assumption of conditional independence of word X i given class Cj

)}()/({max)/( jjj

CPCXPC

XCP

)}()./().../()/({max21 jjNjj

j

CPCXPCXPCXPC

probability of a sequence of words {Xi} in a comment given class C j

Probability of class C given a set of words X = {X1,X2…,XN}

Page 16: Sentiment Analysis for IET ATC 2016

16

Classification via Naïve Bayes

Assumptions -

The order of words {Xi} in a comment is independent of each other given the class {Cj}

A class is determined solely on the specific words in a comment and their frequency of occurrence in that comment

Conditional Independence of the words in a comment given the class of the comment

a “bag of words model”

Page 17: Sentiment Analysis for IET ATC 2016

17

Performance of the Classification Algorithm

Accuracy greater than 75% on predicted classes

Accuracy greater than 90% on training samples

Performance will further increase with preprocessing and filtering

single word comments don’t convey meaningful category information

Use misclassified comments to “Retrain” algorithm

Key Words for classification via Association Rules

Page 18: Sentiment Analysis for IET ATC 2016

18

Algorithm Implementation & Results

• Algorithm designed and built from first principals using Matlab programming language

• Local Language Support by updating Dictionary with Sinhala and Tamil words conveying emotion

• 59,000 comments analyzed and Rated for Sentiment and Classified / Binned in to six categories

• Improved Classification by word relationships (key words) derived from Association Rule Mining

• 3000 Training comments used with six classes for Training Model

• Fast implementation processing all comments in a few hours

• A Word vs. Frequency Analysis used to determine which new words to add to the Dictionary

• The Sentiment rating is a means to “prioritize” the handling of the sorted and binned comments

• Performance improvement by “re-classifying” , miss classified comments and reuse in Training

Page 19: Sentiment Analysis for IET ATC 2016

19

Conclusion

• Pre Processing – improved performance by retaining only relevant words and word combinations for the classification the business, purpose of the analysis

• Spelling mistakes will cause problems as words will not match those in dictionary• Update Dictionary with new words and miss spelled words• Introduce limits on the minimum number of words that should be matched for a comment to

be analyzed – for increased reliability

• Independent Term Matching – doesn’t necessarily capture “meaning” of comment• short comments can be analyzed to assess overall sentiment

• Rate the emotional content in a comment

• Algorithm can provide other segmentations by matching words specific to the purpose of routing

• Naïve Bayes gave good classification accuracy • The severity of sentiment in the classified comment used to prioritize comment handling

• Simple averaging of the attribute values to arrive at the combined effect of all matched words in a comment can also be considered and may give results that are not that far off from the assumption of Normality

Page 20: Sentiment Analysis for IET ATC 2016

20

THANK YOU