On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani

Knowledge Media Institute, The Open University,

Milton Keynes, United Kingdom

The 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland

• Sentiment Analysis

• Twitter

• Stopwords Removal Methods

• Comparative Study

• Conclusion

Outline

“Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text”

3

The main dish was delicious

It is a Syrian dishThe main dish was salty and horrible

Opinion OpinionFact

Sentiment Analysis

Stopwords Removal

Stopwords Removal in Twitter Sentiment Analysis

- Kouloumpis et al. 2011

- Pak & Paroubek, 2010

- Asiaee et al., 2012

- Bollen et al., 2011

- Bifet and Frank, 2010

- Speriosu et al., 2011

- Zhang & Yuan, 2013

- Gokulakrishnan et al 2012

- Saif et al., 2012

- Hu et al., 2013

- Camara et al., 2013Removing Stopwordsis USEFUL

NOYES

• Precompiled

• Very popular

• Outdated

• Domain-Independent

Classic Stopword Lists

• Unsupervised Methods

– Term Frequency

– Term-based Random Sampling

• Supervised

– Term Entropy Measures

– Maximum Likelihood Estimation

Automatic Stopwords Generation Methods

Stopwords Removal for Twitter Sentiment Analysis

Stopword Analysis Set-Up (1)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

OMD

HCR

STS

SemEval

WAB

GASP

OMD HCR STS SemEval WAB GASP

Negative 688 957 1402 1590 2580 5235

Positive 393 397 632 3781 2915 1050

Datasets


Stopwords Removal Methods

1. The Baseline Method

– (non removal of stopwords)

1. The Classic Method

– This method is based on removing stopwords obtained from pre-compiled lists

– Van Stoplist


Stopwords Removal Methods3. Methods based on Zipf’s

Law

- TF-High Method

Removing most frequent

- TF1 Method

Removing singleton words (i.e., words that occur once in tweets)

- IDF Method

Removing words with low inverse document frequency (IDF)


Stopwords Removal Methods

4. Term-based Random Sampling (TBRS)

5. The Mutual Information Method (MI)


Twitter Sentiment Classifiers

– Two Supervised Classifiers:

• Maximum Entropy (MaxEnt)

• Naïve Bayes (NB)

– Measure the performance in Accuracy and F1 measure

– 10 fold cross validation

Experimental Results

Assess the impact of removing stopwords by observing fluctuations on:

- Classification Performance

- Feature space

- Data Sparsity

Experimental Results (1)

1. Classification Performance

70

75

80

85

90

95

OMD HCR STS-Gold SemEval WAB GASP

Accuracy(%)

MaxEnt NB

60

65

70

75

80

85

90

OMD HCR STS-Gold SemEval WAB GASP F1(%)

MaxEnt NB

The baseline classification performance in Accuracy and F-measure

of MaxEnt and NB classifiers across all datasets

Accuracy F-Measure


1. Classification Performance

60

65

70

75

80

85

90

Baseline Classic TF1 TF-High IDF TBRS MI

Accuracy(%)

MaxEnt NB

50

55

60

65

70

75

80

85

Baseline Classic TF1 TF-High IDF TBRS MI F1(%)

MaxEnt NB

Accuracy F-Measure

Average Accuracy and F-measure of MaxEnt and NB classifiers using different stoplists


2. Feature Space

0.005.50

65.24

0.82

11.226.06

19.34


Reduction rate on the feature space of the various stoplists

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


TF=1 TF>1

The number of singleton words to the number non singleton words in all datasets


3. Data Sparsity

0.98800

0.99000

0.99200

0.99400

0.99600

0.99800

1.00000


SparsityDegree


Stoplist impact on the sparsity degree of all datasets

The Ideal Stoplist (1)

• The ideal stopword removal method is the one which:

– Helps maintaining a high classification performance,

– Leads to shrinking the classifier’s feature space

– Reduces the data sparseness

– Has low runtime and storage complexity

– Has minimal human supervision

The Ideal Stoplist (2)

Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplistmethods. Positive sparsity values refer to an increase in the sparsity degree while negative values refer to a decrease in the sparsity degree.

Overall Analysis Results

Conclusion

• We studied how six different stopword removal methods affect the sentiment polarity classification on Twitter.

• The use of pre-compiled (classic) Stoplist has a negative impact on the classification performance.

• TF1 stopword removal method is the one that obtains the best trade-off:

– Reducing the feature space by nearly 65%, – Decreasing the data sparsity degree up to 0.37%, and – Maintaining a high classification performance.

Science

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter