On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Hassan Saif, Miriam Fernandez, Yulan He and Harith Alani

Knowledge Media Institute, The Open University,

Milton Keynes, United Kingdom

The 9th edition of the Language Resources and Evaluation Conference, Reykjavik, Iceland

• Sentiment Analysis

• Twitter

• Stopwords Removal Methods

• Comparative Study

• Conclusion

Outline

“Sentiment analysis is the task of identifying positive and negative opinions, emotions and evaluations in text”

The main dish was delicious

It is a Syrian dishThe main dish was salty and horrible

Opinion OpinionFact

Sentiment Analysis

Stopwords Removal

Stopwords Removal in Twitter Sentiment Analysis

- Kouloumpis et al. 2011

- Pak & Paroubek, 2010

- Asiaee et al., 2012

- Bollen et al., 2011

- Bifet and Frank, 2010

- Speriosu et al., 2011

- Zhang & Yuan, 2013

- Gokulakrishnan et al 2012

- Saif et al., 2012

- Hu et al., 2013

- Camara et al., 2013Removing Stopwordsis USEFUL

• Precompiled

• Very popular

• Outdated

• Domain-Independent

Classic Stopword Lists

• Unsupervised Methods

– Term Frequency

– Term-based Random Sampling

• Supervised

– Term Entropy Measures

– Maximum Likelihood Estimation

Automatic Stopwords Generation Methods

Stopwords Removal for Twitter Sentiment Analysis

Stopword Analysis Set-Up (1)

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

SemEval

OMD HCR STS SemEval WAB GASP

Negative 688 957 1402 1590 2580 5235

Positive 393 397 632 3781 2915 1050

Datasets

Stopwords Removal Methods

1. The Baseline Method

– (non removal of stopwords)

1. The Classic Method

– This method is based on removing stopwords obtained from pre-compiled lists

– Van Stoplist

Stopwords Removal Methods3. Methods based on Zipf’s

- TF-High Method

Removing most frequent

- TF1 Method

Removing singleton words (i.e., words that occur once in tweets)

- IDF Method

Removing words with low inverse document frequency (IDF)

Stopwords Removal Methods

4. Term-based Random Sampling (TBRS)

5. The Mutual Information Method (MI)

Twitter Sentiment Classifiers

– Two Supervised Classifiers:

• Maximum Entropy (MaxEnt)

• Naïve Bayes (NB)

– Measure the performance in Accuracy and F1 measure

– 10 fold cross validation

Experimental Results

Assess the impact of removing stopwords by observing fluctuations on:

- Classification Performance

- Feature space

- Data Sparsity

Experimental Results (1)

1. Classification Performance

OMD HCR STS-Gold SemEval WAB GASP

Accuracy(%)

MaxEnt NB

OMD HCR STS-Gold SemEval WAB GASP F1(%)

MaxEnt NB

The baseline classification performance in Accuracy and F-measure

of MaxEnt and NB classifiers across all datasets

Accuracy F-Measure

1. Classification Performance

Baseline Classic TF1 TF-High IDF TBRS MI

Accuracy(%)

MaxEnt NB

Baseline Classic TF1 TF-High IDF TBRS MI F1(%)

MaxEnt NB

Accuracy F-Measure

Average Accuracy and F-measure of MaxEnt and NB classifiers using different stoplists

2. Feature Space

0.005.50

11.226.06

Reduction rate on the feature space of the various stoplists

TF=1 TF>1

The number of singleton words to the number non singleton words in all datasets

3. Data Sparsity

0.98800

0.99000

0.99200

0.99400

0.99600

0.99800

1.00000

SparsityDegree

Stoplist impact on the sparsity degree of all datasets

The Ideal Stoplist (1)

• The ideal stopword removal method is the one which:

– Helps maintaining a high classification performance,

– Leads to shrinking the classifier’s feature space

– Reduces the data sparseness

– Has low runtime and storage complexity

– Has minimal human supervision

The Ideal Stoplist (2)

Average accuracy, F1, reduction rate on feature space and data sparsity of the six stoplistmethods. Positive sparsity values refer to an increase in the sparsity degree while negative values refer to a decrease in the sparsity degree.

Overall Analysis Results

Conclusion

• We studied how six different stopword removal methods affect the sentiment polarity classification on Twitter.

• The use of pre-compiled (classic) Stoplist has a negative impact on the classification performance.

• TF1 stopword removal method is the one that obtains the best trade-off:

– Reducing the feature space by nearly 65%, – Decreasing the data sparsity degree up to 0.37%, and – Maintaining a high classification performance.

On Stopwords, Filtering and Data Sparsity for Sentiment Analysis of Twitter

Science

Sparsity by worst-case quadratic penalties

Automatic Image Annotation Using Group Sparsity

sept 2007 CardMaps · 2017. 7. 20. · you are very pager-naps pagernaps vaeœl\ues sentiment sentiment sentiment sentiment se co hello sent Th ema s sentiment sentiment rñdÞS sentime

Alleviating Data Sparsity for Twitter Sentiment Analysis

Location Prediction Under Data Sparsity

Sparsity Based Methods for Overparameterized Variational

Sparsity-based Dynamic Hand Gesture Recognition Using Micro-Doppler Signaturesancortek.com/wp-content/uploads/2019/04/Sparsity-based... · 2019. 4. 20. · Sparsity-based Dynamic

Learning With Dynamic Group Sparsity

Structured Sparsity in Natural Language Processing · 2021. 2. 13. · Why Sparsity is Desirable in NLP Occam’s razor and interpretability. The bet on sparsity (Friedman et al.,

Introduction to Sparsity in Signal Processing

Compressed Sensing, Sparsity, and Dimensionality in

Branch Detection and Sparsity Estimation in Matlab EuroAd Workshop - Marina... · Branch Detection and Sparsity Estimation in ... Sparsity estimation for Jacobian function y = y(x(1),x

Gdc2012 frames, sparsity and global illumination

Learning with sparsity-inducing norms

FUNCTIONAL SPARSITY: GLOBAL VERSUS LOCALkaib.people.cofc.edu/research/Sinica_FunSparse.pdf · described two types of sparsity: global and local sparsity. In particular, if f is zero

Exploiting sparsity in high-dimensional statistical ...imagine.enpc.fr › ~dalalyan › Download › lectures.pdf · Exploiting sparsity in high-dimensional statistical inference

WHEN SPARSITY MEETS LOW-RANKNESS: TRANSFORM LEARNING …transformlearning.csl.illinois.edu/assets/Bihan/... · Sparsity Sparse Code Self-similarity & Sparsity u i V i U V i Fig. 1

Structured Sparsity through reweighting and …Structured Sparsity through reweighting and Application to di!usion MRI March 25th, 2015 EPFL-Idiap-ETH Sparsity Workshop 2015 Anna Auria

Sparsity and Saliency

Enhancing Sparsity by Reweighted ℓ Minimization