Towards Scalable Emotion Classi cation in Microblog Based ... · training data (from NLP&CC2013) ... Emotion classification, data cleaning, hashtag, k-NN 1. ... Semi-supervised learning

Towards Scalable Emotion Classification inMicroblog Based on Noisy Training Data

Minglei Li1, Qin Lu1, Lin Gui2, and Yunfei Long1

1 Department of Computing,The Hong Kong Polytechnic University, Hung Hom, Hong Kong

{csmli,csluqin,csylong}@comp.polyu.edu.hk2 Laboratory of Network Oriented Intelligent Computation,

Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, [email protected]

Abstract. The availability of labeled corpus is of great importancefor emotion classification tasks. Because manual labeling is too time-consuming, hashtags have been used as naturally annotated labels toobtain large amount of labeled training data from microblog. However,the inconsistency and noise in annotation can adversely affect the dataquality and thus the performance when used to train a classifier. Inthis paper, we propose a classification framework which allows naturallyannotated data to be used as additional training data and employs ak-NN graph based data cleaning method to remove noise after noisydata has certain accumulations. Evaluation on NLP&CC2013 ChineseWeibo emotion classification dataset shows that our approach achieves15.8% better performance than directly using the noisy data withoutnoise filtering. After adding the filtered data with hashtags into an exist-ing high-quality training data, the performance increases 3.7% comparedto using the high-quality training data alone.

Keywords: emotion classification, data cleaning, hashtag, k-NN

1 Introduction

Emotion classification from social media (such as Tweet, Sina Weibo) is becom-ing more and more important. Many supervised learning methods have beendevekioed to solve this problem. However, supervised methods require a largeamount of labeled training data. Obtaining labeled data manually can be quitetime consuming and noise prone especially for multi-class annotations such asfor subject related emotion annotation. Many research studies take advantageof large amount of text available in the social media to investigate automaticmethods to obtain labeled data [1, 7, 10]. In these works, naturally annotated textfeatures such as hashtags, emoticons and emoji characters inserted in tweets areautomatically extracted from data and these features are then directly used aslabels after some simple rule based filtering. However, these automatically ob-tained labels can be quite noisy. Take the following text as an example, “在你闲

2 M. Li et al.

的时候，玩玩转发微博，未必不是一种乐趣！！！#无聊# (When you are notbusy, playing with microblog retweet may be fun! #boring#)”. From the text wecan infer that the emotion is “happy”, but the author uses a negative hashtag“boring”. As far as we know, there is not much work to handle hashtag noiseproblem for emotion classification. Figure 1 shows that directly adding datausing hashtag as emotion labels (crawled from Sina Weibo) to high qualitytraining data (from NLP&CC2013) will not improve the system and the per-formance degrades continuously as more naturally annotated data are added.This indicates that if there is no appropriate data cleaning method, naturallyannotated data may do more harm than good.

Towards Scalable Emotion Classification in Microblog: How to Make Use of Noisy Training Data

1st Author 1st author's affiliation

1st line of address 2nd line of address

Telephone number, incl. country code

1st author's E-mail address

2nd Author 2nd author's affiliation



2nd E-mail

3rd Author 3rd author's affiliation



3rd E-mail

ABSTRACT The availability of labeled corpus is of great importance for

emotion classification tasks. Because of time consuming for manually labeling, hashtags have been used as naturally annotated labels to obtain large amount of labeled training data from microblog. However, the inconsistency and noise in annotation can adversely affect the data quality and thus the performance when used to train a classifier. In this paper, we propose a classification framework which allows naturally annotated data to be used as additional training data and employs a k-NN graph based data

cleaning method to remove noise after noisy data has certain accumulations. Evaluation on NLP&CC2013 Chinese Weibo emotion classification dataset shows that our approach achieves 15.8% better performance than directly using the noisy data without noise filtering. If adding the filtered data into high quality training data, the performance increase 3.7% compared with just using the high quality training data.

Categories and Subject Descriptors

I.2.7 [Natural language processing]: Text analysis

General Terms

Design, Experimentation

Keywords

Emotion classification, data cleaning, hashtag, k-NN

1. INTRODUCTION Emotion classification from social media (such as Tweet, Sina Weibo) is becoming more and more important. Many supervised learning methods have been employed to solve this problem. However, supervised methods require large amount of labeled training data. Obtaining labeled data, mainly annotated manually, can be more time consuming and noise prone for multi-class annotation, especially for subject related emotion annotation.

Many research studies take advantage of large amount of text available in the social media to investigate automatic methods to obtain labeled data [1, 7, 10]. In these works, naturally annotated text features such as hashtags, emoticons and emoji characters inserted in tweets are automatically extracted from data and these features are then directly used as labels after some simple rule based filtering. However, these automatically obtained labels are noisy.

Take the following text as an example, “在你闲的时候，玩玩转

发微博，未必不是一种乐趣！！！#无聊# (When you are not

busy, playing with microblog retweet may be fun! #boring#)”. From the text we can infer that the emotion is “happy”, but the author uses a negative hashtag “boring”. As far as we know, there

is not much work to handle hashtag noise problem for emotion classification. Figure 1 shows that directly adding data using hashtag as emotion labels (crawled from Sina Weibo) as additional training data to high quality training data (from NLP&CC2013) will not improve the system and the performance degrades continuously as more naturally annotated data are added. This indicate that if there is no appropriate data cleaning method, naturally annotated data may do more harm than good.

Figure 1 Performance of random adding

Semi-supervised learning (SSL) can make use of a small amount of labeled seed data and large amount of unlabeled data to achieve much better performance, such as S3VMs. Data cleaning, as one kind of SSL, has been employed to cope with noisy training data, such as co-training [2] and CoTRADE [5]. However, these methods are mainly used in binary classification. In addition, training data

using automatically obtained hashtags in principle are not unlabeled data. Rather, it is labeled training data with noise.

In this study, we focus on making use of automatically obtained

labeled data for emotion classification. The main objective is to obtain more high quality labeled data to improve the performance of emotion classification. The main issue is the design of data cleaning strategies to obtain more high quality data and use the data to improve classification performance. The basic idea is to train a classifier initially using high quality data provided through manual annotation and make use of this classifier to predict noisy data. Only data with high confidence through the assessment of the

predicted label compared to the original label will be used as the additional training data. As noise can accumulate after several iterations, we also make use of a graph based method to estimate

0 500 1000 1500 2000 2500 3000 3500 4000

0.400

0.405

0.410

0.415

0.420

0.425

0.430

0.435

0.440

0.445

Ma

cro

F-s

co

re

Added number

Performance of Random adding

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise,

or republish, to post on servers or to redistribute to lists, requires prior

specific permission and/or a fee.

Conference’10, Month 1–2, 2010, City, State, Country.

Copyright 2010 ACM 1-58113-000-0/00/0010 …$15.00.

Fig. 1: Performance of random adding

Semi-supervised learning (SSL) can make use of a small amount of labeledseed data and a large amount of unlabeled data to achieve much better perfor-mance, such as S3VMs [14]. Data cleaning, as one kind of SSL, has been usedto cope with noisy training data, such as co-training [2] and CoTRADE [5].However, these methods are mainly used in binary classification. In principle,data using automatically obtained hashtags is not unlabeled data. Rather, it islabeled training data with noise.

In this study, we focus on making use of automatically obtained labeled datafor emotion classification. The main objective is to obtain more high qualitylabeled data to improve the performance of emotion classification . The mainissue in this work is the design of data cleaning strategies to obtain high qualitydata using natural annotation and to use the data to improve classificationperformance. The basic idea is to train a classifier initially using high qualitydata provided through manual annotation and make use of this classifier topredict noisy data. Only data with high confidence through the assessment ofthe predicted label compared to the original label will be used as the additionaltraining data. As noise can accumulate after several iterations, we also makeuse of a graph based method to estimate accumulated noise and remove noisy

Emotion Classification in Microblog Based on Noisy Training Data 3

data to ensure the overall quality of added training data. Through this study,we want to answer two questions: (1). Can the automatically extracted datacontaining hashtag be directly used as training data? and (2). If not, can weeffectively remove noise from the naturally annotated data improve the emotionclassification performance?

The rest of the paper is organized as follows. Section 2 presents related worksin emotion classification and data cleaning. Section 3 introduces our algorithmsand strategies. Section 4 reports the evaluation result. Section 5 gives conclusionand future work.

2 Related Works

Methods for emotion classification can be categorized into rule based methodsand machine learning based methods. The former defines a set of rules to inferthe emotion contained in text. The latter employs a set of features (such as BoW,N-Gram, emotion lexicon) to train a classifier based on some annotated train-ing data to predict the emotion of a new piece of text. One issue for machinelearning based method is how to obtain sufficient high quality training data.Current released corpora includes weblog[12], news headline[13], which are allmanually labeled and thus are quite limited in quantity. Recently, more and moreresearchers explore distant supervision methods which is based on naturally an-notated labels to automatically build training corpus, such as the microblog thatuses naturally annotated emoticon, hashtag and emoji characters as annotatedemotion labels. Mohammad takes emotion linked hashtags in tweets for emotionclassification and proves that when hashtags are consistent to a degree such thatthey can be used for emotion detection in tweets [6]. Similar methods is usedby Wang to obtain a much larger dataset and experiments show that for someminority emotions (i.e., surprise), the prediction performance is not so good[10]. Furthermore, Mohammad makes use of hashtags to construct an emotioncorpus, based on which an emotion lexicon was extracted using PMI and thelexicon is used on another domain for emotion classification [7]. Bandhakavi alsomakes use of microblog containing hashtags to generate emotion lexicon basedon EM with class and neutral model [1]. However, none of them addresses theissue of hashtag noise. Experiments in [10] also show that distant supervision issuitable for some emotions (i.e., happiness, sadness and anger) but less able todistinguish minority emotional labels.

Semi-supervised learning has been widely studied, such as co-training [2]and CoTRADE [5] which include a noise detection method. Wan proposes aco-cleaning and tri-cleaning data cleaning algorithms on sentiment analysis [11].Gui employs noisy detection on cross-lingual opinion analysis based on labelinconsistency with neighbors [3]. However, all the above works are for binaryclassification. No attempt is made for multi-class emotion analysis.

4 M. Li et al.

3 Our Noise Handling Approach

3.1 Problem Definition

Let (xi, yi) denotes a pair of labeled data where xi is the data, and yi is thelabel. Let H: {(x1, y1), . . . , (xm, ym)} denote a high quality labeled dataset;L: {(xm+1, ym+1), . . . , (xm+n, ym+n)} denotes a noisy labeled dataset; and T{(xt1, yt1), . . . , (xth, yth)} denotes the testing dataset. Here, n,m, h are the cor-responding dataset size, respectively. Let us assume that n� m. Let C denotesthe set of class labels (C = {c1, c2, . . . , cCN

}) where ci ∈ C is a class label andCN is the the number of classes. Our objective is to develop an algorithm M toextend H with L and improve the performance on T.

3.2 Our Proposed Strategy

The basic idea of our proposed method is that a classifier initially trained onH is used iteratively to predict L, and the instances with high confidence willbe added as training data to retrain the classifier. Because the added trainingdata contains noise and noise can accumulate, we will also devise a method toremove the added training data whose noise exceeds a certain level. To achievethis goal, two problems need to be solved: (1). How to select instances in L tobe added as training data; and (2). How to detect the noise instances that isalready included in the training data.

Let yi denotes the original label (hashtag) for an instance in L. To select theinstances from L, we train a classifier based on H and predict on L and obtaina predicted label y′i (i ∈ [m + 1,m + n]) with a confidence, which can be theclassifier’s prediction probability. We choose those instances such that y′i = yiwith top n(ci) confidence for each class to construct L′ and add L′ to H. Here,the term confidence means the prediction probability of the classifier. In theexperiment, we will see the usefulness of the original label constraint y′i = yi. Totake into consideration of the naturally imbalanced data in emotion corpora, wealso control the number of added instances for each class. For the class ci withthe least number of instances in H, we set the added instance number to H ineach iteration as na. Then, for other class cj , we set the added number using thefollowing formula:

ncj = napcipcj

(1)

Here pci and pcj is the class proportion in H. This means the more instancesa class is in H, the less additional training data is added for that class. Thisis to make the data to be more balanced. Because of the imperfection of theclassifier, noise in yi can accumulate. To detect the noisy label, we use a k-NNgraph based method. Given an added instance xi, yi from L′ , a k-NN graphG = {V,E,W} is constructed from H + L′, where the nodes are instances inH + L′ and the edge weights are the similarity between instances in the feature


space. Based on the manifold assumption that instances with high similarity inthe feature space will have similar labels [15], we can measure the inconsistencybetween two instances based on the similarity in feature space and the differencein the label space. For each pair of (xi, yi) and its neighbor (xj , yj) in the graph,we compute the edge weight ωij by:

ωij = sim(xi, xj)

Similarity functions sim(xi, xj) can be constructed in many different ways, in-cluding distance based similarity and cosine similarity between the feature vectorrepresentation of two samples. In this work, we simply use cosine similarity asin [3]. For each pair, we define the inconsistency between any two instances tobe proportional to the similarity in feature space and to be inverse with thesimilarity in the label space, which can be calculated as:

Zij = ωijDcicj

where Dcicj is the distance between label class ci and cj . This means under thesame non-zero label distance, the more similar they are in the feature space,the more inconsistent between the two instances. When the label distance iszero, the inconsistency is zero. Dcicj is defined in the class distance matrix Md,with each entry d(p, q) being the distance between class p and class q becausethe probability of a mislabeling between different classes is different. For exam-ple, the emotion class “anger” is more likely to be labeled as “sadness” than“happiness”, so d(anger, sadness) < d(anger, happiness). Each emotion can beexpressed in the valence-arousal coordinate [4]. Based on the emotion point in thevalence-arousal coordinate, we can obtain their corresponding distances. Thenwe use the following formula to compute the label inconsistency for each vertexi:

Ji =

k∑j

Zij =

k∑j

ωijDcicj (2)

where i refers to the center vertex and j refers to the neighbor of i and k is theparameter of k-NN, the number of selected most similar neighbors of node i. Themore similar they are in the feature space and the more distant in label spacebetween the vertex and its neighbors, the larger the error is for the label. WhenJi exceeds a certain threshold Jthresh, we consider it a noisy label and removeit from the training set L′. Here we assume Ji follows the Gaussian distributionand use the high quality dataset H to estimate its mean and variance. For eachsample (xi, yi) in H we compute its Ji using (2) and finally we obtain the meanand variance of J as µJ and σJ . Then Jthresh can be estimated by:

Jthresh = µJ + aσJ (3)

6 M. Li et al.

where a is the parameter to control the removal extent. Here we set a = 2because based on Gaussian distribution, the probability of Ji > Jthresh will be0.023, which is a small probability event and thus we have sufficient confidenceto consider it as a noisy label. Since different classes will have different µJ andσJ because of data imbalance, we compute Jthresh separately for each class. Thealgorithm is shown in Figure 2. The iteration in line 3 is used to control whennoise removal will be executed.

where 𝑎 is the parameter to control the removal extent. Here we set

𝑎 = 2, because based on Gaussian distribution, the probability of

𝐽𝑖 > 𝐽𝑡ℎ𝑟𝑒𝑠ℎ will be 0.023, which can be seen as a small probability

event and thus we have enough confidence to consider it as a noisy

label. Since different classes will have different 𝜇𝐽 and 𝜎𝐽 because

of data imbalance, we compute 𝐽𝑡ℎ𝑟𝑒𝑠ℎ separately for each class.

The whole algorithm is shown in Table 1. The iteration in line 3 is

used to control the occasion for noise removal.

Table 1 Proposed Algorithm

Inputs:

H: High quality labeled data

L: Naturally labeled data(noisy training data)

Learner: Train a multiclass classifier using training dataset

k: The number of nearest neighbors

𝑛𝑎: The number of added instances each iteration

Outputs:

𝐿′: Filtered labeled data from L

f: The refined multiclass classifier trained on H+ 𝐿′

Procedure:

1. 𝐿′={∅}

2. f’ = Learner (H)

3. iteration = 20

4. Loop = True

5. Find least class and compute 𝑛𝑐𝑖 using (1)

6. While (Loop):

7. 𝑦𝑖′ = f’(L) % Making prediction on noisy data

8. iteration = iteration – 1

9. For class ci

10. Add top 𝑛𝑐𝑖 confident instances in L that 𝑦𝑖

′ = 𝑦𝑖 to 𝐿′

11. L = L - 𝐿′

12. If iteration =0 % Perform noisy data removal

13. Iteration = 20

14. For 𝐿𝑘 in 𝐿′

15. Construct kNN graph from H+𝐿′-𝐿𝑘for 𝐿𝑘

16. Compute 𝐽𝑖 using equation (2)

17. Compute 𝐽𝑡ℎ𝑟𝑒𝑠ℎ using (3) based on label of 𝐿𝑘

18. If 𝐽𝑖 > 𝐽𝑡ℎ𝑟𝑒𝑠ℎ

19. Delete 𝐿𝑘 from L’ and put back to L

20. f’ = Learner(H+𝐿′)

21. If the size of L’ keep unchanged

22. Loop = False

23. End of while

24. Return f’, L’

4. EXPERIMENT

4.1 Experiment setup To evaluate our proposed approach, we take the high quality

training data from the benchmark of NLP&CC 2013 Chinese

Microblog Emotion classification task 1 . The task aims at

classifying given microblog a label. There are eight labels: like,

disgust, happiness, sadness, anger, surprise, fear and none. The

dataset consists of 4,000 training instances and 10,000 testing

instances, coming from Sina Weibo, a popular microblog social

network like Twitter. All these data are manually labeled. Thus, we

treat them as high quality data and assume that the labels in these

datasets are ground truth. Since ‘none’ class data cannot be

obtained through hashtag, we remove the ‘none’ class data in the

1 http://tcci.ccf.org.cn/conference/2013/pages/page04_tdata.html

2 http://open.weibo.com/

training and testing dataset. Finally we obtain 7,304 high quality

labelled data and the data is divided into training and testing data

by 1:1 keeping the same proportion of each class.

Our noisy data comes from the Sina Weibo though Sina Weibo

Huati API2 using a list of naturally tagged hashtags (called “Huati

话题(topic)” in Sina) words, such as “难过(sad)”, “给力(helpful)”.

After manually assigning these seed words into the seven emotion

categories, we mined 173,951 microblogs and removed some posts

in the original data through the following rules: 1. Hashtags not at

the beginning of ending of microblog. 2. Texts containing more

than one hashtag. 3. Duplicated texts. 4. Converting traditional

Chinese into simplified Chinese. 5. Texts not containing Chinese

word. 6. Text length after removing hashtag less than 5. After the

above filtering, there are 48,305 microblogs left as our additional

noisy training data.

The classifier used is LibLinear implemented by Scikit-learn

toolkit3 with default parameters. Features used are bag of words

and lexicon from a Chinese emotion lexicon DUTIR 4 . For

evaluation, we use macro F-score. Parameters k is set to 9

empirically showing to achieve the best result. 𝑛𝑎 is set to 5.

4.2 Result and analysis The first experiment is to test the performance of our proposed

algorithm, named as A+O+R+, which means Addition, Original

noisy label information (with 𝑦𝑖′ = 𝑦𝑖 constrain) and noise

Removal with. For comparison, we use the following methods as

baselines:

1) A+O-R- It adds instances by classifier confidence but without

noise detection and original noisy label information. It is

similar to similar to self-learning method.

2) A+O+R-: It adds instances by classifier confidence with original

noisy label constraint, but without noise detection.

3) A+O-R+: It adds instance by classifier confidence with noise

detection, but without original noisy label constraint.

4) A-O+R-: As mentioned in introduction part, it directly adds the

noisy hashtag data without any filtering.

The performance evaluation is shown in Figure 2 and the number

of added instances is shown in Figure 3. The A+O-R- strategy

achieves the worst result, showing the inaccuracy of the predicted

label by classifier. It also indicates that emotion classification from

microblog is more complex because of its informality. A+O-R+ also

does not use the original label information, but its performance is a

little better than A+O-R- because of noise removal. A-O+R- uses only

the original hashtag label of the noisy data without any inspection.

It still performs better than A+O-R- and A+O-R+ , which means that

the hashtag label is more reliable than predicted label by the

classifier. The above three all degrade the performance compared

with not adding data to H. A+O+R- and A+O+R+ achieve similar

increasing performance, much better than the above three, which

indicates the usefulness and noise of hashtag label. However, A+

O+R- degrades a little as iteration number increases because the

accumulated noise in the added dataset. The performance of

A+O+R+ is a little better than A+O+R- and is the more stable.

Compared in Figure 3, the added instance number of A+O+R-

continuously increase while A+O+R+ has a control on the speed

because of removing some noisy instances. It means A+O+R+

3 http://scikit-learn.org/stable/

4 http://ir.dlut.edu.cn

Fig. 2: Proposed Algorithm


4 Experiment

4.1 Experiment Setup

To evaluate our proposed approach, we take the high quality training datafrom the benchmark of NLP&CC 2013 Chinese Microblog Emotion classifica-tion task3. The task aims to predict the emotion label of a given microblog.The dataset has eight labels: like, disgust, happiness, sadness, anger, surprise,fear and none. There are 4,000 training instances and 10,000 testing instancesin the dataset extracted from Sina Weibo, a popular Chinese version microblogsocial network like Twitter. All these data are manually labeled. Thus, we treatthem as high quality data and assume that the labels in these datasets are groundtruth. Since ‘none’ class data cannot be obtained through hashtag, so we mergethe two subsets of data and remove the ‘none’ class data in the training andtesting dataset. Finally we obtain 7,304 high quality labeled data and the datais divided into training and testing data using 1:1 keeping the same proportionof each class.

Emotion Label Seed word number and examples

Like9: 给力(helpful), 可爱(lovely), 奋斗(strive), 喜欢(like),赞(appraise), 爱你(love,you), 相信(believe), 鼓掌(applaud),祝愿(hope)

Sadness27: 伤不起(can’t bear the hurt), 郁闷(sadness), 哭(cry),失望(disappointed),心塞(heart hurt), 难过(sadness), etc.

Disgust9: 无聊(boring), 烦躁(agitated), 嫉妒(jealous),尴尬(embarrassment), 讨厌(dislike), 恶心(disgusting),怀疑(suspect), 烦闷(bored), 厌恶(disgust)

Anger27:,妈的(fuck), 无语(speechless), 气愤(angry), 恼火(anger), tmd,你妹的(your sister),etc.

Surprise16: 神奇(miracle), 惊呆了(shocked), 不可思议inconceivable),天哪(my god), 大吃一惊(shocked), etc

Fear11: 害怕(fearful), 紧张(nervous), 心慌(nervous), 害羞(shy),(embarrassed), etc.

Happiness20: 快乐(happy), 幸福(happy), 哈哈(ha-ha), 爽(so high),感动(moved),开心(joy), 嘻嘻(happy), 高兴(happy), 亲亲(kiss),etc.

Table 1: The huati seed words for crawling the micrblogs.

Our noisy data comes from the Sina Weibo through Sina Weibo Huati API4

using a list of seed words as hashtags (called “Huati话题(topic)” in Sina) words,such as “难过(sad)”, “给力(helpful)”. The detail of the seed word list is shownin Table 1. After manually assigning these seed words into the seven emotion

3 http://tcci.ccf.org.cn/conference/2013/pages/page04 tdata.html4 http://open.weibo.com/

8 M. Li et al.

categories, we mined 173,951 microblogs and clean the posts in the original datathrough the following rules: (1). Hashtags not at the beginning of ending of mi-croblog; (2). Text containing more than one hashtags; (3). Duplicated text; (4).Converting traditional Chinese into simplified Chinese; (5). Texts not contain-ing Chinese word; and (6). Text length after removing hashtag less than five.After the above cleaning, there are 48,305 microblogs left as our additional noisytraining data. The statistics of the noisy data (L) and high quality labelled data(H) are shown in Table 2. Note that the data imbalance in the datasets is quiteobvious. Take H as an example, the ratio of the largest label set to the smallestlabel set is over 14.29 (like vs fear).

Dataset Like Sadness Disgust Anger Surprise Fear Happiness Total

L 5011 16877 6245 6037 1494 855 11786 48305L(%) 10.4 34.9 12.9 12.5 3.1 1.8 24.4 100H 2153 1129 1360 671 350 151 1486 7480

H(%) 28.8 15.1 18.2 9.0 7.1 2.0 19.9 100

Table 2: Dataset statistics of each emotion label. L: The noisy training data. H: Thehigh quality training data. L(%): The percentage of each emotion label in L. H(%):The percentage of each emotion label in H.

The classifier used in this work is LibLinear implemented by Scikit-learntoolkit5 with default parameters. Features used are bag of words and lexiconfrom a Chinese emotion lexicon DUTIR6. For evaluation, we use macro F-score.Parameter k for k-NN is set to 9 empirically and na is set to 5.

4.2 Result and Analysis

The first experiment is to test the performance of our proposed algorithm. Welabel this algorithm as A+O+R+, which means Adding instances with Originalnoisy label information (with y′i = yi constraint) and with noisy instancesRemoval based on the algorithm in Table 2. For comparison, we use the fol-lowing methods that use different addition strategies for comparison:

1. A+O−R− It adds instances by classifier confidence but without the originalnoisy label information and without noisy instances removal. It is similar tosimilar to self-learning method.

2. A+O+R−: It adds instances by classifier confidence with original noisy labelinformation constraint, but without noisy instances removal.

3. A+O−R+: It adds instance by classifier confidence with noisy instances re-moval, but without original noisy label information constraint.

5 http://scikit-learn.org/stable/6 http://ir.dlut.edu.cn


4. A−O+R−: As mentioned in introduction part, it directly adds the noisyhashtag data without any filtering.

5. A−O−R−: This is the reference performance of the original high qualitydata without adding any data. It serves as the baseline for performanceimprovement measure.

The performance evaluation is shown in Figure 3. The horizontal line in-dicates the performance of the original algorithm without using any additionaldata. It is flat as it does change on the value in the x-axis. This serves as theyardstick to see whether additional training data is any good. Note that thethree algorithms, A+O−R−, A+O−R+, A−O+R− are all below A−O−R−. Thismeans they are no good. By examining them in more details, we can see theyused noisy data either without natural label constraint or without noise removal.This clearly indicate that without noise removal, the additional data should notbe used at all. Among the three methods, A+O−R− achieves the worst result,indicating that using noisy data directly has severe adverse effect. It also indi-cates that emotion classification from microblog is more complex because of itsinformality. A+O−R+ also does not use the original label information, but itsperformance is a little better than A+O−R− because of noisy instances removal.A−O+R− uses only the original hashtag label of the noisy data without anyinspection. It still performs better than A+O−R− and A+O−R+ , which meansthat the hashtag label is more reliable than predicted label by the classifier.On the other hand, both A+O+R− and A+O+R+ achieve similar increasingperformance, much better than the baseline. The performance of A+O+R− in-dicates that hashtags are useful even if noise removal is not conducted. However,A+O+R− degrades as iteration number increases because noise can accumulateonce the iteration number increases. The performance of A+O+R+ is the bestperformer as it is more stable. The number of added instances under differentiteration number is shown in Figure 4, which shows that the added instancenumber of A+O+R− continuously increase while A+O+R+ has a control on thespeed because some noisy instances are removed. This further indicates thatA + O + R+ achieves comparative performance with less additional data com-pared to that of A+O+R−.

The second experiment is to test the performance after adding the cleaneddata L′ by different strategies into the original training data H, that is the classi-fier trained on (H+L′), which is given in Table 3. Note that we focus on testingthe effectiveness of the proposed noise filtering algorithm, so the added number isdifferent for different strategies, as is given in Figure 4. A+O+R+ achieves 3.7%improvement than the original training dataset (H), similar to A+O+R− butwith less data. The other three strategies all have worse performance comparedto the baseline because of the added noise.

The third experiment is to test the quality of the filtered noisy training data(L′) by different strategies. That is the performance of a classifier trained onlyon L′ compared with classifier trained on the original high quality data H, givenin Table 4. Compared to H, A+O+R− achieves the best performance in all thenoise data selection method, followed by A+O+R+ with very close performance.

10 M. Li et al.

0 2 0 0 4 0 0 6 0 00 . 3 6

0 . 3 8

0 . 4 0

0 . 4 2

0 . 4 4

0 . 4 6

Macro

F-sco

re

I t e r a t i o n n u m b e r

A + O - R - A + O - R + A + O + R + A + O + R - A - O + R - A - O - R -

Fig. 3: Performance of different addition strategies.

achieves comparative performance with less data compared with A+O+R-.

Figure 2 Performance of different addition strategies

Figure 3 Added instance number vs iteration number

The second experiment is to test the performance after adding the filtered data 𝐿′ by different strategies into the original training data

H, that is classifier trained on 𝐻 + 𝐿′, which is shown in Table 2.

Note that we focus on testing the effectiveness of the proposed

noise filtering algorithm, so the added number is different for different strategies, as is given in Figure 3. A+O+R+ achieves 3.7% improvement than original training dataset (H), similar to A+O+R-

but with less data. The other three strategies all degrade the performance because of added noise.

Table 2 Performance after Adding Filtered Training Data

Methods H A+O-R- A+O-R+ A-O+ R- A+O+R- A+O+R+

Macro F 0.428 0.388 0.388 0.403 0.440 0.444

Table 3 Performance of Using only Filtered Training Data

Methods H A+O-R- A+O-R+ A-O+ R- A+O+R- A+O+R+

Macro F 0.428 0.309 0.295 0.342 0.403 0.396

The third experiment is to test the quality of the filtered noisy training data (𝐿′) by different strategies, that is the performance of

classifier trained only on 𝐿′ compared with classifier trained on the

original high quality data H, which is shown in Table 3. Compared to H, A+O+R- achieves the best performance in all the noise data selection method, followed by A+O+R+ with very close performance. And they achieves 15.8% better performance than directly adding hashtag data (A-O+R-). A-O+R- is better than A+O-

R- and A+O-R+, which again indicates that hashtag label is more reliable than predicted label by classifier.

Through the three set of experiments, we can now answer the two questions we asked earlier. Firstly, hashtags cannot not be directly used as labels because of noise. Secondly, noisy data can be effectively cleaned by our proposed approach to improve the performance of multi-class classification based on our experiments in emotion data.

5. CONCLUSION In this paper, we have proposed a framework for tackling the noise in automatically extracted hashtag labeled data for emotion classification. Experiments show that the hashtag is useful but contains noise and our proposed algorithm that combines the

classifier and hashtag is effective to filter the noise to obtain more high quality data and k-NN graph based noise removal can further stabilize this process. Our method paves a way towards scalable training data for emotion classification from microblog. Our future work will focus on how to use the filtered data to further improve the performance of emotion classification.

6. REFERENCES [1] Bandhakavi, A., Nirmalie, W., Deepak, P., and Stewart, M.

2014. Generating a Word-Emotion Lexicon from #Emotional Tweets. In Proceedings of the Third Joint Conference on Lexical and Computational Semantics Pp. 12–21.

[2] Blum, A., Tom, M. 1998. Combining Labeled and Unlabeled Data with Co-Training. In Proceedings of the Eleventh

Annual Conference on Computational Learning Theory Pp. 92–100. ACM.

[3] Lin, G., Ruifeng X., Qin L. 2014. Cross-Lingual Opinion Analysis via Negative Transfer Detection. ACL.

[4] Mehrabian, A. 1996. Pleasure-Arousal-Dominance: A General Framework for Describing and Measuring Individual Differences in Temperament. Current Psychology 14(4): 261–292.

[5] Min-Ling, Z., Zhi-Hua Z. 2011. CoTrade: Confident Co-Training with Data Editing. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 41(6): 1612–1626.

[6] Mohammad, Saif. M. 2012. # Emotional Tweets. In Pp. 246–255. Association for Computational Linguistics.

[7] Mohammad, Saif, M., Svetlana K. 2014. Using Hashtags to Capture Fine Emotion Categories from Tweets. Computational Intelligence. 31, 301–326.

[8] Tao, C., Ruifeng, Xu., Qin, L. 2014. A Sentence Vector Based Over-Sampling Method for Imbalanced Emotion Classification. In Computational Linguistics and Intelligent Text Processing Pp. 62–72. Springer.

[9] Purver, M., Stuart B. 2012. Experimenting with Distant Supervision for Emotion Classification. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics Pp. 482–491.

[10] Wenbo, W., Chen, L., Krishnaprasad, T., Amit, P. S. 2012. Harnessing Twitter “Big Data” for Automatic Emotion Identification. In Privacy, Security, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing Pp. 587–592.

[11] Xiaojun, W., 2011 Collaborative Data Cleaning for Sentiment Classification with Noisy Training Corpus. In Advances in Knowledge Discovery and Data Mining Pp.

326–337.

0 200 400 600

0.36

0.38

0.40

0.42

0.44

0.46

Ad

de

d insta

nce

nu

mb

er

Iteration number

A+O-R-

A+O+R-

A+O-R+

A+O+R+

A-O+R-

0 200 400 600

0

2000

4000

6000

8000

10000

12000

Ad

de

d insta

nce

nu

mb

er

Iteration number

A+O-R-

A+O+R-

A+O-R+

A+O+R+

A-O+R-

Fig. 4: Added instance number vs iteration number.

And they achieves 15.8% better performance than directly adding hashtag data(A−O+R−). A−O+R− is better than A+O−R− and A+O−R+, which againindicates that hashtag label is more reliable than the predicted labels by theclassifier.


Methods H A+O−R− A+O−R+ A−O+R− A+O+R− A+O+R+

Macro F-score 0.428 0.388 0.388 0.403 0.440 0.444

Table 3: Performance of classifier trained on filtered data combined with the highquality data.

Methods H A+O−R− A+O−R+ A−O+R− A+O+R− A+O+R+

Macro F-score 0.428 0.309 0.295 0.342 0.403 0.396

Table 4: Performance of classifier trained only on filtered training data.

Through the three set of experiments, we can now answer the two questions.Firstly, hashtags cannot be directly used as labels because of noise. Secondly,noisy data can be effectively cleaned by our proposed approach to improve theperformance of multi-class classification based on our experiments.

5 Conclusion

In this paper, we present a framework on automatic noise removal for naturallyannotated data using hashtags for emotion classification. Experiments show thathashtags are useful but naturally annotated data contains noise so they cannotbe used directly without any cleaning. Our proposed algorithm combines theclassifier and hashtag to effectively filter out noise to obtain more high qualitydata. The k-NN graph based noise removal method can further stabilize this pro-cess. Evaluation on NLP&CC2013 Chinese Weibo emotion classification datasetshows that our approach achieves 15.8% better performance than directly usingthe noisy data without noise filtering. After adding the filtered data with hash-tags into an existing high-quality training data, the performance is increased by3.7% compared to using the high-quality training data alone. Our method pavesthe way towards a more scalable training data for emotion classification frommicroblog. Our future work will focus on how to use the filtered data to furtherimprove the performance of emotion classification.

6 Acknowledgement

We give our thanks to the anonymous reviewers for the helpful comments. Thework presented in this paper is supported by Hong Kong Polytechnic University(PolyU RTVU and CERG PolyU 15211/14E) and the National Nature ScienceFoundation of China (project number:6127229).

12 M. Li et al.

References

1. Bandhakavi, A., Nirmalie, W., Deepak, P., and Stewart, M. 2014. Generating aWord-Emotion Lexicon from #Emotional Tweets. In Proceedings of the Third JointConference on Lexical and Computational Semantics Pp. 12–21.

2. Blum, A., Tom, M. 1998. Combining Labeled and Unlabeled Data with Co-Training.In Proceedings of the Eleventh Annual Conference on Computational Learning The-ory Pp. 92–100. ACM.

3. Lin, G., Ruifeng X., Qin L. 2014. Cross-Lingual Opinion Analysis via NegativeTransfer Detection. ACL.

4. Mehrabian, A. 1996. Pleasure-Arousal-Dominance: A General Framework for De-scribing and Measuring Individual Differences in Temperament. Current Psychology14(4): 261–292.

5. Min-Ling, Z., Zhi-Hua Z. 2011. CoTrade: Confident Co-Training with Data Editing.Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 41(6):1612–1626.

6. Mohammad, Saif. M. 2012. #Emotional Tweets. In Pp. 246–255. Association forComputational Linguistics.

7. Mohammad, Saif, M., Svetlana K. 2014. Using Hashtags to Capture Fine EmotionCategories from Tweets. Computational Intelligence. 31, 301–326.

8. Tao, C., Ruifeng, Xu., Qin, L. 2014. A Sentence Vector Based Over-SamplingMethod for Imbalanced Emotion Classification. In Computational Linguistics andIntelligent Text Processing Pp. 62–72. Springer.

9. Purver, M., Stuart B. 2012. Experimenting with Distant Supervision for EmotionClassification. In Proceedings of the 13th Conference of the European Chapter ofthe Association for Computational Linguistics Pp. 482–491.

10. Wenbo, W., Chen, L., Krishnaprasad, T., Amit, P. S. 2012. Harnessing Twitter“Big Data” for Automatic Emotion Identification. In Privacy, Security, Risk andTrust (PASSAT), 2012 International Conference on and 2012 International Confer-nece on Social Computing Pp. 587–592.

11. Xiaojun, W., 2011 Collaborative Data Cleaning for Sentiment Classification withNoisy Training Corpus. In Advances in Knowledge Discovery and Data Mining Pp.326–337.

12. Quan, C. and Ren, F., 2009, August. Construction of a blog emotion corpus forChinese emotional expression analysis. In Proceedings of the 2009 Conference onEmpirical Methods in Natural Language Processing: Volume 3-Volume 3 (pp. 1446-1454). Association for Computational Linguistics.

13. Strapparava, C. and Mihalcea, R., 2007, June. Semeval-2007 task 14: Affectivetext. In Proceedings of the 4th International Workshop on Semantic Evaluations(pp. 70-74). Association for Computational Linguistics.

14. Bennett, Kristin, and Ayhan Demiriz. ”Semi-supervised support vector machines.”Advances in Neural Information processing systems (1999): 368-374.

15. Goldberg, Andrew B., Xiaojin Zhu, Aarti Singh, Zhiting Xu, and Robert D. Nowak.”Multi-Manifold Semi-Supervised Learning.” In AISTATS, pp. 169-176. 2009.

Documents

Towards Scalable Emotion Classi cation in Microblog Based ... · training data (from NLP&CC2013) ... Emotion classification, data cleaning, hashtag, k-NN 1. ... Semi-supervised learning