Automatic term extraction of dynamically updated text collections for sentiment
classification into three classes
Yuliya Rubtsova
The A.P. Ershov Institute of Informatics Systems (IIS)
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
recommender systems;
Applied problems which can be solved with sentiment classification
consumer reviews study to commercial products for businesses;
recommender systems;
Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person
Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the
current emotional state of the person
psychological and medical diagnosis;
safety control by analyzing the behavior of mass gatherings;
assistance in carrying out investigative measures.
Most common sentiment analysis approaches
Supervised machine learning
Dictionaries and rules
Combined method
Existing corpora
Corpora of reviews which contain user marks
Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)
Corps of news (a few emotional texts)
Filtration
Texts containing both positive and negative emotions;
Not informative tweets (less than 40 characters long);
Copied texts and retweets.
Corpus of short texts consists of
114 991 – positive texts
111 923 – negative texts
107 990 – neutral texts
Corpus of short texts
Collection type Number of words Number of unique words
Positive messages 1 559 176 150 720
Negative messages 1 445 517 191 677
Neutral messages 1 852 995 105 239
Unique terms distribution in relation depending on the number of tweets
Uniformity of used collections
Words frequency distribution
Most common approaches for used for N-grams extracting
Manually, using a thesaurus.
Term Extraction, based on significance of this term for a collection
Data sets characteristics
The entire data set is known
The entire data set is avaliable
The entire data set is static (can’t change during calculation)
When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).
Human speech is constantly changing => there is a need to update emotional dictionaries
Change in vocabulary and topics discussed
Febrary August0%
2%
4%
6%
8%
10%
12%
14%12.00%
0.50%
Percentage of references to the Olympic theme on all posts
Change in vocabulary and topics discussed
Febrary August0.00%
0.02%
0.04%
0.06%
0.08%
0.10%
0.12%
0.14%
0.06%
0.12%
Percentage of references to the vacation theme on all posts
Change in vocabulary and topics discussed
Febrary August0.00%
0.01%
0.02%
0.03%
0.00%
0.02%
Percentage of using term “Sebyashka” (selfie – rus) on all posts
Filtration Punctuation – commas, colons, quotation marks
(exclamation marks, question marks and ellipses were retained);
References to significant personalities and events
Proper names;
Numerals;
All links were replaced with the word "Link" and were taken into consideration as a whole;
Many dots were replaced with ellipsis.
TF-ICF
C – number of categories,
cf – the number of categories in which weighed term is found
TF-IDF
tf – is the frequency of term occurrence in the collection (positive or negative tweets) ,
T – total number of messages in the collections,
– the number of messages in the positive and negative collections contained the term
Experiments
Corpus of News texts consists of
46 339 – positive news
46 337 – negative news
46 340 – neutral news
ROMIP mixed collection consists of
543– positive blog texts
236– negative blog texts
103– neutral blog texts
Reviews on books, movies, or digital camera from blogs
Short text collection
News collection
TF-IDF TF-ICFAccuracy 53,9773 57,9545Precision 0,561341047 0,558902611Recall 0,5311636 0,535790598F-Measure 0,545835539 0,547102625
ROMIP collection
TF-IDF TF-ICFAccuracy 69,8619 58,1397Precision 0,709246342 0,61278022Recall 0,698624505 0,581402868F-Measure 0,703895355 0,596679322
TF-IDF TF-ICFAccuracy 95,5981 95,0664Precision 0,958092631 0,953112184Recall 0,955204837 0,94984672F-Measure 0,956646554 0,95147665
Results
Short texts News Romip0
20
40
60
80
100
120
95.66
70.39
54.58
95.15
59.6854.71 TF-IDF
TF-ICF
Experimental results in terms of F-measure
dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;
take into account the lexical speech changes in time;
investigate new terms entering into active vocabulary.
The program module allows