View
1.049
Download
1
Category
Tags:
Preview:
Citation preview
+
TEXT MINING-BASED FORMATION OF
DICTIONARIES EXPRESSING OPINIONS
IN NATURAL LANGUAGES
František Dařena
Jan Žižka
Department
of
Informatics
Faculty of
Business
and
Economics
Mendel
University
in Brno
Czech
Republic
+ Introduction
Many companies collect opinions expressed
by their customers.
These opinions can hide valuable knowledge.
Discovering the knowledge by people can be
sometimes a very demanding task because
the opinion database can be very large,
the customers can use different languages,
the people can handle the opinions subjectively,
sometimes additional resources (like lists of positive
and negative words) might be needed.
+ Objective
To automatically extract words
significant for positive and negative
customers' opinions and to form
dictionaries of positive and negative
words, including the strength of their
positivity and negativity.
+ Data description
Processed data included reviews of hotel clients collected from publicly available sources.
The reviews were labeled as positive and negative.
Reviews characteristics:
more than 5,000,000 reviews,
written in more than 25 natural languages,
written only by real customers, based on a real experience,
written relatively carefully but still containing errors that are typical for natural languages.
+ Review examples
Positive The breakfast and the very clean rooms stood out as the best
features of this hotel.
Clean and moden, the great loation near station. Friendly reception!
The rooms are new. The breakfast is also great. We had a really nice stay.
Good location - very quiet and good breakfast.
Negative High price charged for internet access which actual cost now
is extreamly low.
water in the shower did not flow away
The room was noisy and the room temperature was higher than normal.
The air conditioning wasn't working
+ Data preparation
Data collection, cleaning (removing tags, non-
letter characters), converting to upper-case.
Transforming into the Bag-of-Words
representation, term frequencies (TF) used as
attribute values.
Removing the words with global frequency
MinTF < 2.
+ Data characteristics
Number of unique words for different languages (MinTF = 1)
+ Data characteristics
total negative positive both classes
Number of unique words for different languages – for negative and positive
classes and words in both classes (MinTF = 2)
+ Finding the significant words
Significant words were discovered as relevant
attributes used by a classification algorithm – a
decision tree, the tree-generating algorithm c5 (by
R. Quinlan) based on entropy minimization.
The goal was not to achieve the best classification
accuracy (it was around 90%) but to find relevant
attributes that contribute to assigning a text to a
given class.
The significant words appeared in the nodes of the
decision tree.
+ Representing the decision tree
using rules
The branches of a decision tree can be converted into
rules.
Examples:
f(word1) > 0 AND f(word2) = 0 AND f(word3) = 0 : NEG[N1; I1]
f(word4) = 0 AND f(word5) > 0 AND f(word6) > 0 : NEG[N2; I2]
f(word1) = 0 AND f(word6) > 0 : NEG[N3; I3]
Nx – number of times when the rule was used
Ix – number of times when the rule was used incorrectly
When a word appears in a rule as f(word) > 0 it
contributes to classification into a given class and it is
thus relevant for the class.
+ One word in more paths/rules
The same word (e.g. “friendly”) can appear in
more paths in the decision tree and to contribute
to classification into both classes.
+ Strength of word sentiment
The more a word appears as relevant in rules assigning the
negative (positive) class to a text correctly the more
negative (positive) the word is. However, it is necessary to
consider not only absolute frequency but also the relative
accuracy.
For example, a word W1 is used 10 times for a correct and 0
times for an incorrect classification to the negative class, and
word W2 is used 30 times for a correct and 20 times for an
incorrect classification to the negative class (50 times in
total). Now, the question is which of these two words is `more
negative.' The word W1 was used less times but in 100%
correctly, while the word W2 was used 5 times more but with
only 60% correctness.
+ Sentiment strength weight
ww =NC
NN
×ln NC
2 + NN
2
ln(Nmax )
The weight balances the
frequency when a word was
used for classification and the
correctness of the classification.
The calculated weight then
determines the importance of a
word in relation to a given
category (positive or negative
class) – higher numbers mean
bigger relevancy.
+ Results
+ Results
+ Results
+ Conclusions
A procedure how to apply computers, machine
learning, and natural language processing areas to
automatically find significant words was presented.
From the total number of words (80,000–200,000) only
about 200–300 were identified as significant.
The procedure worked well for many languages.
Following research will focus on generating typical
short phrases instead of only creating individual words.
The procedure might be used during the marketing
research or marketing intelligence, for filtering
reviews, generating lists of key-words etc.
Recommended