Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Feature extraction for sentiment analysis on twitter datawith spanish language
Victor Muniz
Research Center in Mathematics.
Monterrey, Mexico.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33
Introduction
Sentiment Analysis focuses on automatically identifying whether a textexpresses a positive, negative or neutral opinion about some topic.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 2 / 33
Introduction
Among all virtual opinion plataforms,Twitter has become the mostpopular for sentiment analysis due toseveral reasons:
Availability of information
Large amount of data
Constant update
Worldwide available
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 3 / 33
Introduction
Among all virtual opinion plataforms,Twitter has become the mostpopular for sentiment analysis due toseveral reasons:
Lot of applications
Opinion based marketingOnline rankingGovernment and politicsOfficial statisticsAmong many others...
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 4 / 33
Introduction
One of the most popular techniques for text classification is the Bag ofWords (Joachims, 1998), which constructs a Term Document Matrixbased on term frequencies.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 5 / 33
Introduction
However, on twitter data, the application of this (or any) technique is notstraightforward:
Andas bien loco @Telcel con la zona horaria
d tu RED, a cada rato m mueves la Hr.??
#chidotucotorreo @ServicioTelcel
http://t.co/QoOX3OCYxt
@Profeco @Tiendas_OXXO no cumple con
algunos requerimientos como tipos de
bebida falsos asi como la falta del
precio :(
No nos deja pasar el cadenero del oxxo
gooey! k pedo 100pre me pasaaaa!!!
#Queoso
Short text
Misspellings
Abbreviations andnon-standardcontractions
Emoticons, hashtags
Unbalanced classes
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 6 / 33
Introduction
Standard preprocessing techniques on twitter data are not enough,because generally we have variations of words with the same meaning:
pseudo-estudiantes = pseudoestudiantes = seudoestudiantes
= seudestudiantes
separados = separa2
siempre = sienpre = 100pre
This problem causes sparse Term Document Matrix
Bag of words it’s not enough. We need to incorporate contextual(apriori) information
The challenge is to extract the main features of the tweet,which give us insights of the sentiment (polarity) of the text
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 7 / 33
Introduction
There is a lot of work on both feature extraction and classification fortweets, however, the vast majority are focused on english text
Some previous work on lexical normalization of spanish text has beendone (Mosqueda & Moreda, 2012), however, there are importantdifferences between countries and regions, even in the same language.This must be taken into account
The objective of our work, is to implement a normalizationmethod for spanish text by using kernel-based methods, inorder to obtain important features which can be used as inputfor a classification method
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 8 / 33
Preprocessing and normalization
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 9 / 33
Normalization
Data: We obtained and manually classify tweets from the API(https://dev.twitter.com/) according to some specific topics (i.e,convenience stores, cellphone services, etc).
Standard text preprocessing:
Convert to lowercase
Remove stopwords in spanish according to the list given by MartinPorter’s Snowball stemming projecthttp://snowball.tartarus.org/. We add some words relative tothe topic.
Remove special characters: URL’s, @, RT, , -, :, among others
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 10 / 33
Normalization
Remove repeated characters and excess of white spaces
Emoticon sustitution according to the list:en.wikipedia.org/wiki/List_of_emoticons.For instance:
:-) emoticon-positivo >:[ emoticon-negativo:) emoticon-positivo =( emoticon-negativo:o) emoticon-positivo :-[ emoticon-negativo:c) emoticon-positivo :-|| emoticon-muy-negativo:-D emoticon-muy-positivo >:( emoticon-muy-negativoX-D emoticon-muy-positivo :| emoticon-neutral
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 11 / 33
Normalization
The normalisation process consists on
1 Detection of non-conventional words
2 Substitution with similar words, (hopefully the correct ones in termsof the linguistic meaning)
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 12 / 33
Normalization
Detection of non-conventinal words
We used Aspell (http://aspell.net/) with a spanish dictionary,and we added extra terms, such as cities and localities from Mexicoand other ones relative to the topic.
For each word in the preprocessed tweet, we did a search with theAspell API, and if it does not appear, we consider the options givenby Aspell.
Very often, the top ranked suggestion by Aspell is not the best choice
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 13 / 33
Normalization
Detection of non-conventinal words
Consider pseudoestudiantes
[1] "pseudo" "estudiantes" "pseudo-estudiantes"
[4] "predestinares" "predestines" "predestinases"
[7] "predestinareis" "predestinase" "predestinar"
[10] "predestinas" "predestinasteis" "predestinaste"
[13] "predestinis" "sudestada" "predestinaras"
[16] "predestinars" "predestinaseis" "sudestadas"
[19] "predestinis" "predestinadas" "predestinados"
[22] "predestinabas" "predestinamos"
We need to choose the appropriate word from the suggestions
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 14 / 33
Normalization
Kernel methods and “string kernels”.Let x, z ∈ X (input space). Consider the kernel function:
k(x, z) = 〈φ(x),φ(z)〉
where φ is a map:
φ : x ∈ X 7→ φ(x) ∈ H (feature space)
Kernel trick (Scholkopf and Smola, 2002)
Datos Kernel Matriz de Gram Algoritmo Funcion decision
f(x) =∑
αik(xi, x)X k(x, x′) AK
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 15 / 33
Normalization
String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins2000, Herbrich 2002) provides a similarity measure between twodocuments x y y .Let s to be a substring. The mapping to the feature space is given by
φs(x) =∑
s∈xλL(sx ),
where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s intothe document x .Example: Consider s = car:if x =“cara”, then L(sx) = 3 (cara).
φs(x) = λ3,
if x =“cuarto”, then L(sx) = 4 (cuarto)
φs(x) = λ4.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33
Normalization
String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins2000, Herbrich 2002) provides a similarity measure between twodocuments x y y .Let s to be a substring. The mapping to the feature space is given by
φs(x) =∑
s∈xλL(sx ),
where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s intothe document x .Example: Consider s = car:if x =“cara”, then L(sx) = 3 (cara).
φs(x) = λ3,
if x =“cuarto”, then L(sx) = 4 (cuarto)
φs(x) = λ4.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33
Normalization
String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins2000, Herbrich 2002) provides a similarity measure between twodocuments x y y .Let s to be a substring. The mapping to the feature space is given by
φs(x) =∑
s∈xλL(sx ),
where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s intothe document x .Example: Consider s = car:if x =“cara”, then L(sx) = 3 (cara).
φs(x) = λ3,
if x =“cuarto”, then L(sx) = 4 (cuarto)
φs(x) = λ4.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 16 / 33
Normalization
The kernel (dot product) between documents x and y is given by
kn(x , y) =∑
s∈Σn
∑
s⊂x
∑
s⊂yλL(sx )+L(sy ),
where Σn is the set of all substrings of size n from a finite alphabet Σ.Example: Consider the words cat, car, bat and bar con |s| = 2:
c-a c-t a-t b-a b-t c-r a-r b-r
φ(cat) λ2 λ3 λ2 0 0 0 0 0φ(car) λ2 0 0 0 0 λ3 λ2 0φ(bat) 0 0 λ2 λ2 λ3 0 0 0φ(bar) 0 0 0 λ2 0 0 λ2 λ3
k(car , cat) = λ4, k(car , car) = k(cat, cat) = 2λ4 + λ6.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33
Normalization
The kernel (dot product) between documents x and y is given by
kn(x , y) =∑
s∈Σn
∑
s⊂x
∑
s⊂yλL(sx )+L(sy ),
where Σn is the set of all substrings of size n from a finite alphabet Σ.Example: Consider the words cat, car, bat and bar con |s| = 2:
c-a c-t a-t b-a b-t c-r a-r b-r
φ(cat) λ2 λ3 λ2 0 0 0 0 0φ(car) λ2 0 0 0 0 λ3 λ2 0φ(bat) 0 0 λ2 λ2 λ3 0 0 0φ(bar) 0 0 0 λ2 0 0 λ2 λ3
k(car , cat) = λ4, k(car , car) = k(cat, cat) = 2λ4 + λ6.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33
Normalization
The kernel (dot product) between documents x and y is given by
kn(x , y) =∑
s∈Σn
∑
s⊂x
∑
s⊂yλL(sx )+L(sy ),
where Σn is the set of all substrings of size n from a finite alphabet Σ.Example: Consider the words cat, car, bat and bar con |s| = 2:
c-a c-t a-t b-a b-t c-r a-r b-r
φ(cat) λ2 λ3 λ2 0 0 0 0 0φ(car) λ2 0 0 0 0 λ3 λ2 0φ(bat) 0 0 λ2 λ2 λ3 0 0 0φ(bar) 0 0 0 λ2 0 0 λ2 λ3
k(car , cat) = λ4, k(car , car) = k(cat, cat) = 2λ4 + λ6.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 17 / 33
Normalization
There are different types of string kernels (spectrum, constant,sequence, exponential, boundrage), depending on the weight λ andthe substring size s.
If λ(s) = 0 for substrings starting and ending with white space, weobtain the “bag of words” kernel.
It can be computationally expensive for large documents
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 18 / 33
Normalization
Substitution of non-standard words
Kernel PCA based on Aspell suggestions for pseudoestudiantes.We used a sequence string kernel with s = 3 and λ = 0.5.
●
●
●
●
●
●● ●●●
● ●
●
●
●●●
●
●
●
●●●
−4 −3 −2 −1 0 1
−2
−1
01
23
4
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
pseudoestudiantes
pseudo−estudiantes
predestinarespredestines
predestinasespredestinareispredestinasepredestinarpredestinaspredestinasteispredestinastepredestinéis
sudestada
predestinaraspredestinaráspredestinaseis
sudestadas
predestináis
predestinadas
predestinadospredestinabaspredestinamos
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 19 / 33
Normalization
Substitution of non-standard words
The projection of pseudoestudiantes in the first and secondPrincipal Components is
●
●
●
●
●
●● ●●●
● ●
●
●
●●●
●
●
●
●●●
−4 −3 −2 −1 0 1
−2
−1
01
23
4
1st Principal Component
2nd
Prin
cipa
l Com
pone
nt
pseudoestudiantes
pseudo−estudiantes
predestinarespredestines
predestinasespredestinareispredestinasepredestinarpredestinaspredestinasteispredestinastepredestinéis
sudestada
predestinaraspredestinaráspredestinaseis
sudestadas
predestináis
predestinadas
predestinadospredestinabaspredestinamos
pseudoestudiantes
By using the minimum distance criterion with 3 PC, the most similarword is pseudo-estudiantes, which is correct.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 20 / 33
Normalization
Simulation test: randomly change 1 letter in a sample of 200 words fromspanish dictionary
len2 len3 len4
0.0
0.2
0.4
0.6
lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2val.lam
erro
r
factor(metodos)
aspell
bound
constant
exp
sequence
spectrum
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 21 / 33
Normalization
Simulation test: randomly change 2 letters in a sample of 200 words fromspanish dictionary
len2 len3 len4
0.00
0.25
0.50
0.75
lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2val.lam
erro
r
factor(metodos)
aspell
bound
constant
exp
sequence
spectrum
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 22 / 33
Normalization
Some results:
Original
Ese cafe del oxxo si que levanta!!!
Puuues a chambearts!!!
No se cual cafe sea mas malo
si el del @7ElevenMexico o el
de @Tiendas_OXXO pero ambos
son malisiiiiimoooooosss!!!
Pesima mezcla
@paolastonexxx eso es
lo bonito... Puedo pagar
en OXXO. :)
Normalized
ese cafe si levanta pues
chamba
no se cual cafe
sea mas malo si
pero ambos son
musimos pesima
mezcla
eso es bonito puedo
pagar
emoticon-positivo
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 23 / 33
Classification
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 24 / 33
Classification
We implement a classification algorithm similar to that used by Melville etal. (2013).
We used bag of words in the preprocessed and normalized tweets(positive, negative and neutral), to obtain relevant words for eachcategory by using a mutual information measure (Yang and Pedersen,1997).
We add a topic-related list of words for each category as an aprioriinformation.
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 25 / 33
Classification
We implement a multinomial naive Bayesian classifier, where the classc of a tweet is given by
c∗ = argmaxcp(c|d),
wherep(c |d) = α1p1(c |d) + α2p2(c |d),
p1(c |d), α1: class probabilities and weight using bag of wordsp2(c |d), α2: class probabilities and weight using the topic-relatedwords. And
p(c |d) =p(c)
∑p(w |c)ni (d)
p(d)
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 26 / 33
Classification
We use 800 tweets previously classified.
We use 80% for training and 20% for testing by using a CrossValidation criteria
The Mean Average Error (MAE) for the training data was0.192± .015. The MAE for test data was 0.23± .021
But...
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33
Classification
We use 800 tweets previously classified.
We use 80% for training and 20% for testing by using a CrossValidation criteria
The Mean Average Error (MAE) for the training data was0.192± .015. The MAE for test data was 0.23± .021
But...
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33
Classification
We use 800 tweets previously classified.
We use 80% for training and 20% for testing by using a CrossValidation criteria
The Mean Average Error (MAE) for the training data was0.192± .015. The MAE for test data was 0.23± .021
But...
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33
Classification
We use 800 tweets previously classified.
We use 80% for training and 20% for testing by using a CrossValidation criteria
The Mean Average Error (MAE) for the training data was0.192± .015. The MAE for test data was 0.23± .021
But...
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 27 / 33
Classification
●
Error: 0.195
clase estimada
clas
e re
al
0 15 5
9 479 63
14 28 6
N O P
P
O
N
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 28 / 33
Classification
P N O
Distribucion de categorias
0.0
0.2
0.4
0.6
0.8
It is necessary to use classification algorithms sensitive to unbalancedcategories
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 29 / 33
Classification
P N O
Distribucion de categorias
0.0
0.2
0.4
0.6
0.8
It is necessary to use classification algorithms sensitive to unbalancedcategories
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 29 / 33
Work in progress
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 30 / 33
Work in progress
We are improving tweets normalization:
Implementing the methaphone algorithmTesting different types of string kernelsImproving the apriori information (word list for normalization andclassification)
Hashtags information
Classification methods (SVM, Boosting)
Cost sensitive classification
Spatial and temporal analysis of tweets
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 31 / 33
Thank you for your attention!
Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 32 / 33