Feature extraction for sentiment analysis on twitter data with … · 2015. 6. 13. · Twitter has become the most popular for sentiment analysis due to several reasons: Lot of applications

Feature extraction for sentiment analysis on twitter datawith spanish language

Victor Muniz

Research Center in Mathematics.

Monterrey, Mexico.

Victor Muniz (CIMAT Mty) Sentiment Analysis Junio 2015 1 / 33

Introduction

Sentiment Analysis focuses on automatically identifying whether a textexpresses a positive, negative or neutral opinion about some topic.


Introduction

Among all virtual opinion plataforms,Twitter has become the mostpopular for sentiment analysis due toseveral reasons:

Availability of information

Large amount of data

Constant update

Worldwide available


Introduction

Among all virtual opinion plataforms,Twitter has become the mostpopular for sentiment analysis due toseveral reasons:

Lot of applications

Opinion based marketingOnline rankingGovernment and politicsOfficial statisticsAmong many others...


Introduction

One of the most popular techniques for text classification is the Bag ofWords (Joachims, 1998), which constructs a Term Document Matrixbased on term frequencies.


Introduction

However, on twitter data, the application of this (or any) technique is notstraightforward:

Andas bien loco @Telcel con la zona horaria

d tu RED, a cada rato m mueves la Hr.??

#chidotucotorreo @ServicioTelcel

http://t.co/QoOX3OCYxt

@Profeco @Tiendas_OXXO no cumple con

algunos requerimientos como tipos de

bebida falsos asi como la falta del

precio :(

No nos deja pasar el cadenero del oxxo

gooey! k pedo 100pre me pasaaaa!!!

#Queoso

Short text

Misspellings

Abbreviations andnon-standardcontractions

Emoticons, hashtags

Unbalanced classes


Introduction

Standard preprocessing techniques on twitter data are not enough,because generally we have variations of words with the same meaning:

pseudo-estudiantes = pseudoestudiantes = seudoestudiantes

= seudestudiantes

separados = separa2

siempre = sienpre = 100pre

This problem causes sparse Term Document Matrix

Bag of words it’s not enough. We need to incorporate contextual(apriori) information

The challenge is to extract the main features of the tweet,which give us insights of the sentiment (polarity) of the text


Introduction

There is a lot of work on both feature extraction and classification fortweets, however, the vast majority are focused on english text

Some previous work on lexical normalization of spanish text has beendone (Mosqueda & Moreda, 2012), however, there are importantdifferences between countries and regions, even in the same language.This must be taken into account

The objective of our work, is to implement a normalizationmethod for spanish text by using kernel-based methods, inorder to obtain important features which can be used as inputfor a classification method


Preprocessing and normalization


Normalization

Data: We obtained and manually classify tweets from the API(https://dev.twitter.com/) according to some specific topics (i.e,convenience stores, cellphone services, etc).

Standard text preprocessing:

Convert to lowercase

Remove stopwords in spanish according to the list given by MartinPorter’s Snowball stemming projecthttp://snowball.tartarus.org/. We add some words relative tothe topic.

Remove special characters: URL’s, @, RT, , -, :, among others


Normalization

Remove repeated characters and excess of white spaces

Emoticon sustitution according to the list:en.wikipedia.org/wiki/List_of_emoticons.For instance:

:-) emoticon-positivo >:[ emoticon-negativo:) emoticon-positivo =( emoticon-negativo:o) emoticon-positivo :-[ emoticon-negativo:c) emoticon-positivo :-|| emoticon-muy-negativo:-D emoticon-muy-positivo >:( emoticon-muy-negativoX-D emoticon-muy-positivo :| emoticon-neutral


Normalization

The normalisation process consists on

1 Detection of non-conventional words

2 Substitution with similar words, (hopefully the correct ones in termsof the linguistic meaning)


Normalization

Detection of non-conventinal words

We used Aspell (http://aspell.net/) with a spanish dictionary,and we added extra terms, such as cities and localities from Mexicoand other ones relative to the topic.

For each word in the preprocessed tweet, we did a search with theAspell API, and if it does not appear, we consider the options givenby Aspell.

Very often, the top ranked suggestion by Aspell is not the best choice


Normalization

Detection of non-conventinal words

Consider pseudoestudiantes

[1] "pseudo" "estudiantes" "pseudo-estudiantes"

[4] "predestinares" "predestines" "predestinases"

[7] "predestinareis" "predestinase" "predestinar"

[10] "predestinas" "predestinasteis" "predestinaste"

[13] "predestinis" "sudestada" "predestinaras"

[16] "predestinars" "predestinaseis" "sudestadas"

[19] "predestinis" "predestinadas" "predestinados"

[22] "predestinabas" "predestinamos"

We need to choose the appropriate word from the suggestions


Normalization

Kernel methods and “string kernels”.Let x, z ∈ X (input space). Consider the kernel function:

k(x, z) = 〈φ(x),φ(z)〉

where φ is a map:

φ : x ∈ X 7→ φ(x) ∈ H (feature space)

Kernel trick (Scholkopf and Smola, 2002)

Datos Kernel Matriz de Gram Algoritmo Funcion decision

f(x) =∑

αik(xi, x)X k(x, x′) AK


Normalization

String Kernels (Lodhi 2002, Shawe-Taylor y Cristianini 2004, Watkins2000, Herbrich 2002) provides a similarity measure between twodocuments x y y .Let s to be a substring. The mapping to the feature space is given by

φs(x) =∑

s∈xλL(sx ),

where λ ∈ (0, 1) es a weight and L(sx) is the length of the substring s intothe document x .Example: Consider s = car:if x =“cara”, then L(sx) = 3 (cara).

φs(x) = λ3,

if x =“cuarto”, then L(sx) = 4 (cuarto)

φs(x) = λ4.


Normalization


φs(x) =∑

s∈xλL(sx ),


φs(x) = λ3,


φs(x) = λ4.


Normalization


φs(x) =∑

s∈xλL(sx ),


φs(x) = λ3,


φs(x) = λ4.


Normalization

The kernel (dot product) between documents x and y is given by

kn(x , y) =∑

s∈Σn

∑

s⊂x

∑

s⊂yλL(sx )+L(sy ),

where Σn is the set of all substrings of size n from a finite alphabet Σ.Example: Consider the words cat, car, bat and bar con |s| = 2:

c-a c-t a-t b-a b-t c-r a-r b-r

φ(cat) λ2 λ3 λ2 0 0 0 0 0φ(car) λ2 0 0 0 0 λ3 λ2 0φ(bat) 0 0 λ2 λ2 λ3 0 0 0φ(bar) 0 0 0 λ2 0 0 λ2 λ3

k(car , cat) = λ4, k(car , car) = k(cat, cat) = 2λ4 + λ6.


Normalization


kn(x , y) =∑

s∈Σn

∑

s⊂x

∑







Normalization


kn(x , y) =∑

s∈Σn

∑

s⊂x

∑







Normalization

There are different types of string kernels (spectrum, constant,sequence, exponential, boundrage), depending on the weight λ andthe substring size s.

If λ(s) = 0 for substrings starting and ending with white space, weobtain the “bag of words” kernel.

It can be computationally expensive for large documents


Normalization

Substitution of non-standard words

Kernel PCA based on Aspell suggestions for pseudoestudiantes.We used a sequence string kernel with s = 3 and λ = 0.5.

●

●

●

●

●

●● ●●●

● ●

●

●

●●●

●

●

●

●●●

−4 −3 −2 −1 0 1

−2

−1

01

23

4

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

pseudoestudiantes

pseudo−estudiantes

predestinarespredestines

predestinasespredestinareispredestinasepredestinarpredestinaspredestinasteispredestinastepredestinéis

sudestada

predestinaraspredestinaráspredestinaseis

sudestadas

predestináis

predestinadas

predestinadospredestinabaspredestinamos


Normalization

Substitution of non-standard words

The projection of pseudoestudiantes in the first and secondPrincipal Components is

●

●

●

●

●

●● ●●●

● ●

●

●

●●●

●

●

●

●●●

−4 −3 −2 −1 0 1

−2

−1

01

23

4

1st Principal Component

2nd

Prin

cipa

l Com

pone

nt

pseudoestudiantes

pseudo−estudiantes

predestinarespredestines

predestinasespredestinareispredestinasepredestinarpredestinaspredestinasteispredestinastepredestinéis

sudestada

predestinaraspredestinaráspredestinaseis

sudestadas

predestináis

predestinadas

predestinadospredestinabaspredestinamos

pseudoestudiantes

By using the minimum distance criterion with 3 PC, the most similarword is pseudo-estudiantes, which is correct.


Normalization

Simulation test: randomly change 1 letter in a sample of 200 words fromspanish dictionary

len2 len3 len4

0.0

0.2

0.4

0.6

lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2val.lam

erro

r

factor(metodos)

aspell

bound

constant

exp

sequence

spectrum


Normalization

Simulation test: randomly change 2 letters in a sample of 200 words fromspanish dictionary

len2 len3 len4

0.00

0.25

0.50

0.75

lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2 lam 0.5 lam 1.1 lam 1.5 lam 2val.lam

erro

r

factor(metodos)

aspell

bound

constant

exp

sequence

spectrum


Normalization

Some results:

Original

Ese cafe del oxxo si que levanta!!!

Puuues a chambearts!!!

No se cual cafe sea mas malo

si el del @7ElevenMexico o el

de @Tiendas_OXXO pero ambos

son malisiiiiimoooooosss!!!

Pesima mezcla

@paolastonexxx eso es

lo bonito... Puedo pagar

en OXXO. :)

Normalized

ese cafe si levanta pues

chamba

no se cual cafe

sea mas malo si

pero ambos son

musimos pesima

mezcla

eso es bonito puedo

pagar

emoticon-positivo


Classification


Classification

We implement a classification algorithm similar to that used by Melville etal. (2013).

We used bag of words in the preprocessed and normalized tweets(positive, negative and neutral), to obtain relevant words for eachcategory by using a mutual information measure (Yang and Pedersen,1997).

We add a topic-related list of words for each category as an aprioriinformation.


Classification

We implement a multinomial naive Bayesian classifier, where the classc of a tweet is given by

c∗ = argmaxcp(c|d),

wherep(c |d) = α1p1(c |d) + α2p2(c |d),

p1(c |d), α1: class probabilities and weight using bag of wordsp2(c |d), α2: class probabilities and weight using the topic-relatedwords. And

p(c |d) =p(c)

∑p(w |c)ni (d)

p(d)


Classification

We use 800 tweets previously classified.

We use 80% for training and 20% for testing by using a CrossValidation criteria

The Mean Average Error (MAE) for the training data was0.192± .015. The MAE for test data was 0.23± .021

But...


Classification




But...


Classification




But...


Classification




But...


Classification

●

Error: 0.195

clase estimada

clas

e re

al

0 15 5

9 479 63

14 28 6

N O P

P

O

N


Classification

P N O

Distribucion de categorias

0.0

0.2

0.4

0.6

0.8

It is necessary to use classification algorithms sensitive to unbalancedcategories


Classification

P N O

Distribucion de categorias

0.0

0.2

0.4

0.6

0.8

It is necessary to use classification algorithms sensitive to unbalancedcategories


Work in progress


Work in progress

We are improving tweets normalization:

Implementing the methaphone algorithmTesting different types of string kernelsImproving the apriori information (word list for normalization andclassification)

Hashtags information

Classification methods (SVM, Boosting)

Cost sensitive classification

Spatial and temporal analysis of tweets


Thank you for your attention!


Documents

Feature extraction for sentiment analysis on twitter data with … · 2015. 6. 13. · Twitter has become the most popular for sentiment analysis due to several reasons: Lot of applications