OpenEdition Lab projects in Text Mining

OPENEDITION LAB TEXT MINING PROJECTS

Patrice BellotAix-‐Marseille Université -‐ CNRS (LSIS UMR 7296 ; OpenEdition) !patrice.bellot@univ-‐amu.fr

LSIS -‐ DIMAG team http://www.lsis.org/spip.php?id_rubrique=291 OpenEdition Lab : http://lab.hypotheses.org

Hypotheses!600+ blogs

Revues.org!300+ journals

Calenda!20 000+ events

OpenEdition Books!1000+ books

A European Web platform for Human and Social Sciences

A digital infrastructure for open access

A lab for experimenting new Text Mining and new IR systems

Open Edition - a Facility of Excellence

3

2012-2020 7 millions €

Objectives:!15 000 + books!2000 + blogs!Freemium!Multilingual

P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition)

OpenEdition Lab — Our Team

Directors : Patrice Bellot (Professor in Comp. Sc. / NLP / IR) - Marin Dacos (Head of OpenEdition)

Engineers : Elodie Faath - Arnaud Cordier

PhD Students : Hussam Hamdan, Chahinez Benkoussas, Anaïs Ollagnier

Post-docs : Young-Min Kim (2012-13), Shereen Albitar (2014)

4

http://lab.hypotheses.org

• 220 learned societes and centers (France) • 30 university presses (France, UK, Belgium, Switzerland, Canada, Mexico, Hungary/USA)

• CCSD – France -‐ Lyon (HAL / DataCenter), • CHNM – USA – Washington, • OAPEN – NL -‐ The Hague, • UNED – Spain -‐ Universidad Nacional de Educación a Distancia, • Fundação Calouste Gulbenkian – Portugal, • Max Weber Stinftung – Germany, • Google – USA (Google Grants for DH), • DARIAH – Europe.

Our partners

And you?


OpenEdition Lab (Text Mining Projects for DL)

Aims to :

— Link papers / books / blogs automatically (reference analysis, Named Entities…)

— Detect hot topics, hot books, hot papers : content oriented analysis (not only by using logs) - sentiment analysis - review of books analysis (and finding)

— Book searching with complex and long queries

— Reading recommandation

6

Project 1: BILBO

EN SVM CRF

Natural Language Processing / Text Mining / Information Retrieval / Machine Learning

P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 8

Projet N°2 : ECHO!

Détec'on!automa'que!de!compte0rendus!de!lecture! LREC,&2014!

Mise!en!rela'on!!(BILBO)!

Recherche!Web!Analyse!de!sen'ments!

Mesure&de&l’écho!

NAACL-SEMEVAL,&2013!

logs,!métriques…!


Projet N°3 : COOKER!

BILBO!

ECHO!Graphe!des!contenus!

(puis!hypergraphe)!

Recommanda)on!

COOKER!

Classifica?on!automa?que!

et!métaCdonnées!(thèmes,!langues,!

auteurs…)!


Semantic Annotation of Bib. References

10


A – references in a specific section

B : references in notes

C : references in the body


BILBO: A software for Annotating Bibliographical Reference in Digital Humanities

Google Digital Humanities Research Awards (2011, 2012)

State of the art : — CiteSeer system (Giles et al., 1998) in computer science, 80% of precision for author and 40% for pages. Conditional Random Fields (CRFs) (Peng et al., 2006, Lafferty et al., 2001) for scientific articles, 95% of average precision (99% for author, 95% for title, 85% for editor).

— Run on the cover page (title) and/or on the Reference section at the end of papers : not in the footnotes, not in the text body

— Not very robust (in the real world : no stylesheets - poorly respected)

!

13


References – three levels Architecture

Source XHTML (TEI guidelines)

LEARNING AUTOMATIC ANNOTATION

TXT

Estimated XML files

OpenEdition

•  Three platforms in humanities and social science •  Revues.org online journals - 340 journals - Various reference formats - 20 different languages (90% in French)

BILBO Automatic Annotation of Bibliographical References

BILBO

•  Automatic reference annotation software •  CLEO’s OpenEdition platform •  Unstructured and scattered reference data •  Prototype development, Web service ! source code will be distributed (GPL) •  Google Digital Humanities Research Awards (’10, ’11)

Part of Equipex future investment award: DILOH (’12)

Young-Min Kim, Jade Tavernier LIA, University of Avignon

84911 Avignon, France {young-min.kim, jade.tavernier}@univ-avignon.fr

Elodie Faath, Marin Dacos

CLEO, Centre for Open Electronic Publishing 13331 Marseille, France

{elodie.faath, marin.dacos}@revues.org

Patrice Bellot LSIS, Aix-Marseille University

13397 Marseille, France [email protected]

http://lia.univ-avignon.fr http://www.lsis.org http://cleo.cnrs.fr

Web Service Plain text input Automatic annotation result DOI extraction (via crossref.org)

Future platform

auto-annotation in xml files

learning data

a learned CRF

article in Revues.org

other articles, blog postings

recognized field

Machine learning techniques on title and target context

Level 1

Level 2 Learning data

Tokenizer, Extractor

New data

Tokenizer, Extractor

Manual annotation

External machine learning modules

Level 1 model

Level 2 model

Level 3 model

Machine learning modules

Mallet, SVMlight Conditional Random Fields

Automatic annotator Call a model

Level 1 Level 2

Bibliography Notes Implicit References

Level 3

Comparison with other online tools New Data : Reference data of library of University of Michigan, 150 <bibl> (level 1), among them, 70 <note> (level 2) Test without new learning (level 1 model) True token annotation based evaluation

BILBO vs. Grobid, Freecite, Biblio Research blog – all info. about BILBO http://bilbo.hypotheses.org/

Tool Overall accuracy

Title recall

Author recall

BILBO 68% 73% 87% Grobid 33% 55% 7%

Freecite 40% 52% 26% Biblio 26% 50% 5%

Kim & Bellot, 2012


Conditional Random Fields for IE

— A discriminative model that is specified over a graph that encodes the conditional dependencies (relationships between observations)

— Can be employed for sequential labeling (linear chain CRF)

— Take context into account

— The probability of a label sequence y given an observationsequence x is : with F the (rich) feature functions (transition and state functions) Parameters must be estimated using an iterative technique such as iterative scaling or gradient-based methods

15

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence DataProceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001)

Y

i�1 Y

i

Y

i+1

?ss -

?ss -

?ss

X

i�1 X

i

X

i+1

Y

i�1 Y

i

Y

i+1

c6s -

c6s -

c6s

X

i�1 X

i

X

i+1

Y

i�1 Y

i

Y

i+1

cs

cs

cs

X

i�1 X

i

X

i+1

Figure 2. Graphical structures of simple HMMs (left), MEMMs (center), and the chain-structured case of CRFs (right) for sequences.An open circle indicates that the variable is not generated by the model.

sequence. In addition, the features do not need to specifycompletely a state or observation, so one might expect thatthe model can be estimated from less training data. Anotherattractive property is the convexity of the loss function; in-deed, CRFs share all of the convexity properties of generalmaximum entropy models.

For the remainder of the paper we assume that the depen-dencies of Y, conditioned on X, form a chain. To sim-plify some expressions, we add special start and stop statesY0 = start and Y

n+1 = stop. Thus, we will be using thegraphical structure shown in Figure 2. For a chain struc-ture, the conditional probability of a label sequence can beexpressed concisely in matrix form, which will be usefulin describing the parameter estimation and inference al-gorithms in Section 4. Suppose that p

✓

(Y |X) is a CRFgiven by (1). For each position i in the observation se-quence x, we define the |Y| ⇥ |Y| matrix random variableM

i

(x) = [M

i

(y

0, y |x)] by

M

i

(y

0, y |x) = exp (⇤

i

(y

0, y |x))

⇤

i

(y

0, y |x) =

Pk

�

k

f

k

(e

i

,Y|ei = (y

0, y),x) +

Pk

µ

k

g

k

(v

i

,Y|vi = y,x) ,

where e

i

is the edge with labels (Y

i�1,Yi

) and v

i

is thevertex with labelY

i

. In contrast to generative models, con-ditional models like CRFs do not need to enumerate overall possible observation sequences x, and therefore thesematrices can be computed directly as needed from a giventraining or test observation sequence x and the parametervector ✓. Then the normalization (partition function)Z

✓

(x)

is the (start, stop) entry of the product of these matrices:

Z

✓

(x) = (M1(x) M2(x) · · ·Mn+1(x))

start,stop

.

Using this notation, the conditional probability of a labelsequence y is written as

p

✓

(y |x) =

Qn+1i=1 M

i

(y

i�1,yi

|x)⇣Qn+1i=1 M

i

(x)

⌘

start,stop

,

where y0 = start and y

n+1 = stop.

4. Parameter Estimation for CRFsWe now describe two iterative scaling algorithms to findthe parameter vector ✓ that maximizes the log-likelihood

of the training data. Both algorithms are based on the im-proved iterative scaling (IIS) algorithm of Della Pietra et al.(1997); the proof technique based on auxiliary functionscan be extended to show convergence of the algorithms forCRFs.

Iterative scaling algorithms update the weights as �

k

�

k

+ ��

k

and µ

k

µ

k

+ �µ

k

for appropriately chosen��

k

and �µ

k

. In particular, the IIS update ��

k

for an edgefeature f

k

is the solution of

eE[f

k

]

def=

X

x,y

ep(x,y)

n+1X

i=1

f

k

(e

i

,y|ei ,x)

=

X

x,y

ep(x) p(y |x)

n+1X

i=1

f

k

(e

i

,y|ei ,x) e

��kT (x,y) .

where T (x,y) is the total feature count

T (x,y)

def=

X

i,k

f

k

(e

i

,y|ei ,x) +

X

i,k

g

k

(v

i

,y|vi ,x) .

The equations for vertex feature updates �µ

k

have similarform.

However, efficiently computing the exponential sums onthe right-hand sides of these equations is problematic, be-cause T (x,y) is a global property of (x,y), and dynamicprogramming will sum over sequences with potentiallyvarying T . To deal with this, the first algorithm, AlgorithmS, uses a “slack feature.” The second, Algorithm T, keepstrack of partial T totals.

For Algorithm S, we define the slack feature by

s(x,y)

def=

S �X

i

X

k

f

k

(e

i

,y|ei ,x)�

X

i

X

k

g

k

(v

i

,y|vi ,x) ,

where S is a constant chosen so that s(x(i),y) � 0 for all

y and all observation vectors x

(i) in the training set, thusmaking T (x,y) = S. Feature s is “global,” that is, it doesnot correspond to any particular edge or vertex.

For each index i = 0, . . . , n+1 we now define the forward

vectors ↵

i

(x) with base case

↵0(y |x) =

n1 if y = start

0 otherwise

3 Conditional Random Fields

Lafferty et al. [8] define the the probability of a particular label sequence y

given observation sequence x to be a normalized product of potential functions,each of the form

exp (!

j

λjtj(yi−1, yi, x, i) +!

k

µksk(yi, x, i)), (2)

where tj(yi−1, yi, x, i) is a transition feature function of the entire observationsequence and the labels at positions i and i−1 in the label sequence; sk(yi, x, i)is a state feature function of the label at position i and the observation sequence;and λj and µk are parameters to be estimated from training data.

When defining feature functions, we construct a set of real-valued featuresb(x, i) of the observation to expresses some characteristic of the empirical dis-tribution of the training data that should also hold of the model distribution.An example of such a feature is

b(x, i) =

"

1 if the observation at position i is the word “September”

0 otherwise.

Each feature function takes on the value of one of these real-valued observationfeatures b(x, i) if the current state (in the case of a state function) or previousand current states (in the case of a transition function) take on particular val-ues. All feature functions are therefore real-valued. For example, consider thefollowing transition function:

tj(yi−1, yi, x, i) =

"

b(x, i) if yi−1 = IN and yi = NNP

0 otherwise.

In the remainder of this report, notation is simplified by writing

s(yi, x, i) = s(yi−1, yi, x, i)

and

Fj(y, x) =n

!

i=1

fj(yi−1, yi, x, i),

where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y

given an observation sequence x to be written as

p(y|x, λ) =1

Z(x)exp (

!

j

λjFj(y, x)). (3)

Z(x) is a normalization factor.

4

3 Conditional Random Fields

Lafferty et al. [8] define the the probability of a particular label sequence y

given observation sequence x to be a normalized product of potential functions,each of the form

exp (!

j

λjtj(yi−1, yi, x, i) +!

k

µksk(yi, x, i)), (2)

where tj(yi−1, yi, x, i) is a transition feature function of the entire observationsequence and the labels at positions i and i−1 in the label sequence; sk(yi, x, i)is a state feature function of the label at position i and the observation sequence;and λj and µk are parameters to be estimated from training data.

When defining feature functions, we construct a set of real-valued featuresb(x, i) of the observation to expresses some characteristic of the empirical dis-tribution of the training data that should also hold of the model distribution.An example of such a feature is

b(x, i) =

"

1 if the observation at position i is the word “September”

0 otherwise.

Each feature function takes on the value of one of these real-valued observationfeatures b(x, i) if the current state (in the case of a state function) or previousand current states (in the case of a transition function) take on particular val-ues. All feature functions are therefore real-valued. For example, consider thefollowing transition function:

tj(yi−1, yi, x, i) =

"

b(x, i) if yi−1 = IN and yi = NNP

0 otherwise.

In the remainder of this report, notation is simplified by writing

s(yi, x, i) = s(yi−1, yi, x, i)

and

Fj(y, x) =n

!

i=1

fj(yi−1, yi, x, i),

where each fj(yi−1, yi, x, i) is either a state function s(yi−1, yi, x, i) or a transi-tion function t(yi−1, yi, x, i). This allows the probability of a label sequence y

given an observation sequence x to be written as

p(y|x, λ) =1

Z(x)exp (

!

j

λjFj(y, x)). (3)

Z(x) is a normalization factor.

4

1.2 Graphical Models 7

Logistic Regression

HMMs

Linear-chain CRFs

Naive BayesSEQUENCE

SEQUENCE

CONDITIONAL CONDITIONAL

Generative directed models

General CRFs

CONDITIONAL

GeneralGRAPHS

GeneralGRAPHS

Figure 1.2 Diagram of the relationship between naive Bayes, logistic regression,HMMs, linear-chain CRFs, generative models, and general CRFs.

Furthermore, even when naive Bayes has good classification accuracy, its prob-ability estimates tend to be poor. To understand why, imagine training naiveBayes on a data set in which all the features are repeated, that is, x =(x1, x1, x2, x2, . . . , xK , xK). This will increase the confidence of the naive Bayesprobability estimates, even though no new information has been added to the data.Assumptions like naive Bayes can be especially problematic when we generalizeto sequence models, because inference essentially combines evidence from di↵erentparts of the model. If probability estimates at a local level are overconfident, itmight be di�cult to combine them sensibly.Actually, the di↵erence in performance between naive Bayes and logistic regressionis due only to the fact that the first is generative and the second discriminative;the two classifiers are, for discrete input, identical in all other respects. Naive Bayesand logistic regression consider the same hypothesis space, in the sense that anylogistic regression classifier can be converted into a naive Bayes classifier with thesame decision boundary, and vice versa. Another way of saying this is that the naiveBayes model (1.5) defines the same family of distributions as the logistic regressionmodel (1.7), if we interpret it generatively as

p(y,x) =exp {

P

k �kfk(y,x)}P

y,x exp {P

k �kfk(y, x)} . (1.9)

This means that if the naive Bayes model (1.5) is trained to maximize the con-ditional likelihood, we recover the same classifier as from logistic regression. Con-versely, if the logistic regression model is interpreted generatively, as in (1.9), and istrained to maximize the joint likelihood p(y,x), then we recover the same classifieras from naive Bayes. In the terminology of Ng and Jordan [2002], naive Bayes andlogistic regression form a generative-discriminative pair.The principal advantage of discriminative modeling is that it is better suited to

1.3 Linear-Chain Conditional Random Fields 9

. . .

. . .

y

x

Figure 1.3 Graphical model of an HMM-like linear-chain CRF.

. . .

. . .

y

x

Figure 1.4 Graphical model of a linear-chain CRF in which the transition scoredepends on the current observation.

1.3 Linear-Chain Conditional Random Fields

In the previous section, we have seen advantages both to discriminative modelingand sequence modeling. So it makes sense to combine the two. This yields a linear-chain CRF, which we describe in this section. First, in Section 1.3.1, we define linear-chain CRFs, motivating them from HMMs. Then, we discuss parameter estimation(Section 1.3.2) and inference (Section 1.3.3) in linear-chain CRFs.

1.3.1 From HMMs to CRFs

To motivate our introduction of linear-chain conditional random fields, we beginby considering the conditional distribution p(y|x) that follows from the jointdistribution p(y,x) of an HMM. The key point is that this conditional distributionis in fact a conditional random field with a particular choice of feature functions.First, we rewrite the HMM joint (1.8) in a form that is more amenable to general-ization. This is

p(y,x) =1Z

exp

8

<

:

X

t

X

i,j2S

�ij1{yt=i}1{yt�1=j} +X

t

X

i2S

X

o2O

µoi1{yt=i}1{xt=o}

9

=

;

,

(1.13)where ✓ = {�ij , µoi} are the parameters of the distribution, and can be any realnumbers. Every HMM can be written in this form, as can be seen simply by setting�ij = log p(y0 = i|y = j) and so on. Because we do not require the parameters tobe log probabilities, we are no longer guaranteed that the distribution sums to 1,unless we explicitly enforce this by using a normalization constant Z. Despite thisadded flexibility, it can be shown that (1.13) describes exactly the class of HMMsin (1.8); we have added flexibility to the parameterization, but we have not addedany distributions to the family.


39:16 Y.-M. Kim and P. Bellot

Table V: Verified input, local and global features. The selected ones in BILBO arewritten in black, and the non-selected ones are in gray.

Input featuresFeature category DescriptionRaw input token (I1) Tokenized word itself in the input string and the lowercased

wordPreceding or following tokens (I2) Three preceding and three following tokens of current token

N-gram (I3) Attachment of preceding or following N-gram tokens

Prefix/suffix in character level (I4) 8 different prefix/suffix as in [Councill et al. 2008]

Local featuresFeature cate-gory

Feature name Description Example

Number ALLNUMBERS All characters are numbers 1984(F1) NUMBERS One or more characters are numbers in-4

DASH One or more dashes are included in numbers 665-680

(F1digit) 1DIGIT, 2DIGIT ... If number, number of digits in it 5, 78, ...Capitalization ALLCAPS All characters are capital letters RAYMOND(F2) FIRSTCAP First character is capital letter Paris

ALLSAMLL All characters are lower cased pouvoirsNONIMPCAP Capital letters are mixed dell’Ateneo

Regular form INITIAL Initialized expression Ch.-R.(F3) WEBLINK Regular expression for web pages apcss.orgEmphasis (F4) ITALIC Italic characters RegionalLocation BIBL START Position is in the first one-third of reference -(F5) BIBL IN Position is between the one-third and two-third -

BIBL END Position is between the two-third and the end -Lexicon POSSEDITOR Possible for the abbreviation of editor ed.(F6) POSSPAGE Possible for the abbreviation of page pp.

POSSMONTH Possible for month SeptemberPOSSVOLUME Possible for the abbreviation of volume vol.

External list SURNAMELIST Found in an external surname list RAYMOND(F7) FORENAMELIST Found in an external forename list Simone

PLACELIST Found in an external place list New YorkJOURNALLIST Found in an external journal list African

Affaire

Punctuation(F8)

COMMA,POINT, LINK,PUNC, LEAD-INGQUOTES,END-INGQUOTES,PAIREDBRACES

Punctuation mark itself (comma, point) or punc-tuation type. These features are defined espe-cially for the case of non-separated punctuation.

46-55, 1993.S.; [en “Thedesign”. (1)

Global featuresFeature category Feature name DescriptionLocal feature existence [local feature name] Corresponding local feature is found in the input

string(G1) (F3, F4, and F6 features are finally selected)Feature distribution(G2) NOPUNC, 1PUNC,

2PUNC, MORE-PUNC

There are no, 1, 2, or more PUNC features in the in-put string

NONUMBER There is no number in the input stringSTARTINITIAL The input string starts with an initial expressionENDQUOTECOMMA An ending quote is followed by a commaFIRSTCAPCOMMA A token having FIRSTCAP feature is followed by a

comma

has bibliographic parts, the role of first token may be important than the others. Ifwe can encode this kind of information before we put all together tokens in a vectorspace without order to apply an algorithm, we would improve the note classification

ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2013.

Kim & Bellot, 2013


Reference Parsing in Digital Humanities 39:23

●

●

●

●

●

Tokenization effect

Training sets

Mic

ro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

8081

8283

84

● Separated PunctuationAttached Punctuation

(a) Corpus level 1

●

●

●

●

●

Tokenization effect

Training sets

Mic

ro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

9091

9293

94

● Separated PunctuationAttached Punctuation

(b) Cora dataset

Fig. 3: Basic tokenization effect. Each point is the averaged value of 10 different cross-validated experiments.

●

●

●

●

●

Cumulative local feature effect

Training sets

Mic

ro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

7577

7981

8385

● F0(Base)F1(Num.)F2(Cap.)F3(Reg.)F4(Emp.)

F5(Loc.)F6(Lex.)F7(Ext.)F8(Pun.)

(a) Corpus level 1

●

●

●●

●

Cumulative local feature effect

Training setsM

icro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

8688

9092

9496

● F0(Base)F1(Num.)F2(Cap.)F3(Reg.)F4(Emp.)

F5(Loc.)F6(Lex.)F7(Ext.)F8(Pun.)

(b) Cora dataset

Fig. 4: Cumulative local feature effect from F1 to F8 with C1 and Cora

uation. We repeat cross validations by cumulatively adding features of each categoryfrom F1 to F8. Too detailed features such as that of category F1-sub are excluded herebecause by testing the detailed ones at the end, we want to eliminate them if theyare not effective for reference parsing. Each line corresponds to the mean curve of 10micro-averaged F-measures for the corresponding local feature set. The lowest line F0is the result when using only input features without any local features. The line F1


Kim & Bellot, 2013


Reference Parsing in Digital Humanities 39:25

●

●

●

●

●

Cumulative List feature effect

Training sets

Mic

ro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

8283

8485

86

● F6F7aF7b

F7cF7

(a) Corpus level 1

●

●

●

●

●

Cumulative List feature effect

Training setsM

icro−a

vera

ged

F−m

easu

re

50% 60% 70% 80% 90%

9192

9394

95

● F6F7aF7b

F7cF7

(b) Cora dataset

Fig. 5: Cumulative external list feature effect

Detailed analysis of the effect of external lists and lexicon. One of interesting discoveriesfrom the above analysis is lexical features are not always effective for reference pars-ing. Lexicon features defined with strict rules without overlapping have actually nosignificant impact, whereas external lists such as surname, forename, place, and jour-nal lists obviously enhance the prediction accuracy. Now we investigate more in detailthese effects to see what we can infer as useful information such as the existence of anexceptional feature who contributes greater part of the improvement or who gets ridof the improvement obtained by other features.

Fig. 5 shows the effect of external list features on C1 and Cora respectively. Theexternal list features are cumulatively added in order of surname, forename, place,and journal (F7a, F7b, F7c, and F7 respectively) to the baseline F6. Let us first seethe figure on the left, the result of C1. Surname feature alone provides somewhat un-even effect but adding forename feature, we can obtain significantly better F-measurevalues. It is an expected phenomenon because these two features are detected simul-taneously from the lists. Only when surname and forename features are observed in apair or a triple of continuing tokens, we attach these features to the tokens. Becauseof this property, the features work more stably when they are used at the same time.Place feature greatly increases the accuracy except on 90/10 split sets. We suspect thatthis sudden decline would be caused by the internal structure of split dataset. Forexample, the proportion of ambiguous tokens having multiple list features can havebeen increased in the test set. But the instability disappears when adding journal listfeature, which raises the F-measure from 0.842 to 0.851 on 90/10 split sets. The re-sult empirically justifies that the external list features are complementary each other.By using all of them, unbiased list information can be supplied, then it leads to goodperformance. Same analysis can be applied to the result of Cora in spite of relativelyminor change.

Fig. 6 shows the effect of added lexicon features on C1 (left) and Cora (right). The fea-tures are cumulatively added in order of posseditor, posspage, possmonth, and possvol-ume (F6a, F6b, F6c, and F6). We can hardly find a pattern in both of the figures be-


Kim & Bellot, 2013


39:30 Y.-M. Kim and P. Bellot

Table VII: Micro averaged precision and recall per field for C1 and Cora with finallychosen strategy

(a) C1 - detailed labelsFields #true #annot. #exist. prec.(%) recall(%)surname 1080 1164 1203 92.78 89.78forename 1128 1220 1244 92.46 90.68title(m) 3277 4132 3690 79.31 88.81title(a) 2782 3253 3069 85.52 90.65title(j) 440 564 681 78.01 64.61title(u) 511 660 652 77.42 78.37title(s) 18 24 118 75.00 15.25publisher 1021 1367 1171 74.69 87.19date 793 838 855 94.63 92.75biblscope(pp) 210 223 219 94.17 95.89biblscope(i) 152 191 189 79.58 80.42biblscope(v) 75 87 102 86.21 73.53extent 66 69 70 95.65 94.29place 433 524 539 82.63 80.33abbr 417 468 502 89.10 83.07nolabel 231 306 488 75.49 47.34edition 46 178 211 25.84 21.80orgname 74 87 118 85.06 62.71bookindicator 47 49 65 95.92 72.31OTHERS 95 177 395 53.67 24.05Average 12896 15581 15581 82.77 82.77

(b) Cora datasetFields #true #annot. #exist. prec.(%) recall(%)author 2797 2855 2830 97.97 98.83title 3508 3613 3560 97.10 98.54booktitle 1750 1882 1865 92.99 93.83journal 546 615 617 88.78 88.49date 636 641 642 99.22 99.07institution 268 299 306 89.63 87.58publisher 165 188 203 87.77 81.28location 247 279 289 88.53 85.47editor 232 261 295 88.89 78.64pages 422 429 438 98.37 96.35volume 306 327 320 93.58 95.63tech 130 155 178 83.87 73.03note 75 122 123 61.48 60.98Average 11082 11666 11666 95.00 95.00

punctuation is attached, but the latter is significantly negative when reference fieldsare much detailed. Hypothesis 2 is confirmed with these observations.

For our system BILBO, input and local features written in black in Table V are fi-nally selected. BILBO provides two different labeling levels, simple model using onlybasic labels, and detailed model including also sub labels of title and biblscope. Thelatter is intended for the articles of OpenEdition platform whereas the former is forgeneral references from external data. We choose to separate punctuation marks forthe moment, first to immediately use the parsing result without re-separation of punc-tuation from predicted reference field, and second to easily adapt to external referencesregardless of too detailed features.

6.2. Hypothesis verification for notes - corpus level 2There are two main objectives in these sub experiments. First, confirm Hypothesis 4by showing that note filtering, which excludes non-bibliographic notes via a SVM withglobal features, increases note parsing performance. Second, find the most effectivestrategy of global features and training data setting for the classification. Instead ofdetailing the global feature selection procedure, which is similar to the local feature se-lection procedure, we use the finally selected global features given in Table V. To avoidconfusion, we refer to bibliographic information extraction in notes as note parsing.Since there is not any other benchmark dataset, we only test our OpenEdition corpuslevel 2. Same evaluation measures and experimental settings about cross-validationsplits are used as before. We conduct experiments basically with two different tok-enization strategies, separation and non-separation of punctuation by taking the bestlocal feature combinations that we have found in the above experiments. When sepa-rating punctuation, strategy F7, cumulatively added features from F1 to F7, is selected


Learning on FR data Testing on US data

(715 references)

Kim & Bellot, 2013


Test : http://bilbo.openeditionlab.org Sources : http://github.com/OpenEdition/bilbo


EQUIPEX OpenEdition: BILBO

21


IR and Digital Libraries !Sentiment Analysis

22


Searching for book reviews• Applying and testing classical supervised approaches for filtering reviews = a new kind of genre

classification.

• Developing a corpus of reviews of books from the OpenEdition.org platforms and from the Web.

• Collecting two kinds of reviews: — Long reviews of scientific books written by expert reviewers in scientific journals— Short reviews such reader comments on social web sites

• Linking reviews to their corresponding books using BILBO

23

Review ≠ Abstract


Searching for book reviews• A supervised classification approach

• Feature selection : decision trees, Z-score

• Features : localisation of named entities,

24

the corpus. In order to give more importance tothe difference in how many times a term appear inboth classes, we used the normalized z-score de-scribed in Equation (4) with the measure � intro-duced in Equation (3)

� =tfC0 � tfC1

tfC0 + tfC1(3)

The normalization measure � is taken into accountto calculate normalized z-score as following:

Z

�(wi|Cj) =

8>>>>>>><

>>>>>>>:

Z(wi|Cj).(1 + |�(wi|Cj |)if Z > 0 and � > 0,or if Z 0 and � 0

Z(wi|Cj).(1� |�(wi|Cj |)if Z > 0 and � 0,or if Z 0 and � > 0

(4)In the table 3 we can observe the 30 highest nor-

malized Z scores for Review and Review classesfor the corpus after an unigram indexing schemewas performed. We can see that a lot of this fea-tures relate to the classe where they predominate.

Table 3: Distribution of the 30 highest normalizedZ scores across the corpus.

# Feature Z�

score# Feature Z�

score1 abandonne 30.14 16 winter 9.232 seront 30.00 17 cleo 8.883 biographie 21.84 18 visible 8.754 entranent 21.20 19 fondamentale 8.675 prise 21.20 20 david 8.546 sacre 21.20 21 pratiques 8.527 toute 20.70 22 signification 8.478 quitte 19.55 23 01 8.389 dimension 15.65 24 institutionnels 8.3810 les 14.43 25 1930 8.1611 commandement 11.01 26 attaques 8.1412 lie 10.61 27 courrier 8.0813 construisent 10.16 28 moyennes 7.9914 lieux 10.14 29 petite 7.8515 garde 9.75 30 adapted 7.84

In our training corpus, we have 106 911 wordsobtained from the Bag-of-Words approach. We se-lected all tokens (features) that appear more than5 times in each classes. The goal is therefore todesign a method capable of selecting terms thatclearly belong to one genre of documents. We ob-tained a vector space that contains 5 957 words(features). After calculating the normalized z-score of all features, we selected the first 1 000features according to this score.

5.3 Using Named Entity (NE) distribution asfeatures

Most of researches involve removing irrelevant de-scriptors. In this section, we describe a new ap-proach for better represent the documents in thecontext of this study. The purpose is to find ele-ments that characterize the Review class.

After a linguistic and statistical corpus analysis,we identified some common characteristics (illus-trated in Figure 3, 4 and 5). We have identifiedthat the presence of the bibliographical referenceor some its elements (title, author(s) and date) ofthe reviewed book is often in the title of the review,as in the following example:

[...]<title level="a" type="main"> Dean R. Hoge,Jacqueline E. Wenger,<hi rend="italic">

Evolving Visions of the Priesthood. Changes from Vatican IIto the Turn of the New Century</hi>

</title>

<title type="sub"> Collegeville (MIN), Liturgical Press,2003, 226 p.</title> [...]

In the Review class, we found scientific arti-cles. In those documents, a bibliography sectionis generally present at the end of the text. As weknow, this section contains authors’ names, loca-tions, dates, etc... However, in the Review classthis section is quite often absent. Based on thisanalysis, we tagged all documents of each classusing the Named Entity Recognition tool TagEN(Poibeau, 2003). We aim to explore the distribu-tion of 3 named entities (”authors’ names”, ”loca-tions” and ”dates”) in the text after removing allXML-HTML tags. After that, we divided textsinto 10 parts (the size of each part = total num-ber of words / 10). The distribution ratio of eachnamed entity in each part is used as feature to buildthe new document representation and we obtaineda set of 30 features.

Figure 3: ”Person” named entity distribution

6 Experiments

In this section we describe results from experi-ments using a collection of documents from Re-vues.org and the Web. We use supervised learning

Figure 4: ”Location” named entity distribution

Figure 5: ”Date” named entity distribution

methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).

6.1 Naive Bayes (NB)

In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.

arg max

hi

P (hi).|w|Y

j=1

P (wj |hi) (5)

where P (wj |hi) =tfj,hinhi

We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.

6.2 Support Vector Machines (SVM)

SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)

6.3 Results

We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.

Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined

Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%

SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185

2 NB 90.6% 64.2% 75.1% 37.4% 76.3% 50.2%SVM (Linear) 87.2% 81.3% 84.2% 75.3% 82.7% 78.8%SVM (RBF) 87.2% 86.5% 86.8% 83.1% 84.0% 83.6%* C = 32.0* � = 0.00781


Figure 4: ”Location” named entity distribution

Figure 5: ”Date” named entity distribution

methods to build our classifiers, and evaluate theresulting models on new test cases. The focus ofour work has been on comparing the effectivenessof different inductive learning algorithms (NaiveBayes, Support Vector Machines with RBF andLinear Kernels) in terms of classification accuracy.We also explored alternative document represen-tations (bag-of-words, feature selection using z-score, Named Entity repartition in the text).

6.1 Naive Bayes (NB)

In order to evaluate different classification mod-els, we have adopted as a baseline the naive Bayesapproach (Zubaryeva and Savoy, 2010). The clas-sification system has to choose between two pos-sible hypotheses: h0 = It is a Review and h1 =It is not a Review the class that has the maxi-mum value according to the Equation (5). Where|w| indicates the number of words included in thecurrent document and wj is the number of wordsthat appear in the document.

arg max

hi

P (hi).|w|Y

j=1

P (wj |hi) (5)

where P (wj |hi) =tfj,hinhi

We estimate the probabilities with the Equation(5) and get the relation between the lexical fre-quency of the word wj in the whole size of thecollection Thi (denoted tfj,hi) and the size of thecorresponding corpus.

6.2 Support Vector Machines (SVM)

SVM designates a learning approach introducedby Vapnik in 1995 for solving two-class patternrecognition problem (Vapnik, 1995). The SVMmethod is based on the Structural Risk Mini-mization principle (Vapnik, 1995) from computa-tional learning theory. In their basic form, SVMslearn linear threshold function. Nevertheless, bya simple plug-in of an appropriate kernel func-tion, they can be used to learn linear classifiers,radial basic function (RBF) networks, and three-layer sigmoid neural nets (Joachims, 1998). Thekey in such classifiers is to determine the opti-mal boundaries between the different classes anduse them for the purposes of classification (Ag-garwal and Zhai, 2012). Having the vectors formthe different representations presented below. weused the Weka toolkit to learning model. Thismodel with the use of the linear kernel and RadialBasic Function(RBF) sometimes allows to reacha good level of performance at the cost of fastgrowth of the processing time during the learningstage.(Kummer, 2012)

6.3 Results

We have used different strategies to represent eachtextual unit. First, the unigram model (Bag-of-Words) where all words are considered as features.We also used feature selection based on the nor-malized z-score by keeping the first 1000 wordsaccording to this score (after removing all wordsthat appear less than 5 times). As the third ap-proach, we suggested that the common featuresbetween the Review collection can be located inthe Named Entity distribution in the text.

Table 4: Results showing the performances ofthe classification models using different indexingschemes on the test set. The best values for theReview class are noted in bold and those forReview class are are underlined

Review Review# Model R P F-M R P F-M1 NB 65.5% 81.5% 72.6% 81.6% 65.7% 72.8%

SVM (Linear) 99.6% 98.3% 98.9% 97.9% 99.5% 98.7%SVM (RBF) 89.8% 97.2% 93.4% 96.8% 88.5% 92.5%* C = 5.0* � = 0.00185



Benkoussas & Bellot, LREC 2014



Sentiment Analysis in Twitter

26

Authorities are only too aware that Kashgar is 4,000 kilometres (2,500 miles) from Beijing but only a tenth ofthe distance from the Pakistani border, and are desperate to ensure instability or militancy does not leak over thefrontiers.Taiwan-made products stood a good chance of becoming even more competitive thanks to wider access to overseasmarkets and lower costs for material imports, he said.”March appears to be a more reasonable estimate while earlier admission cannot be entirely ruled out,” accordingto Chen, also Taiwan’s chief WTO negotiator.friday evening plans were great, but saturday’s plans didnt go as expected – i went dancing & it was an ok club,but terribly crowded :-(WHY THE HELL DO YOU GUYS ALL HAVE MRS. KENNEDY! SHES A FUCKING DOUCHEAT&T was okay but whenever they do something nice in the name of customer service it seems like a favor, whileT-Mobile makes that a normal everyday thinobama should be impeached on TREASON charges. Our Nuclear arsenal was TOP Secret. Till HE told our enemieswhat we had. #Coward #TraitorMy graduation speech: ”I’d like to thanks Google, Wikipedia and my computer! :D #iThingteens

Table 5: List of example sentences with annotations that were provided to the annotators. All subjective phrases areitalicized. Positive phrases are in green, negative phrases are in red, and neutral phrases are in blue.

Worker 1 I would love to watch Vampire Diaries :) and some Heroes! Great combination 9/13Worker 2 I would love to watch Vampire Diaries :) and some Heroes! Great combination 11/13Worker 3 I would love to watch Vampire Diaries :) and some Heroes! Great combination 10/13Worker 4 I would love to watch Vampire Diaries :) and some Heroes! Great combination 13/13Worker 5 I would love to watch Vampire Diaries :) and some Heroes! Great combination 11/13Intersection I would love to watch Vampire Diaries :) and some Heroes! Great combination

Table 6: Example of a sentence annotated for subjectivity on Mechanical Turk. Words and phrases that were marked assubjective are italicized and highlighted in bold. The first five rows are annotations provided by Turkers, and the finalrow shows their intersection. The final column shows the accuracy for each annotation compared to the intersection.

Note that ignoring Fneutral does not reduce thetask to predicting positive vs. negative labels only(even though some participants have chosen to doso) since the gold standard still contains neutrallabels which are to be predicted: Fpos and Fneg

would suffer if these examples are labeled as posi-tive and/or negative instead of neutral.

We provided participants with a scorer. In addi-tion to outputting the overall F-score, it produceda confusion matrix for the three prediction classes(positive, negative, and objective), and it also vali-dated the data submission format.

5 Participants and Results

The results for subtask A are shown in Tables 7 and8 for Twitter and for SMS messages, respectively;those for subtask B are shown in Table 9 for Twit-ter and in Table 10 for SMS messages. Systems areranked by their scores for the constrained runs; theranking based on scores for unconstrained runs isshown as a subindex.

For both subtasks, there were teams that only sub-mitted results for the Twitter test set. Some teamssubmitted both a constrained and an unconstrainedversion (e.g., AVAYA and teragram). As one wouldexpect, the results on the Twitter test set tended to bebetter than those on the SMS test set since the SMSdata was out-of-domain with respect to the training(Twitter) data.

Moreover, the results for subtask A were signifi-cantly better than those for subtask B, which showsthat it is a much easier task, probably because thereis less ambiguity at the phrase-level.

5.1 Subtask A: Contextual PolarityTable 7 shows that subtask A, Twitter, attracted 23teams, who submitted 21 constrained and 7 uncon-strained systems. Five teams submitted both a con-strained and an unconstrained system, and two otherteams submitted constrained systems that are onthe boundary between being constrained and uncon-strained.

316

Twitter RT @tash jade: That’s really sad, Charlie RT “Until tonight I never realised how fucked up I was” -Charlie Sheen #sheenroast

SMS Glad to hear you are coping fine in uni... So, wat interview did you go to? How did it go?

Table 1: Examples of sentences from each corpus that contain subjective phrases.

While some Twitter sentiment datasets have al-ready been created, they were either small and pro-prietary, such as the i-sieve corpus (Kouloumpiset al., 2011), or they were created only for Span-ish like the TASS corpus2 (Villena-Roman et al.,2013), or they relied on noisy labels obtained fromemoticons and hashtags. They further focused onmessage-level sentiment, and no Twitter or SMScorpus with expression-level sentiment annotationshas been made available so far.

Thus, the primary goal of our SemEval-2013 task2 has been to promote research that will lead to abetter understanding of how sentiment is conveyedin Tweets and SMS messages. Toward that goal,we created the SemEval Tweet corpus, which con-tains Tweets (for both training and testing) and SMSmessages (for testing only) with sentiment expres-sions annotated with contextual phrase-level polar-ity as well as an overall message-level polarity. Weused this corpus as a testbed for the system evalua-tion at SemEval-2013 Task 2.

In the remainder of this paper, we first describethe task, the dataset creation process, and the evalu-ation methodology. We then summarize the charac-teristics of the approaches taken by the participatingsystems and we discuss their scores.

2 Task Description

We had two subtasks: an expression-level subtaskand a message-level subtask. Participants couldchoose to participate in either or both subtasks. Be-low we provide short descriptions of the objectivesof these two subtasks.

Subtask A: Contextual Polarity DisambiguationGiven a message containing a marked instanceof a word or a phrase, determine whether thatinstance is positive, negative or neutral in thatcontext. The boundaries for the marked in-stance were provided: this was a classificationtask, not an entity recognition task.

2http://www.daedalus.es/TASS/corpus.php

Subtask B: Message Polarity ClassificationGiven a message, decide whether it is ofpositive, negative, or neutral sentiment. Formessages conveying both a positive and anegative sentiment, whichever is the strongerone was to be chosen.

Each participating team was allowed to submit re-sults for two different systems per subtask: one con-strained, and one unconstrained. A constrained sys-tem could only use the provided data for training,but it could also use other resources such as lexi-cons obtained elsewhere. An unconstrained systemcould use any additional data as part of the trainingprocess; this could be done in a supervised, semi-supervised, or unsupervised fashion.

Note that constrained/unconstrained refers to thedata used to train a classifier. For example, if otherdata (excluding the test data) was used to developa sentiment lexicon, and the lexicon was used togenerate features, the system would still be con-strained. However, if other data (excluding the testdata) was used to develop a sentiment lexicon, andthis lexicon was used to automatically label addi-tional Tweet/SMS messages and then used with theoriginal data to train the classifier, then such a sys-tem would be unconstrained.

3 Dataset Creation

In the following sections we describe the collectionand annotation of the Twitter and SMS datasets.

3.1 Data CollectionTwitter is the most common micro-blogging site onthe Web, and we used it to gather tweets that expresssentiment about popular topics. We first extractednamed entities using a Twitter-tuned NER system(Ritter et al., 2011) from millions of tweets, whichwe collected over a one-year period spanning fromJanuary 2012 to January 2013; we used the publicstreaming Twitter API to download tweets.

313

Twitter RT @tash jade: That’s really sad, Charlie RT “Until tonight I never realised how fucked up I was” -Charlie Sheen #sheenroast

SMS Glad to hear you are coping fine in uni... So, wat interview did you go to? How did it go?

Table 1: Examples of sentences from each corpus that contain subjective phrases.

While some Twitter sentiment datasets have al-ready been created, they were either small and pro-prietary, such as the i-sieve corpus (Kouloumpiset al., 2011), or they were created only for Span-ish like the TASS corpus2 (Villena-Roman et al.,2013), or they relied on noisy labels obtained fromemoticons and hashtags. They further focused onmessage-level sentiment, and no Twitter or SMScorpus with expression-level sentiment annotationshas been made available so far.

Thus, the primary goal of our SemEval-2013 task2 has been to promote research that will lead to abetter understanding of how sentiment is conveyedin Tweets and SMS messages. Toward that goal,we created the SemEval Tweet corpus, which con-tains Tweets (for both training and testing) and SMSmessages (for testing only) with sentiment expres-sions annotated with contextual phrase-level polar-ity as well as an overall message-level polarity. Weused this corpus as a testbed for the system evalua-tion at SemEval-2013 Task 2.

In the remainder of this paper, we first describethe task, the dataset creation process, and the evalu-ation methodology. We then summarize the charac-teristics of the approaches taken by the participatingsystems and we discuss their scores.

2 Task Description

We had two subtasks: an expression-level subtaskand a message-level subtask. Participants couldchoose to participate in either or both subtasks. Be-low we provide short descriptions of the objectivesof these two subtasks.

Subtask A: Contextual Polarity DisambiguationGiven a message containing a marked instanceof a word or a phrase, determine whether thatinstance is positive, negative or neutral in thatcontext. The boundaries for the marked in-stance were provided: this was a classificationtask, not an entity recognition task.

2http://www.daedalus.es/TASS/corpus.php

Subtask B: Message Polarity ClassificationGiven a message, decide whether it is ofpositive, negative, or neutral sentiment. Formessages conveying both a positive and anegative sentiment, whichever is the strongerone was to be chosen.

Each participating team was allowed to submit re-sults for two different systems per subtask: one con-strained, and one unconstrained. A constrained sys-tem could only use the provided data for training,but it could also use other resources such as lexi-cons obtained elsewhere. An unconstrained systemcould use any additional data as part of the trainingprocess; this could be done in a supervised, semi-supervised, or unsupervised fashion.

Note that constrained/unconstrained refers to thedata used to train a classifier. For example, if otherdata (excluding the test data) was used to developa sentiment lexicon, and the lexicon was used togenerate features, the system would still be con-strained. However, if other data (excluding the testdata) was used to develop a sentiment lexicon, andthis lexicon was used to automatically label addi-tional Tweet/SMS messages and then used with theoriginal data to train the classifier, then such a sys-tem would be unconstrained.

3 Dataset Creation

In the following sections we describe the collectionand annotation of the Twitter and SMS datasets.

3.1 Data CollectionTwitter is the most common micro-blogging site onthe Web, and we used it to gather tweets that expresssentiment about popular topics. We first extractednamed entities using a Twitter-tuned NER system(Ritter et al., 2011) from millions of tweets, whichwe collected over a one-year period spanning fromJanuary 2012 to January 2013; we used the publicstreaming Twitter API to download tweets.

313

Instructions: Subjective words are ones which convey an opinion. Given a sentence, identify whether it is objective,positive, negative, or neutral. Then, identify each subjective word or phrase in the context of the sentence and markthe position of its start and end in the text boxes below. The number above each word indicates its position. Theword/phrase will be generated in the adjacent textbox so that you can confirm that you chose the correct range.Choose the polarity of the word or phrase by selecting one of the radio buttons: positive, negative, or neutral. If asentence is not subjective please select the checkbox indicating that ”There are no subjective words/phrases”. Pleaseread the examples and invalid responses before beginning if this is your first time answering this hit.

Figure 1: Instructions provided to workers on Mechanical Turk followed by a screenshot.

Average # of Total Phrase Count VocabularyCorpus Words Characters Positive Negative Neutral SizeTwitter - Training 25.4 120.0 5,895 3,131 471 20,012Twitter - Dev 25.5 120.0 648 430 57 4,426Twitter - Test 25.4 121.2 2,734 1,541 160 11,736SMS - Test 24.5 95.6 1,071 1,104 159 3,562

Table 2: Statistics for Subtask A.

We then identified popular topics as those namedentities that are frequently mentioned in associationwith a specific date (Ritter et al., 2012). Given thisset of automatically identified topics, we gatheredtweets from the same time period which mentionedthe named entities. The testing messages had differ-ent topics from training and spanned later periods.

To identify messages that express sentiment to-wards these topics, we filtered the tweets us-ing SentiWordNet (Baccianella et al., 2010). Weremoved messages that contained no sentiment-bearing words, keeping only those with at least oneword with positive or negative sentiment score thatis greater than 0.3 in SentiWordNet for at least onesense of the words. Without filtering, we found classimbalance to be too high.3

Twitter messages are rich in social media features,including out-of-vocabulary (OOV) words, emoti-cons, and acronyms; see Table 1. A large portion ofthe OOV words are hashtags (e.g., #sheenroast)and mentions (e.g., @tash jade).

3Filtering based on an existing lexicon does bias the datasetto some degree; however, note that the text still contains senti-ment expressions outside those in the lexicon.

Corpus Positive Negative Objective/ Neutral

Twitter - Training 3,662 1,466 4,600Twitter - Dev 575 340 739Twitter - Test 1,573 601 1,640SMS - Test 492 394 1,208

Table 3: Statistics for Subtask B.

We annotated the same Twitter messages with an-notations for subtask A and subtask B. However,the final training and testing datasets overlap onlypartially between the two subtasks since we hadto throw away messages with low inter-annotatoragreement, and this differed between the subtasks.For testing, we also annotated SMS messages, takenfrom the NUS SMS corpus4 (Chen and Kan, 2012).Tables 2 and 3 show statistics about the corpora wecreated for subtasks A and B.

4http://wing.comp.nus.edu.sg/SMSCorpus/

314


Aspect Based Sentiment Analysis

— Subtask 1: Aspect term extraction

Given a set of sentences with pre-identified entities (e.g., restaurants), identify the aspect terms present in the sentence and return a list containing all the distinct aspect terms. An aspect term names a particular aspect of the target entity.

For example, "I liked the service and the staff, but not the food”, “The food was nothing much, but I loved the staff”. Multi-word aspect terms (e.g., “hard disk”) should be treated as single terms (e.g., in “The hard disk is very noisy” the only aspect term is “hard disk”).— Subtask 2: Aspect term polarity

For a given set of aspect terms within a sentence, determine whether the polarity of each aspect term is positive, negative, neutral or conflict (i.e., both positive and negative).

For example:

“I loved their fajitas” → {fajitas: positive}“I hated their fajitas, but their salads were great” → {fajitas: negative, salads: positive}“The fajitas are their first plate” → {fajitas: neutral}“The fajitas were great to taste, but not to see” → {fajitas: conflict}

27

http://alt.qcri.org/semeval2014/task4/



— Subtask 3: Aspect category detection

Given a predefined set of aspect categories (e.g., price, food), identify the aspect categories discussed in a given sentence. Aspect categories are typically coarser than the aspect terms of Subtask 1, and they do not necessarily occur as terms in the given sentence.

For example, given the set of aspect categories {food, service, price, ambience, anecdotes/miscellaneous}:

“The restaurant was too expensive” → {price}“The restaurant was expensive, but the menu was great” → {price, food}— Subtask 4: Aspect category polarity

Given a set of pre-identified aspect categories (e.g., {food, price}), determine the polarity (positive, negative, neutral or conflict) of each aspect category.

For example:

“The restaurant was too expensive” → {price: negative}“The restaurant was expensive, but the menu was great” → {price: negative, food: positive}

28

http://alt.qcri.org/semeval2014/task4/


Hussam Hamdan1 Frédéric Léchet1 Patrice LellotHussam:hamdan_lsis:org1 Frederic:bechet_lif:univ-mrs:fr1 Patrice:bellot_lsis:org

Vix-Marseille University1 Marseille France

Twitter is a real-time1 highly social microblogging service that allows us to post short messages1The Sentiment Vnalysis of Twitter is useful for many domains )Marketing1Finance1 Social1 etc:::E1 Many approaches were proposed for this task1 we have applied several machine learning approaches in order to classify the Tweets using the dataset of SemEval DNj!: Many resources were used for feature extractionG WordNet )similar adjectives and verb groupsE1 RLpedia )the hidden conceptsE1 SentiWordNet )the polarity and subjectivityE1 and other Twitter specific features such as number of y1w1_1 etc:

Highlights

ResultsNaive Bayes ModelAverage F-measure of negative and positive classes has been improved by 45 wrt uni-gram model

SVM ModelAverage F-measure of negative and positive classes has been improved by 1(55wrt uni-gram model

System Architecture

Preprocessing Feature Extraction

Classification Model

Training Set:6#56 TweetsDevelopment SetGj48# Tweets

Gas by my house hit §!:!5yyyy1 Ikm going to *hapel Hill on Sat: GE

DBpedia WordNet

Senti-FeaturesxSentiWordNetC

TwitterSpecific

Pos Neg Objective

Conclusion

Classification

TwitterDictionary

Gas by my house hit §!:!5yyyy1 Ikm going to *hapel Hill on Sat: very happy

Settlementconnected1 blessedmove1 displace1 sit sit_down

w_1 wy1 ww polarity1 subjectivitywpos wneg

Preprocessing

Feature Extraction*lassification model

- Using the similar adjectives from WordNet has a significant effect with Naive Layes but a little effect with SVM:- Using the hidden concepts is not so significant in this data set1 more significant for the objective class with SVM- Using Senti-features and Twitter specific features and verb groups were useful with SVM

Experiments with DBpediaD WordNet and SentiWordNet as resources for Sentiment analysis in micro-blogging

)Linear kernelE

2 PG Precision1 RG Recall1 FG F-measure

2

SemEval 2013


Sentiment Analysis on Twitter : Using Z-Score

• Z-Score helps to discriminate words for Document Classification, Authorship Attribution (J. Savoy, ACM TOIS 2013)

30

Z_score for each term ti in a class Cj (tij) by cal-culating its term relative frequency tfrij in a par-ticular class Cj, as well as the mean (meani) which is the term probability over the whole cor-pus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)

The term which has salient frequency in a class in compassion to others will have a salient Z_score. Z_score was exploited for SA by (Zubaryeva and Savoy 2010) , they choose a threshold (>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_scores as added features for classification because the tweet is too short, therefore many tweets does not have any words with salient Z_score. The three following figures 1,2,3 show the distribution of Z_score over each class, we remark that the majority of terms has Z_score between -1.5 and 2.5 in each class and the rest are either vey frequent (>2.5) or very rare (<-1.5). It should indicate that negative value means that the term is not frequent in this class in comparison with its frequencies in other classes. Table1 demonstrates the first ten terms having the highest Z_scores in each class. We have test-ed to use different values for the threshold, the best results was obtained when the threshold is 3.

positive

Z_score

negative

Z_score

Neutral

Z_score

Love Good Happy Great Excite Best Thank Hope Cant Wait

14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05

Not Fuck Don’t Shit Bad Hate Sad Sorry Cancel stupid

13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83

Httpbit Httpfb Httpbnd Intern Nov Httpdlvr Open Live Cloud begin

6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17

Table1. The first ten terms having the highest Z_score in each class

- Sentiment Lexicon Features (POL) We used two sentiment lexicons, MPQA Subjec-tivity Lexicon(Wilson, Wiebe et al. 2005) and

Bing Liu's Opinion Lexicon which is created by (Hu and Liu 2004) and augmented in many latter works. We extract the number of positive, nega-tive and neutral words in tweets according to the-se lexicons. Bing Liu's lexicon only contains negative and positive annotation but Subjectivity contains negative, positive and neutral.

- Part Of Speech (POS) We annotate each word in the tweet by its POS tag, and then we compute the number of adjec-tives, verbs, nouns, adverbs and connectors in each tweet.

4 Evaluation

4.1 Data collection We used the data set provided in SemEval 2013 and 2014 for subtask B of sentiment analysis in Twitter(Rosenthal, Ritter et al. 2014) (Wilson, Kozareva et al. 2013). The participants were provided with training tweets annotated as posi-tive, negative or neutral. We downloaded these tweets using a given script. Among 9646 tweets, we could only download 8498 of them because of protected profiles and deleted tweets. Then, we used the development set containing 1654 tweets for evaluating our methods. We combined the development set with training set and built a new model which predicted the labels of the test set 2013 and 2014.

4.2 Experiments

Official Results The results of our system submitted for SemEval evaluation gave 46.38%, 52.02% for test set 2013 and 2014 respectively. It should mention that these results are not correct because of a software bug discovered after the submis-sion deadline, therefore the correct results is demonstrated as non-official results. In fact the previous results are the output of our classifier which is trained by all the features in section 3, but because of index shifting error the test set was represented by all the features except the terms.

Non-official Results We have done various experiments using the features presented in Section 3 with Multinomial Naïve-Bayes model. We firstly constructed fea-ture vector of tweet terms which gave 49%, 46% for test set 2013, 2014 respectively. Then, we augmented this original vector by the Z_score


!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)


positive

Z_score

negative

Z_score

Neutral

Z_score


14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05


13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83


6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17





4 Evaluation


4.2 Experiments




!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)


positive

Z_score

negative

Z_score

Neutral

Z_score


14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05


13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83


6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17





4 Evaluation


4.2 Experiments




!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!")) Eq. (2)


positive

Z_score

negative

Z_score

Neutral

Z_score


14.31 14.01 12.30 11.10 10.35 9.24 9.21 8.24 8.10 8.05


13.99 12.97 10.97 8.99 8.40 8.29 8.28 8.11 7.53 6.83


6.44 4.56 3.78 3.58 3.45 3.40 3.30 3.28 3.28 3.17





4 Evaluation


4.2 Experiments



features which improve the performance by 6.5% and 10.9%, then by pre-polarity features which also improve the f-measure by 4%, 6%, but the extending with POS tags decreases the f-measure. We also test all combinations with the-se previous features, Table2 demonstrates the results of each combination, we remark that POS tags are not useful over all the experiments, the best result is obtained by combining Z_score and pre-polarity features. We find that Z_score fea-tures improve significantly the f-measure and they are better than pre-polarity features.

Figure 1 Z_score distribution in positive class

Figure 2 Z_score distribution in neutral class

Figure 3 Z_score distribution in negative class

Features F-measure 2013 2014

Terms 49.42 46.31 Terms+Z 55.90 57.28 Terms+POS 43.45 41.14 Terms+POL 53.53 52.73 Terms+Z+POS 52.59 54.43 Terms+Z+POL 58.34 59.38 Terms+POS+POL 48.42 50.03 Terms+Z+POS+POL 55.35 58.58 Table 2. Average f-measures for positive and negative clas-

ses of SemEval2013 and 2014 test sets. We repeated all previous experiments after using a twitter dictionary where we extend the tweet by the expressions related to each emotion icons or abbreviations in tweets. The results in Table3 demonstrate that using that dictionary improves the f-measure over all the experiments, the best results obtained also by combining Z_scores and pre-polarity features. Features F-measure

2013 2014 Terms 50.15 48.56 Terms+Z 57.17 58.37 Terms+POS 44.07 42.64 Terms+POL 54.72 54.53 Terms+Z+POS 53.20 56.47 Terms+Z+POL 59.66 61.07 Terms+POS+POL 48.97 51.90 Terms+Z+POS+POL 55.83 60.22

Table 3. Average f-measures for positive and negative clas-ses of SemEval2013 and 2014 test sets after using a twitter

dictionary.

5 Conclusion

In this paper we tested the impact of using Twitter Dictionary, Sentiment Lexicons, Z_score features and POS tags for the sentiment classifi-cation of tweets. We extended the feature vector of tweets by all these features; we have proposed new type of features Z_score and demonstrated that they can improve the performance. We think that Z_score can be used in different ways for improving the Sentiment Analysis, we are going to test it in another type of corpus and using other methods in order to combine these features.

Reference Apoorv Agarwal,Boyi Xie,Ilia Vovsha,Owen

Rambow and Rebecca Passonneau (2011). Sentiment analysis of Twitter data. Proceedings of the Workshop on Languages










dictionary.

5 Conclusion













dictionary.

5 Conclusion




[Hamdan, Béchet & Bellot, SemEval 2014]

Run Const- Unconst- Use Super-rained rained Neut.? vised?

NRC-Canada 69.02 yes yesGU-MLT-LT 65.27 yes yesteragram 64.86 64.86(1) yes yesBOUNCE 63.53 yes yesKLUE 63.06 yes yesAMI&ERIC 62.55 61.17(3) yes yes/semiFBM 61.17 yes yesAVAYA 60.84 64.06(2) yes yes/semiSAIL 60.14 61.03(4) yes yesUT-DB 59.87 yes yesFBK-irst 59.76 yes yesnlp.cs.aueb.gr 58.91 yes yes�UNITOR 58.27 59.50(5) yes semiLVIC-LIMSI 57.14 yes yesUmigon 56.96 yes yesNILC USP 56.31 yes yesDataMining 55.52 yes semi�ECNUCS 55.05 58.42(6) yes yesnlp.cs.aueb.gr 54.73 yes yesASVUniOfLeipzig 54.56 yes yesSZTE-NLP 54.33 53.10(9) yes yesCodeX 53.89 yes yesOasis 53.84 yes yesNTNU 53.23 50.71(10) yes yesUoM 51.81 45.07(15) yes yesSSA-UO 50.17 yes noSenselyticTeam 50.10 yes yesUMCC DLSI (SA) 49.27 48.99(12) yes yesbwbaugh 48.83 54.37(8) yes yes/semisenti.ue-en 47.24 47.85(13) yes yesSU-sentilab 45.75(14) yes yesOPTWIMA 45.40 54.51(7) yes yesREACTION 45.01 yes yesuottawa 42.51 yes yesIITB 39.80 yes yesIIRG 34.44 yes yessinai 16.28 49.26(11) yes yesMajority Baseline 29.19 N/A N/A

Table 9: Results for subtask B on the Twitter dataset. The� indicates a system submitted as constrained but whichused additional Tweets or additional sentiment-annotatedtext to collect statistics that were then used as a feature.

These averages are much lower than those for sub-task A, which indicates that subtask B is harder,probably because a message can contain parts ex-pressing both positive and negative sentiment.

Run Const- Unconst- Use Super-rained rained Neut.? vised?

NRC-Canada 68.46 yes yesGU-MLT-LT 62.15 yes yesKLUE 62.03 yes yesAVAYA 60.00 59.47(1) yes yes/semiteragram 59.10(2) yes yesNTNU 57.97 54.55(6) yes yesCodeX 56.70 yes yesFBK-irst 54.87 yes yesAMI&ERIC 53.63 52.62(7) yes yes/semi�ECNUCS 53.21 54.77(5) yes yesUT-DB 52.46 yes yesSAIL 51.84 51.98(8) yes yes�UNITOR 51.22 48.88(10) yes semiSZTE-NLP 51.08 55.46(3) yes yesSenselyticTeam 51.07 yes yesNILC USP 50.12 yes yesREACTION 50.11 yes yesSU-sentilab 49.57(9) no yesnlp.cs.aueb.gr 49.41 55.28(4) yes yesLVIC-LIMSI 49.17 yes yesFBM 47.40 yes yesASVUniOfLeipzig 46.50 yes yessenti.ue-en 44.65 46.72(12) yes yesSSA UO 44.39 yes noUMCC DLSI (SA) 43.39 40.67(14) yes yesUoM 42.22 35.22(15) yes yesOPTWIMA 40.98 47.15(11) yes yesuottawa 40.51 yes yesbwbaugh 39.73 43.43(13) yes yes/semiIIRG 22.16 yes yesMajority Baseline 19.03 N/A N/A

Table 10: Results for subtask B on the SMS dataset. The� indicates a system submitted as constrained but whichused additional Tweets or additional sentiment-annotatedtext to collect statistics that were then used as a feature.

Once again, NRC-Canada had the best con-strained system with an F1-measure of 69%, fol-lowed by teragram, which had the best uncon-strained system with an F1-measure of 64.9%.

As Table 10 shows, the average F1-measure onthe SMS test set was 50.2% for constrained and50.3% for unconstrained systems. NRC-Canada hadthe best constrained system with an F1=68.5%, andAVAYA had the best unconstrained one with F1-measure of 59.5%.

318

Best official 2013 results

[Hamdan, Bellot & Béchet, SemEval 2014]


Subjectivity lexicon : MPQA

- The MPQA (Multi-Perspective Question Answering) Subjectivity Lexicon

31

http://mpqa.cs.pitt.eduTheresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-‐‑Level Sentiment Analysis. Proc. of HLT-‐‑EMNLP-‐‑2005.

P. Bellot (AMU-‐CNRS, LSIS-‐OpenEdition) 32http://wordnetweb.princeton.edu/perl/webwn http://www.cs.rochester.edu/research/cisd/wordnet

Because meaningful sentences are composed ofmeaningful words, any system that hopes to processnatural languages as people do must have informationabout words and their meanings. This information istraditionally provided through dictionaries, andmachine-readable dictionaries are now widely avail-able. But dictionary entries evolved for the conve-nience of human readers, not for machines. WordNet1

provides a more effective combination of traditionallexicographic information and modern computing. WordNet is an online lex-ical database designed for use under program control. English nouns, verbs,adjectives, and adverbs are organized into sets of synonyms, each representinga lexicalized concept. Semantic relations link the synonym sets [4].

Language DefinitionsWe define the vocabulary of a language as a set W of pairs (f,s), where a formf is a string over a finite alphabet, and a sense s is an element from a givenset of meanings. Forms can be utterances composed of a string of phonemesor inscriptions composed of a string of characters. Each form with a sense ina language is called a word in that language. A dictionary is an alphabetical listof words. A word that has more than one sense is polysemous; two words thatshare at least one sense in common are said to be synonymous.

A word’s usage is the set C of linguistic contexts in which the word can beused. The syntax of the language partitions C into syntactic categories. Wordsthat occur in the subset N are nouns, words that occur in the subset V areverbs, and so on. Within each category of syntactic contexts are further cate-gories of semantic contexts—the set of contexts in which a particular f can beused to express a particular s.

The morphology of the language is defined in terms of a set M of relationsbetween word forms. For example, the morphology of English is partitionedinto inflectional, derivational, and compound morphological relations. Finally,the lexical semantics of the language is defined in terms of a set S of relationsbetween word senses. The semantic relations into which a word enters deter-mine the definition of that word.

AIcommonsense problems

COMMUNICATIONS OF THE ACM November 1995/Vol. 38, No. 11 39

WordNet: A LexicalDatabase for EnglishGeorge A. Miller

This database links English nouns,

verbs, adjectives, and adverbs to

sets of synonyms that are in turn

linked through semantic relations

that determine word definitions.

1 WordNet is a registered trademark of Princeton University, available by anonymous ftp from clarity.princeton.edu


http://sentiwordnet.isti.cnr.it



— Dataset : 3K English sentences from the restaurant review + 3K English sentences extracted from customer reviews of laptops + tagged by experienced human annotators

— We proposed : 1. Aspect term extraction: CRF model

2. Aspect Term Polarity Detection: Multinomial Naive-Bayes classifier with some features such as Z-score, POS and prior polarity extracted from Subjectivity Lexicon (Wilson, Wiebe et al. 2005) and Bing Liu's Opinion Lexicon

3. Category Detection & Category Polarity Detection : Z-score model

!!

34

(tij) by calculating its term relative frequency tfrij in a particular class Cj, as well as the mean (meani) which is the term probability over the whole corpus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!"))Eq. (2)

Z_score was exploited for SA by (Zubaryeva and Savoy 2010), they choose a threshold (Z>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_score as added features for multinomial Naive Bayes classifier.

3.4 Subtask4: Category Polarity Detection

We have used Multinomial Naive-Bayes as in the subtask2 step (2) with the same features, but the different that we add also the name of the category as a feature. Thus, for each sentence having n category we add n examples to the training set, the difference between them is the feature of the category.

4 Experiments and Evaluations

We tested our system using the training and testing data provided by SemEval 2014 ABSA task. Two data sets were provided; the first contains3Ksentences of restaurant reviews annotated by the aspect terms, their polarities, their categories, the polarities of each category. The second contains of 3K sentences of laptop reviews annotated just by the aspect terms, their polarities. The evaluation process was done in two steps. First step is concerning the subtasks 1 and 3 which involves the aspect terms extraction and category detection, we were provided with restaurant review and laptop review sentences and we had to extract the aspect terms for both data sets and the categories for the restaurant one. Baseline methods were provided; Table1 demonstrates the results of these subtasks in terms of precision P, recall R and f-measure F for our system and the baseline2. 2http://alt.qcri.org/semeval2014/task4/data/uploads/baselinesystemdescription.pdf

We remark that our system is 24% and 21% above the baseline for aspect terms extraction in restaurant and laptop reviews respectively, and above 3% for category detection in restaurant reviews.

Data subtask P R F Res 1 Baseline 0,52 0,42 0,47

System 0.81 0.63 0.71 3 Baseline 0,73 0,59 0,65

System 0.77 0.60 0.68 Lap 1 Baseline 0,44 0,29 0,35

System 0.76 0.45 0.56 Table 1. Results of subtask 1, 2 for restaurant reviews,

subtask 1 for laptop reviews The second step involves the evaluation of

subtask 2 and 4, we were provided with(1) restaurant review sentences annotated by their aspect terms, and categories, we had to determine the polarity for each aspect term and category; (2) laptop review sentences annotated by aspect terms and we had to determine the aspect term polarity. Table 2 demonstrates the results of our system and the baseline (A: accuracy, R: number of true retrieved examples, All: number of all true examples).

Data subtask R All A Res 2 Baseline 673 1134 0,64

System 818 1134 0.72 4 Baseline 673 1025 0,65

System 739 1025 0.72 Lap 2 Baseline 336 654 0,51

System 424 654 0,64 Table 2. Results of subtask 2, 4 for restaurant reviews,

subtask 2 for laptop reviews We remark that our system is 8% and 13% above the baseline for aspect terms polarity detection in restaurant and laptop reviews respectively, and 7% above for category polarity detection in restaurant reviews.

5 Conclusion

We have built a system for Aspect-Based Sentiment Analysis; we proposed different supervised methods for the four sub-tasks. Our results are always above the baseline proposed by the organiser of SemEval. We proposed to use CRF for aspect term extraction, Z-score model for category detection, Multinomial Naive-Bayes with some new features for polarity detection. We find that the use of Z-score is useful for the category and polarity detection, we are going to

(tij) by calculating its term relative frequency tfrij in a particular class Cj, as well as the mean (meani) which is the term probability over the whole corpus multiplied by nj the number of terms in the class Cj, and standard deviation (sdi) of term ti according to the underlying corpus (see Eq. (1,2)). Z!"#$% !!" =

!"#!"!!"#$!!"# Eq. (1)

Z!"#$% !!" =

!"#!"!!!∗!(!")!"∗! !" ∗(!!!(!"))Eq. (2)

Z_score was exploited for SA by (Zubaryeva and Savoy 2010), they choose a threshold (Z>2) for selecting the number of terms having Z_score more than the threshold, then they used a logistic regression for combining these scores. We use Z_score as added features for multinomial Naive Bayes classifier.

3.4 Subtask4: Category Polarity Detection

We have used Multinomial Naive-Bayes as in the subtask2 step (2) with the same features, but the different that we add also the name of the category as a feature. Thus, for each sentence having n category we add n examples to the training set, the difference between them is the feature of the category.

4 Experiments and Evaluations

We tested our system using the training and testing data provided by SemEval 2014 ABSA task. Two data sets were provided; the first contains3Ksentences of restaurant reviews annotated by the aspect terms, their polarities, their categories, the polarities of each category. The second contains of 3K sentences of laptop reviews annotated just by the aspect terms, their polarities. The evaluation process was done in two steps. First step is concerning the subtasks 1 and 3 which involves the aspect terms extraction and category detection, we were provided with restaurant review and laptop review sentences and we had to extract the aspect terms for both data sets and the categories for the restaurant one. Baseline methods were provided; Table1 demonstrates the results of these subtasks in terms of precision P, recall R and f-measure F for our system and the baseline2. 2http://alt.qcri.org/semeval2014/task4/data/uploads/baselinesystemdescription.pdf

We remark that our system is 24% and 21% above the baseline for aspect terms extraction in restaurant and laptop reviews respectively, and above 3% for category detection in restaurant reviews.

Data subtask P R F Res 1 Baseline 0,52 0,42 0,47

System 0.81 0.63 0.71 3 Baseline 0,73 0,59 0,65

System 0.77 0.60 0.68 Lap 1 Baseline 0,44 0,29 0,35

System 0.76 0.45 0.56 Table 1. Results of subtask 1, 2 for restaurant reviews,

subtask 1 for laptop reviews The second step involves the evaluation of

subtask 2 and 4, we were provided with(1) restaurant review sentences annotated by their aspect terms, and categories, we had to determine the polarity for each aspect term and category; (2) laptop review sentences annotated by aspect terms and we had to determine the aspect term polarity. Table 2 demonstrates the results of our system and the baseline (A: accuracy, R: number of true retrieved examples, All: number of all true examples).

Data subtask R All A Res 2 Baseline 673 1134 0,64

System 818 1134 0.72 4 Baseline 673 1025 0,65

System 739 1025 0.72 Lap 2 Baseline 336 654 0,51

System 424 654 0,64 Table 2. Results of subtask 2, 4 for restaurant reviews,

subtask 2 for laptop reviews We remark that our system is 8% and 13% above the baseline for aspect terms polarity detection in restaurant and laptop reviews respectively, and 7% above for category polarity detection in restaurant reviews.

5 Conclusion

We have built a system for Aspect-Based Sentiment Analysis; we proposed different supervised methods for the four sub-tasks. Our results are always above the baseline proposed by the organiser of SemEval. We proposed to use CRF for aspect term extraction, Z-score model for category detection, Multinomial Naive-Bayes with some new features for polarity detection. We find that the use of Z-score is useful for the category and polarity detection, we are going to


IR and Digital Libraries !Social Book Search

35


INEX topics


INEX 2014 Social Book Search Track

— In 2014, the Social Book Search Track consists of two tasks:

• Suggestion task: a system-oriented batch retrieval/recommendation task

• Interactive task: a user-oriented interactive task where we want to gather user data on searching for different search tasks and different search interfaces.

— 2.8 million book descriptions with metadata from Amazon and LibraryThing

— 14 million reviews (1.5 million books have no review)

— Amazon: formal metadata like booktitle, author, publisher, publication year, library classification codes, Amazon categories and similar product information, as well as user-generated content in the form of user ratings and reviews

— LibraryThing, there are user tags and user-provided metadata on awards, book characters and locations and blurbs

37

https://inex.mmci.uni-‐saarland.de/tracks/books/



approach to retrieval. We started by using Wikipedia as an external source ofinformation, since many books have their dedicated Wikipedia article [3]. Weassociate a Wikipedia article to each topic and we select the most informativewords from the articles in order to expand the query. For our recommendationruns, we used the reviews and the ratings attributed to books by Amazon users.We computed a ”social relevance” probability for each book, considering theamount of reviews and the ratings. This probability was then interpolated withscores obtained by Maximum Likelihood Estimates computed on whole Amazonpages, or only on reviews and titles, depending on the run.

The rest of the paper is organized as follows. The following Section gives aninsight into the document collection whereas Section 3 describes the our retrievalframework. Finally, we describe our runs in Section 4 and discuss some resultsin Sections 5 and 6.

2 The Amazon collection

The document used for this year’s Book Track is composed of Amazon pages ofexisting books. These pages consist of editorial information such as ISBN num-ber, title, number of pages etc... However, in this collection the most importantcontent resides in social data. Indeed Amazon is social-oriented, and user cancomment and rate products they purchased or they own. Reviews are identi-fied by the <review> fields and are unique for a single user: Amazon does notallow a forum-like discussion. They can also assign tags of their creation to aproduct. These tags are useful for refining the search of other users in the waythat they are not fixed: they reflect the trends for a specific product. In theXML documents, they can be found in the <tag> fields. Apart from this userclassification, Amazon provides its own category labels that are contained in the<browseNode> fields.

Table 1. Some facts about the Amazon collection.

Number of pages (i.e. books) 2, 781, 400Number of reviews 15, 785, 133Number of pages that contain a least a review 1, 915, 336

3 Retrieval model

3.1 Sequential Dependence Model

Like the previous year, we used a language modeling approach to retrieval [4].We use Metzler and Croft’s Markov Random Field (MRF) model [5] to integratemultiword phrases in the query. Specifically, we use the Sequential Dependance

Run nDCG@10 P@10 MRR MAP

p4-inex2011SB.xml social.fb.10.50 0.3101 0.2071 0.4811 0.2283p54-run4.all-topic-fields.reviews-split.combSUM 0.2991 0.1991 0.4731 0.1945p4-inex2011SB.xml social 0.2913 0.1910 0.4661 0.2115p4-inex2011SB.xml full.fb.10.50 0.2853 0.1858 0.4453 0.2051p54-run2.all-topic-fields.all-doc-fields 0.2843 0.1910 0.4567 0.2035p62.recommendation 0.2710 0.1900 0.4250 0.1770p54-run3.title.reviews-split.combSUM 0.2643 0.1858 0.4195 0.1661p62.sdm-reviews-combine 0.2618 0.1749 0.4361 0.1755p62.baseline-sdm 0.2536 0.1697 0.3962 0.1815p62.baseline-tags-browsenode 0.2534 0.1687 0.3877 0.1884p4-inex2011SB.xml full 0.2523 0.1649 0.4062 0.1825wiki-web-nyt-gw 0.2502 0.1673 0.4001 0.1857p4-inex2011SB.xml amazon 0.2411 0.1536 0.3939 0.1722p62.sdm-wiki 0.1953 0.1332 0.3017 0.1404p62.sdm-wiki-anchors 0.1724 0.1199 0.2720 0.1253p4-inex2011SB.xml lt 0.1592 0.1052 0.2695 0.1199p18.UPF QE group BTT02 0.1531 0.0995 0.2478 0.1223p18.UPF QE genregroup BTT02 0.1327 0.0934 0.2283 0.1001p18.UPF QEGr BTT02 RM 0.1291 0.0872 0.2183 0.0973p18.UPF base BTT02 0.1281 0.0863 0.2135 0.1018p18.UPF QE genre BTT02 0.1214 0.0844 0.2089 0.0910p18.UPF base BT02 0.1202 0.0796 0.2039 0.1048p54-run1.title.all-doc-fields 0.1129 0.0801 0.1982 0.0868

Table 2. O�cial results of the Best Books for Social Search task of the INEX 2011Book track, using judgements derived from the LibraryThing discussion groups. Ourruns are identified by the p62 prefix and are in boldface.

We observe that our recommendation approach performs the best amongstour other runs, while our two query expansion approaches with Wikipedia bothfail. Our baseline-sdm run do not use any additional information except theuser query (which is in fact the title of the corresponding LibraryThing thread),hence this is a good of comparison for other runs using social information forexample. Despite that using an external encyclopedic resource like Wikipedia donot work for improving the initial query formulation, we see that a traditionalpseudo-relevance feedback (PRF) approach achieved the best results overall thisyear. Indeed the approach of the University of Amsterdam (p4) was to expandthe query with 50 terms extracted from the top 10 results, either performingover a full index or over an index that only include social tags (such as reviews,tags and ratings). The latter performed the best with their PRF approach, andit is coherent with the results of our recommendation run. Indeed in this runwe only consider the content of the user reviews, which correspond to a limitedversion of the social index mentioned above. It also suggests that the baselinemodel is quite e↵ective and selects relevant feedback documents, which is con-firmed by the results computed with the Amazon Mechanical Turk judgementsshown in Table 3.

Run nDCG@10 P@10 MRR MAP

p62.baseline-sdm 0.6092 0.5875 0.7794 0.3896p4-inex2011SB.xml amazon 0.6055 0.5792 0.7940 0.3500p62.baseline-tags-browsenode 0.6012 0.5708 0.7779 0.3996p4-inex2011SB.xml full 0.6011 0.5708 0.7798 0.3818p4-inex2011SB.xml full.fb.10.50 0.5929 0.5500 0.8075 0.3898p62.sdm-reviews-combine 0.5654 0.5208 0.7584 0.2781p4-inex2011SB.xml social 0.5464 0.5167 0.7031 0.3486p4-inex2011SB.xml social.fb.10.50 0.5425 0.5042 0.7210 0.3261p54-run2.all-topic-fields.all-doc-fields 0.5415 0.4625 0.8535 0.3223

Table 3. Top runs of the Best Books for Social Search task of the INEX 2011 Booktrack, using judgements obtained by crowdsourcing (Amazon Mechanical Turk). Ourruns are identified by the p62 prefix and are in boldface.

In this table we see that the baselines perform very well compared to theothers, and it confirms that a language modeling base system performs very wellon this test collection. It is very good at retrieving relevant documents in the firstranks which is an essential quality for a system that performs PRF. Hence a queryexpansion approach can be very e↵ective on this dataset, but feedback documentsmust come from the target collection and not from an external resource. Itis however important to note that these judgements are coming from peoplethat often are not experts or that do not have the experience of good readers.Their assessments may then come from the suggestions of well-known searchengines or directly from Amazon. This behavior could possibly explain the highperformances of the baselines for the AMT judgements set.

To confirm this assessment, we tried to combine the four heterogenous re-sources mentioned in Section 3.2 and we reported the results on Table 2 underthe uno�cial run identified by wiki-web-nyt-gw . Although the combinationof multiple external resources does much better than using Wikipedia alone, itstill does not beat our baseline. Hence can safely a�rm that reformulating thequery using a wide range of external sources of knowledge does not work whenthe target collection is mainly composed of recommendation or opinion-orientedtext.

The other part of our contribution lies in the social opinion that we took intoaccount in our ranking function. Indeed we are the only group that submittedruns that model the popularity and the likability of books based on user reviewsand ratings. Royal School of Library and Information Science’s group (p54) triedin their early experiments to define an helpfulness score for each review, aiming togive more weight to a review found truthful, and also tried to weigh books reviewsaccording to their associated ratings. However these experiments showed that itdidn’t performed well compared to an approach where they sum the relevancescore of all the reviews for a given book. The two runs we submitted that makeuse of social information (recommendation and sdm-reviews-combine) canboth be viewed as a re-ranking of the baseline, and both of them improve its

Model (SDM), which is a special case of the MRF. In this model three featuresare considered: single term features (standard unigram language model features,fT ), exact phrase features (words appearing in sequence, fO) and unorderedwindow features (require words to be close together, but not necessarily in anexact sequence order, fU ).

Documents are thus ranked according to the following scoring function:

scoreSDM (Q,D) = �T

X

q2Q

fT (q,D)

+ �O

|Q|�1X

i=1

fO(qi, qi+1, D)

+ �U

|Q|�1X

i=1

fU (qi, qi+1, D)

where the features weights are set according to the author’s recommendation(�T = 0.85, �O = 0.1, �U = 0.05). fT , fO and fU are the log maximum likelihoodestimates of query terms in document D, computed over the target collectionwith a Dirichlet smoothing.

3.2 External resources combination

As previously done last year, we exploited external resources in a Pseudo-RelevanceFeedback (PRF) fashion to expand the query with informative terms. Given a re-source R, we form a subset RQ of informative documents considering the initialquery Q using pseudo-relevance feedback. To this end we first rank documentsof R using the SDM ranking function. An entropy measure HRQ(t) is then com-puted for each term t over RQ in order to weigh them according to their relativeinformativeness:

HRQ(t) = �X

w2t

p(w|RQ) · log p(w|RQ)

These external weighted terms are finally used to expand the original query.The ranking function of documents over the target collection C is then definedas follows:

score(Q,D) = scoreSDM (Q,D) +1

|S|X

RQ2S

X

t2RQ

HRQ(t) · fT (t,D)

where S is the set of external resources.For our o�cial experiments with the Book Track we only consideredWikipedia

as an external resource, but we also conducted uno�cial experiments on theTweet Contextualization track after the workshop. In order to extract a com-prehensive context from a tweet, we used a larger set S of resources. It is com-posed of four general resources: Wikipedia as an encyclopedic source, the New

Sequential DependanceModel (SDM) - Markov Random Field (Metzler & Croft, 2004)

We use our SDM baseline defined in section 3.1 and incorporate the aboverecommendation estimate:

scorerecomm(Q,D) = �D scoreSDM (Q,D) + (1� �D) tD

where the �D parameter was set based on the observation over the test topicsmade available to participants for training purposes. Indeed we observed onthese topics that the tD had no influence on the ranking of documents after thehundredth result (average estimation). Hence we fix the smoothing parameterto:

�D =argmaxD scoreSDM (Q,D)� scoreSDM (Q,D)100

NResults

In practice, this approach is re-ranking of the results of the SDM retrievalmodel based on the popularity and the likability of the di↵erent books.

4 Runs

This year we submitted 6 runs for the Social Search for Best Books task only.We used Indri3 for indexing and searching. We did not remove any stopword andused the standard Krovetz stemmer.

baseline-sdm This run is the implementation of the SDM model described inSection 3.1. We use it as a strong baseline.

baseline-tags-browsenode This is an attempt to produce an improved base-line that uses the Amazon classification as well as user tags. We search all singlequery terms in the specific XML fields (<tag> and <browseNode>). This part isthen combined with the SDM model, which is weighted four times more than the”tag searching” part. We set these weights empirically after observations on thetest topics. The Indri syntax for the query schumann biography would typicallybe:

#weight (

0.2 #combine ( #1(schumann).tag #1(biography).tag

#1(schumann).browseNode #1(biography).browseNode )

0.8 #weight ( 0.85 #combine( Schumann Biography )

0.1 #combine( #1(schumann biography) )

0.05 #combine( #uw8(schumann biography) ) )

)

3http://www.lemurproject.org

We can iterate and construct a directed graph of Wikipedia articles linkedtogether. Children node pages (or sub-articles) are weighted half that of theirparents in order to minimize a potential topic drift. We avoid loops in the graph(i.e. a children node can not be linked to one of his elder) because it bringsno additional information. It also could change weights between linked articles.Informative words are then extracted from the sub-articles and incorporated toour retrieval model like another external resource.

3.4 Social opinion for book search

The test collection used this year for the Book Track contains Amazon pagesof books. These pages are composed amongst others of editorial information,like the number of pages or the blurb, user ratings and user reviews. However,contrary to the previous years, the actual content of the books is not available.Hence, the task is to rank books according to the sparse informative content andthe opinion of readers expressed in the reviews, considering that the user ratingsare integers between 1 and 5.

Here, we wanted to model two social popularity assumptions: a product thathave a lot of reviews must be relevant (or at least popular), and a high ratedproduct must be relevant. Then, a product having a large number of good reviewsreally must be relevant. However in the collection there is often a small amountof ratings for a given book. The challenge was to determine whether each userrating is significant or not. To do so, we first define XD

R a random set of ”bad”ratings (1, 2 or 3 over 5 points) for book D. Then, we evaluate the statisticalsignificant di↵erences between XD

R and XDR [ XD

U using Welch’s t-test, whereXD

U is the actual set of user rating for book D. The statistical test is computedby:

tD =XD

R [XDU �XD

U

sXD

R [XDU �XD

U

where

sXD

R [XDU �XD

U=

ss2RU

nRU+

s2UnU

Where s2 is the unbiased estimator of the variance of the two sets and nX is thenumber of ratings for set X.

The underlying assumption is that significant di↵erences occur under twodi↵erent situations. First, when there is a small amount of user ratings (Xi

U )but they all are very good. For example this is the case of good but little-knownbooks. Second, when there is a very large amount of user ratings but there areaverage. Hence this statistical test gives us a single estimate of both likabilityand popularity.

Test statistique entre les notes observées et des notes aléatoires

Est-ce qu’une note est significative ?

Projet ANR CAAS


Query Expansion : with Concepts from DBPedia

40


Terms only vs. Extended Features

— We modeled book likeliness based on the following idea: the more the number of reviews it has, the more interesting the book is (it may not be a good or popular book but a book that has a high impact)

— InL2 information retrieval model alone (DFR-based model, Divergence From Randomness) seem to perform better than SDM (Language Modeling) with extended features

41

Benkoussas, Hamdan, Albitar, Ollagnier & Bellot, 2014