Extracting Aspect Specific Opinion Expressions · Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 627–637, Austin, Texas, November

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 627–637,Austin, Texas, November 1-5, 2016. c©2016 Association for Computational Linguistics

Extracting Aspect Specific Opinion Expressions

Abhishek LaddhaIIT Delhi ∗

New Delhi, India, [email protected]

Arjun MukherjeeDepartment of Computer Science,University of Houston, TX, USA

[email protected]

Abstract

Opinionated expression extraction is a cen-tral problem in fine-grained sentiment anal-ysis. Most existing works focus on eithergeneric subjective expression or aspect ex-pression extraction. However, in opinionmining, it is often desirable to mine the as-pect specific opinion expressions (or aspect-sentiment phrases) containing both the as-pect and the opinion. This paper proposesa hybrid generative-discriminative frameworkfor extracting such expressions. The hybridmodel consists of (i) an unsupervised gener-ative component for modeling the semanticcoherence of terms (words/phrases) based ontheir collocations across different documents,and (ii) a supervised discriminative sequencemodeling component for opinion phrase ex-traction. Experimental results using Ama-zon.com reviews demonstrate the effective-ness of the approach that significantly outper-forms several state-of-the-art baselines.

1 Introduction

Aspect based sentiment analysis is one of the mainframeworks in opinion mining (Liu and Zhang,2012). Most of the websites only display the ag-gregated ratings of products but people are more in-terested in fine-grained opinions that capture aspectspecific properties in reviews. Therefore, it is desir-able to have a holistic approach to mine aspect spe-cific opinion expressions containing both aspect and

∗Research performed during author’s internship at Univer-sity of Houston

opinion terms within the sentence context as a com-posite aspect-sentiment phrase (e.g., “had to flashfirmware everyday”, “clear directions in voice” etc.)and further group them under coherent aspect cate-gories. Apart from knowing the key issues in prod-ucts that are often expressed via aspect-sentimentphrases, they are also useful in applications suchas comparing similar products and summarizingtheir important features where it is more conve-nient to have the aspect-sentiment phrases ratherthan generic aspect/sentiment words lacking the nat-ural aspect opinion correspondence in the right con-text. They can also be applied to the various taskssuch as sentiment classification, comparative aspectevaluations, aspect rating prediction, etc.

The thread of research in (Brody and Elhadad,2010; Titov and McDonald, 2008; Zhao et al., 2010;Mei et al., 2007; Jo and Oh, 2011) focus on extract-ing and grouping aspect and opinion words via gen-erative models but lack the natural aspect opinioncorrespondence (e.g., in the manner they appear insentences). (Wang et al., 2016; Fei et al., 2016) candiscover aspect specific opinion unigrams but doesnot focus on phrases. The thread on fine grainedopinion expressions (Wiebe et al., 2005; Choi et al.,2006; Breck et al., 2007) focus on subjective expres-sion extraction which are generic instead of aspectspecific. Formally, the task can be stated as follows:

Given a set of reviews, for each sentence,s = (w1, . . . wn), with the head aspect(HA), wHA=i, i ∈ [1, n], discover a sub-sequence(wp, . . . wq) where p ≤ i ≤ q that best describesthe aspect-sentiment phrase containing the headaspect. We refer head aspect to the word describing

627

fine-grained property of product. Further, groupthese phrases under relevant aspect categories. Theexamples below show labeled aspect-sentimentphrases within [[ ]] with the head aspect (HA)italicized:

• I’ve been very happy with it so far done a[[firmware update without a hitch]].• After less than two years, the [[signal became

spotty]].

In this paper, we propose a novel hybrid modelto solve the problem. We call this Phrase Senti-ment Model (PSM). PSM is capable of extractinga myriad of expression types covering: verb phrases(“screen has poor viewability”), noun, adjective oradverbial phrase (“recurrent black screen of death”,“quite stable and fast connection”), implied positive(“voice activated directions”), implied negative (“re-quires reboot every few hours”) etc. The hybridframework facilitates holistic modeling that catersfor varied expression types (leveraging its discrim-inative sequence model) and also grouping them un-der relevant aspect categories with context (exploit-ing its generative framework). Our approach is alsocontext and polarity independent facilitating genericaspect-sentiment phrase extraction in any domain.

Further, we propose a novel sampling schemebased on Generalized Polya urn models that opti-mizes phrasal collocations to improve coherence. Tothe best of our knowledge, a hybrid framework hasnot been attempted before for opinion phrase ex-traction. Additionally, the paper produced a labeleddataset of aspect specific opinion phrases across 4domains containing more than 5200 sentences codedwith phrase boundaries across both positive and neg-ative polarities which will be released to serve as alanguage resource. Experimental evaluation showsthat our approach outperformed the baselines by alarge margin.

2 Related Work

Subjective expression extraction (Choi et al., 2005)has traditionally used sequence models (e.g., CRFs).Various parsing, syntactic, lexical and dictionarybased features (Kim and Hovy, 2006; Jakob andGurevych, 2010; Kobayashi et al., 2007) have beenused for subjective expression extraction. In (Yang

and Cardie, 2013; Johansson and Moschitti, 2011)dependency relations were also used for opinion ex-pression extraction. Sauper et al., (2011) employsan HMM over words and model the latent topics asstates in an HMM to discover the product properties(often aspects) and its associated attributes (pos/neg)polarities separately. In Yang and Cardie, (2012)a semi-CRF based approach is used which allowsequence labeling at segment level and (Yang andCardie, 2014) employed semi-CRF for opinion ex-pression intensity and polarity classification. How-ever, all the above works focus on generic subjectiveexpressions as opposed to aspect specific opinion-sentiment phrases.

In (Choi et al., 2006; Yang and Cardie, 2013)joint models were proposed for identifying opin-ion holders and expression, relations among them innews articles. In (Johansson and Moschitti, 2011)a re-ranking approach was used on the output ofa sequence model to improve opinion expressionextraction. In (Li et al., 2015; Mukherjee, 2016)subjective expressions implying a negative opin-ion were discovered using sequence models andmarkov networks; while in (Berend, 2011) super-vised keyphrase extraction was used for phrase ex-traction. These works mostly relied on word levelfeatures under the first-order Markov assumption.Above works are tailored for only expression extrac-tion and do not group coherent phrases under rele-vant aspect categories.

Another thread of research involves topic phrasemining. Wang et al., (2007) proposes a Topical n-gram model (TNG) that mines phrases based on sta-tistical collocation. Lindsey et al., (2012) employhierarchical Pitman-Yor process to model phrases.In Fei et al., (2014), Generalized Polya urn model(LDA-P-GPU) was used to group the candidate nounphrases. In (El-Kishky et al., 2014; Liu et al., 2015),frequency based information were used for miningphrases that are good for generic phrases but can-not model relevant yet longer phrases due to theirinfrequency. Thus, they are unable to capture longphrases containing both aspect and opinion. Themodels TNG and LDA-P-GPU are closest to ourtask as they can discover relevant aspect expressionsthat can contain opinions and are considered as base-lines.

Next there are works that generate phrasal

628

Figure 1: Plate Notation of PSM

datasets. In SemEval 2015, Aspect based SentimentAnalysis Task (Pontiki et al., 2015), a dataset wasproduced that had annotations for aspect phrases.The focus was on aspect phrases as opposed toaspect-sentiment phrases. The MPQA 2.0 corpus(Wiebe et al., 2005) has some labeled opinion ex-pressions, but they are generic subjective expres-sions as opposed to aspect-sentiment phrases (seeSection 1) we find in reviews.

Zhao et al., (2011) extracts topical phrase intweets using relevance and interestingness. Wu etal., (2009) proposed a phrase dependency parsingapproach to extract product feature (aspect expres-sion) and opinion expression and the relation be-tween them. They considered all noun and verbphrases (NPs, VPs) as product features and its sur-rounding dictionary opinion words as opinions. Fea-tures were constructed using phrase dependency treeto extract relation among all product features andopinions that were later used in aspect and opinionexpression extraction. Although, Wu et al., (2009)doesn’t discover aspect specific opinion phrases, itsuse of NPs in extracting candidate opinion phrasesis similar to Fei et al., (2014) which is considered asa baseline.

3 Phrase Sentiment Model (PSM)

PSM is a hybrid between generative and discrimina-tive modeling that combines the best of both worlds.Its generative modeling lays the foundation for emis-sion of aspects and aspect specific opinion phrases indocuments, while its discriminative sequence mod-eling component (via an embedded CRF) facilitatesaspect specific opinion phrase extraction.

As noted in Titov and McDonald (2008), model-ing entire reviews as documents tend to correspond

to the global properties of a product (e.g., brand,name, etc.) resulting in rather overlapping aspects.To avoid this, we perform sentence level model-ing that helps improve aspect sharpness. A reviewsentence sd of N words, is denoted as wd,s,j ={wd,s,1, wd,s,2 . . . wd,s,N} where each wd,s,j is oneof the V words in the vocabulary. There can be ex-ponential number of phrase sequence possible foreach sentence sd i.e. 〈wd,s,j〉j=st+len−1

j=st of arbi-trary length len(len = 1 for words; len > 1 forphrases) starting at an index st ∈ [0, |sd|]. Weobserved that most of opinion expression are cen-tered around the head aspect thereby causing thespace of potential opinion expressions to be quitesparse. We took advantage of following observa-tion and trained a sequence model (e.g., CRF) forphrase sequence tagging as described in the nextsubsection. We generated M = 5 best sequencelabelings of each sentence sd(s

m∈1...Md ) via for-

ward Viterbi and backward A∗ search1. Hence, ourvocabulary is the union of unigrams and n-gramsdiscovered by CRF over M best sequence label-ing, i.e., the model’s vocabulary, V = {wd,s,j} ∪{〈wd,smd ,j〉

j=st+len−1j=st } ∀d, s, j, len, st,m.

In PSM, for each aspect a, we model its aspectspecific terms (words/phrases) distributions and as-pect background word distributions using multino-mials ϕAa and ϕBa , drawn from Dir(β) over the vo-cabulary v1...V . For each domain d, we first draw adomain specific aspect distribution θd ∼ Dir(α).Next, for each review sentence (document), sd ofa domain, d we draw an aspect, zd,s ∼ Mult(θd).We assume that each sentence evaluates one as-pect which mostly holds true in the review do-main. We associate a latent switch variable foreach word rd,s,j and each phrase 〈rd,smd ,j〉

j=st+len−1j=st

of vocabulary where each switch variable r ∈{0, 1}. To generate each term 〈wd,smd ,j〉

j=st+len−1j=st

of the labeled sequence sm∈{1...M}d , we first set

the switch variables for the sentence, sd via thediscriminative CRF model, i.e. 〈rd,smd ,j〉

j=|sd|j=1 ←

1Z exp

(∑j

∑k λkfk (rj−1, rj , wj)

)by fitting a

previously trained CRF model. The switch vari-ables, r ∈ {1, 0} for a particular tagging smd of

1Value of M was tuned via pilot experiments using theCRF++ toolkit (Kudo, 2009)

629

sentence sd span over all its words and take val-ues r = 1 for words being part of an aspect spe-cific opinion phrase or r = 0 for aspect backgroundwords, upon observing all words in sd. Finally, de-pending upon the aspect, zd,s and the switch vari-able rd,s,j , we emit (unigram) terms in the sentenceas follows:

wd,s,j ∼{Mult(ϕAzd,s) if rd,s,j = 1

Mult(ϕBzd,s) if rd,s,j = 0(1)

and for phrasal terms (i.e., when〈rd,smd ,j〉

j=st+len−1j=1 = 1∀ valid st and len),

we emit〈wd,smd ,j〉j=st+len−1j=st ∼Mult(ϕAzd,s).

3.1 Inference

We employ MCMC Gibbs sampling for posterior in-ference. As latent variables z and r belong to differ-ent levels, we hierarchically sample z and then r foreach sweep of a Gibbs iteration as follows:p(zd,s = a|Z¬d,s, R¬d,s,W¬d,s) ∝

(nsd,a)¬d,s+α(nsd,(·))¬d,s+Aα

×[(∏Vv=1

Γ(nAa,v+β)

Γ((nAa,v)¬d,s+β)

)/

(Γ(nA

a,(·)+V β)

Γ((nAa,(·))¬d,s+V β)

)]×

[(∏Vv=1

Γ(nBa,v+β)

Γ((nBa,v)¬d,s+β)

)/

(Γ(nB

a,(·)+V β)

Γ((nBa,(·))¬d,s+V β)

)](2)

Samplers for r consist of three cases: (i) individualaspect-specific opinion words, (ii) individual back-ground words, and (iii) phrasal opinion:p(rd,s,j = 1|zd,s = a,wd,s,j = v, . . .) ∝

(nAa,v)¬d,s,j+β)

(nAa,(·))¬d,s,j+V β)

× pCRF (rd,s,j−1, rd,s,j = 1|v) (3)

p(rd,s,j = 0|zd,s = a,wd,s,j = v, . . .) ∝(nBa,v)¬d,s,j+β)

(nBa,(·))¬d,s,j+V β)

× pCRF (rd,s,j−1, rd,s,j = 0|v) (4)

p(〈rd,s,j = 1〉j=st+len−1j=st |zd,s = a, 〈wd,s,j = v

′〉j=st+len−1j=st , . . .) ∝

(nAa,v)¬d,s,j+β)

(nAa,(·))¬d,s,j+V β)

× pCRF (rd,s,j=st−1, rd,s,j=st = 1,

rd,s,j=st+1 = 1, . . . rd,s,j=st+len−1 = 1|v′) (5)where nsd,a denotes the # of sentences in domain dassigned to aspect a, nAa,v, n

Ba,v denote the # of times

term v was assigned to aspect a in the aspect specificopinion, and aspect specific background languagemodels respectively. A count variable with subscript(·) signifies the marginalized sum over the latter in-dex and ¬ denotes the discounted counts. The sam-pler in (5) computes the likelihood of a sequenceof contiguous terms, 〈rd,s,j = 1〉j=st+len−1

j=st form-ing an aspect (zd,s = a) specific opinion phrase, v

′

starting at index j = st and of length len, wherev′

= 〈wd,s,j〉j=st+len−1j=st . The sequence probabili-

ties, pCRF in equations (3, 4, 5) can be obtained asfollows.

Let w = (wt=1 . . . wt=T ) denote the sequence ofobserved words in a sentence, and let each obser-vation wt have a label yt ∈ Y indicating whetherwt is part of a aspect specific opinion phrase, whereY = {1, 0}. We consider a first order Markovlinear-chain CRF in our hybrid model. We definethe Markovian transition and forward-backwardvariables of our embedded CRF as follows:ψt(j, i, w) = p(yt = j|yt−1 = i)p(wt = w|yt = j) (6)

αt(j) =∑

i∈Y ψt(j, i, wt)αt−1(i) (7)

βt(j) =∑

j∈Y ψt+1(j, i, wt+1)βt+1(j) (8)where α1(j) = ψ1(j, y0, w1) and βT (i) = 1.This lays the foundation for expressing the sequenceprobabilities, pCRF in closed form as follows:pCRF (yt−1, yt|w) ∝ αt−1(yt−1)ψt(yt, yt−1, wt)βt(yt) (9)pCRF (yt−2, yt−1, yt|w) ∝

αt−2(yt−2)ψt−1(yt−1, yt−2, wt−1)ψt(yt, yt−1, wt)βt(yt) (10)Eq. (9) is used for computing the sequence proba-bilities for individual opinion/background words forsamplers in eq. (3, 4) while eq. (10) and its exten-sions are used for computing the sequence probabil-ities in the phrase samplers in eq. (5). The valuesfor ψt, αt, βt are obtained from a previously trainedCRF model upon fitting to the current sentence sdfor which sampling is being performed.

3.2 Embedded CRF Training

We employ linear-chain CRFs (Lafferty et al., 2001)for modeling phrases. While word (W) and Part-Of-Speech (POS) tag features are effective in varioussequence modeling tasks (Yang and Cardie, 2014;Yang and Cardie, 2012), in our problem context,(W+POS) features are insufficient as they do notconsider the head aspect (HA) and its relevant posi-tional/contextual features, i.e., how do different POStags, syntactic units (chunks), polar sentiments ap-pear in proximity to the head aspect? Hence, center-ing on the HA, we propose a set of pivot features tomodel context.Pivot Features: We consider five feature families:POS Tags (T): DT, IN, JJ, MD, NN, RB, VB, etc.Phrase Chunks (C): ADJP, ADVP, NP, PP, VP, etc.Prefixes (P): anti, in, mis, non, pre, sub, un, etc.

630

Category Feature Template Example of feature apperaing in a sentence1st order featuresWi+j ;−4 ≤ j ≤ 4W ∈ {T,C, P, S, SP}

SPi+j SPi−1 = NEG; previous term of HA is of NEGpolarity,. . . have this terrible voice on the . . .

Si+j Si−2 = ing; suffix of 2nd previous term of head aspectis “ing”, . . . kept dropping the signal . . .

. . . . . .2nd order featuresWi+j , Yi+j ;−4 ≤ j ≤ 4W,Y ∈ {T,C, P, S, SP}

Ti+j , Ti+j′ Ti−2 = JJ, Ti−1 = V BZ, · · · frequently dropsconnection. . .

Ti+j , Ci+j′ Ti+2 = RB,Ci+3 = ADJP ; . . . screen clarity isgood. . .

. . . . . .3rd order featuresWi+j , Yi+j , Zi+j ;−4 ≤j ≤ 4W,Y,Z ∈ {T,C, P, S, SP}

Ti+j , Si+j′ , Ti+j′′ Ti+2 = JJ, Pi+4 = un, Ti+4 = JJ ; . . . screen is blankand unresponsive. . .

. . . . . .

Table 1: Pivot Templates: Subscript i denotes the index of the head aspect, HA (italicized). Subscript j denotes theindex relative to i

Suffixes (S): able, est, ful, ic, ing, ive, ness etc.Word Sentiment Polarity (SP): POS, NEG, NEU

Pivoting on the head aspect, we look forward andbackward to generate a family of binary features de-fined by a specific template (see Table 1). Each tem-plate generates several features that capture variouspositional context around the HA. Additionally, weconsider up to 3rd order pivot features allowing us tomodel various dependencies as features. For polar-ity, we used the opinion lexicon2 derived from (Huand Liu, 2004).Feature Templates: Table 1 details the templatesfor features pivoting on the head aspect. Various fea-tures from these templates coupled with the valueof the current sequence tag at yt or a combinationof current and previous labels yt, yt−1 serve as ourlinear chain features (LCF), f(yi−1, yi, w). Further,the index i for LCF can refer to any word in the sen-tence and not necessarily the head aspect, yieldingus a very rich feature space.Learning the CRF λs: Given a set of training ex-amples {wi, yt} where yt are the correct sequencetags, we estimate the CRF Λ = {λk}parameters byminimizing the negative log-likelihood (NLL),Λ = argminΛ(C

∑i log(p(yt|wi,Λ)) +

∑k λ

2k) (11)

Where p(yt|w,Λ) ∝ exp(∑k

∑t λkfk(yt−1, yt, w)), C is

the soft-margin parameter, and the term∑

k λ2k indi-

cates L2 regularization on the feature weights, λk.

2http://www.cs.uic.edu/∼liub/FBS/opinion-lexicon-English.rar

The training set for learning the embedded CRFmodels λs is detailed in Table 3 (col 1, 2).

4 Optimizing Phrasal Collocations

Topic models can be described in terms of a simplePolya urn (SPU) sampling schemes in the sense thatwhen a particular term (word or phrase) is drawnfrom a topic, count of that term is incremented inthat topic. This enforces the topic distribution totend towards these terms over time as frequency ofthem increases. Therefore, the posterior of gen-erative topic models often favors terms with highfrequency e.g., unigrams, while phrasal terms areranked lower due to their lower frequencies. Thisis undesirable for phrase extraction.

In contrast, Generalized Polya urn (GPU) modeldiffers from SPU in its sampling process. Whena certain term is drawn, the count of that term in-creases as well as it also increases the count of termswhich are similar to drawn word/phrases via pseudo-counts for promotion. Thus, GPU caters for promo-tion of others terms in a principled manner. It hasbeen previously used for unigram topic modeling in(Mimno et al., 2011). In this work we leverage it forphrases.

4.1 Proposed PSM-GPU model

We optimize the collocations of relevant aspectwords and phrases in the GPU framework in twoways:

631

Word to phrase: Intuitively, if an aspect word is as-signed to a topic then that topic should represent thataspect and to all other phrases in that aspect’s phraseset (i.e., phrases containing that aspect) should be-long to the same topic. Thus, when an aspect wordis assigned to a topic then each phrase in its aspectset is promoted with a small count in that topic.Phrase to word: When a phrase 〈wd,s,j〉j=st+len−1

j=st

is assigned to a topic, each component word wd,s,jwhere j ∈ [st, st+len−1] in it is also promoted witha certain small count, i.e., each word of that phraseis also assigned to that topic by a certain amount.

We now define the term promotion matrix, A forthe GPU framework. Every element of A,Aw,w′refers to the promotion pseudocount, i.e., whenevera w was seen in an urn, we increment the count byAw,w′ of w

′. w,w

′can be word or phrases.

Aw,w′ =

1 if w = w′

σ if w is an aspect word,

w′is a phrase ∈ Phrase set of w

δ w′is a word ∈ Phrase w

0 otherwise

(12)

To improve the ranking of phrases, the value of σis kept greater than δ. Empirically values are givensection 5.1.

4.2 PSM-GPU Inference

Accounting the GPU process above, the approxi-mate Gibbs samplers for z and r take the followingform:

p(zd,s = a|Z¬d,s, R¬d,s,W¬d,s) ∝[

(nsd,a)¬d,s+α)

(nsd,(·))¬d,s+Aα)

]×

∏V

v=1

Γ(∑V

w′=1

(Av,w′ ∗nA

a,w′ )+β)

Γ(∑V

w′=1

(Av,w′ ∗nA

a,w′ )¬d,s+β)

Γ(∑Vv=1

∑V

w′=1

(Av,w′ ∗nA

a,w′ )+V β)

Γ(∑Vv=1

∑V

w′=1

(Av,w′ ∗nA

a,w′ )¬d,s+V β)

×

[(∏Vv=1

Γ(nBa,v+β)

Γ((nBa,v)¬d,s+β)

)/

(Γ(nB

a,(·)+V β)

Γ((nBa,(·))¬d,s+V β)

)](13)

p(rd,s,j = 1|zd,s = a,wd,s,j = v, . . .) ∝∑V

w′=1

(Av,w′ ∗nA

a,w′ )¬d,s,j+β)

∑Vv=1

∑V

w′=1

(Av,w′ ∗nA

a,w′ )¬d,s,j+V β

×

pCRF (rd,s,j−1, rd,s,j = 1|v) (14)Similarly, phrasal opinion switch variable can be de-rived. The sampler for individual background wordsremains unchanged.

5 Experimental Evaluation

In this section, we evaluate our proposed models.We first detail our dataset, followed by baselines andresults.

5.1 Dataset and Parameter Settings

DomainPos. LabeledNeg. Labeled

Positive Negative TotalPhrase Phrase

Router 414 1256 1937 5291 7228GPS 948 672 2473 2231 4704

Mouse 376 477 1421 2591 4012Keyboard 398 660 912 1539 2451

Table 3: Statistics of dataset of four domain

Dataset Statistics: For CRF training, we createda phrase labeled dataset of aspect opinion phrasesusing product reviews from Amazon across 4 do-mains each spanning 4 head aspects. In this work,head aspects for a domain are known a priori ei-ther directly using unsupervised topic induction orguided by domain knowledge (e.g. using aspectmodels such as (Zhao et al., 2010; Chen et al., 2013;Mukherjee and Liu, 2012)). Our focus is on phraseextraction and grouping. We labeled the positive andnegative opinion phrases spans (Table 3; col 2, 3)in the reviews following the annotation schemes in(Wilson et al., 2005) for embedded CRF training inPSM. Table 3 details our labeled data for CRF train-ing. This phrase boundary labeled dataset (Table 3;col 2, 3) is “orthogonal” or disjoint from the datawhere the PSM model was fit and evaluated (Table3, col 4, 5). This avoids overfitting and makes a faircase for all the experiments of PSM.Preprocessing and Parameter Setting: We re-moved the stopwords, punctuation, special charac-ters and words appearing less than 5 times in eachdomain. For all models, posterior estimates of la-tent variables were taken with a sampling lag of 50iterations post burn-in phase (of 200 iterations) with2,000 iterations in total. Dirichlet priors were set toα = 50/K, where K is the number of topics (em-pirically set to 10 via pilot) and β = 0.1. The CRFparameters C = 1 and GPU parameters σ = 0.05and δ = 0.01 were estimated using cross validation.

5.2 BaselinesWe consider the following relevant phrase extractionmodels as our baselines:

632

PSM-GPU sMC-GPU LDA-P-GPURouter→Connection:“updatingfirmware secure connection”,“dropping connection”, “con-nection excellent”, “instabilitywireless connection”, “crashesentire connection”, “updating”,“kills current connection in-cluding downloads”, “affectedplugged connection”,“cable radiofrequency connection”, “halfway”

Router→Connection:“connectionbig time”, “signal weak timeconnection”, “connection time sta-ble”, “lose connection”, “internetconnection speed dropped”, “stopworking lot connection”, “starteddropping internet connection”,“started dropping connection”,“drop wireless connection dropswired connection”, “connectiondropping problems”

Router→Connection:“drops”,“times day internet connection”,“internet connection multipletimes”, “internet connectiontimes”, “internet connectionminutes”, “dropping internetconnection”, “broadband internetconnection”, “extremely slowinternet connection”, “lost internetconnection”,“internet connectioncouple”

GPS→Screen:“poor screen con-trast daylight”, “direction screenmissing poor screen contrast”,“slow screen size makes useless”,“screen turned”, “turn”, “nothingscreen”, “night screen bit bright”,“screen unpredictable directions”,“lacks faster screen refresh rate”,“smoother screen refresh”

GPS→Screen:“bright”, “excellentnice touch screen”, “touch screenn’t”, “nice big touch screen”,”touch screen big size”, “smaller”,“touch screen but nice”, “screenaccurate spoken direction”, “touchscreen nice”, “nice slim touchscreen”

GPS→Screen:“smaller screen butdont”, “wide screen”, “nothing butscreen”, “ok but screen”, “screenbut normal”, “screen size”, “largescreen”, “better traffic features constouch screen”, “screen real estatethan”, “than years screen”

Table 2: Example aspect specific opinion phrases (comma delimited in order) discovered by PSM-GPU, sMC-GPU,LDA-P-GPU. Errors are italicized and marked in red.

LDA with phrases (LDA-P): As aspect-sentimentphrases are often noun phrases, a basic approachis to include the noun phrases (extracted using aparser) as separate terms in the corpus.Topical N-gram (TNG): The TNG model in (Wanget al., 2007) extends LDA to model n-grams of arbi-trary length. As aspects often appear close to theiropinion in the sentence, topical n-grams for each as-pect form a natural baseline. We used the authorsoriginal implementation in the MALLET toolbox.LDA-P with GPU (LDA-P-GPU): This model isdue to (Fei et al., 2014) and is tailored for phraseextraction in opinion mining. It employs LDA withnoun phrases in the GPU framework to rank the as-pect phrases higher in their topics. Our implemen-tations of LDA-P and LDA-P-GPU use the nounphrases discovered by the Stanford Parser.semi-Markov CRF with GPU(sMC-GPU): Thismodel builds over the model of (Yang and Cardie,2012) that used dependency tree features and semi-CRF to model the arbitrarily long expressions. Weused these expression spans as multiword in vocab-ulary. Then we employ GPU based sampling withLDA proposed by (Fei et al., 2014) to collocate opin-ion expressions.

5.3 Qualitative Analysis

To assess the quality of extracted expressions, we la-beled the topics following instructions in (Mimno etal., 2011). First, each topic was labeled as coher-ent or incoherent and an aspect name was given ifthe topic was coherent. Each topic was presented asa list of top 45 terms in descending order of theirprobabilities under that topic. A topic was consid-ered coherent if the terms in the topic were semanti-cally related to each other.

Next, for coherent topics, their terms were labeledas correct (if the terms semantics was relevant to thetopic) or incorrect (otherwise). Two human judgeswere used in the annotation. Agreements being high(κ > 0.78), disagreements were resolved upon con-sensus among judges.

Table 2 reports the top 10 terms(words/phrases)for aspect ’connection’ (Router domain) and aspect’screen’(GPS domain) across PSM-GPU, sMC-GPUand LDA-P-GPU (the two closest competitor). Wenote that PSM-GPUs phrases are more expressivecompared to sMC-GPU because sMC-GPU is proneto have longer phrases due to segment features butPSM’s switch variable captured more relevant aspectspecific opinion expressions. sMC-GPU has better

633

Figure 2: Charts from left to right are Topical words Precision@15, Precision@30, Precision@45 of coherent topicsof each model and last one is number of coherent topics of each model.

Domain PSM-GPU PSM sMC-GPU LDA-P-GPU TNG LDA-PP R F Ac. P R F Ac. P R F Ac. P R F Ac. P R F Ac. P R F Ac.

Router 71.867.869.768.8 69.067.368.167.7 69.667.768.668.2 65.267.966.567.1 65.266.866.065.9 65.568.066.767.4GPS 69.369.269.269.3 66.767.767.267.4 65.368.065.765.5 63.668.465.966.8 64.168.366.158.8 66.069.767.868.6

Mouse 87.481.484.087.3 82.980.481.382.4 83.481.582.682.3 81.378.379.480.9 82.977.980.082.5 77.977.377.177.7Keyboard 92.572.981.284.1 90.270.178.481.6 88.470.377.681.0 86.769.376.479.9 88.866.675.879.1 83.663.471.475.0

Avg. 80.372.876.077.4 77.271.473.874.8 76.771.973.674.3 74.271.072.173.7 75.369.972.071.6 73.369.670.872.2

Table 4: Sentiment classification: Precision, Recall, F1 and accuracy from top to down for each domain and eachmodel

phrases compared to LDA-P-GPU because the lat-ter only considers noun phrases which may not al-ways be semantically coherent under an aspect. Thequalitative results of other baselines TNG and LDA-P were worse than that of LDA-P-GPU and henceomitted due to space constraints. However, the sub-sequent experiments compare all models quantita-tively.

5.4 Quantitative Analysis

We consider the following metrics and tasks:Average Precision: Figure 2 shows the averagePrecision@n (p@n) for n = 15, 30, 45 of all coher-ent topics for each model in each domain. We notethat PSM-GPU achieves the highest precision for alldomains significantly (p < 0.01) outperforming itsclosest competitor sMC-GPU. sMC-GPU tends todiscover longer phrases due to segment features insemi-CRF and combined with GPU gains the max-imum strength among other baselines. Next in or-der are LDA-P-GPU, PSM, and TNG. We have notshown the result of LDA-P as its top terms didn’tcontain enough phrases and its precision scores werequite lower compared to other models. But it isworthwhile to note that PSM outperforms LDA-P-GPU (2nd best competitor) at lower ranks which ismore important (e.g., in majority domains for p@15)

and shows its effectiveness. It is a bit unfair tocompare PSM with sMC-GPU because PSM is lack-ing phrase rank optimization whereas sMC-GPU en-forces it, and the p@n metric uses rank position asits goodness criterion. However, we will see that inan actual application task, both PSM and PSM-GPUdoes better than sMC-GPU. Also, we observed thatp@45 is higher compare to p@15 or p@30. Thereason is even though we are promoting the phrasesusing GPU it is not able to remove some aspect opin-ion words from top 15 terms due to their high occur-rence in phrases. For e.g. Table 2 has opinion wordslike “updating”, “turn” which are considered incor-rect because of non-phrasal terms.# of coherent topics: Figure 2 (rightmost chart)shows the number of coherent topics produced byeach model. A model that can discover more co-herent topics is better. We find that PSM-GPU candiscover more coherent topics with phrases than itsbaselines across all domains. The trends of othermodels are similar to p@n and can be analogouslyexplained.We note that the Topic Coherence (TC) metric in(Mimno et al., 2011) which is often used to approx-imate coherence in unigram topic models as it cor-relates with human notions of coherence, uses co-document frequency of individual words in topics.

634

However, in our problem as phrases are sparse, theirco-document frequency is far lower than words.Hence, the TC metric is not directly applicable. Ourmeasure of coherence is based on human judgment(achieves high agreements, κ > 0.78 see Section5.3) and from Table 2 we can see the discoveredphrases do reflect coherence. Hence, to evaluate thephrases quantitatively, we employ an actual senti-ment classification task that uses the posterior of ourmodels (top phrases) as features. This is reasonablebecause the estimated topics (when used as features)improve sentiment classification, it shows that theyare meaningful and capable of capturing latent sen-timent that govern polarities.Sentiment Classification: For this task, instead ofusing all the words as features, we used the poste-rior on ϕA (top 50 terms of ϕA) as features. For allmodels, all possible n-grams of top 50 terms are alsoconsidered as features. We trained SVMs3 (usingthe SVMLight toolkit) with the features describedabove. Evaluation for this task employed 5-foldcross validation on the data in Table 3 (col 4, 5). Foreach test fold, the features were induced upon fittingthe aspect extraction models on the training data ofthat fold. From Table 4 we note that both PSM-GPUand PSM outperform all competitors on average F1across all domains. More specifically, we note thatPSM alone that uses no rank optimization performsbetter than sMC-GPU employing phrasal rank op-timization under GPU scheme. We believe this isdue to PSM’s switching component that can dis-cover correct aspect/sentiment terms (sufficient forpolarity classification) and rank it higher based onfrequency even though the expressive aspect specificphrases remain ranked lower. sMC-GPU tends tohave longer phrases so it does well, however, underGPU, longer phrases may not be promoted well asthey lack anchor aspect terms under a relevant topic.LDA-P-GPU uses standard (Noun Phrases) NPs forphrases with rank optimization and hence is the nextin performance order as NPs may not capture opin-ion well. TNG does not perform as well as it relieson multiword collocation as opposed to NP/VP forphrase extraction. LDA-P’s performance is lowestas it cannot rank the relevant NPs high. PSM-GPU

3Using an RBF kernel (C = 10, g = 0.01) which per-formed best upon tuning various SVM parameters via cross val-idation.

has the right balance of phrase boundary span andphrasal rank optimization via GPU that makes it sig-nificantly outperform (p < 0.01) all competitors.

Domain Precision Recall F1 Acc.Router 69.68 67.3 68.4 67.9GPS 66.47 68.6 67.5 68.00

Mouse 87.24 80.5 83.4 86.9Keyboard 85.45 70.9 77.1 81.1

Table 5: Domain ablation result on polarity classification

Sequence Model Sensitivity: To assess the robust-ness of the hybrid framework, we evaluate the sensi-tivity of the embedded CRF model via domain abla-tion. We choose the best performer PSM-GPU andablate each domain during its CRF training. We re-peat the previous experiment on sentiment classifi-cation using the ablated model. From the results inTable 5, we note that the reduction in precision isrelatively more than that of recall. However, the F1score does not drop significantly (compared to Table4) for any domain showing the robustness of the hy-brid framework. We note that even with some skew-ness in the labeled data (Table 3), CRF is not over-fitting here and the proposed pivot features (Table1) are powerful enough to learn the phrasal structureacross domain.

6 Conclusion

This paper presented a novel hybrid framework foraspect specific opinion expression extraction. Twomodels PSM and PSM-GPU were proposed thatemploy CRF discriminative sequence modeling forphrase boundary extraction and generative modelingfor grouping relevant terms under a topic. PSM-GPU further optimized the aspect coherence usingthe generalized Polya urn sampling scheme. Ex-perimental results showed that the proposed hybridmodels can extract more coherent aspect specificopinion expressions significantly outperforming allcompetitors across all domains and are robust incross-domain knowledge transfer.

Acknowledgment

This work was supported in part by NSF 1527364.

635

ReferencesGabor Berend. 2011. Opinion expression mining by

exploiting keyphrase extraction. In IJCNLP, pages1162–1170. Citeseer.

Eric Breck, Yejin Choi, and Claire Cardie. 2007. Iden-tifying expressions of opinion in context. In IJCAI,volume 7, pages 2683–2688.

Samuel Brody and Noemie Elhadad. 2010. An unsu-pervised aspect-sentiment model for online reviews.In Human Language Technologies: The 2010 AnnualConference of the North American Chapter of the As-sociation for Computational Linguistics, pages 804–812. Association for Computational Linguistics.

Zhiyuan Chen, Arjun Mukherjee, Bing Liu, MeichunHsu, Malu Castellanos, and Riddhiman Ghosh. 2013.Exploiting domain knowledge in aspect extraction. InEMNLP, pages 1655–1667.

Yejin Choi, Claire Cardie, Ellen Riloff, and SiddharthPatwardhan. 2005. Identifying sources of opinionswith conditional random fields and extraction patterns.In Proceedings of the conference on Human LanguageTechnology and Empirical Methods in Natural Lan-guage Processing, pages 355–362. Association forComputational Linguistics.

Yejin Choi, Eric Breck, and Claire Cardie. 2006. Jointextraction of entities and relations for opinion recog-nition. In Proceedings of the 2006 Conference onEmpirical Methods in Natural Language Processing,pages 431–439. Association for Computational Lin-guistics.

Ahmed El-Kishky, Yanglei Song, Chi Wang, Clare RVoss, and Jiawei Han. 2014. Scalable topical phrasemining from text corpora. Proceedings of the VLDBEndowment, 8(3):305–316.

Geli Fei, Zhiyuan Chen, and Bing Liu. 2014. Re-view topic discovery with phrases using the polya urnmodel. In COLING, pages 667–676.

Geli Fei, Zhiyuan Brett Chen, Arjun Mukherjee, andBing Liu. 2016. Discovering correspondence of senti-ment words and aspects. In In proceedings of the 17thInternational Conference on Intelligent Text Process-ing and Computational Linguistics.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the tenthACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 168–177.ACM.

Niklas Jakob and Iryna Gurevych. 2010. Extractingopinion targets in a single-and cross-domain settingwith conditional random fields. In Proceedings of the2010 conference on empirical methods in natural lan-guage processing, pages 1035–1045. Association forComputational Linguistics.

Yohan Jo and Alice H Oh. 2011. Aspect and sentimentunification model for online review analysis. In Pro-ceedings of the fourth ACM international conferenceon Web search and data mining, pages 815–824. ACM.

Richard Johansson and Alessandro Moschitti. 2011. Ex-tracting opinion expressions and their polarities: ex-ploration of pipelines and joint models. In Proceed-ings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Tech-nologies: short papers-Volume 2, pages 101–106. As-sociation for Computational Linguistics.

Soo-Min Kim and Eduard Hovy. 2006. Extracting opin-ions, opinion holders, and topics expressed in onlinenews media text. In Proceedings of the Workshop onSentiment and Subjectivity in Text, pages 1–8. Associ-ation for Computational Linguistics.

Nozomi Kobayashi, Kentaro Inui, and Yuji Matsumoto.2007. Extracting aspect-evaluation and aspect-of re-lations in opinion mining. In EMNLP-CoNLL, vol-ume 7, pages 1065–1074. Citeseer.

T Kudo. 2009. Crf++: Yet another crf toolkit [ol].John Lafferty, Andrew McCallum, and Fernando CN

Pereira. 2001. Conditional random fields: Probabilis-tic models for segmenting and labeling sequence data.

Huayi Li, Arjun Mukherjee, Jianfeng Si, and Bing Liu.2015. Extracting verb expressions implying negativeopinions. In AAAI, pages 2411–2417.

Robert V Lindsey, William P Headden III, and Michael JStipicevic. 2012. A phrase-discovering topic modelusing hierarchical pitman-yor processes. In Proceed-ings of the 2012 Joint Conference on Empirical Meth-ods in Natural Language Processing and Computa-tional Natural Language Learning, pages 214–222.Association for Computational Linguistics.

Bing Liu and Lei Zhang. 2012. A survey of opinionmining and sentiment analysis. In Mining text data,pages 415–463. Springer.

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, and Ji-awei Han. 2015. Mining quality phrases from mas-sive text corpora. In Proceedings of the 2015 ACMSIGMOD International Conference on Management ofData, pages 1729–1744. ACM.

Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, andChengXiang Zhai. 2007. Topic sentiment mixture:modeling facets and opinions in weblogs. In Pro-ceedings of the 16th international conference on WorldWide Web, pages 171–180. ACM.

David Mimno, Hanna M Wallach, Edmund Talley,Miriam Leenders, and Andrew McCallum. 2011. Op-timizing semantic coherence in topic models. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 262–272. Asso-ciation for Computational Linguistics.

636

Arjun Mukherjee and Bing Liu. 2012. Aspect extractionthrough semi-supervised modeling. In Proceedings ofthe 50th Annual Meeting of the Association for Com-putational Linguistics: Long Papers-Volume 1, pages339–348. Association for Computational Linguistics.

Arjun Mukherjee. 2016. Extracting aspect specific sen-timent expres-sions implying negative opinions. InIn proceedings of the 17th International Conferenceon Intelligent Text Processing and Computational Lin-guistics.

Maria Pontiki, Dimitrios Galanis, Haris Papageorgiou,Suresh Manandhar, and Ion Androutsopoulos. 2015.Semeval-2015 task 12: Aspect based sentiment anal-ysis. In Proceedings of the 9th International Work-shop on Semantic Evaluation (SemEval 2015), Asso-ciation for Computational Linguistics, Denver, Col-orado, pages 486–495.

Christina Sauper, Aria Haghighi, and Regina Barzi-lay. 2011. Content models with attitude. In Pro-ceedings of the 49th Annual Meeting of the Associa-tion for Computational Linguistics: Human LanguageTechnologies-Volume 1, pages 350–358. Associationfor Computational Linguistics.

Ivan Titov and Ryan McDonald. 2008. Modeling on-line reviews with multi-grain topic models. In Pro-ceedings of the 17th international conference on WorldWide Web, pages 111–120. ACM.

Xuerui Wang, Andrew McCallum, and Xing Wei. 2007.Topical n-grams: Phrase and topic discovery, with anapplication to information retrieval. In Data Mining,2007. ICDM 2007. Seventh IEEE International Con-ference on, pages 697–702. IEEE.

Shuai Wang, Zhiyuan Chen, and Bing Liu. 2016. Miningaspect-specific opinion using a holistic lifelong topicmodel. In Proceedings of the 25th International Con-ference on World Wide Web, pages 167–176. Interna-tional World Wide Web Conferences Steering Com-mittee.

Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005.Annotating expressions of opinions and emotions inlanguage. Language resources and evaluation, 39(2-3):165–210.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-levelsentiment analysis. In Proceedings of the conferenceon human language technology and empirical methodsin natural language processing, pages 347–354. Asso-ciation for Computational Linguistics.

Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu.2009. Phrase dependency parsing for opinion mining.In Proceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume 3-Volume 3, pages 1533–1541. Association for Compu-tational Linguistics.

Bishan Yang and Claire Cardie. 2012. Extracting opin-ion expressions with semi-markov conditional randomfields. In Proceedings of the 2012 Joint Conferenceon Empirical Methods in Natural Language Process-ing and Computational Natural Language Learning,pages 1335–1345. Association for Computational Lin-guistics.

Bishan Yang and Claire Cardie. 2013. Joint inferencefor fine-grained opinion extraction. In ACL (1), pages1640–1649.

Bishan Yang and Claire Cardie. 2014. Joint modelingof opinion expression extraction and attribute classifi-cation. Transactions of the Association for Computa-tional Linguistics, 2:505–516.

Wayne Xin Zhao, Jing Jiang, Hongfei Yan, and XiaomingLi. 2010. Jointly modeling aspects and opinions witha maxent-lda hybrid. In Proceedings of the 2010 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 56–65. Association for Computa-tional Linguistics.

Wayne Xin Zhao, Jing Jiang, Jing He, Yang Song,Palakorn Achananuparp, Ee-Peng Lim, and XiaomingLi. 2011. Topical keyphrase extraction from twitter.In Proceedings of the 49th Annual Meeting of the As-sociation for Computational Linguistics: Human Lan-guage Technologies-Volume 1, pages 379–388. Asso-ciation for Computational Linguistics.

637

Documents

Extracting Aspect Specific Opinion Expressions · Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 627–637, Austin, Texas, November