Latent Variable Sentiment Grammar · These models handle sentiment composition im-plicitly and predict sentiment polarities only based on embeddings of current nodes. In contrast,

Latent Variable Sentiment Grammar∗

Liwen Zhang†, Kewei Tu†, Yue Zhang‡�†School of Information Science and Technology,ShanghaiTech University, Shanghai, China

‡Institute of Advanced Technology, Westlake Institute for Advanced Study, China�School of Engineering, Westlake University, Hangzhou, China

{zhanglw1,tukw}@[email protected]

Abstract

Neural models have been investigated forsentiment classification over constituent trees.They learn phrase composition automaticallyby encoding tree structures but do not explic-itly model sentiment composition, which re-quires to encode sentiment class labels. Tothis end, we investigate two formalisms withdeep sentiment representations that capturesentiment subtype expressions by latent vari-ables and Gaussian mixture vectors, respec-tively. Experiments on Stanford SentimentTreebank (SST) show the effectiveness of sen-timent grammar over vanilla neural encoders.Using ELMo embeddings, our method givesthe best results on this benchmark.

1 Introduction

Determining the sentiment polarity at or below thesentence level is an important task in natural lan-guage processing. Sequence structured models (Liet al., 2015; McCann et al., 2017) have been ex-ploited for modeling each phrase independently.Recently, tree structured models (Zhu et al., 2015;Tai et al., 2015; Teng and Zhang, 2017) were lever-aged for learning phrase compositions in sentencerepresentation given the syntactic structure. Suchmodels classify the sentiment over each constituentnode according to its hidden vector through treestructure encoding.

Though effective, existing neural methods do notconsider explicit sentiment compositionality (Mon-tague, 1974). Take the sentence “The movie is notvery good, but I still like it” in Figure 1 as example(Dong et al., 2015), over the constituent tree, sen-timent signals can be propagated from leaf nodesto the root, going through negation, intensificationand contrast according to the context. Modelingsuch signal channels can intuitively lead to more

∗ Work was done when the first author was visiting West-lake University. The third author is the corresponding author.

Draw tree

guiwuqiansha

March 2019

1 Introduction

1

-1

-1

0

The movie is

-1

-1

not

1

0

very

1

good

0

,

0

but

1

I still like it

1

Figure 1: Example of sentiment composition

interpretable and reliable results. To model sen-timent composition, direct encoding of sentimentsignals (e.g., +1/-1 or more fine-grained forms) isnecessary.

To this end, we consider a neural network gram-mar with latent variables. In particular, we em-ploy a grammar as the backbone of our approach inwhich nonterminals represent sentiment signals andgrammar rules specify sentiment compositions. Inthe simplest version of our approach, nonterminalsare sentiment labels from SST directly, resulting ina weighted grammar. To model more fine-grainedemotions (Ortony and Turner, 1990), we considera latent variable grammar (LVG, Matsuzaki et al.(2005), Petrov et al. (2006)), which splits each non-terminal into subtypes to represent subtle sentimentsignals and uses a discrete latent variable to denotethe sentiment subtype of a phrase. Finally, inspiredby the fact that sentiment can be modeled witha low dimensional continuous space (Mehrabian,1980), we introduce a Gaussian mixture latent vec-tor grammar (GM-LVeG, Zhao et al. (2018)), which

associates each sentiment signal with a continuousvector instead of a discrete variable.

Experiments on SST show that explicit mod-eling of sentiment composition leads to signifi-cantly improved performance over standard treeencoding, and models that learn subtle emotionsas hidden variables give better results than coarse-grained models. Using a bi-attentive classifica-tion network (Peters et al., 2018) as the encoder,out final model gives the best results on SST. Toour knowledge, we are the first to consider neuralnetwork grammars with latent variables for senti-ment composition. Our code will be released athttps://github.com/Ehaschia/bi-tree-lstm-crf.

2 Related Work

Phrase-level sentiment analysis Li et al. (2015)and McCann et al. (2017) proposed sequence struc-tured models that predict the sentiment polaritiesof the individual phrases in a sentence indepen-dently. Zhu et al. (2015), Le and Zuidema (2015),Tai et al. (2015) and Gupta and Zhang (2018) pro-posed Tree-LSTM models to capture bottom-updependencies between constituents for sentimentanalysis. In order to support information flow bidi-rectionally over trees, Teng and Zhang (2017) intro-duced a Bi-directional Tree-LSTM model that addsa top-down component after Tree-LSTM encoding.These models handle sentiment composition im-plicitly and predict sentiment polarities only basedon embeddings of current nodes. In contrast, wemodel sentiment explicitly.

Sentiment composition Moilanen and Pulman(2007) introduced a seminal model for sentimentcomposition (Montague, 1974), composed positive,negative and neutral (+1/-1/0) singles hierarchi-cally. Taboada et al. (2011) proposed a lexicon-based method for addressing sentence level contex-tual valence shifting phenomena such as negationand intensification. Choi and Cardie (2008) used astructured linear model to learn semantic composi-tionality relying on a set of manual features. Donget al. (2015) developed a statistical parser to learnthe sentiment structure of a sentence. Our methodis similar in that grammars are used to model se-mantic compositionality. But we consider neuralmethods instead of statistical methods for senti-ment composition. Teng et al. (2016) proposed asimple weighted-sum model of introducing senti-ment lexicon features to LSTM for sentiment analy-sis. They used -2 to 2 represent sentiment polarities.

In contrast, we model sentiment subtypes with la-tent variables and combine the strength of neuralencoder and hierarchical sentiment composition.

Latent Variable Grammar There has been aline of work using discrete latent variables to en-rich coarse-grained constituent labels in phrase-structure parsing (Johnson, 1998; Matsuzaki et al.,2005; Petrov et al., 2006; Petrov and Klein, 2007).Our work is similar in that discrete latent variablesare used to model sentiment polarities. To ourknowledge, we are the first to consider modelingfine-grained sentiment signals by investigating dif-ferent types of latent variables. Recently, there hasbeen work using continuous latent vectors for mod-eling syntactic categories (Zhao et al., 2018). Weconsider their grammar also in modeling sentimentpolarities.

3 Baseline

We take the constituent Tree-LSTM as our baseline,which extends sequential LSTM to tree-structurednetwork topologies. Formally, our model computesa parent representation from its two children in aTree-LSTM:

iflfrog

=

σσσσ

tanh

Wt

xhlhr

+ bt

(1)

cp = i⊗ g + fl ⊗ cl + fr ⊗ cr (2)

hp = o⊗ tanh(cp) (3)

where Wt ∈ R5Dh×3Dh and bt ∈ R3Dh are train-able parameters, ⊗ is the Hadamard product andx represents the input of leaf node. Our formula-tion is a special case of the N -ary Tree-LSTM (Taiet al., 2015) with N = 2.

Existing work (Tai et al. (2015), Zhu et al.(2015)) performs softmax classification on eachnode according to the state vetcor h on each nodefor sentiment analysis. We follow this method inour baseline model.

4 Sentiment Grammars

We investigate sentiment grammars as a layer ofstructured representation on top of a tree-LSTM,

https://github.com/Ehaschia/bi-tree-lstm-crf

which model the correlation between different sen-timent labels over a tree. Depending on how asentiment label is represented, we develop three in-creasingly complex models. In particular, the firstmodel, which is introduced in Section 4.1, uses aweighted grammar to model the first-order corre-lation between sentiment labels in a tree. It canbe regarded as a PCFG model. The second model,which is introduced in Section 4.2, introduces a dis-crete latent variable for a refined representation ofsentiment classes. Finally, the third model, whichis introduced in Section 4.3, considers a continuouslatent representation of sentiment classes.

4.1 Weighted GrammarsFormally, a sentiment grammar is defined asG = (N,S,Σ, Rt, Re,Wt,We), where N ={A,B,C, ...} is a finite set of sentiment polari-ties, S ∈ N is the start symbol, Σ is a finite setof terminal symbols representing words such thatN ∩ Σ = ∅, Rt is the transition rule set contain-ing production rules of the form X � α whereX ∈ N and α ∈ N+; Re is the emission rule setcontaining production rules of the form X � wwhere X ∈ N and w ∈ Σ+. Wt and We are setsof weights indexed by production rules in Rt andRe, respectively. Different from standard formalgrammars, for each sentiment polarity in a parsetree our sentiment grammar invokes one emissionrule to generate a string of terminals and invokeszero or one transition rule to product its child senti-ment polarities. This is similar to the behavior ofhidden Markov models. Therefore, in a parse treeeach non-leaf node is a sentiment polarity and isconnected to exactly one leaf node which is a stringof terminals. The terminals that are connected tothe parent node can be obtained by concatenatingthe leaf nodes of its child nodes. Figure 2 shows anexample for our sentiment grammar. In this paper,we only consider Rt in the Chomsky normal form(CNF) for clarity of presentation. However, it isstraightforward to extend our formulation to thegeneral case.

The score of a sentiment tree T conditioned ona sentence w is defined as follows:

S(T |w,K) =∏rt∈T

Wn(rt)×∏re∈T

We(re) (4)

where rt and re represent a transition rule andan emission rule in sentiment parse tree T , re-spectively. We specify the transition weights Wn

with a non-negative rank-3 tensor. We compute

CB

A

A

B

𝑤1 𝑤2

𝑤1:2 𝑤3

𝑤1:3

Figure 2: Sentiment grammar example. Here yellownodes are leaf nodes of green constituent nodes, blueline B � w1 and black line A � BC represent anemission rule and a transition rule, respectively.

the non-negative weight of each emission ruleWe(X � wi:j) by applying a single layer per-ceptron fX and an exponential function to the neu-ral encoder state vector hi:j representing the con-stituent wi:j .

Sentiment grammars provides a principled wayfor explicitly modeling sentiment composition, andthrough parameterizing the emission rules with neu-ral encoders, it can take the advantage of deep learn-ing. In particular, by adding a weighted grammaron top of a tree-LSTM, our model is reminiscentof LSTM-CRF in the sequence structure.

4.2 Latent Variable GrammarsInspired by categorical models (Ortony and Turner,1990) which regard emotions as an overlay over aseries of basic emotions, we extend our sentimentgrammars with Latent Variable Grammars (LVGs;Petrov et al. (2006)), which refine each constituenttree node with a discrete latent variables, splittingeach observed sentiment polarity into finite unob-served sentiment subtypes. We refer to trees overunsplit sentiment polarities as unrefined trees andtrees over sentiment subtypes as refined trees.

Suppose that the sentiment polaritiesA,B andCof a transition rule A � BC are split into nA, nBand nC subtypes, respectively. The weights of therefined transition rule can be represented by a non-negative rank-3 tensor WA�BC ∈ RnA×nB×nC .Similarly, given an emission rule A � wi:j , theweights of its refined rules by splitting A into nAsubtypes is a non-negative vector WA�wi:j ∈ RnA

calculated by an exponential function and a singlelayer perceptron fA:

WA�wi:j = exp(fA(hi:j)) (5)

where hi:j is the vector representation of con-stituent wi:j . The score of a refined parse tree

is defined as the product of weights of all transitionrules and emission rules that make up the refinedparse tree, similar to Equation 4. The score of anunrefined parse tree is then defined as the sum ofthe scores of all refined trees that are consistentwith it.

Note that Weighted Grammar (WG) can beviewed as a special case of LVGs where each senti-ment polarity has one subtype.

4.3 Gaussian Mixture Latent VectorGrammars

Inspired by continuous models (Mehrabian, 1980)which model emotions in a continuous low dimen-sional space, we employ Latent Vector Grammars(LVeGs) (Zhao et al., 2018) that associate eachsentiment polarity with a latent vector space repre-senting the set of sentiment subtypes. We followthe idea of Gaussian Mixture LVeGs (GM-LVeGs)(Zhao et al., 2018), which uses Gaussian mixturesto model weight functions. Because Gaussian mix-tures have the nice property of being closed underproduct, summation, and marginalization, learningand parsing can be done efficiently using dynamicprogramming

In GM-LVeG, the weight function of a transitionor emission rule r is defined as a Gaussian mixturewith Kr mixture components:

Wr(r) =

Kr∑k=1

ρr,kN (r|µr,k,Σr,k) (6)

where r is the concatenation of the latent vectorsrepresenting subtypes for sentiment polarities inrule r, ρr,k > 0 is the k-th mixing weight (theKr mixture weights do not necessarily sum upto 1), and N (r|µr,k,Σr,k) denotes the k-th Gaus-sian distribution parameterized by mean µr,k andco-variance matrix Σr,k. For an emission ruleA � wi:j , all the Gaussian mixture parametersare calculated by single layer perceptrons from thevector representation hi:j of constituent wi:j :

ρr,k = exp(fρ,kA (hi:j))

µr,k = fµ,kA (hi:j) (7)

Σr,k = exp(fΣ,kA (hi:j))

For the sake of computational efficiency, we useGaussian distributions with diagonal co-variancematrices.

4.4 ParsingThe goal of our task is to find the most probablesentiment parse tree T ∗, given a sentencew and itsconstituency parse tree skeleton K. The polarity ofthe root node represents the polarity of the wholesentence, and the polarity of a constituent node isconsidered as the polarity of the phrase spanned bythe node. Formally, T ∗ is defined as:

T ∗ = argmaxT∈G(w,K)

P (T |w,K) (8)

where G(w,K) denotes the set of unrefinedsentiment parse trees for w with skeleton K.P (T |w,K) is defined based on the parse tree scoreEquation 4:

P (T |w,K) =S(T |w,K)∑T̂∈K S(T̂ |w,K)

. (9)

Note that unlike syntactic parsing, on SST we donot need to consider structural ambiguity, and thusresolving only rule ambiguity.T ∗ can be found using dynamic programming

such as the CYK algorithm for WG. However, pars-ing becomes intractable with LVGs and LVeGssince we have to compute the score of an unre-fined parse tree by summing over all of its refinedversions. We use the best performing max-rule-product decoding algorithm (Petrov et al., 2006;Petrov and Klein, 2007) for approximate parsing,which searches for the parse tree that maximizesthe product of the posteriors (or expected counts)of unrefined rules in the parse tree. The detailedprocedure is described below, which is based onthe classic inside-outside algorithm.

For LVGs, we first use dynamic programmingto recursively calculate the inside score functionsAI (a, i, j) and outside score function sAO(a, i, j)for each sentiment polarity over each span wi:j

consistent with skeleton K using Equation 10and Equation 11 in Table 1, respectively. Simi-larly for LVeGs, we recursively calculate insidescore function sAI (a, i, j) and outside score func-tion sAO(a, i, j) in LVeG are calculated by Equa-tion 13 and Equation 14 in Table 1, in which wereplace the sum of discrete variables in Equation 10-11 with the integral of continuous vectors. Next,using Equation 12 and Equation 15 in Table 1, wecalculate the score s(A � BC, i, k, j) for LVGand LVeG, respectively, where 〈A � BC, i, k, j〉represents an anchored transition rule A � BCwith A, B and C spanning phrasewi:j ,wi,k−1 and

sAI (a, i, j) =∑

A�BC∈Rt

∑b∈N

∑c∈N

WA�wi:j (a) WA�BC(a, b, c)× sBI (b, i, k) sCI (c, k + 1, j) .(10)

sAO(a, i, j) =∑

B�CA∈Rt

∑b∈N

∑c∈N

WA�wi:j (a) WB�CA(b, c, a)× sBO(b, k, j) sCI (c, k, i− 1)

+∑

B�AC∈Rt

∑b∈N

∑c∈N

WA�wi:j (a) WB�AC(b, a, c)× sBO(b, i, k) sCI (c, j + 1, k) .(11)

s(A � BC, i, k, j) =∑a∈N

∑b∈N

∑c∈N

WA�BC(a, b, c)× sAO(a, i, j)× sBI (b, i, k)× sCI (c, k + 1, j) . (12)

sAI (a, i, j) =∑

A�BC∈Rt

∫∫WA�wi:j (a)WA�BC(a, b, c)× sBI (b, i, k)sCI (c, k + 1, j) dbdc .(13)

sAO(a, i, j) =∑

B�CA∈Rt

∫∫WA�wi:j (a)WB�CA(b, c,a)× sBO(b, k, j)sCI (c, k, i− 1) dbdc

+∑

B�AC∈Rt

∫∫WA�wi:j (a)WB�AC(b,a, c)× sBO(b, i, k)sCI (c, j + 1, k) dbdc.(14)

s(A � BC, i, k, j) =

∫∫∫WA�BC(a, b, c)× sAO(a, i, j)× sBI (b, i, k)× sCI (c, k + 1, j) dadbdc .(15)

Table 1: Equation 10-12 calculate the inside score, outside score and production rule score for LVG, respectively.Equation 13-15 is used for LVeG. Equation 10 and Equation 13 are the inside score functions of a sentimentpolarity A over its spanwi:j in the sentencew1:n. Equation 11 and Equation 14 are the outside score functions ofa sentiment polarity A over a span wi:j in the sentence w1:n. Equation 12 and Equation 15: the production rulescore function of a rule A � BC with sentiment polarities A, B, and C spanning wordswi:j ,wi,k−1, andwk+1:j

respectively. Here we use lower case letters a, b, c . . . represent discrete subtypes of sentiment polarities A, B,C . . . in LVG and use bold lower case letters a, b, c . . . represent continuous subtypes of sentiment polarities inLVeG. Note that spans such as wi:j mentioned above are all given by the skeleton K of sentence w1:n.

wk+1:j (all being consistent with skeleton K), re-spectively. The posterior (or expected count ) of〈A � BC, i, k, j〉 can be calculate as follows:

q(A � BC, i, k, j) =s(A �, BC, i, k, j)

sI(S, 1, n), (16)

where sI(S, 1, n) is the inside score for the startsymbol S over the whole sentence w1:n. Then wecan run CYK algorithm to identify the parse treethat maximizes the product of rule posteriors. It’sobjective function is given by:

T ∗q = argmaxTq∈G(w,K)

∏e∈T

q(e) (17)

where e ranges over all the transition rules in thesentiment parse tree T .

Note that the equations in Table 1 are tailoredfor our sentiment grammars and differ from theirstandard versions in two aspects. First, we take

into account the additional emission rules in theinside and outside computation; second, the parsetree skeleton is assumed given and hence the splitpoint k is prefixed in all the equations.

4.5 Learning

Given a training dataset D = {Ti,wi,Ki|i =1 . . .m} containing m samples, where Ti is thegold sentiment parse tree for the sentence wi withits corresponding gold tree skeleton Ki. The dis-criminative learning objective is to minimize thenegative log conditional likelihood:

L(Θ) = − logm∏i=1

PΘ(Ti|wi,Ki) , (18)

where Θ represents the set of trainable parametersof our models. We optimize the objective withgradient-based methods. In particular, gradients arefirst calculated over the sentiment grammar layer,

before being back-propagated to the tree-LSTMlayer.

The gradient computation for the three modelsinvolves computing expected counts of rules, whichhas been described in Section 4.4. For WG andLVG, the derivative of Wr, the parameter of anunrefined production rule r is:

∂L(Θ)

∂Wr=

m∑i=1

(EΘ[fr(t)|Ti]− EΘ[fr(t)|wi]),(19)

where EΘ[fr(t)|Ti] denotes the expected count ofthe unrefined production rule r with respect to PΘ

in the set of refined trees t, which are consistentwith the observed parse tree T . Similarly, we useEΘ[fr(t)|wi] for the expectation over all deriva-tions of the sentence wi.

For LVeG, the derivative with respect to Θr, theparameters of the weight function Wr(r) of anunrefined production rule r is:

∂L(Θ)

∂Θr=

m∑i=1

∫ (∂Wr(r)

∂Θr(20)

× EΘ[fr(t)|wi]− EΘ[fr(t)|Ti]Wr(r)

)dr .

The two expectations in Equation 19 and 20 canbe efficiently computed using the inside-outsidealgorithm in Table 1. The derivative of the param-eters of neural encoder can be derived from thederivative of the parameters of the emission rules.

5 Experiments

To investigate the effectiveness of modeling sen-timent composition explicitly and using discretevariables or continuous vectors to model sentimentsubtypes, we compare standard constituent Tree-LSTM (ConTree) with our models ConTree+WG,ConTree+LVG and ConTree+LVeG, respectively.To show the universality of our approaches, we alsoexperiment with the combination of a state-of-the-art sequence structured model, bi-attentive classifi-cation network (BCN, Peters et al. (2018)), with ourmodel: BCN+WG, BCN+LVG and BCN+LVeG.

5.1 Data

We use Stanford Sentiment TreeBank (SST, Socheret al. (2013)) for our experiments. Each constituentnode in a phrase-structured tree is manually as-signed an integer sentiment polarity from 0 to 4,which correspond to five sentiment classes: very

negative, negative, neutral, positive and very pos-itive, respectively. The root label represents thesentiment label of the whole sentence. The con-stituent node label represents the sentiment label ofthe phrase it spans. We perform both binary classi-fication (-1, 1) and fine-grained classification (0-4),called SST-2 and SST-5, respectively. Followingprevious work, we use the labels of all phrases andgold-standard tree structures for training and test-ing. For binary classification, we merge all positivelabels and negative labels.

5.2 Experimental Settings

Hyper-parameters For ConTree, word vectorsare initialized using Glove (Pennington et al., 2014)300-dimensional embeddings and are updated to-gether with other parameters. We set the hiddensize of hidden units is 300. Adam (Kingma andBa, 2014) is used to optimize the parameters withlearning rate is 0.001. We adopt Dropout after theEmbedding layer with a probability of 0.5. Thesentence level mini-batch size is 32. For BCN ex-periment, we follow the model setting in McCannet al. (2017) except the sentence level mini-batchis set to 8.

5.3 Development Experiments

We use the SST development dataset to investigatedifferent configurations of our latent variables andGaussian mixtures. The best performing parame-ters on the development set are used in all followingexperiments.

LVG subtype numbers To explore the suitablenumber of latent variables to model subtypes of asentiment polarity, we evaluate our ConTree+LVGmodel with different number of latent variablesfrom 1 to 8. Figure 6(a) shows that there is anupward trend while the number of hidden variablesn increases from 1 to 4. After reaching the peakwhen n = 4, the accuracy decreases as the numberof latent variable continue to increase. We thuschoose n = 4 for remaining experiments.

LVeG Gaussian dimensions We investigate theinfluence of the latent vector dimension on the accu-racy for ConTree+LVeG. The component numberof Gaussian mixtures is fixed to 1, Figure 6(b) illu-minates that as the dimension increases from 1 to 8,there is a rise of accuracy from 1 to 2, followed bya decrease from 2 to 8. Thus we set the Gaussiandimension to 2 for remaining experiments.

48.5

49

49.5

50

50.5

51

51.5

52

1 2 3 4 5 6 7 8

n : the number of latent variables

(a)

49.5

50

50.5

51

51.5

52

1 2 3 4 5 6 7 8

d : dimensions of gaussian

(b)

50

50.5

51

51.5

52

1 2 3 4 5 6 7 8

Kr : number of gaussian components

(c)

Figure 6: Sentence level accuraicies on the development dataset. Figure (a) shows the performance of Con-Tree+LVG with different latent variables. Figure (b) and Figure (c) show the performance of ConTree+LVeGwith different Gaussian dims and Gaussian mixture component numbers, respectively.

Model SST-5 Root SST-5 Phrase SST-2 Root SST-2 PhraseConTree (Le and Zuidema, 2015) 49.9 - 88.0 -ConTree (Tai et al., 2015) 51.0 - 88.0 -ConTree (Zhu et al., 2015) 50.1 - - -ConTree (Li et al., 2015) 50.4 83.4 86.7 -ConTree (Our implementation) 51.5 82.8 89.4 86.9ConTree + WG 51.7 83.0 89.7 88.9ConTree + LVG4 52.2 83.2 89.8 89.1ConTree + LVeG 52.9 83.4 89.8 89.5

Table 2: Experimental results with constituent Tree-LSTMs.

ModelSST-5 SST-2

Root Phrase Root PhraseBCN(P) 54.7 - - -BCN(O) 54.6 83.3 91.4 88.8BCN+WG 55.1 83.5 91.5 90.5BCN+LVG4 55.5 83.5 91.7 91.3BCN+LVeG 56.0 83.5 92.1 91.6

Table 3: Experimental results with ELMo. BCN(P) isthe BCN implemented by Peters et al. (2018). BCN(O)is the BCN implemented by ourselves.

LVeG Gaussian mixture component numbersFuture 6(c) shows the performance of differentcomponent numbers with fixing the Gaussian di-mension to 2. With the increase of Gaussian com-ponent number, the fine-grained sentence level ac-curacy declines slowly. The best performance isobtained when the component number Kr = 1,which we choose for remaining experiments.

5.4 Main Results

We re-implement constituent Tree-LSTM (Con-Tree) of Tai et al. (2015) and obtain better resultsthan their original implementation. We then in-tegrate ConTree with Weighted Grammars (Con-Tree+WG), Latent Variable Grammars with a sub-

type number of 4 (ConTree+LVG4), and LatentVariable Grammars (ConTree+LVeG), respectively.Table 2 shows the experimental results for senti-ment classification on both SST-5 and SST-2 at thesentence level (Root) and all nodes (Phrase).

The performance improvement of ConTree+WGover ConTree reflects the benefit of handlingsentiment composition explicitly. Particularlythe phrase level binary classification task, Con-Tree+WG improves the accuracy by 2 points.

Compared with ConTree+WG, ConTree+LVG4improves the fine-grained sentence level accuracyby 0.5 point, which demonstrates the effectivenessof modeling the sentiment subtypes with discretevariables. Similarly, incorporating Latent VectorGrammar into the constituent Tree-LSTM, the per-formance improvements, especially on the sentencelevel SST-5, demonstrate the effectiveness of mod-eling sentiment subtypes with continuous vectors.The performance improvements of ConTree+LVeGover ConTree+LVG4 show the advantage of infinitesubtypes over finite subtypes.

There has also been work using large-scale ex-ternal datasets to improve performances of senti-ment classification. Peters et al. (2018) combinedbi-attentive classification network (BCN, McCannet al. (2017)) with a pretrained language model

0

0.5

1

1.5

2

2.5

3

words 1<=h<5 5<=h<10 h>=10

WG

LVG4

LVeG

Figure 7: Changes in phrase level 5-class accuracies ofour methods over ConTree.

with character convolutions on a large-scale cor-pus (ELMo) and reported an accuracy of 54.7 onsentence-level SST-5. For fair comparison, wealso augment our model with ELMo. Table 3shows that our methods beat the baseline on ev-ery task. BCN+WG improves accuracies on alltask slightly by modeling sentiment compositionexplicitly. The obvious promotion of BCN+LVG4and BCN+LVeG shows that explicitly modelingsentiment composition with fine-grained sentimentsubtypes is useful. Particularly, BCN+LVeG im-proves the sentence level classification accurraciesby 1.4 points (fine-grained) and 0.7 points (binary)compared to BCN (our implementation), respec-tively. To our knowledge, we achieve the best re-sults on the SST dataset.

5.5 Analysis

We make further analysis of our methods based onthe constituent Tree-LSTM model. In the follow-ing, using WG, LVG and LVeG denote our threemethods, respectively.

Impact on words and phrases Figure 7 showsthe accuracy improvements over ConTree onphrases of different heights. Here the height hof a phrase in parse tree is defined as the distancebetween its corresponding constituent node and thedeepest leaf node in its subtree. The improvementof our methods on word nodes, whose height is 0,is small because neural networks and word embed-dings can already capture the emotion of words.In fact, the accuracy of ConTree on word nodesreaches 98.1%. As the height increases, the per-formance of our methods increase, expect for theaccuracies of WG when h ≥ 10 since the coarse-grained sentiment representation is far difficulty forhandling too many sentiment compositions over thetree structure. The performance improvements of

Figure 8: Top left is the normalized confusion matrixon the 5-class phrase level test dataset for ConTree.The others are the performance changs of our modelover ConTree. Value in each cell is written with a unitof ×10−2

LVG4 and LVeG when h ≥ 10 show modeling fine-grained sentiment signals can represent sentimentof higher phrases better.

Impact on sentiment polarities Figure 8 showsthe performance changes of our models over Con-Tree on different sentiment polarities. The accuracyof every sentiment polarity on WG over ConTreeimproves slightly. Compared with ConTree, theaccuracies of LVG4 and LVeG on extreme sen-timents (the strong negative and strong positivesentiments) receive significant improvement. Inaddition, the proportion of extreme emotions mis-classified as weak emotions (the negative and posi-tive sentiments) drops dramatically. It indicates thatLVG4 and LVeG can capture the subtle differencebetween extreme sentiments and weak sentimentsby modeling sentiment subtypes explicitly.

Visualization of sentiment subtypes To investi-gate whether our LVeG can accurately model differ-ent emotional subtypes, we visualize all the strongnegative sentiment phrases with length below 6that are classified correctly in a 2D space. Sincein LVeG, 2-dimension 1-component Gaussian mix-tures are used to model a distribution over subtypesof a specific sentiment of phrases, we directly rep-resent phrases by their Gaussian means µ. FromFigure 9, we see that boring emotions such as “Ex-tremely boring” and “boring” (green dots) are lo-cated at the bottom left, stupid emotions such as

Extremely dumb .

Extremely dumb

Extremely boring .

Extremely boring boring

A real clunker .

clunker .Ridiculous .

is bad .is bad

bad

It 's painful .

hate this movie

hate

awful and

A dreadful live-action movie .

dreadful

A very bad sign .

stupider

Boring

bloody mess .

Figure 9: Visualization of correlation of phrases notlonger than 5 in the strong negative sentiment space

“stupider” and “Ridiculous” (red dots) are mainlylocated at the top right and negative emotions withno special emotional tendency such as “hate” and“bad” (blue dots) are evenly distributed throughoutthe space. This demonstrates that LVeG can capturesentiment subtypes.

6 Conclusion

We presented a range of sentiment grammars forusing neural networks to model sentiment com-position explicitly, and empirically showed thatexplicit modeling of sentiment composition withfine-grained sentiment subtypes gives better perfor-mance compared to state-of-the-art neural networkmodels in sentiment analysis. By using EMLo em-beddings, our final model improves fine-grainedaccuracies by 1.3 points compare to the currentbest result.

Acknowledgments

This work was supported by the Major Programof Science and Technology Commission Shang-hai Municipal (17JC1404102) and NSFC (No.61572245) . We would like to thank the anony-mous reviewers for their careful reading and usefulcomments.

ReferencesYejin Choi and Claire Cardie. 2008. Learning with

compositional semantics as structural inference forsubsentential sentiment analysis. In Proceedingsof the conference on Empirical Methods in NaturalLanguage Processing, pages 793–801. Associationfor Computational Linguistics.

Li Dong, Furu Wei, Shujie Liu, Ming Zhou, and Ke Xu.2015. A statistical parsing framework for sentimentclassification. Computational Linguistics, pages265–308.

Amulya Gupta and Zhu Zhang. 2018. To attend or notto attend: A case study on syntactic structures forsemantic relatedness. In Proceedings of the 56th An-nual Meeting of the Association for ComputationalLinguistics, pages 2116–2125. Association for Com-putational Linguistics.

Mark Johnson. 1998. Pcfg models of linguistic treerepresentations. Computational Linguistics, pages613–632.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. InternationalConference on Learning Representations.

Phong Le and Willem Zuidema. 2015. Compositionaldistributional semantics with long short term mem-ory. In Proceedings of the Fourth Joint Conferenceon Lexical and Computational Semantics, pages 10–19. Association for Computational Linguistics.

Jiwei Li, Thang Luong, Dan Jurafsky, and EduardHovy. 2015. When are tree structures necessary fordeep learning of representations? In Proceedings ofthe 2015 Conference on Empirical Methods in Nat-ural Language Processing, pages 2304–2314. Asso-ciation for Computational Linguistics.

Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii.2005. Probabilistic CFG with latent annotations. InProceedings of the 43rd Annual Meeting of the As-sociation for Computational Linguistics, pages 75–82, Ann Arbor, Michigan. Association for Computa-tional Linguistics.

Bryan McCann, James Bradbury, Caiming Xiong, andRichard Socher. 2017. Learned in translation: Con-textualized word vectors. In Advances in Neural In-formation Processing Systems, pages 6294–6305.

Albert Mehrabian. 1980. Basic dimensions for a gen-eral psychological theory: Implications for person-ality, social, environmental, and developmental stud-ies. Oelgeschlager, Gunn & Hain Cambridge, MA.

Karo Moilanen and Stephen Pulman. 2007. Sentimentcomposition. In Proceedings of Recent Advances inNatural Language Processing, pages 378–382.

Richard Montague. 1974. Formal Philosophy; Se-lected Papers of Richard Montague. New Haven:Yale University Press.

Andrew Ortony and Terence J. Turner. 1990. What’sbasic about basic emotions? Psychological Review,97(3):315–331.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 confer-ence on Empirical Methods in Natural LanguageProcessing, pages 1532–1543. Association for Com-putational Linguistics.

Matthew Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word rep-resentations. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies, pages 2227–2237. Associationfor Computational Linguistics.

Slav Petrov, Leon Barrett, Romain Thibaux, and DanKlein. 2006. Learning accurate, compact, and inter-pretable tree annotation. In Proceedings of the 21stInternational Conference on Computational Linguis-tics and the 44th annual meeting of the Associationfor Computational Linguistics, pages 433–440. As-sociation for Computational Linguistics.

Slav Petrov and Dan Klein. 2007. Improved inferencefor unlexicalized parsing. In Human Language Tech-nologies 2007: The Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics; Proceedings of the Main Conference,pages 404–411. Association for Computational Lin-guistics.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 conference onEmpirical Methods in Natural Language Processing,pages 1631–1642. Association for ComputationalLinguistics.

Maite Taboada, Julian Brooke, Milan Tofiloski, Kim-berly Voll, and Manfred Stede. 2011. Lexicon-basedmethods for sentiment analysis. Computational lin-guistics, pages 267–307.

Kai Sheng Tai, Richard Socher, and Christopher D.Manning. 2015. Improved semantic representationsfrom tree-structured long short-term memory net-works. In Proceedings of the 53rd Annual Meet-ing of the Association for Computational Linguisticsand the 7th International Joint Conference on Natu-ral Language Processing, pages 1556–1566. Associ-ation for Computational Linguistics.

Zhiyang Teng, Duy Tin Vo, and Yue Zhang. 2016.Context-sensitive lexicon features for neural senti-ment analysis. In Proceedings of the 2016 Confer-ence on Empirical Methods in Natural LanguageProcessing, pages 1629–1638. Association for Com-putational Linguistics.

Zhiyang Teng and Yue Zhang. 2017. Head-lexicalizedbidirectional tree lstms. Transactions of the Associ-ation for Computational Linguistics, 5:163–177.

Yanpeng Zhao, Liwen Zhang, and Kewei Tu. 2018.Gaussian mixture latent vector grammars. In Pro-ceedings of the 56th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 1181–1189. Association for Computational Linguistics.

Xiaodan Zhu, Parinaz Sobihani, and Hongyu Guo.2015. Long short-term memory over recursive struc-tures. In International Conference on MachineLearning, pages 1604–1612.

Documents

Latent Variable Sentiment Grammar · These models handle sentiment composition im-plicitly and predict sentiment polarities only based on embeddings of current nodes. In contrast,