Unsupervised Approaches...Unsupervised Approaches Aditya M Joshi Center for Indian Language Technologies (CFILT) IIT Bombay 20th June, 2016 [email protected] [email protected]

Unsupervised Approaches

Aditya M Joshi Center for Indian Language Technologies (CFILT)

IIT Bombay 20th June, 2016

[email protected] [email protected]

mailto:[email protected]?subject=

mailto:[email protected]

Images from wikimedia commons

Unsupervised Approaches

• Technique to infer a function to describe hidden structure from unlabelled data

• Use unlabelled data for prediction tasks

Popular Approaches

• Clustering

• Latent Dirichlet Allocation (LDA) Model

clustering

Clustering

• Find clusters in a set of data points

Clustering

• Find clusters in a set of data points

k-means Clustering

• Dataset {x1, x2 … xn} • Goal: Partition n observations into k

clusters • Membership indicated by rnk

• Goal, redefined: Minimize J

Algorithm

• Initialisation: pick K of the data points to be the initial means.

• Go over each data point and assign it to one of the means based on which is closest. E.g. if datapoint xn is closest to the second mean, assign it to that mean

• Recompute each of the means as the average of all the points assigned to the mean

Illustration

latent dirichlet allocation models

Outline

• Motivation and Introduction (Blei (2011))

• Building blocks of LDA: Dirichlet and Multinomials (Kullis (2012))

• Estimation using LDA (Heinrich (2004))

• Evaluation of LDA (Wallach (2009))

• Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))

Outline

• Motivation and Introduction (Blei (2011)) • Building blocks of LDA: Dirichlet and

Multinomials (Kullis (2012))




• Experimentation

Revisiting classifiersWhat did Prof. Pushpak Bhattacharyya talk about in

“Topics in NLP” lecture today?

Lecture transcript

Classifier

NLP, Databases, Compilers

Topic models can do much more than this: with unlabeled corpus

SA, MT, Wordnet

“Topic-document distribution”

Lectures from 2008 to 2013

Topic Modeler

strong AIparser

alignmentACL

thwarting

co-reference resolution

demoRPC

MTP

KrishnaRaag

Mahabharat

NLP Academic Cultural

Swar-sandhya

*Hypothetical example

NLP = 0.7 Academic = 0.2 Cultural = 0.1

Proportion of each topic in a document: “Multiple membership”

* And in context of sentiment analysis?

“Word-topic distribution”“Aaditya, you are not making

sense.” “Let’s study word sense

disambiguation”

Lectures from 2008 to 2013

Topic Modeler

senselogic

explanationconfused

sense

wordnetpolysemy

iterative word

*Hypothetical example

“Relevance of each word to a topic” Words across “topics” actually indicate different senses in which a word occurs

Definition

• Topic models are a suite of algorithms that discover thematic structures in a data collection. (Blei (2011) )

• What is a thematic structure? A topic: a collection of words

• Used for a wide variety of tasks such as: author recognition, aspect extraction, sentiment modelling

Black box

Topic Modeler

Unlabeled corpus

Document-topic distribution

Overall word-topic distribution

FAQs:

Can you predict a test document directly? Not directly. Is there only one way to construct a topic model? No. By intelligent structure of the model, you can derive useful information.

LDA Model

• Latent Dirichlet Allocation (LDA) model is a basic probabilistic topic model

• This presentation focuses on LDA and its adaptations with sentiment as the goal.

Plate Notation (1/2)

wNd

D

wNd

D

Unlabeled Corpus

wNd

D

Labeled Corpus

L

z

Word

Topic (Latent)

Plate Notation (2/2)

wNd

D

z

wNd

D

z

wNd

D

z

Ns

Word-level topics Document-level topicsSentence-level topics

Growing LDA further

wNd

D

z

θ θ(Z): NLP = 0.7, culture = 0.2, motivation = 0.1

Z

ϕ

ϕ (Z,word): (NLP, sense) = 0.7, (culture, sense)= 0.1, (motivation, sense) = 0.2

Let us now focus on these two multinomial distributions

Outline






• Experimentation

Multinomial distribution

• Training a LDA implies learning the parameters of the multinomial distribution

• We now focus on a multinomial distribution and the way it is modelled in case of LDA.

θ

Parameter estimation (Heinrich (2004))

P(θ|x) = P(x | θ) P(θ) P(x)

Prior

Posterior

Likelihood

P(x) = ∫ P (x| θ). P(θ) d θ

Estimating the posterior: P(θ|x) Why? Goal: To estimate θd and ϕwz as accurately, given the data: documents The two are categorical distributions

P(θ|x) ∝ P(x | θ) P(θ)

Marginal likelihood

Binomial distribution & MLE

• Toss of a biased coin

P(X=1) = q P(X=0) = (1-q)

X = {x1, x2,... ,xN}

MLE = argmax P(X|q)

MLE = argmax P(x1|q). P(x2|q)... P(XN|q) = argmax qx1(1-q)(1-x

1). .... qxN(1-q)

(1-xN

)

= argmax q(x1+x2..XN)(1-q)(N-(X1+X2..XN)

= argmax qm(1-q)n-m

= m / n

P(x1|q) = qx1(1-q)(1-x1)

argmax (mlog q + (n-m)log(1-q))

Equating derivative to zero, m/q = (n-m)/(1-q)

q = m/n m, n are “sufficient statistics” of a binominal distribution

MAP of Binomial distribution (1/2)

MAP = argmax P(q|X) = argmax P(X|q) P(q) = argmax qm(1-q)n-m P(q) Problem! P(q) can be any distribution,

strictly speaking. Computationally difficult!

Assume: P(q) is a beta distribution P(q) = qα-1 (1-q) β-1 / Beta(β-α)

MAP of Binomial distribution (2/2)

MAP = argmax qm(1-q)n-m P(q) ∝ argmax qm(1-q)n-m qα-1 (1-q) β-1 ∝ argmax q(m+α-1)(1-q)(n-m+β-1)

∝ (m+α-1) / (n+α+β-2)argmax (m+α-1) log q + (n-m+β-1) log(1-q))

Equating derivative to zero, (m+α-1) /q = (n-m+β-1) /(1-q) (m+α-1)-q(m+α-1) = q (n-m+β-1) (m+α-1) = q(m+α-1+n-m+ β-1) (m+α-1) = q(α+n+ β-2) q = (m+α-1) /(α+n+ β-2)

Beta Distribution

Binomial Distribution

Conjugate prior

Conjugate prior

• A distribution is a conjugate prior to a posterior distribution if both of them have the same form

• “Algebraic convenience”

Beta distribution is a conjugate prior of binomial distribution

What is it for categorical distribution?

Categorical distribution

• Roll of a diceP(X=1) = q1

P(X=2) = q2

... P(X=6) = q6

P(xi=k|q) = qk

X = {X1,...XN} ~ Cat(q)P(X|q) = argmax P(x1|q). P(x2|q)... P(XN|q)

= argmax π qjcj

MAP = argmax P(X|q) P(q)

= argmax π qjcj P(q)

P(q) ∝ π qj αj-1

MAP ∝ argmax π qjcj qj αj-1

∝ argmax π qj αj+cj-1

Dirichlet distributionDirichlet Distribution

Categorical Distribution

Conjugate prior

Binomial & Categorical distribution

θ zCategorical

αθ ~ Dir (α) z ~ Categorical (θ)

q xBinomial

α q ~ Beta(α, β ) x ~ Binomial (q)

β

P(z| θ) = θz

Hyper-parameters Distribution Random variable assignments

Does the name Latent Dirichlet Allocation seem justifiable now?

Z

Nd

D

Our first LDA model

θ

z

w

α

ϕ

β

Outline



• Estimation using LDA (Heinrich (2004)) • Evaluation of LDA (Wallach (2009))


• Experimentation

Estimation of LDA model

The denominator is computationally intractable. Hence, Gibbs sampling is used.

We now describe the generative story.

P(θ, ϕ| w) = P(w| θ, ϕ)P(θ, ϕ )/P(w)

Every LDA paper has: Plate notation Generative story Gibbs sampling formulas

Z

Nd

D

Generative story

θ

z

w

α

ϕ

β

Sample ϕ ~ Dir(β)

For each document, Generate θ ~ Dir (α) For each word,

Sample z ~ Multinomial (θ)

Sample w ~ ϕ(z)

Z

Nd

D

Implementing topic models

θ

z

w

α

ϕ

β

Sample ϕ ~ Dir(β)



Sample w ~ ϕ(z)

Sampling from multinomial

Input: θ : P(z=0) = 0.1, P(z=1) = 0.3, P(z=2) = 0.6

Goal: Sample a z given this distribution

θ

z

Z=0 Z=1 Z=2

0 0.1 0.4 1

Z

Nd

D

Implementing topic models

θ

z

w

α

ϕ

β

Sample ϕ ~ Dir(β)



Sample w ~ ϕ(z)

Gibbs sampling

Initialize all word positions to random z’s. Compute θ & ϕ accordingly. For each iteration, For each document, For each word, Generate a z based on θ Generate a w based on ϕw|z

Compute θ & ϕ

Outline




• Evaluation of LDA (Wallach (2009)) • Plugging in sentiment (Jo & Oh (2011)), Lin & He (2009))

• Experimentation

Evaluation

• Qualitative evaluation (Understanding topic cohesion) (Mukherjee et al (2012))

• Classification accuracy based on topics uncovered

• Held-out likelihood (Likelihood of data given parameters) (Wallach et al (2009))

A naïve addition: • Measuring sentiment cohesion: Count of positive

and negative words in each topic

Outline






• Experimentation

Experiments with LDA

• Goal: Understand topic models & obtain sentiment-coherent topics from a LDA model

• Implementation: – Topic model implementation using Gibbs

sampling – Hyper-parameter estimation as given in Heinrich

(2009) – “Left to right” likelihood algorithm by Wallach

(2009)

Data set

• Movie review data set from Amazon by McAuley & Leskovec (2013). – Training data set: 11000 movie reviews – Test data set: 2000 movie reviews

• Average length of a review: ~140 words

Effect of hyper-parameter estimation

Discovering sentiment-coherent topics

Modify basic LDA in one of the following ways:

1) Bootstrap sentiment priors with word lists

2) Modifying the structure of the topic model

Existing topic models

• Lin & He(2009) present a Joint Sentiment-Topic Model with sentiment as a latent variable.

• Jo & Ho(2011) extract senti-aspects: (sentiment, feature) pairs.

• Titov & McDonald (2008) use a sliding window model to incorporate discourse nature of reviews.

• Mukherjee & Liu (2012b) identify words belonging to six types of review comment expressions from an unlabeled corpus.





Discovering sentiment: Use of priors

• Induce positive words and negative words to belong to certain topics (based on Lin & He (2009) ) – For negative words, set

beta(word, z=0 to z/2) = 2*beta. beta(word, z=z/2 to z) = 0.

– The corresponding beta for positive words.

Use of Priors: Results (1/2)

• “Basic”: Imposing priors on only 12 sentiment words

• Leads to greater sentiment words being identified in correct topics

Use of Priors: Results(2/2)

Qualitative evaluation:Topic 38 7.330 horror 2.392 killer 2.248 scary 2.147 house 2.072 gore Some topics are positive while

others are negative, depending on the priors.Topic 13

6.931 michael 3.929 fans 3.423 live 2.379 amazing 2.354 concert





Discovering sentiment: Modifying structure

• Sentiment is explicitly modelled as a latent variable (Based on joint sentiment Tying model by Lin & He (2009)SLDA SLDA-Split

Sentiment as a Variable: Results (1/2)

Parameters: Z = 70; S = 2

Sentiment as a Variable: Results (2/2)

• SLDA

• SLDA-Split

Topic 13, s = 0 9.551 show 9.254 humor 7.166 comedy 4.846 watch 4.680 hilarious

Topic 13, s = 2 6.964 rock 5.547 children 5.38 school 4.636 remember 3.432 learn

No equivalence between topic 13 for s = 0 and s = 2.

Topic 31, s = 0 8.006 product 6.277 received 5.244 amazon 4.119 condition 4.043 seller

Topic 31, s = 1 5.206 return 4.661 problem 4.412 disappoint 3.654 case 3.616 copy

Topic 31, s = 2 10.358 amazon 9.213 play 7.068 player 3.651 dvds 3.594 purchased

A topic essentially implies “different polarities” in the same ‘context’

For S = 3,

Conclusion

• Unsupervised approaches rely on unlabelled data

• We looked at k-means clustering • Also at unsupervised/semi-supervised

approaches like LDA

References (1/2)• Balamurali, A., Joshi, A., & Bhattacharyya, P. (2011). Harnessing wordnet senses for supervised sentiment classication. In Proceedings of

the Conference on Empirical Methods in Natural Language Processing, (pp. 1081{1091). Association for Computational Linguistics. • Balamurali, A., Joshi, A., & Bhattacharyya, P. (2012). Cross-lingual sentiment analysis for indian languages using linked wordnets. In

COLING (Posters), (pp. 73{82). • Balamurali, A., Khapra, M. M., & Bhattacharyya, P. (2013). Lost in translation: viability of machine translation for cross language

sentiment analysis. In Computational Linguistics and Intelligent Text Processing (pp. 38{49). Springer. • Banea, C., Mihalcea, R., Wiebe, J., & Hassan, S. (2008). Multilingual subjectivity analysis using machine translation. In Proceedings of

the Conference on Empirical Methods in Natural Language Processing, (pp. 127{135). Association for Computational Linguistics. • Blei, D. M. (2011). Introduction to probabilistic topic models. • Blei, D. M., Ng, A. Y., Jordan, M. I., & La

erty, J. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 2003. • Boyd-Graber, J., Chang, J., Gerrish, S., Wang, C., & Blei, D. (2009). Reading tea leaves: How humans interpret topic models. In Neural

Information Processing Systems (NIPS). • Brody, S. & Elhadad, N. (2010). An unsupervised aspect-sentiment model for online reviews. In HLT-NAACL, (pp. 804{812). The

Association for Computational Linguistics. • Carl, M. (2012). Translog-ii: a program for recording user activity data for empirical reading and writing research. In LREC, (pp.

4108{4112). • Dragsted, B. (2010). Coordination of reading and writing processes in translation. Translation and Cognition, American Translators

Association Scholarly Monograph Series. Amsterdam/Philadelphia: Benjamins, 41{62. • Duh, K., Fujino, A., & Nagata, M. (2011). Is machine translation ripe for cross-lingual sentiment classication? In ACL (Short Papers), (pp.

429{433). • Fellbaum, C. (2010). Wordnet: An electronic lexical database. 1998. WordNet is available from http://www. cogsci. princeton. edu/wn. • Jo, Y. & Oh, A. (2011). Aspect and sentiment unication model for online review analysis. In Proceedings of the fourth ACM international

conference on Web search and data • mining, (pp. 815{824). ACM.

References (2/2)• Joshi, S., Kanojia, D., & Bhattacharyya, P. (2013). More than meets the eye: Study of human cognition in sense annotation. In Proceedings of NAACL-HLT, (pp. 733{738). • Kulis, B. (2012). Conjugate priors. • Lin, C. & He, Y. (2009). Joint sentiment/topic model for sentiment analysis. In Cheung, D. W.-L., Song, I.-Y., Chu, W. W., Hu, X., & Lin, J. J. (Eds.), CIKM, (pp. 375{384).

ACM. • Lu, B., Tan, C., Cardie, C., & Tsou, B. K. Joint bilingual sentiment classification with unlabeled parallel corpora. • McAuley, J. J. & Leskovec, J. (2013). From amateurs to connoisseurs: modeling the evolution of user expertise through online reviews. In Proceedings of the 22nd

international conference on World Wide Web, (pp. 897{908). International World Wide Web Conferences Steering Committee. • McCallum, A. (2002). MALLET: A machine learning for language toolkit. • Meng, X., Wei, F., Liu, X., Zhou, M., Xu, G., & Wang, H. (2012). Cross-lingual mixture model for sentiment classication. In Proceedings of the 50th Annual Meeting of the

Association for Computational Linguistics: Long Papers-Volume 1, (pp. 572{581). • Association for Computational Linguistics. • Mukherjee, A. & Liu, B. (2012a). Aspect extraction through semi-supervised modeling. In ACL (1), (pp. 339{348). The Association for Computer Linguistics. • Mukherjee, A. & Liu, B. (2012b). Modeling review comments. In ACL (1), (pp. 320{329). The Association for Computer Linguistics. • Mukherjee, A. & Liu, B. (2013). Discovering user interactions in ideological discussions. In ACL (1), (pp. 671{681). The Association for Computer Linguistics. • Mukherjee, S. & Bhattacharyya, P. (2012). Wikisent: Weakly supervised sentiment analysis through extractive summarization with wikipedia. In Machine Learning and

Knowledge Discovery in Databases (pp. 774{793). Springer. • Nallapati, R., Ahmed, A., Xing, E. P., & Cohen, W. W. (2008). Joint latent topic models for text and citations. In Li, Y., 0001, B. L., & Sarawagi, S. (Eds.), KDD, (pp.

542{550). ACM. • Pang, B. & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting

on Association for Computational Linguistics, (pp. 271). Association for Computational Linguistics. • Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1-2), 1{135. • Prettenhofer, P. & Stein, B. (2010). Cross-language text classication using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association

for Computational Linguistics, (pp. 1118{1127). Association for Computational Linguistics. • Rosen-Zvi, M., Griths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. In 20th Conference on Uncertainty in Articial Intelligence,

volume 21, Ban Park Lodge, Ban , Canada.

• Scott, G. G., O'Donnell, P. J., & Sereno, S. C. (2012). Emotion words aect eye xations during reading. Journal of Experimental Psychology: Learning, Memory, and Cognition, 38 (3), 783.

• Searle, J. R. (1992). The rediscovery of the mind. the MIT Press. • Titov, I. & McDonald, R. T. (2008a). A joint model of text and aspect ratings for sentiment summarization. In McKeown, K., Moore, J. D., Teufel, S., Allan, J., & Furui, S.

(Eds.), ACL, (pp. 308{316). The Association for Computer Linguistics. • Titov, I. & McDonald, R. T. (2008b). Modeling online reviews with multi-grain topic models. CoRR, abs/0801.1063. • Wallach, H. M., Mimno, D. M., & McCallum, A. (2009). Rethinking lda: Why priors matter. In NIPS, volume 22, (pp. 1973{1981). • Wallach, H. M., Murray, I., Salakhutdinov, R., & Mimno, D. (2009). Evaluation methods for topic models. In Proceedings of the 26th Annual International Conference on

Machine Learning, (pp. 1105{1112). ACM. • Wang, X., McCallum, A., & Wei, X. (2007). Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In Proceedings of the 7th IEEE • International Conference on Data Mining (ICDM), Nebraska, USA. • Yin, Y., Zhou, C., & Zhu, J. (2010). A pipe route design methodology by imitating human imaginal thinking. CIRP Annals-Manufacturing Technology, 59 (1), 167{170.

Documents

Unsupervised Approaches...Unsupervised Approaches Aditya M Joshi Center for Indian Language Technologies (CFILT) IIT Bombay 20th June, 2016 [email protected] [email protected]