Machine learning in NLP Learning word meaning from corpora · -20pt learning word meaning from corpora: intuition I you shall know a word by the company it keeps [Firth, 1957] I the

Machine learning in NLPLearning word meaning from corpora

Richard Johansson

October 23, 2014

-20pt

this lecture

I methods for learning word �meaning� (in some restrictedsense) automatically from unlabeled corpora

I speci�cally: represent word �meaning� as a vector or a classI rat is more similar to mouse than to oxygenI rat and mouse both belong to class 12345

-20pt

what's the point?

I searching for similar words: query expansion in IR,lexicography, . . .

I corpus-based methods give a higher coverage than lexicons

I plugging them into NLP systems (PoS taggers, parsers, NErecognition, . . . ) helps to generalize [Turian et al., 2010]

I recent approaches start with generic vectors and then tailorthem to the task at hand e.g. sentiment analysis[Socher et al., 2011]

-20pt

example: some animals (in 2D)

korp

kråka

kaja

skata

duva

falk

örn

hundkatt

björn

råtta

mus

varg

älg

hjort

räv

-20pt

similarity of meaning vectors

I given two word meaning vectors, we can apply a similarity ordistance function

I Euclidean distance: multidimensional equivalent of ourintuitive notion of distance

I it is 0 if the vectors are identical

I cosine similarity: multiply the coordinates, divide by thelengths

I it is 1 if the vectors are identical, 0 if completely di�erent

-20pt

how �meaning� vectors help NLP systems generalize

I instead of a 100,000-dimensional vector like this

Göteborg → [0, 0, . . . , 0, 1, 0, . . .]Tashkent → [0, 0, . . . , 0, 0, 1, . . .]

I . . . we can have a 100-dimensional vector like this

Göteborg → [0.010,−0.20, . . . , 0.15, 0.07,−0.23, . . .]Tashkent → [0.015,−0.05, . . . , 0.11, 0.14,−0.12, . . .]

-20pt

learning word meaning from corpora: intuition

I �you shall know a word by the company it keeps� [Firth, 1957]

I the meaning of a word is re�ected in the set of contextswhere it appears:

I the set of documents where it appears[Landauer and Dumais, 1997]

I other words with which it cooccurs [Schütze, 1992]I and other linguistic phenomena [Padó and Lapata, 2007]

I simplest idea for making vectors: simply count the contexts

-20pt

example: tårta and pizza in Swedish corpora

-20pt

example: the context is the document

D1: The stock market crashed in Tokyo.

D2: Cook the lentils gently in butter.

D1 D2

stock 1 0market 1 0crashed 1 0Tokyo 1 0cook 0 1lentils 0 1gently 0 1butter 0 1

-20pt

example: the contexts are the nearby words

D1: The stock market crashed in Tokyo.

D2: Cook the lentils gently in butter.

context: the word before and after

the stock market crashed lentils in

the 0 1 0 0 1 0stock 1 0 1 0 0 0market 0 1 0 1 0 0crashed 0 0 1 0 0 1lentils 1 0 0 0 0 1

-20pt

example: the e�ect of context

Reinfeldt: 1

Skavlan: 0.897105

Ludl: 0.878873

Lindson: 0.874008

Gertten: 0.871961

Stillman: 0.871375

Adolph: 0.86191

Ritterberg: 0.852531

Böök: 0.848459

Kessiakoff: 0.834909

Strage: 0.82995

Rinaldo: 0.825585

Reinfeldt: 1

Bildt: 0.973508

Sahlin: 0.960694

rödgröna: 0.960072

Reinfeldts: 0.958742

Juholt: 0.958644

uttalade: 0.956048

rådet: 0.954971

statsministern: 0.952898

politiker: 0.952712

Odell: 0.952376

Schyman: 0.952065

-20pt

word-by-context matrices are huge

I even if we're using sparse vectors, memory becomes an issue

I dimensionality reduction: a mathematical operation thattransforms a high-dimensional matrix into a lower-dimensionalone with similar properties

I popular choices:I singular value decomposition (SVD), used in LSA/LSI

[Landauer and Dumais, 1997]I random projection [Kanerva et al., 2000]: see for instance the

very readable PhD thesis by Sahlgren [2006]

-20pt

another idea: learning to predict a word in a context

�after a few years abroad, he moved back to ___�

�the furniture was imported from ___�

�he visited the libraries in London, ___, Florence and Venice�

�during the German siege of ___ in 1870, he was found dead�

I could we train a �classi�er� to predict the missing word?

-20pt

leaning to predict a word in a context

I the skip-gram with negative sampling (SGNS) model[Mikolov et al., 2013a]:

I we have one set of vectors for the target words, and anotherfor the contexts

I e.g. one target word vector for �Paris�, and a context vectorfor �appearing after the bigram live in�

I for each word�context pair, generate some random negativeexamples of contexts

I e.g. �Paris� + �appearing after the bigram eat a�

I then let's use this probability model:

P(real example |w , c) =1

1+ e−w ·c

I �nally, train the model (i.e. tune the word and contextvectors) so that the probability of real examples is high and ofrandom negative examples is low

-20pt

some software you can use

I word2vec (http://code.google.com/p/word2vec) isMikolov's own implementation of SGNS and other models

I gensim (http://radimrehurek.com/gensim) is a Pythonlibrary implementing LSA/LSI, SGNS (can also directly readword2vec output

http://code.google.com/p/word2vec

http://radimrehurek.com/gensim

-20pt

there is structure in vector spaces

I it was recently discovered that vector spaces can be used toanswer analogy questions [Mikolov et al., 2013b]

I �X is to Germany as Paris is to France�

I this can be done using simple linear algebra operations:

V (Paris)− V (France) + V (Germany)

I this was originally done with prediction-based vector spaces

I . . . but context-counting spaces can also be used if we use aslightly di�erent search technique [Levy and Goldberg, 2014]

-20pt

example: countries and cities

-20pt

countries and cities

0.6 0.4 0.2 0.0 0.2 0.4 0.60.8

0.6

0.4

0.2

0.0

0.2

0.4

0.6

Berlin

Tyskland

Stockholm

Sverige

Paris

Frankrike

Moskva

Ryssland

Köpenhamn

Danmark

Oslo

Norge

Tallinn

Estland

Rom

Italien

-20pt

demo (Swedish)

I http://demo.spraakdata.gu.se/cgi-bin/richard/sim?query=pizza&model=sg_64w

I http://demo.spraakdata.gu.se/cgi-bin/richard/analogy?q1=Paris&q2=Frankrike&q3=

Berlin&model=sg_512w

http://demo.spraakdata.gu.se/cgi-bin/richard/sim?query=pizza&model=sg_64w

http://demo.spraakdata.gu.se/cgi-bin/richard/analogy?q1=Paris&q2=Frankrike&q3=Berlin&model=sg_512w

http://demo.spraakdata.gu.se/cgi-bin/richard/analogy?q1=Paris&q2=Frankrike&q3=Berlin&model=sg_512w

-20pt

computing word clusters

I word vectors can be used to compute clusters of wordsI k-means or other algorithms

Gothenburg

hamburger

pizza

Paris

I another popular word clustering algorithm: Brown[Brown et al., 1992]

I start by putting each word into a clusterI merge clusters to increase HMM probability of our corpus

I http://cs.stanford.edu/~pliang/software/ � under�Word clustering�

http://cs.stanford.edu/~pliang/software/

-20pt

examples of Brown clusters

1100111011 Harrisburg

1100111011 Meaux

1100111011 Chengde

1100111011 Gothenburg

1100111011 Abadan

1100111011 Ashgabat

1100111011 Versailles

1100111010 Carinthia

1100111010 Kemerovo

1100111010 Andalusia

1100111010 Krasnodar

110011100 Sydney

110011100 Paris

110011100 Frankfurt

10101101 barbecue

10101101 pizza

10101101 soda

10101101 hamburger

10101101 pesticide

10101101 math

10101101 methamphetamine

10101101 soup

10101101 platinum

10101100 amusement

10101100 petrochemical

10101100 apparel

10101100 aluminium

10101100 catfish

-20pt

open research questions (a selection)

I which relations are encoded in vector spaces, and how?I for instance, what about hyponymy?

I can knowledge resources (e.g. lexicons such as WordNet) beused to improve vector spaces and vice versa?

I how should vector spaces (or clusters) be designed in order tobe useful in applications?

I how to use multilingual corpora?

-20pt

example: the word rock

rock

kappa

oljerock

jacka

långrock

morgonrock

musik

hårdrock punkrock

jazz

funk

-20pt

example: the senses of rock

rock

rock-1

rock-2

kappa

oljerock

jacka

långrock

morgonrock

musik

hårdrock punkrock

jazz

funk

-20pt

references I

I Brown, P., deSouza, P., Mercer, R., Della Pietra, V., and Jenifer C. Lai(1992). Class-based n-gram models of natural language. ComputationalLinguistics, 18(4):467�479.

I Firth, J. (1957). Papers in Linguistics 1934�1951. OUP.

I Kanerva, P., Kristo�ersson, J., and Holst, A. (2000). Random indexing oftext samples for latent semantic analysis. In Proceedings of the 22nd AnnualConference of the Cognitive Science Society.

I Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato's problem:The latent semantic analysis theory of acquisition, induction andrepresentation of knowledge. Psychological Review, 104:211�240.

I Levy, O. and Goldberg, Y. (2014). Linguistic regularities in sparse andexplicit word representations. In Proceedings of CoNLL.

I Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a).Distributed representations of words and phrases and their compositionality.In NIPS.

-20pt

references II

I Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguistic regularities incontinuous space word representations. In Proc. NAACL.

I Padó, S. and Lapata, M. (2007). Dependency-based construction ofsemantic space models. Computational Linguistics, 33(1).

I Sahlgren, M. (2006). The Word-Space Model. PhD thesis, StockholmUniversity.

I Schütze, H. (1992). Dimensions of meaning. In Proc. ACM/IEEE conf. onSupercomputing.

I Socher, R., Pennington, J., Huang, E., Ng, A. Y., and Manning, C. D.(2011). Semi-supervised recursive autoencoders for predicting sentimentdistributions. In Proc. EMNLP.

I Turian, J., Ratinov, L.-A., and Bengio, Y. (2010). Word representations: Asimple and general method for semi-supervised learning. In Proc. ACL.

Documents

Machine learning in NLP Learning word meaning from corpora · -20pt learning word meaning from corpora: intuition I you shall know a word by the company it keeps [Firth, 1957] I the