Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Machine learning in NLPLearning word meaning from corpora
Richard Johansson
October 23, 2014
-20pt
this lecture
I methods for learning word �meaning� (in some restrictedsense) automatically from unlabeled corpora
I speci�cally: represent word �meaning� as a vector or a classI rat is more similar to mouse than to oxygenI rat and mouse both belong to class 12345
-20pt
what's the point?
I searching for similar words: query expansion in IR,lexicography, . . .
I corpus-based methods give a higher coverage than lexicons
I plugging them into NLP systems (PoS taggers, parsers, NErecognition, . . . ) helps to generalize [Turian et al., 2010]
I recent approaches start with generic vectors and then tailorthem to the task at hand e.g. sentiment analysis[Socher et al., 2011]
-20pt
example: some animals (in 2D)
korp
kråka
kaja
skata
duva
falk
örn
hundkatt
björn
råtta
mus
varg
älg
hjort
räv
-20pt
similarity of meaning vectors
I given two word meaning vectors, we can apply a similarity ordistance function
I Euclidean distance: multidimensional equivalent of ourintuitive notion of distance
I it is 0 if the vectors are identical
I cosine similarity: multiply the coordinates, divide by thelengths
I it is 1 if the vectors are identical, 0 if completely di�erent
-20pt
how �meaning� vectors help NLP systems generalize
I instead of a 100,000-dimensional vector like this
Göteborg → [0, 0, . . . , 0, 1, 0, . . .]Tashkent → [0, 0, . . . , 0, 0, 1, . . .]
I . . . we can have a 100-dimensional vector like this
Göteborg → [0.010,−0.20, . . . , 0.15, 0.07,−0.23, . . .]Tashkent → [0.015,−0.05, . . . , 0.11, 0.14,−0.12, . . .]
-20pt
learning word meaning from corpora: intuition
I �you shall know a word by the company it keeps� [Firth, 1957]
I the meaning of a word is re�ected in the set of contextswhere it appears:
I the set of documents where it appears[Landauer and Dumais, 1997]
I other words with which it cooccurs [Schütze, 1992]I and other linguistic phenomena [Padó and Lapata, 2007]
I simplest idea for making vectors: simply count the contexts
-20pt
example: tårta and pizza in Swedish corpora
-20pt
example: the context is the document
D1: The stock market crashed in Tokyo.
D2: Cook the lentils gently in butter.
D1 D2
stock 1 0market 1 0crashed 1 0Tokyo 1 0cook 0 1lentils 0 1gently 0 1butter 0 1
-20pt
example: the contexts are the nearby words
D1: The stock market crashed in Tokyo.
D2: Cook the lentils gently in butter.
context: the word before and after
the stock market crashed lentils in
the 0 1 0 0 1 0stock 1 0 1 0 0 0market 0 1 0 1 0 0crashed 0 0 1 0 0 1lentils 1 0 0 0 0 1
-20pt
example: the e�ect of context
Reinfeldt: 1
Skavlan: 0.897105
Ludl: 0.878873
Lindson: 0.874008
Gertten: 0.871961
Stillman: 0.871375
Adolph: 0.86191
Ritterberg: 0.852531
Böök: 0.848459
Kessiakoff: 0.834909
Strage: 0.82995
Rinaldo: 0.825585
Reinfeldt: 1
Bildt: 0.973508
Sahlin: 0.960694
rödgröna: 0.960072
Reinfeldts: 0.958742
Juholt: 0.958644
uttalade: 0.956048
rådet: 0.954971
statsministern: 0.952898
politiker: 0.952712
Odell: 0.952376
Schyman: 0.952065
-20pt
word-by-context matrices are huge
I even if we're using sparse vectors, memory becomes an issue
I dimensionality reduction: a mathematical operation thattransforms a high-dimensional matrix into a lower-dimensionalone with similar properties
I popular choices:I singular value decomposition (SVD), used in LSA/LSI
[Landauer and Dumais, 1997]I random projection [Kanerva et al., 2000]: see for instance the
very readable PhD thesis by Sahlgren [2006]
-20pt
another idea: learning to predict a word in a context
�after a few years abroad, he moved back to ___�
�the furniture was imported from ___�
�he visited the libraries in London, ___, Florence and Venice�
�during the German siege of ___ in 1870, he was found dead�
I could we train a �classi�er� to predict the missing word?
-20pt
leaning to predict a word in a context
I the skip-gram with negative sampling (SGNS) model[Mikolov et al., 2013a]:
I we have one set of vectors for the target words, and anotherfor the contexts
I e.g. one target word vector for �Paris�, and a context vectorfor �appearing after the bigram live in�
I for each word�context pair, generate some random negativeexamples of contexts
I e.g. �Paris� + �appearing after the bigram eat a�
I then let's use this probability model:
P(real example |w , c) =1
1+ e−w ·c
I �nally, train the model (i.e. tune the word and contextvectors) so that the probability of real examples is high and ofrandom negative examples is low
-20pt
some software you can use
I word2vec (http://code.google.com/p/word2vec) isMikolov's own implementation of SGNS and other models
I gensim (http://radimrehurek.com/gensim) is a Pythonlibrary implementing LSA/LSI, SGNS (can also directly readword2vec output
-20pt
there is structure in vector spaces
I it was recently discovered that vector spaces can be used toanswer analogy questions [Mikolov et al., 2013b]
I �X is to Germany as Paris is to France�
I this can be done using simple linear algebra operations:
V (Paris)− V (France) + V (Germany)
I this was originally done with prediction-based vector spaces
I . . . but context-counting spaces can also be used if we use aslightly di�erent search technique [Levy and Goldberg, 2014]
-20pt
example: countries and cities
-20pt
countries and cities
0.6 0.4 0.2 0.0 0.2 0.4 0.60.8
0.6
0.4
0.2
0.0
0.2
0.4
0.6
Berlin
Tyskland
Stockholm
Sverige
Paris
Frankrike
Moskva
Ryssland
Köpenhamn
Danmark
Oslo
Norge
Tallinn
Estland
Rom
Italien
-20pt
demo (Swedish)
I http://demo.spraakdata.gu.se/cgi-bin/richard/sim?query=pizza&model=sg_64w
I http://demo.spraakdata.gu.se/cgi-bin/richard/analogy?q1=Paris&q2=Frankrike&q3=
Berlin&model=sg_512w
-20pt
computing word clusters
I word vectors can be used to compute clusters of wordsI k-means or other algorithms
Gothenburg
hamburger
pizza
Paris
I another popular word clustering algorithm: Brown[Brown et al., 1992]
I start by putting each word into a clusterI merge clusters to increase HMM probability of our corpus
I http://cs.stanford.edu/~pliang/software/ � under�Word clustering�
-20pt
examples of Brown clusters
1100111011 Harrisburg
1100111011 Meaux
1100111011 Chengde
1100111011 Gothenburg
1100111011 Abadan
1100111011 Ashgabat
1100111011 Versailles
1100111010 Carinthia
1100111010 Kemerovo
1100111010 Andalusia
1100111010 Krasnodar
110011100 Sydney
110011100 Paris
110011100 Frankfurt
10101101 barbecue
10101101 pizza
10101101 soda
10101101 hamburger
10101101 pesticide
10101101 math
10101101 methamphetamine
10101101 soup
10101101 platinum
10101100 amusement
10101100 petrochemical
10101100 apparel
10101100 aluminium
10101100 catfish
-20pt
open research questions (a selection)
I which relations are encoded in vector spaces, and how?I for instance, what about hyponymy?
I can knowledge resources (e.g. lexicons such as WordNet) beused to improve vector spaces and vice versa?
I how should vector spaces (or clusters) be designed in order tobe useful in applications?
I how to use multilingual corpora?
-20pt
example: the word rock
rock
kappa
oljerock
jacka
långrock
morgonrock
musik
hårdrock punkrock
jazz
funk
-20pt
example: the senses of rock
rock
rock-1
rock-2
kappa
oljerock
jacka
långrock
morgonrock
musik
hårdrock punkrock
jazz
funk
-20pt
references I
I Brown, P., deSouza, P., Mercer, R., Della Pietra, V., and Jenifer C. Lai(1992). Class-based n-gram models of natural language. ComputationalLinguistics, 18(4):467�479.
I Firth, J. (1957). Papers in Linguistics 1934�1951. OUP.
I Kanerva, P., Kristo�ersson, J., and Holst, A. (2000). Random indexing oftext samples for latent semantic analysis. In Proceedings of the 22nd AnnualConference of the Cognitive Science Society.
I Landauer, T. K. and Dumais, S. T. (1997). A solution to Plato's problem:The latent semantic analysis theory of acquisition, induction andrepresentation of knowledge. Psychological Review, 104:211�240.
I Levy, O. and Goldberg, Y. (2014). Linguistic regularities in sparse andexplicit word representations. In Proceedings of CoNLL.
I Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a).Distributed representations of words and phrases and their compositionality.In NIPS.
-20pt
references II
I Mikolov, T., Yih, W.-t., and Zweig, G. (2013b). Linguistic regularities incontinuous space word representations. In Proc. NAACL.
I Padó, S. and Lapata, M. (2007). Dependency-based construction ofsemantic space models. Computational Linguistics, 33(1).
I Sahlgren, M. (2006). The Word-Space Model. PhD thesis, StockholmUniversity.
I Schütze, H. (1992). Dimensions of meaning. In Proc. ACM/IEEE conf. onSupercomputing.
I Socher, R., Pennington, J., Huang, E., Ng, A. Y., and Manning, C. D.(2011). Semi-supervised recursive autoencoders for predicting sentimentdistributions. In Proc. EMNLP.
I Turian, J., Ratinov, L.-A., and Bengio, Y. (2010). Word representations: Asimple and general method for semi-supervised learning. In Proc. ACL.