Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with...

Improving Distributional Similarity with Lessons Learned from Word Embeddings

Authors: Omar Levy, Yoav Goldberg, Ido Dagan

Presentation: Collin Gress

Motivation

u We want to do NLP tasks. How do we represent words?

Motivation

u We generally want vectors. Think neural networks

Motivation

u What are some ways to get vector representations of words?

Motivation

u What are some ways to get vector representations of words?

u Distributional hypothesis: "words that are used and occur in the same contexts tend to purport similar meanings” - Wikipedia

Vector representations of words and their surrounding contexts

u Word2vec [1]

u Glove [2]

u Word2vec [1]

u Glove [2]

u PMI – Pointwise mutual Information

u Word2vec [1]

u Glove [2]

u PMI – Pointwise mutual information

u SVD of PMI – Singular value decomposition of PMI matrix

Very briefly: Word2vec

u This paper focuses on Skip-Gram negative sampling (SGNS) which predicts context words based off of a target word

u Optimization problem solvable by gradient descent. Want to maximize ! ⋅ #for word context pairs that exist in the dataset, and minimize it for “hallucinated” word context pairs [0].

u For every real word-context pair in dataset, hallucinate % word-context pairs. That is, given some word target, draw % contexts from &'(#) = +,-./(+)

∑ +,-./(+1)21

u For every real word-context pair in dataset, hallucinate % word-context pairs. That is, given some word target, draw % contexts from &'(#) = +,-./(+)

∑ +,-./(+1)21

u End up with a set of vectors, !3 ∈ ℝ6 for every word in dataset. Similarly, set of vectors, #3 ∈ ℝ6 for each context in the dataset.

See Mikolov paper [1] for details

Very briefly: Glove

u Learn !-dimensional vectors " and #⃗ as well as word and context specific scalars, %& and %' such that " ⋅ #⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

Very briefly: Glove

u Learn !-dimensional vectors " and #⃗ as well as word and context specific scalars, %& and %' such that " ⋅ #⃗ + %& + %' = log(count ", # ) for all word context pairs in data set [0]

u Objective “solved” by factorization of the log count matrix, 5678(':;<= &,' ), > ⋅ ?@ + %& + %'⃗

Very briefly: Pointwise mutual information (PMI)

u !"# $, & = log( - .,/- . - / )

PMI: example

Source: https://en.wikipedia.org/wiki/Pointwise_mutual_information

PMI matrices for word, context pairs in practice

u Very sparse

Interesting relationships between PMI and SGNS; PMI and Glove

u SGNS is implicitly factorizing PMI shifted by some constant [0]. Specifically, SGNS finds optimal vectors, ! and "⃗, such that ! ⋅ "⃗ = &'( !, " − log(0). 2 ⋅ 34 = ' − log 0

u Recall that, in Glove we learn vectors d-dimensional vectors ! and "⃗ as well as word and context specific scalars, 56 and 57 such that ! ⋅ "⃗ + 56 + 57 = log(count !, " ) for all word context pairs in data set

u If we fix 56 and 57 such that 56 = log("=>?@ ! ) and 57 = log("=>?@ " ), we get a problem nearly equivalent to factorizing PMI matrix shifted by log( A ). I.e. 2 ⋅ 34 +56 +57⃗ = ' − log( A )

u Or in simple terms, SGNS (Word2vec) and Glove aren’t too different from PMI

Very briefly: SVD of PMI matrix

u Singular value decomposition of PMI gives us dense vectors

u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %&

u Factorize PMI matrix, ! into product of three matrices i.e. " ⋅ $ ⋅ %&u Why does that help? ' = ") ⋅ $) and * = %)

Thesis

u The performance gains of word embeddings are mainly attributed to the optimization of algorithm hyperparameters by algorithm designers rather than the algorithms themselves

Thesis

u PMI and SVD baselines used for comparison in embedding papers were the most “vanilla” versions, hence the apparent superiority of embedding algorithms

Thesis

u PMI and SVD baselines used for comparison in embedding papers were the most “vanilla” versions, hence the apparent superiority of embedding algorithms

u Hyperparameters of Glove and Word2vec can be applied to PMI and SVD, drastically improving their performance

Pre-processing Hyperparameters

u Dynamic Context Window: context word counts are weighted by their distance to the target word. Word2vec does this by setting each context word’s weight to !"#$%!'"()*$"'+,#-)./!"#$%!'"() . Glove uses /

$"'+,#-)

$"'+,#-)u Subsampling: remove very frequent words from corpus. Word2vec does this by

removing words which are more frequent than some threshold 0 with

probability 1 − +3 where 4 is the corpus wide frequency of a word.

u Subsampling can be “dirty” or “clean”. In dirty subsampling, words are removed before word context pairs are formed. In clean subsampling, it is done after.

u Deleting Rare Words: exactly what you would expect. Negligible effect on performance

Association Metric Hyperparameters

u Shifted PMI: As previously discussed, SGNS implicitly factorizes the PMI matrix shifted by log(&). When working with PMI matrices, we can simply apply this transformation by picking some constant &, meaning each cell of the PMI matrix is ()* +, - − log(&)

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) = (23456 2 )7∑ (23456 29 )7:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@ A(B,2)A B A7(2)

where /= c = (23456 2 )7∑ (23456 29 )7:;

u Context Distribution Smoothing: used in Word2vec to smooth the context

distribution for negative sampling. /0(-) = (23456 2 )7∑ (23456 29 )7:;

where < is some

constant. Can be used in PMI in the same sort of way.

()*= +, - = >?@ A(B,2)A B A7(2)

where /= c = (23456 2 )7∑ (23456 29 )7:;

u Context distribution smoothing helps to correct PMI’s bias towards word context pairs where the context is rare

Post-processing Hyperparameters

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗ for word representations rather than just !

u New ! holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return " + $⃗ for word representations rather than just "

u New " holds more information. Previously, two vectors would have high cosine similarity if the two words are replaceable with one another. In the new form, weight is also awarded if the one word tends to appear in the context of the other [0].

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, &. Then define ' = )* + Σ*-

u Eigenvalue Weighting: In SVD, weight the eigenvalue matrix by raising value to some power, %. Then define & = () * Σ),

u Authors observed that SGNS results in ”symmetric” weight and context matrices. I.e. neither is orthonormal and no bias given to either in training objective [0]. Symmetry obtainable in SVD by letting % = -

u Adding Context Vectors: Glove paper [2] suggests that it is useful to return ! + #⃗for word representations rather than just !

.u Vector Normalization: general assumption is to normalize word vectors with /.

normalization, may be worthwhile to experiment with.

Experiments Setup: HyperparameterSpace

Experiment Setup: Training

u Train on English Wikipedia dump. 77.5 million sentences, 1.5 billion tokens

u Use ! = 500 for SVD, SGNS, Glove

u Glove trained for 50 iterations

Experiment Setup: Testing

u Similarity task: Models used for word similarity tasks on six datasets. Each dataset human labeled with word pair similarity scores

u Similarity scores calculated with cosine similarity

u Analogy task: two analogy datasets used. Given analogies of the form ! is to !∗ as # is to #∗ where #∗ is not given. Example: “Paris is to France as Tokyo is to _”.

u Analogies calculated using 3CosAdd and 3CosMul

u 3CosAdd: !$%'!()∗∈+,\{/,/∗,)}(cos #∗, !∗ − cos #∗, ! + cos #∗, # )

u 3CosAdd:!%&'!()∗∈+,\{/,/∗,)} cos #∗, !∗ − cos #∗, ! + cos #∗, #

u 3CosMul: !%&'!()∗∈+,\{/,/∗,)}789()∗,/∗)<789()∗,))

789 )∗,/ => where ? = .001

Experiment Results

Experiment Results Cont.

Key Takeaways

u Average score of SGNS (Word2vec) is lower than SVD for window sizes 2, 5. SGNS never outperforms SVD by more than 1.7%

Key Takeaways

u Previous results to the contrary due to using tuned Word2vec vs vanilla SVD

Key Takeaways

u Glove only superior to SGNS on analogy tasks using 3CosAdd (generally considered inferior to 3CosMult)

Key Takeaways

u CBOW seems to only perform well on MSR analogy data set

Key Takeaways

u SVD does not benefit from shifting PMI matrix

Key Takeaways

u SVD does not benefit from shifting PMI matrix

u Using SVD with an eigenvalue weighting of 1 results in poor performance compared to .5 or 0

Recommendations

u Tune hyperparameters based on task at hand (duh)

Recommendations

u If you’re using PMI, always use context distribution smoothing

Recommendations

u If you’re using SVD, always use eigenvalue weighting

Recommendations

u SGNS always performs well and is computationally efficient to train and utilize

Recommendations

u With SGNS, use many negative examples, i.e. prefer larger !

Recommendations

u With SGNS, use many negative examples, i.e. prefer larger !u Experiment with " =" + &⃗ variation in SGNS and Glove

References

u [0] Levy, Omer, Yoav Goldberg, and Ido Dagan. "Improving distributional similarity with lessons learned from word embeddings." Transactions of the Association for Computational Linguistics 3 (2015): 211-225.

u [1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

u [2] Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

Improving Distributional Similarity with Lessons …...Improving Distributional Similarity with...

Documents

USE OF MORPHOLOGY IN DISTRIBUTIONAL WORD EMBEDDING … · approaches on the newly created rare and multimorphemic word similarity dataset, which itself is another contribution of

Distributional Schwarzschild Geometry fromDistribut

Distributional Equity, Social Welfare

Compositional Operators in Distributional · PDF fileCompositional Operators in Distributional ... in distributional semantics. ... an introduction to compositional and distributional

Lecture 8: Distributional considerations

SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney

Word Similarity: Distributional Similarity (I)faculty.cse.tamu.edu/huangrh/Fall17/l13_2_sparse_vectors.pdf · Intuition of distributional word similarity • Nidaexample: A bottle

Compositional Operators in Distributional Semantics - · PDF file · 2017-08-28Semantics and Distributional Semantics I provide an introduction to compositional and distributional

Distributional Impact of Privatization: The Sri Lankan ... docs/2.24.03-Privatization/Knight_john_Sri...Distributional Impact of Privatization: The Sri Lankan Experience ... the distributional

Compositionality and Distributional Semantic Models …clic.cimec.unitn.it/roberto/ESSLLI10-dsm-workshop/ESSLLIintro... · Distributional Semantic Models An introduction ... Distributional

Distributional Semantics

Module 3 Lessons 1–14 M3... · Lesson 4 : 9Fundamental Theorem of Similarity (FTS) 8•3 G8-M3-Lesson 4: Fundamental Theorem of Similarity (FTS) 1. In the diagram below, points

A distributional semantic approach to the …fperek.net/pdfs/PerekHilpertToAppear-distributional...1 A distributional semantic approach to the periodization of change in the productivity

Monitoring Issues in Digital Public Space - From Data ...mediamining.univ-lyon2.fr/velcin/webpol/resume/de... · McCarthy. Characterising measures of lexical distributional similarity

DISTRIBUTIONAL WORD SIMILARITY David Kauchak CS159 Fall 2014

Distributional Bootstrapping - CLIPS

Evaluating the Complexity of Time Series Based on Distributional Orderliness · Evaluating the Complexity of Time Series Based on Distributional Orderliness 162 similarity is too

Distributional semantic pre-filtering in context-aware ...ricci/papers/SPF-UMUAI-2016-online.pdf · Published online: 31 March 2015 ... to-situation similarity computation methods

1 Distributional Semantics: Word Association and Similarity Ido Dagan Including Slides by: Katrin Erk (mostly), Marco Baroni, Alessandro Lenci (BLESS)

Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Bar-Ilan University Israel 1