Upload
moscowdatafest
View
385
Download
1
Embed Size (px)
Citation preview
Introduction to word embeddings
Pavel Kalaidin@facultyofwonder
Moscow Data Fest, September, 12th, 2015
distributional hypothesis
лойс
годно, лойслойс за песню
из принципа не поставлю лойсвзаимные лойсы
лойс, если согласен
What is the meaning of лойс?
годно, лойслойс за песню
из принципа не поставлю лойсвзаимные лойсы
лойс, если согласен
What is the meaning of лойс?
кек
кек, что ли?кек)))))))ну ты кек
What is the meaning of кек?
кек, что ли?кек)))))))ну ты кек
What is the meaning of кек?
vectorial representations of words
simple and flexible platform for
understanding text and probably not messing up
one-hot encoding?
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
co-occurrence matrix
recall: word-document co-occurrence matrix for LSA
credits: [x]
from entire document to window (length 5-10)
still seems suboptimal -> big, sparse, etc.
lower dimensions, we want dense vectors
(say, 25-1000)
How?
matrix factorization?
SVD of co-occurrence matrix
lots of memory?
idea: directly learn low-dimensional vectors
here comes word2vec
Distributed Representations of Words and Phrases and their Compositionality, Mikolov et al: [paper]
idea: instead of capturing co-occurrence counts
predict surrounding words
Two models:C-BOW
predicting the word given its context
skip-grampredicting the context given a word
Explained in great detail here, so we’ll skip it for now Also see: word2vec Parameter Learning Explained, Rong, paper
CBOW: several times faster than skip-gram, slightly better accuracy for the frequent wordsSkip-Gram: works well with small amount of
data, represents well rare words or phrases
Examples?
Wwoman- Wman= Wqueen- Wking
classic example
<censored example>
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method, Goldberg et al, 2014 [arxiv]
all done with gensim:github.com/piskvorky/gensim/
...failing to take advantage of the vast amount of repetition
in the data
so back to co-occurrences
GloVe for Global VectorsPennington et al, 2014: nlp.stanford.
edu/pubs/glove.pdf
Ratios seem to cancel noise
The gist: model ratios with vectors
The model
Preserving linearity
Preventing mixing dimensions
Restoring symmetry, part 1
recall:
Restoring symmetry, part 2
Least squares problem it is now
SGD->AdaGrad
ok, Python code
glove-python:github.com/maciejkula/glove-python
two sets of vectorsinput and context + bias
average/sum/drop
complexity |V|2
complexity |C|0.8
Evaluation: it works
#spb#gatchina#msk#kyiv#minsk#helsinki
Compared to word2vec
#spb#gatchina#msk#kyiv#minsk#helsinki
t-SNE:github.com/oreillymedia/t-SNE-tutorial
seaborn:stanford.edu/~mwaskom/software/seaborn/
Abusing models
music playlists:github.com/mattdennewitz/playlist-to-vec
user interestsParagraph vectors: cs.stanford.
edu/~quocle/paragraph_vector.pdf
predicting hashtagsinteresting read: #TAGSPACE: Semantic
Embeddings from Hashtags [link]
RusVectōrēs: distributional semantic models for Russian: ling.go.mail.ru/dsm/en/
corpus matters
building block forbigger models╰(*´︶`*)╯
</slides>