Upload
others
View
23
Download
0
Embed Size (px)
Citation preview
From Word Embeddings To Document DistancesMatt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2
1Department of Computer Science & Engineering, Washington University in St. Louis, USA2Department of Computer Science, Cornell University, USA
word embeddingRd
Obama
Illinois
press
President
greets
speaks
Chicago
P
ObamaspeaksIllinois
d1/41/2
P
PresidentgreetspressChicago
1/4
d0
word embeddingRd
Obama
Illinois
press
President
greets
speaks
Chicago
word embeddingRd
Obama
Illinois
press
President
greets
speaks
Chicago
Word Embedding & Document Vectors
8i
8j
s.t.
minT�0
nX
i,j=1
Tijkxi � xjk2
nX
j=1
Tij = di
nX
i=1
Tij = d0j
WMD Optimization
The Word Mover's Distance
Rd
words
different from [Collobert & Weston, 2008], [Mnih & Hinton, 2009] word2vec is not deep!
word2vec[Mikolov et al., 2013]
3 million different words embeddedEmbedded Google News Corpus
trained on 100 billion words
embeds billions of words / hour with a desktop computer
word2vec model
How can we leverage this high quality word embedding to compute document distances?
learns complex word relationships:vec(Japan) - vec(sushi) + vec(Germany) ⇡ vec(bratwurst)vec(Einstein) - vec(scientist) + vec(Picasso) ⇡ vec(painter)
Tij'transport' between word i in d and word j in d'
xi word embedding for word i
dinormalized BOW (nBOW) weight for word i in first document
nBOW weight for word j in second document d0
j
n number of words in dictionary
in CV this is the Earth Mover's Distance (EMD) [Rubner et al., 1998]
k-Nearest Neighbor Results
3
0
0
1
0
2
0
0
0
0
1
‘campaign’
‘Obama’
‘speech’
‘Washington’
bag-of-words TF-IDF
0.01
0
0
0.2
0
0.04
0
0
0
0
0.13
LSI LDA
‘Oba
ma’
‘speech’
‘a’
‘Oba
ma’
‘speech’
‘a’
topic distributions
sports
politics
music
Civil War
politics topic
‘Obam
a’‘speech’‘W
ashington’
‘football’‘M
adonna’‘Vicksburg’‘guitar’
‘soccer’
[Salton & Buckley, 1988] [Deerwester et al., 1990] [Blei et al., 2003]
Prior Art
1 2 3 4 5 6 7 80
0.20.40.60.81
1.21.41.6
aver
age
erro
r w.r.
t. BO
W 1.61.41.21.00.80.60.40.2
0
1.291.15
1.00.72
0.60 0.55 0.49 0.42
BOW TF-IDF Okapi BM25
LSI LDA mSDA CCG
WMD
Approximations
different word embeddingsaverage kNN error
HLBL: [Mnih & Hinton, 2009] CW: [Collobert & Weston, 2008]
W2V: [Mikolov et al., 2013]
word embeddingRd
Obama
Illinois
press
President
greets
speaks
Chicago
O(n3 log n)WCD(d,d0) , kXd�Xd0k2
O(nd)
Approximation Results
2. The Relaxed Word Mover's Distance
1. The Word Centroid Distance
RWMD(d,d0),max
(nX
i,j=1
T
⇤1,ijkxi � xjk2,
nX
i,j=1
T
⇤2,ijkxi � xjk2
)
O(n2d)
8i
8j
s.t.
minT�0
nX
i,j=1
Tijkxi � xjk2
nX
j=1
Tij = di
nX
i=1
Tij = d0j
8i
8j
s.t.
minT�0
nX
i,j=1
Tijkxi � xjk2
nX
j=1
Tij = di
nX
i=1
Tij = d0j
8i
8j
s.t.
minT�0
nX
i,j=1
Tijkxi � xjk2
nX
j=1
Tij = di
nX
i=1
Tij = d0j
T⇤
constraint polytope
kxi� x
jk28(i, j
)
kxi� x
jk28(i, j
)
kxi� x
jk28(i, j
)
dimensionality of word embeddings
matrix of all word embeddings (d,n)
d
X
10
0.20.40.60.81
1.21.21.0
0.8
0.6
0.4
0.2
RWMD WMDWCDRWMDRWMD
0.93
0.56 0.540.45 0.42
0
average kNN error w.r.t. BOW
c2 c1
1.21.0
0.8
0.6
0.4
0.2
RWMD WMDWCDRWMDRWMD
0.92 0.92
0.30
0.96 1.0
0
average distance w.r.t. WMD
c2 c1 T⇤2T⇤
1
Approximations
for a random test input
References1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111– 3119, 2013 2. Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008. 3. Mnih, A. and Hinton, G. E. A scalable hierarchical dis- tributed language model. In NIPS, pp. 1081–1088, 2009. 4. Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In ICCV, pp. 59–66. IEEE, 1998. 5. Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & man- agement, 24(5):513–523, 1988. 6. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. 7. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022, 2003. 8. Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and Cunningham, J. Bayesian optimization with inequality constraints. In ICML, pp. 937–945, 2014.
We tuned all hyperparameters for all methods with BO
[Gardner et al., 2014]
XdXd0