From Word Embeddings To Document Distances · From Word Embeddings To Document Distances Matt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2 1Department of Computer

From Word Embeddings To Document DistancesMatt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2

1Department of Computer Science & Engineering, Washington University in St. Louis, USA2Department of Computer Science, Cornell University, USA

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

P

ObamaspeaksIllinois

d1/41/2

P

PresidentgreetspressChicago

1/4

d0

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

Word Embedding & Document Vectors

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

WMD Optimization

The Word Mover's Distance

Rd

words

different from [Collobert & Weston, 2008], [Mnih & Hinton, 2009] word2vec is not deep!

word2vec[Mikolov et al., 2013]

3 million different words embeddedEmbedded Google News Corpus

trained on 100 billion words

embeds billions of words / hour with a desktop computer

word2vec model

How can we leverage this high quality word embedding to compute document distances?

learns complex word relationships:vec(Japan) - vec(sushi) + vec(Germany) ⇡ vec(bratwurst)vec(Einstein) - vec(scientist) + vec(Picasso) ⇡ vec(painter)

Tij'transport' between word i in d and word j in d'

xi word embedding for word i

dinormalized BOW (nBOW) weight for word i in first document

nBOW weight for word j in second document d0

j

n number of words in dictionary

in CV this is the Earth Mover's Distance (EMD) [Rubner et al., 1998]

k-Nearest Neighbor Results

3

0

0

1

0

2

0

0

0

0

1

‘campaign’

‘Obama’

‘speech’

‘Washington’

bag-of-words TF-IDF

0.01

0

0

0.2

0

0.04

0

0

0

0

0.13

LSI LDA

‘Oba

ma’

‘speech’

‘a’

‘Oba

ma’

‘speech’

‘a’

topic distributions

sports

politics

music

Civil War

politics topic

‘Obam

a’‘speech’‘W

ashington’

‘football’‘M

adonna’‘Vicksburg’‘guitar’

‘soccer’

[Salton & Buckley, 1988] [Deerwester et al., 1990] [Blei et al., 2003]

Prior Art

1 2 3 4 5 6 7 80

0.20.40.60.81

1.21.41.6

aver

age

erro

r w.r.

t. BO

W 1.61.41.21.00.80.60.40.2

0

1.291.15

1.00.72

0.60 0.55 0.49 0.42

BOW TF-IDF Okapi BM25

LSI LDA mSDA CCG

WMD

Approximations

different word embeddingsaverage kNN error

HLBL: [Mnih & Hinton, 2009] CW: [Collobert & Weston, 2008]

W2V: [Mikolov et al., 2013]

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

O(n3 log n)WCD(d,d0) , kXd�Xd0k2

O(nd)

Approximation Results

2. The Relaxed Word Mover's Distance

1. The Word Centroid Distance

RWMD(d,d0),max

(nX

i,j=1

T

⇤1,ijkxi � xjk2,

nX

i,j=1

T

⇤2,ijkxi � xjk2

)

O(n2d)

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

T⇤

constraint polytope

kxi� x

jk28(i, j

)

kxi� x

jk28(i, j

)

kxi� x

jk28(i, j

)

dimensionality of word embeddings

matrix of all word embeddings (d,n)

d

X

10

0.20.40.60.81

1.21.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.93

0.56 0.540.45 0.42

0

average kNN error w.r.t. BOW

c2 c1

1.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.92 0.92

0.30

0.96 1.0

0

average distance w.r.t. WMD

c2 c1 T⇤2T⇤

1

Approximations

for a random test input

References1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111– 3119, 2013 2. Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008. 3. Mnih, A. and Hinton, G. E. A scalable hierarchical distributed language model. In NIPS, pp. 1081–1088, 2009. 4. Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In ICCV, pp. 59–66. IEEE, 1998. 5. Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & man- agement, 24(5):513–523, 1988. 6. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. 7. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022, 2003. 8. Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and Cunningham, J. Bayesian optimization with inequality constraints. In ICML, pp. 937–945, 2014.

We tuned all hyperparameters for all methods with BO

[Gardner et al., 2014]

XdXd0

Documents

From Word Embeddings To Document Distances · From Word Embeddings To Document Distances Matt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2 1Department of Computer