1
From Word Embeddings To Document Distances Matt J. Kusner 1 , Yu Sun 1,2 , Nicholas I. Kolkin 1 , Kilian Q. Weinberger 1,2 1 Department of Computer Science & Engineering, Washington University in St. Louis, USA 2 Department of Computer Science, Cornell University, USA word embedding R d Obama Illinois press President greets speaks Chicago P Obama speaks Illinois d 1/4 1/2 P President greets press Chicago 1/4 d 0 word embedding R d Obama Illinois press President greets speaks Chicago word embedding R d Obama Illinois press President greets speaks Chicago Word Embedding & Document Vectors 8i 8j s.t. min T0 n X i,j =1 T ij kx i - x j k 2 n X j =1 T ij = d i n X i=1 T ij = d 0 j WMD Optimization The Word Mover's Distance R d words different from [Collobert & Weston, 2008], [Mnih & Hinton, 2009] word2vec is not deep! word2vec [Mikolov et al., 2013] 3 million different words embedded Embedded Google News Corpus trained on 100 billion words embeds billions of words / hour with a desktop computer word2vec model How can we leverage this high quality word embedding to compute document distances? learns complex word relationships: vec(Japan ) - vec(sushi ) + vec(Germany ) vec(bratwurst ) vec(Einstein ) - vec(scientist ) + vec(Picasso ) vec(painter ) T ij 'transport' between word i in d and word j in d' x i word embedding for word i d i normalized BOW (nBOW) weight for word i in first document nBOW weight for word j in second document d 0 j n number of words in dictionary in CV this is the Earth Mover's Distance (EMD) [Rubner et al., 1998] k-Nearest Neighbor Results 3 0 0 1 0 2 0 0 0 0 1 ‘campaign’ ‘Obama’ ‘speech’ ‘Washington’ bag-of-words TF-IDF 0.01 0 0 0.2 0 0.04 0 0 0 0 0.13 LSI LDA ‘Obama’ ‘speech’ ‘a’ ‘Obama’ ‘speech’ ‘a’ topic distributions sports politics music Civil War politics topic ‘Obama’ ‘speech’ ‘Washington’ ‘football’ ‘Madonna’ ‘Vicksburg’ ‘guitar’ ‘soccer’ [Salton & Buckley, 1988] [Deerwester et al., 1990] [Blei et al., 2003] Prior Art 0.2 0.4 0.6 0.8 1.2 1.4 1.6 average error w.r.t. BOW 1.6 1.4 1.2 1.0 0.8 0.6 0.4 0.2 0 1.29 1.15 1.0 0.72 0.60 0.55 0.49 0.42 BOW TF-IDF Okapi BM25 LSI LDA mSDA CCG WMD Approximations different word embeddings average kNN error HLBL: [Mnih & Hinton, 2009] CW: [Collobert & Weston, 2008] W2V: [Mikolov et al., 2013] word embedding R d O(n 3 log n) WCD(d, d 0 ) , kXd - Xd 0 k 2 O(nd) Approximation Results 2. The Relaxed Word Mover's Distance 1. The Word Centroid Distance RWMD(d, d 0 ) , max ( n X i,j =1 T 1,ij kx i - x j k 2 , n X i,j =1 T 2,ij kx i - x j k 2 ) O(n 2 d) 8i 8j s.t. min T0 n X i,j =1 T ij kx i - x j k 2 n X j =1 T ij = d i n X i=1 T ij = d 0 j 8i 8j s.t. min T0 n X i,j =1 T ij kx i - x j k 2 n X j =1 T ij = d i n X i=1 T ij = d 0 j 8i 8j s.t. min T0 n X i,j =1 T ij kx i - x j k 2 n X j =1 T ij = d i n X i=1 T ij = d 0 j T constraint polytope kx i - x j k 2 8(i, j ) kx i - x j k 2 8(i, j ) kx i - x j k 2 8(i, j ) dimensionality of word embeddings matrix of all word embeddings (d,n) d X 0 1.2 1.0 0.8 0.6 0.4 0.2 RWMD WMD WCD RWMD RWMD 0.93 0.56 0.54 0.45 0.42 0 average kNN error w.r.t. BOW c2 c1 T 2 T 1 Approximations for a random test input References 1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111– 3119, 2013 2. Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008. 3. Mnih, A. and Hinton, G. E. A scalable hierarchical dis- tributed language model. In NIPS, pp. 1081–1088, 2009. 4. Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In ICCV, pp. 59–66. IEEE, 1998. 5. Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & man- agement, 24(5):513–523, 1988. 6. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. 7. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022, 2003. 8. Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and Cunningham, J. Bayesian optimization with inequality constraints. In ICML, pp. 937–945, 2014. We tuned all hyperparameters for all methods with BO [Gardner et al., 2014] Xd Xd 0

From Word Embeddings To Document Distances · From Word Embeddings To Document Distances Matt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2 1Department of Computer

  • Upload
    others

  • View
    23

  • Download
    0

Embed Size (px)

Citation preview

Page 1: From Word Embeddings To Document Distances · From Word Embeddings To Document Distances Matt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2 1Department of Computer

From Word Embeddings To Document DistancesMatt J. Kusner1, Yu Sun1,2, Nicholas I. Kolkin1, Kilian Q. Weinberger1,2

1Department of Computer Science & Engineering, Washington University in St. Louis, USA2Department of Computer Science, Cornell University, USA

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

P

ObamaspeaksIllinois

d1/41/2

P

PresidentgreetspressChicago

1/4

d0

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

Word Embedding & Document Vectors

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

WMD Optimization

The Word Mover's Distance

Rd

words

different from [Collobert & Weston, 2008], [Mnih & Hinton, 2009] word2vec is not deep!

word2vec[Mikolov et al., 2013]

3 million different words embeddedEmbedded Google News Corpus

trained on 100 billion words

embeds billions of words / hour with a desktop computer

word2vec model

How can we leverage this high quality word embedding to compute document distances?

learns complex word relationships:vec(Japan) - vec(sushi) + vec(Germany) ⇡ vec(bratwurst)vec(Einstein) - vec(scientist) + vec(Picasso) ⇡ vec(painter)

Tij'transport' between word i in d and word j in d'

xi word embedding for word i

dinormalized BOW (nBOW) weight for word i in first document

nBOW weight for word j in second document d0

j

n number of words in dictionary

in CV this is the Earth Mover's Distance (EMD) [Rubner et al., 1998]

k-Nearest Neighbor Results

3

0

0

1

0

2

0

0

0

0

1

‘campaign’

‘Obama’

‘speech’

‘Washington’

bag-of-words TF-IDF

0.01

0

0

0.2

0

0.04

0

0

0

0

0.13

LSI LDA

‘Oba

ma’

‘speech’

‘a’

‘Oba

ma’

‘speech’

‘a’

topic distributions

sports

politics

music

Civil War

politics topic

‘Obam

a’‘speech’‘W

ashington’

‘football’‘M

adonna’‘Vicksburg’‘guitar’

‘soccer’

[Salton & Buckley, 1988] [Deerwester et al., 1990] [Blei et al., 2003]

Prior Art

1 2 3 4 5 6 7 80

0.20.40.60.81

1.21.41.6

aver

age

erro

r w.r.

t. BO

W 1.61.41.21.00.80.60.40.2

0

1.291.15

1.00.72

0.60 0.55 0.49 0.42

BOW TF-IDF Okapi BM25

LSI LDA mSDA CCG

WMD

Approximations

different word embeddingsaverage kNN error

HLBL: [Mnih & Hinton, 2009] CW: [Collobert & Weston, 2008]

W2V: [Mikolov et al., 2013]

word embeddingRd

Obama

Illinois

press

President

greets

speaks

Chicago

O(n3 log n)WCD(d,d0) , kXd�Xd0k2

O(nd)

Approximation Results

2. The Relaxed Word Mover's Distance

1. The Word Centroid Distance

RWMD(d,d0),max

(nX

i,j=1

T

⇤1,ijkxi � xjk2,

nX

i,j=1

T

⇤2,ijkxi � xjk2

)

O(n2d)

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

8i

8j

s.t.

minT�0

nX

i,j=1

Tijkxi � xjk2

nX

j=1

Tij = di

nX

i=1

Tij = d0j

T⇤

constraint polytope

kxi� x

jk28(i, j

)

kxi� x

jk28(i, j

)

kxi� x

jk28(i, j

)

dimensionality of word embeddings

matrix of all word embeddings (d,n)

d

X

10

0.20.40.60.81

1.21.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.93

0.56 0.540.45 0.42

0

average kNN error w.r.t. BOW

c2 c1

1.21.0

0.8

0.6

0.4

0.2

RWMD WMDWCDRWMDRWMD

0.92 0.92

0.30

0.96 1.0

0

average distance w.r.t. WMD

c2 c1 T⇤2T⇤

1

Approximations

for a random test input

References1. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In NIPS, pp. 3111– 3119, 2013 2. Collobert, R. and Weston, J. A unified architecture for natural language processing: Deep neural networks with multitask learning. In ICML, pp. 160–167. ACM, 2008. 3. Mnih, A. and Hinton, G. E. A scalable hierarchical dis- tributed language model. In NIPS, pp. 1081–1088, 2009. 4. Rubner, Y., Tomasi, C., and Guibas, L. J. A metric for distributions with applications to image databases. In ICCV, pp. 59–66. IEEE, 1998. 5. Salton, G. and Buckley, C. Term-weighting approaches in automatic text retrieval. Information processing & man- agement, 24(5):513–523, 1988. 6. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harshman, R. A. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. 7. Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022, 2003. 8. Gardner, J., Kusner, M. J., Xu, E., Weinberger, K. Q., and Cunningham, J. Bayesian optimization with inequality constraints. In ICML, pp. 937–945, 2014.

We tuned all hyperparameters for all methods with BO

[Gardner et al., 2014]

XdXd0