Two Topics in High Dimensional Space: Hubness and Low ... · Hubness and Low Dimensional Geometry Jiaji Huang July 18th 1/24. Agenda 1 Hubness explained ... Introduce Some anchors

Two Topics in High Dimensional Space:Hubness and Low Dimensional Geometry

Jiaji Huang

July 18th

1 / 24

Agenda

1 Hubness explained

2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]

3 Low-dimensionality: Miscellaneous Results [2, 3]

2 / 24

Agenda

1 Hubness explained



Hubness 1

I Hubs: “popular” nearest neighbors

I Nk (k-occurrence against a query set): “the number of timesa point being the k-NN of query items”

1M. Radovanović et. al, JMLR 20103 / 24

Hubness

I Example: multivariate Gaussian, k = 5

I For each data point, retrieve its 5-NN among all thegenerated data points

Figure: Distribution of 5-occurrence

I Conclusion: When d big, some data points are being retrievedtoo many times!

4 / 24

Hubness Degrades NN Search

Evidence seen in

I Bilingual Lexicon Induction 2

I Document classification 3

I Audio Retrieval 4

I Zero-shot Image labeling 5

2Dinu et. al. ICLR 20143Suzuki et. al. EMNLP 20134Aucouturier et. al. Pattern recognition 20085Shigeto et. al. KDD 2015

5 / 24

Agenda

1 Hubness explained



Bilingual Lexicon Induction

I How to translate words without parallel corpora?

I isometry between word embedding spaces of two languages

English Spanish

6 / 24

Bilingual Lexicon Induction

Introduce Some anchors by using a seeding dictionary

I Align the anchors via a rotation

I More translation pairs can be induced by Nearest Neighbor(NN) Search

7 / 24

Inverted Softmax Mitigates Hubness 6

I Distance matrix Di,j where

i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target

I Kernel matrix exp(−Di,j/�)

Kernel

normalize columns normalize rows

6Smith et. al., ICLR 20178 / 24





Kernel normalize columns

normalize rows






Kernel normalize columns normalize rows


Doubts ...

I ISF works but not clear why

I What if run the normalization for multiple times?

Let’s back off a bit ...

9 / 24

Viewing NN as an Optimization Problem

I NN is equivalent to argmaxj exp(−Di,j/�).

I Obviously,

argmaxj

exp(−Di,j/�) ≡ argmaxj

exp(−Di,j/�)∑j′ exp(−Di,j′/�)

I RHS is the solution of the following

minP〈D,P〉+ �

∑i,j

Pi,j logPi,j

s.t.Pi,j ≥ 0,∑j

Pi,j = 1(1)

10 / 24


I NN is equivalent to argmaxj exp(−Di,j/�).I Obviously,

argmaxj

exp(−Di,j/�) ≡ argmaxj

exp(−Di,j/�)∑j′ exp(−Di,j′/�)

I RHS is the solution of the following

minP〈D,P〉+ �

∑i,j

Pi,j logPi,j

s.t.Pi,j ≥ 0,∑j

Pi,j = 1(1)

10 / 24


Proposition 1 (NN as an optimization problem)

The NN criterion is equivalent to argmaxj Pi,j , where P is thesolution of the following optimization problem,

minP〈D,P〉+ �

∑i,j

Pi,j logPi,j

s.t.Pi,j ≥ 0,∑j

Pi,j = 1. (P0)

11 / 24

Equal Preference Assumption

I Column mean of P encodes how popular each target item is

I Let’s enforce them to be equally “popular”

Definition 1 (Equal Preference Assumption)

pfj ,1

m

∑i

Pi,j =1

n, ∀j

I Approximately holds when m,n are huge

12 / 24

Hubless Nearest Neighbor (HNN)

Applying the assumption:

Definition 2 (HNN)

HNN is the criterion that retrieves index argmaxj Pi,j , where P isthe solution of problem

minP〈D,P〉+ �

∑i,j

Pi,j logPi,j

s.t.Pi,j ≥ 0,∑j

Pi,j = 1,1

m

∑i

Pi,j =1

n

. (P1)

13 / 24

Sovling (P1)

Algorithm 1 Sinkhorn Iteration

Input: DOutput: P

P← exp(−D/�) where exp is on elements.while stopping criteria not met do

// normalize columnsP← Pdiag

{mn ./(P

>1)}

// normalize rowsP← diag{1./(P1)}P

end while

I ISF is a single step of Sinkhorn iteration!

I Less efficient if m,n are huge

14 / 24

Dual of (P1) 7

Proposition 2 (Dual of (P1) )The solution of problem (P1) can be expressed as

Pi,j =exp

(βj−Di,j

�

)∑

j exp(βj−Di,j

�

) , (2)where βj is the solution of

minβ

∑i

`i ,� log∑

j

exp

(βj −Di,j

�

)− 1n

∑j

βj

(D)I exp(−βj/�) is a column normalizerI ISF simply sets the normalizer as

∑j exp(−Di,j/�)

7Genevay et. al., 201615 / 24

Dual Solver

Algorithm 2 Dual Solver for Problem (P1)

Input: DOutput: P

1: β ← 02: while stopping criteria not met do3: for i = 1, . . . ,m do4: // parallelizable5: Compute gradient ∇`i.6: end for7: β ← β − η · 1m

∑i∇`i

8: end while9: Compute P by E.q. (2) and return.

16 / 24

Illustrative Experiments

I dimension d = 300, number of classes C = 10K,m = n = 10K

I p(x|c) = N (µc, 0.01), c = 1, . . . , C, uniform prior p(c)

5 10N1

0.0

0.2

0.4

p(N1)

NNISFHNN

(a) 1-occurrence

0 10 20 30iteration

74

76

78

P@1

ISFHNN

(b) Algo. 1 iterations

0 5000 10000j

−10

0

10

norm

alize

r

ISFHNN

(c) Column Normalizer

17 / 24

Illustrative Experiments (contd.)

Top-1, top-5, top-10 retrieval accuracies

P@1 P@5 P@10

NN 51.71 71.88 78.51

ISF 73.75 88.57 92.36

HNN78.85 91.43 94.60

(Algorithm 1)

HNN78.84 91.41 94.59

(Algorithm 2)

18 / 24

BLI experiments8

sourcetarget

en es fr it pt de

en

NN 54.98 55.66 46.30 37.02 53.03ISF 70.35 72.31 63.75 53.02 65.75CSLS 71.21 72.62 64.11 53.54 66.50HNN 71.34 73.65 64.91 54.03 64.61

es

NN 59.87 60.76 61.95 66.94 48.73ISF 73.11 76.82 76.33 78.67 61.70CSLS 73.02 76.44 76.44 80.29 62.29HNN 74.38 78.24 77.86 81.09 60.78

fr

NN 61.60 61.73 59.43 46.31 57.10ISF 74.46 75.72 73.78 60.89 69.07CSLS 74.88 76.68 74.34 62.06 70.34HNN 75.97 77.23 75.12 63.10 67.94

it

NN 51.38 64.63 61.45 51.91 50.68ISF 65.57 77.76 76.64 67.32 63.58CSLS 65.32 78.46 76.74 68.85 64.57HNN 67.57 79.75 78.56 70.33 62.96

pt

NN 42.21 68.93 47.48 50.98 37.95ISF 55.76 81.67 64.37 68.37 51.07CSLS 54.75 81.98 63.68 67.92 51.77HNN 57.42 83.96 66.19 70.44 49.93

de

NN 56.06 44.33 52.78 45.44 33.20ISF 69.74 60.77 71.59 65.99 52.74CSLS 68.65 59.21 69.88 63.69 50.72HNN 69.20 60.22 70.71 65.09 52.08

8Follow setups in https://github.com/facebookresearch/MUSE19 / 24

https://github.com/facebookresearch/MUSE

Analysis

I Why HNN is less impressive on German?

en es fr it pt detarget

en

es

fr

it

pt

de

source

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.21e−11

Figure: var[pfj ] for all the pairs

I Hubness reduced?

100 101 102 103N5

10−6

10−4

10−2

p(N5) NN

ISFCSLSHNN

20 / 24

Analysis

Table: Some representative “hubs”

wordN5 N5 frequency

by NN by HNN rank

conspersus 1,776 0 484,387oryzopsis 1,235 5 472,161

these 1,042 25 122s+bd 912 16 440,835were 798 24 40you 474 20 50

would 467 40 73

I Typos (extremely low-frequency) are likely to be hubs 9

I Some functional words (very frequent) can also be hubs

9Dinu et. al., ICLR 201421 / 24

Agenda

1 Hubness explained



Principal Angles and Subspace Classification [2]

22 / 24

Robustness and Local Geometry Preservation [3]

I Application of (K, �)-robustness 10

I Deep Network +∑

(i,j)∈NB |d(fi, fj)− d(xi, xj)|where NB is local neighborhood

10Xu et. al., Machine Learning 201223 / 24

Reference I

[1] J. Huang et. al. Hubless Nearest Neighbor Search forBilingual Lexicon Induction. ACL 2019.

[2] J. Huang et. al. The Role of Principal Angles in SubspaceClassification. IEEE Transactions on Signal Processing. 2015

[3] J. Huang et. al. Discrimnative Robust FeatureTransformation. NIPS 2015

24 / 24

Hubness explainedBilingual Lexicon Induction (BLI) and Hubness Reduction ACLLow-dimensionality: Miscellaneous Results TSP, NIPS

Documents

Two Topics in High Dimensional Space: Hubness and Low ... · Hubness and Low Dimensional Geometry Jiaji Huang July 18th 1/24. Agenda 1 Hubness explained ... Introduce Some anchors