Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Two Topics in High Dimensional Space:Hubness and Low Dimensional Geometry
Jiaji Huang
July 18th
1 / 24
Agenda
1 Hubness explained
2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]
3 Low-dimensionality: Miscellaneous Results [2, 3]
2 / 24
Agenda
1 Hubness explained
2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]
3 Low-dimensionality: Miscellaneous Results [2, 3]
Hubness 1
I Hubs: “popular” nearest neighbors
I Nk (k-occurrence against a query set): “the number of timesa point being the k-NN of query items”
1M. Radovanović et. al, JMLR 20103 / 24
Hubness 1
I Hubs: “popular” nearest neighbors
I Nk (k-occurrence against a query set): “the number of timesa point being the k-NN of query items”
1M. Radovanović et. al, JMLR 20103 / 24
Hubness
I Example: multivariate Gaussian, k = 5
I For each data point, retrieve its 5-NN among all thegenerated data points
Figure: Distribution of 5-occurrence
I Conclusion: When d big, some data points are being retrievedtoo many times!
4 / 24
Hubness Degrades NN Search
Evidence seen in
I Bilingual Lexicon Induction 2
I Document classification 3
I Audio Retrieval 4
I Zero-shot Image labeling 5
2Dinu et. al. ICLR 20143Suzuki et. al. EMNLP 20134Aucouturier et. al. Pattern recognition 20085Shigeto et. al. KDD 2015
5 / 24
Agenda
1 Hubness explained
2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]
3 Low-dimensionality: Miscellaneous Results [2, 3]
Bilingual Lexicon Induction
I How to translate words without parallel corpora?
I isometry between word embedding spaces of two languages
English Spanish
6 / 24
Bilingual Lexicon Induction
I How to translate words without parallel corpora?
I isometry between word embedding spaces of two languages
English Spanish
6 / 24
Bilingual Lexicon Induction
Introduce Some anchors by using a seeding dictionary
I Align the anchors via a rotation
I More translation pairs can be induced by Nearest Neighbor(NN) Search
7 / 24
Inverted Softmax Mitigates Hubness 6
I Distance matrix Di,j where
i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target
I Kernel matrix exp(−Di,j/�)
Kernel
normalize columns normalize rows
6Smith et. al., ICLR 20178 / 24
Inverted Softmax Mitigates Hubness 6
I Distance matrix Di,j where
i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target
I Kernel matrix exp(−Di,j/�)
Kernel normalize columns
normalize rows
6Smith et. al., ICLR 20178 / 24
Inverted Softmax Mitigates Hubness 6
I Distance matrix Di,j where
i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target
I Kernel matrix exp(−Di,j/�)
Kernel normalize columns normalize rows
6Smith et. al., ICLR 20178 / 24
Doubts ...
I ISF works but not clear why
I What if run the normalization for multiple times?
Let’s back off a bit ...
9 / 24
Viewing NN as an Optimization Problem
I NN is equivalent to argmaxj exp(−Di,j/�).
I Obviously,
argmaxj
exp(−Di,j/�) ≡ argmaxj
exp(−Di,j/�)∑j′ exp(−Di,j′/�)
I RHS is the solution of the following
minP〈D,P〉+ �
∑i,j
Pi,j logPi,j
s.t.Pi,j ≥ 0,∑j
Pi,j = 1(1)
10 / 24
Viewing NN as an Optimization Problem
I NN is equivalent to argmaxj exp(−Di,j/�).I Obviously,
argmaxj
exp(−Di,j/�) ≡ argmaxj
exp(−Di,j/�)∑j′ exp(−Di,j′/�)
I RHS is the solution of the following
minP〈D,P〉+ �
∑i,j
Pi,j logPi,j
s.t.Pi,j ≥ 0,∑j
Pi,j = 1(1)
10 / 24
Viewing NN as an Optimization Problem
I NN is equivalent to argmaxj exp(−Di,j/�).I Obviously,
argmaxj
exp(−Di,j/�) ≡ argmaxj
exp(−Di,j/�)∑j′ exp(−Di,j′/�)
I RHS is the solution of the following
minP〈D,P〉+ �
∑i,j
Pi,j logPi,j
s.t.Pi,j ≥ 0,∑j
Pi,j = 1(1)
10 / 24
Viewing NN as an Optimization Problem
Proposition 1 (NN as an optimization problem)
The NN criterion is equivalent to argmaxj Pi,j , where P is thesolution of the following optimization problem,
minP〈D,P〉+ �
∑i,j
Pi,j logPi,j
s.t.Pi,j ≥ 0,∑j
Pi,j = 1. (P0)
11 / 24
Equal Preference Assumption
I Column mean of P encodes how popular each target item is
I Let’s enforce them to be equally “popular”
Definition 1 (Equal Preference Assumption)
pfj ,1
m
∑i
Pi,j =1
n, ∀j
I Approximately holds when m,n are huge
12 / 24
Equal Preference Assumption
I Column mean of P encodes how popular each target item is
I Let’s enforce them to be equally “popular”
Definition 1 (Equal Preference Assumption)
pfj ,1
m
∑i
Pi,j =1
n, ∀j
I Approximately holds when m,n are huge
12 / 24
Equal Preference Assumption
I Column mean of P encodes how popular each target item is
I Let’s enforce them to be equally “popular”
Definition 1 (Equal Preference Assumption)
pfj ,1
m
∑i
Pi,j =1
n, ∀j
I Approximately holds when m,n are huge
12 / 24
Equal Preference Assumption
I Column mean of P encodes how popular each target item is
I Let’s enforce them to be equally “popular”
Definition 1 (Equal Preference Assumption)
pfj ,1
m
∑i
Pi,j =1
n, ∀j
I Approximately holds when m,n are huge
12 / 24
Hubless Nearest Neighbor (HNN)
Applying the assumption:
Definition 2 (HNN)
HNN is the criterion that retrieves index argmaxj Pi,j , where P isthe solution of problem
minP〈D,P〉+ �
∑i,j
Pi,j logPi,j
s.t.Pi,j ≥ 0,∑j
Pi,j = 1,1
m
∑i
Pi,j =1
n
. (P1)
13 / 24
Sovling (P1)
Algorithm 1 Sinkhorn Iteration
Input: DOutput: P
P← exp(−D/�) where exp is on elements.while stopping criteria not met do
// normalize columnsP← Pdiag
{mn ./(P
>1)}
// normalize rowsP← diag{1./(P1)}P
end while
I ISF is a single step of Sinkhorn iteration!
I Less efficient if m,n are huge
14 / 24
Dual of (P1) 7
Proposition 2 (Dual of (P1) )The solution of problem (P1) can be expressed as
Pi,j =exp
(βj−Di,j
�
)∑
j exp(βj−Di,j
�
) , (2)where βj is the solution of
minβ
∑i
`i ,� log∑
j
exp
(βj −Di,j
�
)− 1n
∑j
βj
(D)I exp(−βj/�) is a column normalizerI ISF simply sets the normalizer as
∑j exp(−Di,j/�)
7Genevay et. al., 201615 / 24
Dual Solver
Algorithm 2 Dual Solver for Problem (P1)
Input: DOutput: P
1: β ← 02: while stopping criteria not met do3: for i = 1, . . . ,m do4: // parallelizable5: Compute gradient ∇`i.6: end for7: β ← β − η · 1m
∑i∇`i
8: end while9: Compute P by E.q. (2) and return.
16 / 24
Illustrative Experiments
I dimension d = 300, number of classes C = 10K,m = n = 10K
I p(x|c) = N (µc, 0.01), c = 1, . . . , C, uniform prior p(c)
5 10N1
0.0
0.2
0.4
p(N1)
NNISFHNN
(a) 1-occurrence
0 10 20 30iteration
74
76
78
P@1
ISFHNN
(b) Algo. 1 iterations
0 5000 10000j
−10
0
10
norm
alize
r
ISFHNN
(c) Column Normalizer
17 / 24
Illustrative Experiments (contd.)
Top-1, top-5, top-10 retrieval accuracies
P@1 P@5 P@10
NN 51.71 71.88 78.51
ISF 73.75 88.57 92.36
HNN78.85 91.43 94.60
(Algorithm 1)
HNN78.84 91.41 94.59
(Algorithm 2)
18 / 24
BLI experiments8
sourcetarget
en es fr it pt de
en
NN 54.98 55.66 46.30 37.02 53.03ISF 70.35 72.31 63.75 53.02 65.75CSLS 71.21 72.62 64.11 53.54 66.50HNN 71.34 73.65 64.91 54.03 64.61
es
NN 59.87 60.76 61.95 66.94 48.73ISF 73.11 76.82 76.33 78.67 61.70CSLS 73.02 76.44 76.44 80.29 62.29HNN 74.38 78.24 77.86 81.09 60.78
fr
NN 61.60 61.73 59.43 46.31 57.10ISF 74.46 75.72 73.78 60.89 69.07CSLS 74.88 76.68 74.34 62.06 70.34HNN 75.97 77.23 75.12 63.10 67.94
it
NN 51.38 64.63 61.45 51.91 50.68ISF 65.57 77.76 76.64 67.32 63.58CSLS 65.32 78.46 76.74 68.85 64.57HNN 67.57 79.75 78.56 70.33 62.96
pt
NN 42.21 68.93 47.48 50.98 37.95ISF 55.76 81.67 64.37 68.37 51.07CSLS 54.75 81.98 63.68 67.92 51.77HNN 57.42 83.96 66.19 70.44 49.93
de
NN 56.06 44.33 52.78 45.44 33.20ISF 69.74 60.77 71.59 65.99 52.74CSLS 68.65 59.21 69.88 63.69 50.72HNN 69.20 60.22 70.71 65.09 52.08
8Follow setups in https://github.com/facebookresearch/MUSE19 / 24
https://github.com/facebookresearch/MUSE
Analysis
I Why HNN is less impressive on German?
en es fr it pt detarget
en
es
fr
it
pt
de
source
0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
3.21e−11
Figure: var[pfj ] for all the pairs
I Hubness reduced?
100 101 102 103N5
10−6
10−4
10−2
p(N5) NN
ISFCSLSHNN
20 / 24
Analysis
Table: Some representative “hubs”
wordN5 N5 frequency
by NN by HNN rank
conspersus 1,776 0 484,387oryzopsis 1,235 5 472,161
these 1,042 25 122s+bd 912 16 440,835were 798 24 40you 474 20 50
would 467 40 73
I Typos (extremely low-frequency) are likely to be hubs 9
I Some functional words (very frequent) can also be hubs
9Dinu et. al., ICLR 201421 / 24
Agenda
1 Hubness explained
2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]
3 Low-dimensionality: Miscellaneous Results [2, 3]
Principal Angles and Subspace Classification [2]
22 / 24
Robustness and Local Geometry Preservation [3]
I Application of (K, �)-robustness 10
I Deep Network +∑
(i,j)∈NB |d(fi, fj)− d(xi, xj)|where NB is local neighborhood
10Xu et. al., Machine Learning 201223 / 24
Reference I
[1] J. Huang et. al. Hubless Nearest Neighbor Search forBilingual Lexicon Induction. ACL 2019.
[2] J. Huang et. al. The Role of Principal Angles in SubspaceClassification. IEEE Transactions on Signal Processing. 2015
[3] J. Huang et. al. Discrimnative Robust FeatureTransformation. NIPS 2015
24 / 24
Hubness explainedBilingual Lexicon Induction (BLI) and Hubness Reduction ACLLow-dimensionality: Miscellaneous Results TSP, NIPS