36
Two Topics in High Dimensional Space: Hubness and Low Dimensional Geometry Jiaji Huang July 18th 1 / 24

Two Topics in High Dimensional Space: Hubness and Low ... · Hubness and Low Dimensional Geometry Jiaji Huang July 18th 1/24. Agenda 1 Hubness explained ... Introduce Some anchors

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Two Topics in High Dimensional Space:Hubness and Low Dimensional Geometry

    Jiaji Huang

    July 18th

    1 / 24

  • Agenda

    1 Hubness explained

    2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]

    3 Low-dimensionality: Miscellaneous Results [2, 3]

    2 / 24

  • Agenda

    1 Hubness explained

    2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]

    3 Low-dimensionality: Miscellaneous Results [2, 3]

  • Hubness 1

    I Hubs: “popular” nearest neighbors

    I Nk (k-occurrence against a query set): “the number of timesa point being the k-NN of query items”

    1M. Radovanović et. al, JMLR 20103 / 24

  • Hubness 1

    I Hubs: “popular” nearest neighbors

    I Nk (k-occurrence against a query set): “the number of timesa point being the k-NN of query items”

    1M. Radovanović et. al, JMLR 20103 / 24

  • Hubness

    I Example: multivariate Gaussian, k = 5

    I For each data point, retrieve its 5-NN among all thegenerated data points

    Figure: Distribution of 5-occurrence

    I Conclusion: When d big, some data points are being retrievedtoo many times!

    4 / 24

  • Hubness Degrades NN Search

    Evidence seen in

    I Bilingual Lexicon Induction 2

    I Document classification 3

    I Audio Retrieval 4

    I Zero-shot Image labeling 5

    2Dinu et. al. ICLR 20143Suzuki et. al. EMNLP 20134Aucouturier et. al. Pattern recognition 20085Shigeto et. al. KDD 2015

    5 / 24

  • Agenda

    1 Hubness explained

    2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]

    3 Low-dimensionality: Miscellaneous Results [2, 3]

  • Bilingual Lexicon Induction

    I How to translate words without parallel corpora?

    I isometry between word embedding spaces of two languages

    English Spanish

    6 / 24

  • Bilingual Lexicon Induction

    I How to translate words without parallel corpora?

    I isometry between word embedding spaces of two languages

    English Spanish

    6 / 24

  • Bilingual Lexicon Induction

    Introduce Some anchors by using a seeding dictionary

    I Align the anchors via a rotation

    I More translation pairs can be induced by Nearest Neighbor(NN) Search

    7 / 24

  • Inverted Softmax Mitigates Hubness 6

    I Distance matrix Di,j where

    i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target

    I Kernel matrix exp(−Di,j/�)

    Kernel

    normalize columns normalize rows

    6Smith et. al., ICLR 20178 / 24

  • Inverted Softmax Mitigates Hubness 6

    I Distance matrix Di,j where

    i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target

    I Kernel matrix exp(−Di,j/�)

    Kernel normalize columns

    normalize rows

    6Smith et. al., ICLR 20178 / 24

  • Inverted Softmax Mitigates Hubness 6

    I Distance matrix Di,j where

    i = 1, . . . ,m : index of source; j = 1, . . . , n : index of target

    I Kernel matrix exp(−Di,j/�)

    Kernel normalize columns normalize rows

    6Smith et. al., ICLR 20178 / 24

  • Doubts ...

    I ISF works but not clear why

    I What if run the normalization for multiple times?

    Let’s back off a bit ...

    9 / 24

  • Viewing NN as an Optimization Problem

    I NN is equivalent to argmaxj exp(−Di,j/�).

    I Obviously,

    argmaxj

    exp(−Di,j/�) ≡ argmaxj

    exp(−Di,j/�)∑j′ exp(−Di,j′/�)

    I RHS is the solution of the following

    minP〈D,P〉+ �

    ∑i,j

    Pi,j logPi,j

    s.t.Pi,j ≥ 0,∑j

    Pi,j = 1(1)

    10 / 24

  • Viewing NN as an Optimization Problem

    I NN is equivalent to argmaxj exp(−Di,j/�).I Obviously,

    argmaxj

    exp(−Di,j/�) ≡ argmaxj

    exp(−Di,j/�)∑j′ exp(−Di,j′/�)

    I RHS is the solution of the following

    minP〈D,P〉+ �

    ∑i,j

    Pi,j logPi,j

    s.t.Pi,j ≥ 0,∑j

    Pi,j = 1(1)

    10 / 24

  • Viewing NN as an Optimization Problem

    I NN is equivalent to argmaxj exp(−Di,j/�).I Obviously,

    argmaxj

    exp(−Di,j/�) ≡ argmaxj

    exp(−Di,j/�)∑j′ exp(−Di,j′/�)

    I RHS is the solution of the following

    minP〈D,P〉+ �

    ∑i,j

    Pi,j logPi,j

    s.t.Pi,j ≥ 0,∑j

    Pi,j = 1(1)

    10 / 24

  • Viewing NN as an Optimization Problem

    Proposition 1 (NN as an optimization problem)

    The NN criterion is equivalent to argmaxj Pi,j , where P is thesolution of the following optimization problem,

    minP〈D,P〉+ �

    ∑i,j

    Pi,j logPi,j

    s.t.Pi,j ≥ 0,∑j

    Pi,j = 1. (P0)

    11 / 24

  • Equal Preference Assumption

    I Column mean of P encodes how popular each target item is

    I Let’s enforce them to be equally “popular”

    Definition 1 (Equal Preference Assumption)

    pfj ,1

    m

    ∑i

    Pi,j =1

    n, ∀j

    I Approximately holds when m,n are huge

    12 / 24

  • Equal Preference Assumption

    I Column mean of P encodes how popular each target item is

    I Let’s enforce them to be equally “popular”

    Definition 1 (Equal Preference Assumption)

    pfj ,1

    m

    ∑i

    Pi,j =1

    n, ∀j

    I Approximately holds when m,n are huge

    12 / 24

  • Equal Preference Assumption

    I Column mean of P encodes how popular each target item is

    I Let’s enforce them to be equally “popular”

    Definition 1 (Equal Preference Assumption)

    pfj ,1

    m

    ∑i

    Pi,j =1

    n, ∀j

    I Approximately holds when m,n are huge

    12 / 24

  • Equal Preference Assumption

    I Column mean of P encodes how popular each target item is

    I Let’s enforce them to be equally “popular”

    Definition 1 (Equal Preference Assumption)

    pfj ,1

    m

    ∑i

    Pi,j =1

    n, ∀j

    I Approximately holds when m,n are huge

    12 / 24

  • Hubless Nearest Neighbor (HNN)

    Applying the assumption:

    Definition 2 (HNN)

    HNN is the criterion that retrieves index argmaxj Pi,j , where P isthe solution of problem

    minP〈D,P〉+ �

    ∑i,j

    Pi,j logPi,j

    s.t.Pi,j ≥ 0,∑j

    Pi,j = 1,1

    m

    ∑i

    Pi,j =1

    n

    . (P1)

    13 / 24

  • Sovling (P1)

    Algorithm 1 Sinkhorn Iteration

    Input: DOutput: P

    P← exp(−D/�) where exp is on elements.while stopping criteria not met do

    // normalize columnsP← Pdiag

    {mn ./(P

    >1)}

    // normalize rowsP← diag{1./(P1)}P

    end while

    I ISF is a single step of Sinkhorn iteration!

    I Less efficient if m,n are huge

    14 / 24

  • Dual of (P1) 7

    Proposition 2 (Dual of (P1) )The solution of problem (P1) can be expressed as

    Pi,j =exp

    (βj−Di,j

    )∑

    j exp(βj−Di,j

    ) , (2)where βj is the solution of

    minβ

    ∑i

    `i ,� log∑

    j

    exp

    (βj −Di,j

    )− 1n

    ∑j

    βj

    (D)I exp(−βj/�) is a column normalizerI ISF simply sets the normalizer as

    ∑j exp(−Di,j/�)

    7Genevay et. al., 201615 / 24

  • Dual Solver

    Algorithm 2 Dual Solver for Problem (P1)

    Input: DOutput: P

    1: β ← 02: while stopping criteria not met do3: for i = 1, . . . ,m do4: // parallelizable5: Compute gradient ∇`i.6: end for7: β ← β − η · 1m

    ∑i∇`i

    8: end while9: Compute P by E.q. (2) and return.

    16 / 24

  • Illustrative Experiments

    I dimension d = 300, number of classes C = 10K,m = n = 10K

    I p(x|c) = N (µc, 0.01), c = 1, . . . , C, uniform prior p(c)

    5 10N1

    0.0

    0.2

    0.4

    p(N1)

    NNISFHNN

    (a) 1-occurrence

    0 10 20 30iteration

    74

    76

    78

    P@1

    ISFHNN

    (b) Algo. 1 iterations

    0 5000 10000j

    −10

    0

    10

    norm

    alize

    r

    ISFHNN

    (c) Column Normalizer

    17 / 24

  • Illustrative Experiments (contd.)

    Top-1, top-5, top-10 retrieval accuracies

    P@1 P@5 P@10

    NN 51.71 71.88 78.51

    ISF 73.75 88.57 92.36

    HNN78.85 91.43 94.60

    (Algorithm 1)

    HNN78.84 91.41 94.59

    (Algorithm 2)

    18 / 24

  • BLI experiments8

    sourcetarget

    en es fr it pt de

    en

    NN 54.98 55.66 46.30 37.02 53.03ISF 70.35 72.31 63.75 53.02 65.75CSLS 71.21 72.62 64.11 53.54 66.50HNN 71.34 73.65 64.91 54.03 64.61

    es

    NN 59.87 60.76 61.95 66.94 48.73ISF 73.11 76.82 76.33 78.67 61.70CSLS 73.02 76.44 76.44 80.29 62.29HNN 74.38 78.24 77.86 81.09 60.78

    fr

    NN 61.60 61.73 59.43 46.31 57.10ISF 74.46 75.72 73.78 60.89 69.07CSLS 74.88 76.68 74.34 62.06 70.34HNN 75.97 77.23 75.12 63.10 67.94

    it

    NN 51.38 64.63 61.45 51.91 50.68ISF 65.57 77.76 76.64 67.32 63.58CSLS 65.32 78.46 76.74 68.85 64.57HNN 67.57 79.75 78.56 70.33 62.96

    pt

    NN 42.21 68.93 47.48 50.98 37.95ISF 55.76 81.67 64.37 68.37 51.07CSLS 54.75 81.98 63.68 67.92 51.77HNN 57.42 83.96 66.19 70.44 49.93

    de

    NN 56.06 44.33 52.78 45.44 33.20ISF 69.74 60.77 71.59 65.99 52.74CSLS 68.65 59.21 69.88 63.69 50.72HNN 69.20 60.22 70.71 65.09 52.08

    8Follow setups in https://github.com/facebookresearch/MUSE19 / 24

    https://github.com/facebookresearch/MUSE

  • Analysis

    I Why HNN is less impressive on German?

    en es fr it pt detarget

    en

    es

    fr

    it

    pt

    de

    source

    0.0

    0.4

    0.8

    1.2

    1.6

    2.0

    2.4

    2.8

    3.21e−11

    Figure: var[pfj ] for all the pairs

    I Hubness reduced?

    100 101 102 103N5

    10−6

    10−4

    10−2

    p(N5) NN

    ISFCSLSHNN

    20 / 24

  • Analysis

    Table: Some representative “hubs”

    wordN5 N5 frequency

    by NN by HNN rank

    conspersus 1,776 0 484,387oryzopsis 1,235 5 472,161

    these 1,042 25 122s+bd 912 16 440,835were 798 24 40you 474 20 50

    would 467 40 73

    I Typos (extremely low-frequency) are likely to be hubs 9

    I Some functional words (very frequent) can also be hubs

    9Dinu et. al., ICLR 201421 / 24

  • Agenda

    1 Hubness explained

    2 Bilingual Lexicon Induction (BLI) and Hubness Reduction [1]

    3 Low-dimensionality: Miscellaneous Results [2, 3]

  • Principal Angles and Subspace Classification [2]

    22 / 24

  • Robustness and Local Geometry Preservation [3]

    I Application of (K, �)-robustness 10

    I Deep Network +∑

    (i,j)∈NB |d(fi, fj)− d(xi, xj)|where NB is local neighborhood

    10Xu et. al., Machine Learning 201223 / 24

  • Reference I

    [1] J. Huang et. al. Hubless Nearest Neighbor Search forBilingual Lexicon Induction. ACL 2019.

    [2] J. Huang et. al. The Role of Principal Angles in SubspaceClassification. IEEE Transactions on Signal Processing. 2015

    [3] J. Huang et. al. Discrimnative Robust FeatureTransformation. NIPS 2015

    24 / 24

    Hubness explainedBilingual Lexicon Induction (BLI) and Hubness Reduction ACLLow-dimensionality: Miscellaneous Results TSP, NIPS