Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft

Similarity-based Learning via Data DrivenEmbeddings∗

Purushottam Kar1 Prateek Jain2

1Indian Institute of TechnologyKanpur

2Microsoft Research IndiaBengaluru

November 3, 2011

∗To appear in the proceedings of NIPS 2011P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 1 / 29

Outline

1 An Introduction to Learning

2 A Brief History of Learning with Similarities

3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function

4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults

5 References

P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 2 / 29

Learning

Outline





5 References


Learning

Digit Classification†

†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29

http://yann.lecun.com/exdb/mnist/

Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning




Learning

Spam mail detection

Dear Junta,

The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).

Please collect your Lunch Packet. Themess would resume its normal working from27th October.

A legitimate mail

Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa

Most likely a spam mail

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium

Title: Similarity-based Learning via Data Driven Embeddings

Speaker: Purushottam Kar

Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur

To each his own ...


Learning

Spam mail detection

Dear Junta,



A legitimate mail







To each his own ...


Learning

Spam mail detection

Dear Junta,



A legitimate mail







To each his own ...


Learning

Spam mail detection

Dear Junta,



A legitimate mail







To each his own ...


Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.

We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε


Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.

Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε


Learning

More formally ...

We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.

Prx∼D

[ˆ(x) 6= `(x)

]< ε


Learning

Representing the data

Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd

How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd

I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images

I Easier said than done for text, emails, web data (eg. BoW for text)

SOLUTION 2 : Work with some distance/similarity function overthe data

X


Learning



How to make heterogeneous data (images, sound, web data)numeric ?

SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd

Φ : X → Rd




X


Learning




Φ : X → Rd




X


Learning




Φ : X → Rd




X


Learning




Φ : X → Rd




X


Learning




Φ : X → Rd




X


Learning




Φ : X → Rd



SOLUTION 2 : Work with some distance/similarity function overthe data X


Learning with Similarities

Outline





5 References



Classical algorithms that learn with similarities

Let K be a similarity measure (or w.l.o.g. a distance measure)

Nearest neighbor classification

ˆ(x) = `(NN(x))

NN(x) = arg maxx ′∈S

[K (x , x ′)

]Perceptron algorithm : X ⊂ Rd

ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd

ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)

SVM allows use of arbitrary Positive semi-definite kernels

ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)




Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification

ˆ(x) = `(NN(x))


[K (x , x ′)

]

Perceptron algorithm : X ⊂ Rd


ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)


ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)





ˆ(x) = `(NN(x))


[K (x , x ′)



ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)


ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)





ˆ(x) = `(NN(x))


[K (x , x ′)



ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)


ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)





ˆ(x) = `(NN(x))


[K (x , x ′)



ˆ(x) = sgn

(∑x ′∈S

α(x ′)K (x , x ′)`(x ′)

)K (x , x ′) =

⟨x , x ′

⟩w =

∑x ′∈S

α(x ′)`(x ′)


ˆ(x) = sgn

(∑x ′∈S

αSVM(x ′)K (x , x ′)`(x ′)

)




A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]

A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task




A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]

Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task




A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]

However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task




A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task



Suitable Similarities

A suitable similarity should intuitively give better classifierperformance

It is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?

I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS

I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]

Can we do the same for non-PSD similarities ?




A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performance

In general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?







A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)

Can formal notions of suitability lead to guaranteed performance ?







A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?


























Learning with Suitable Similarities

Outline





5 References


Learning with Suitable Similarities Learning with a Suitable Similarity Function

Outline





5 References



What is a good similarity function ?

Intuitively, a good similarity function should at least respect thelabeling of the domain

It should not assign small similarity to points with same label andlarge similarity to distinctly labeled points

Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ

In other words, according to the similarity function, most points, onan average, are more similar to points of the same label




Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points


Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ







Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ







Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w(x ′)

K (x , x ′)− w(x ′′)

K (x , x ′′)]≥ γ




Learning with a good similarity function

Theorem ([Balcan and Blum, 2006])

Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2

δ

labeled points (xi)ni=1, the classifier ˆdefined below has error at margin

γ2 no more than ε+ δ with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

w(xi)`(xi)K (x , xi)

)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well






δ



ˆ(x) = sgn(

n∑i=1


)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiers

Consequently one can use these algorithms to learn this classifieras well






δ



ˆ(x) = sgn(

n∑i=1


)

Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well


Learning with Suitable Similarities Learning with a Suitable Distance Function

Outline





5 References



What is a good distance function

Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)

D(x) <√

B andD−(x)D(x) <

√B, such that at least a (1− ε) probability mass of examples

x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties





D(x) <√

B andD−(x)D(x) <


x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled points

Yet again this yields a classifier with guaranteed generalizationproperties





D(x) <√

B andD−(x)D(x) <


x ∼ D satisfies

Prx ′∼D+

x ′′∼D−

[`(x)

(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)

)< 0

]≥ 1

2+ γ

The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties



Learning with a good distance function

Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2

γ2 lg 1δ pairs of positive and negatively labeled points

(x+

i , x−i

)ni=1,

the classifier ˆdefined below has error at margin γB no more than ε+ δ

with probability greater than 1− δ,

ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn

(d(x , x+

i )− d(x , x−1 ))






(x+

i , x−i

)ni=1,



ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementation

Each of the pairs yields a stump sgn(d(x , x+

i )− d(x , x−1 ))






(x+

i , x−i

)ni=1,



ˆ(x) = sgn(

n∑i=1

βi sgn(d(x , x+

i )− d(x , x−1 )))

,n∑

i=1βi = 1, βi ≥ 0

This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn

(d(x , x+

i )− d(x , x−1 ))


Data-sensitive Notions of Suitability

Outline





5 References



A unified notion of what is a good similarity/distance

Disparate as the last two models may seem, they are, in fact, quiterelated to each other

Motivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either




Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitive

This notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either




Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiers

The resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either




Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two models

Consequently, the model does not require separate treatment forsimilarity and distance functions either




Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either



What is a good similarity/distance function

Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ

With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.



What is a good similarity/distance function

Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies

Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ

With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.


Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability

Outline





5 References



Learning with data-sensitive notions of suitability

The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.

Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.




The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.

We will later on plug in the routines to learn the transfer functionas well.




The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.




Algorithm 1 LEARN-DISSIM

Require: A similarity function K , landmark pairs L =(

x+i , x

−i

)n

i=1, a good

transfer function f .Ensure: A classifier ˆ : X → {−1,+1}

1: Define ΦL : X → Rn as ΦL : x 7→(

f (K (x , x+i )− K (x , x−i ))

)n

i=1

2: Get a labeled training set T ={

tj}n′

j=1 ⊂ X sampled from D.

3: T ′ ←{

ΦL(tj )}n′

j=1 ⊂ Rn be the data set embedded in Rn

4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))

6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space






x+i , x

−i

)n

i=1, a good



f (K (x , x+i )− K (x , x−i ))

)n

i=1


tj}n′


3: T ′ ←{

ΦL(tj )}n′



6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.

The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space






x+i , x

−i

)n

i=1, a good



f (K (x , x+i )− K (x , x−i ))

)n

i=1


tj}n′


3: T ′ ←{

ΦL(tj )}n′



6: return ˆ

LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space




The results given earlier guarantee small classification error atlarge margin

Not amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.




The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]

We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.




The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.



Working with surrogate loss functions

Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E

x∼D[L(G(x))] ≤ ε where

G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]

Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2

L/ε21) ln(4B/δε1) landmark pairs, the

expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E

x

[L(ˆ(x))

]≤ ε+ ε1.



Working with surrogate loss functions

Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E

x∼D[L(G(x))] ≤ ε where

G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)

[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]

Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2

L/ε21) ln(4B/δε1) landmark pairs, the

expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E

x

[L(ˆ(x))

]≤ ε+ ε1.


Data-sensitive Notions of Suitability Learning the Best Notion of Suitability

Outline





5 References



Learning the transfer function

We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.

This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E

x∼D[L(G(x))] and let L(f ,L) denote the generalization loss

of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.




We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.

For any transfer function f and arbitrary set of landmarks L, letL(f ) = E






We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E


of the best classifier that uses the embedding ΦL defined by thelandmarks L.

The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.




We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E






Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1

4CLB .

Taking n =64B2C2

Lε21

ln(

16B·N (F ,r)δε1

)random landmark pairs, we havewith probability greater than (1− δ)

supf∈F

[|L(f ,L)− L(f )|] ≤ ε1

Algorithm 2 FTUNERequire: A family of transfer functions F , a

similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .

1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),

Lf ← L (f ,L)4: end for5: f∗ ← arg min

f∈FLf

6: return f∗.




Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1

4CLB .

Taking n =64B2C2

Lε21

ln(

16B·N (F ,r)δε1

)random landmark pairs, we havewith probability greater than (1− δ)

supf∈F

[|L(f ,L)− L(f )|] ≤ ε1

Algorithm 2 FTUNERequire: A family of transfer functions F , a

similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .

1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),

Lf ← L (f ,L)4: end for5: f∗ ← arg min

f∈FLf

6: return f∗.



Intelligent choice of landmark points

If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossible

Thus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit

Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.

1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min

x∈T

∑x′∈S

K (x , x ′).

4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.

`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L




If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasets

On large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit



x∈T

∑x′∈S

K (x , x ′).


`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L




If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit



x∈T

∑x′∈S

K (x , x ′).


`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L




If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit



x∈T

∑x′∈S

K (x , x ′).


`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for

10: return L


Data-sensitive Notions of Suitability Results

Outline





5 References



Results

50 100 150 200 250 3000.5

0.6

0.7

0.8

0.9

1AmazonBinary (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE+DFTUNEBBS+DBBSDBOOST

50 100 150 200 250 3000

0.1

0.2

0.3

0.4

0.5

0.6

0.7Amazon47 (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y


50 100 150 200 250 3000.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45Mirex07 (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y


0 100 200 3000

0.1

0.2

0.3

0.4

0.5

0.6

FaceRec (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y




Results

50 100 150 200 250 3000.65

0.7

0.75

0.8

0.85

0.9

0.95

1Isolet (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y

FTUNE (Single)FTUNE (Multiple)BBSDBOOST

50 100 150 200 250 3000.5

0.6

0.7

0.8

0.9

1Letters (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y


50 100 150 200 250 3000.93

0.94

0.95

0.96

0.97

0.98

0.99

1Pen−digits (Accuracy vs Landmarks)

Number of Landmarks

Acc

urac

y


50 100 150 200 250 3000.88

0.9

0.92

0.94

0.96

0.98

Number of Landmarks

Acc

urac

y

Opt−digits (Accuracy vs Landmarks)




Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.

In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.



Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.

Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.



Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.

In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.



Discussion

BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.



Thanks

Preprint available athttp://www.cse.iitk.ac.in/users/purushot/


http://www.cse.iitk.ac.in/users/purushot/

References

Outline





5 References


References

References I

Arora, S., Babai, L., Stern, J., and Sweedyk, Z. (1997).The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations.

Journal of Computer and System Sciences, 54(2):317–331.

Balcan, M.-F. and Blum, A. (2006).On a Theory of Learning with Similarity Functions.In International Conference on Machine Learning, pages 73–80.

Balcan, M.-F., Blum, A., and Vempala, S. (2006).Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings.Machine Learning, 65(1):79–94.

Garey, M. R. and Johnson, D. (1979).Computers and Intractability: A Guide to the theory of NP-Completeness.Freeman, San Francisco.

Goldfarb, L. (1984).A Unified Approach to Pattern Recognition.Pattern Recognition, 17(5):575–582.


References

References II

Gottlieb, L.-A., Kontorovich, A. L., and Krauthgamer, R. (2010).Efficient Classification for Metric Data.In Annual Conference on Computational Learning Theory.

Haasdonk, B. (2005).Feature Space Interpretation of SVMs with Indefinite Kernels.IEEE Transactions on Pattern Analysis and Machince Intelligence, 27(4):482–492.

Kar, P. and Jain, P. (2011).Similarity-based Learning via Data Driven Embeddings.In 25th Annual Conference on Neural Information Processing Systems.(to appear).

Pekalska, E. and Duin, R. P. W. (2001).On Combining Dissimilarity Representations.In Multiple Classifier Systems, pages 359–368.

Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998).Structural Risk Minimization Over Data-Dependent Hierarchies.IEEE Transactions on Information Theory, 44(5):1926–1940.


References

References III

von Luxburg, U. and Bousquet, O. (2004).Distance-Based Classification with Lipschitz Functions.Journal of Machine Learning Research, 5:669–695.

Wang, L., Yang, C., and Feng, J. (2007).On Learning with Dissimilarity Functions.In International Conference on Machine Learning, pages 991–998.

Weinberger, K. Q. and Saul, L. K. (2009).Distance Metric Learning for Large Margin Nearest Neighbor Classification.Journal of Machine Learning Research, 10:207–244.


Documents

Similarity-based Learning via Data Driven EmbeddingsSimilarity-based Learning via Data Driven Embeddings Purushottam Kar1 Prateek Jain2 1Indian Institute of Technology Kanpur 2Microsoft