Upload
others
View
35
Download
0
Embed Size (px)
Citation preview
Similarity-based Learning via Data DrivenEmbeddings∗
Purushottam Kar1 Prateek Jain2
1Indian Institute of TechnologyKanpur
2Microsoft Research IndiaBengaluru
November 3, 2011
∗To appear in the proceedings of NIPS 2011P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 1 / 29
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 2 / 29
Learning
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Digit Classification†
†MNIST database : http://yann.lecun.com/exdb/mnist/P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 3 / 29
Learning
Spam mail detection
Dear Junta,
The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).
Please collect your Lunch Packet. Themess would resume its normal working from27th October.
A legitimate mail
Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa
Most likely a spam mail
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium
Title: Similarity-based Learning via Data Driven Embeddings
Speaker: Purushottam Kar
Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur
To each his own ...
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29
Learning
Spam mail detection
Dear Junta,
The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).
Please collect your Lunch Packet. Themess would resume its normal working from27th October.
A legitimate mail
Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa
Most likely a spam mail
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium
Title: Similarity-based Learning via Data Driven Embeddings
Speaker: Purushottam Kar
Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur
To each his own ...
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29
Learning
Spam mail detection
Dear Junta,
The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).
Please collect your Lunch Packet. Themess would resume its normal working from27th October.
A legitimate mail
Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa
Most likely a spam mail
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium
Title: Similarity-based Learning via Data Driven Embeddings
Speaker: Purushottam Kar
Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur
To each his own ...
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29
Learning
Spam mail detection
Dear Junta,
The Hall-8 mess will be closed for theoccasion of Diwali at lunch & dinnertime. The breakfast will be servedalong with Lunch packets tomorrow (26thOctober, 2011).
Please collect your Lunch Packet. Themess would resume its normal working from27th October.
A legitimate mail
Hello,I am resending my previous mail to you,I hope you do get it this time aroundand understand its content fully. Iam contacting you briefly based on theInvestment of Forty Five Million Dollars(US$ 45,000,000:00) in your country, as Ipresently have a client who is interestedin investing in your country.Sincerely Yours,J. Costa
Most likely a spam mail
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING SEMINAR SERIESDepartmental Colloquium
Title: Similarity-based Learning via Data Driven Embeddings
Speaker: Purushottam Kar
Affiliation: Ph.D. Scholar, CSE Dept., IIT Kanpur
To each his own ...
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 4 / 29
Learning
More formally ...
We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.
We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.
Prx∼D
[ˆ(x) 6= `(x)
]< ε
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29
Learning
More formally ...
We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.
Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.
Prx∼D
[ˆ(x) 6= `(x)
]< ε
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29
Learning
More formally ...
We are working over a domain X and wish to learn a targetclassifier over the domain ` : X → {−1,+1}.We are given training points S = {x1, x2, . . . , xn} sampled fromsome distribution D over X and their true labels {`(x1), . . . , `(xn)}.Our goal is to output a classifier ˆ : X → {−1,+1} such that itmostly gives out the true labels.
Prx∼D
[ˆ(x) 6= `(x)
]< ε
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 5 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?
SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data
X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning
Representing the data
Most learning algorithms (Perceptron, MRF, DBN, SVM, ...) likeworking with numeric data i.e. X ⊂ Rd
How to make heterogeneous data (images, sound, web data)numeric ?SOLUTION 1 : Force a numeric representation by embedding alldata in some Euclidean space Rd
Φ : X → Rd
I Easy to do for images : (n × n) pixels 7→ R3n2for RGB images
I Easier said than done for text, emails, web data (eg. BoW for text)
SOLUTION 2 : Work with some distance/similarity function overthe data X
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 6 / 29
Learning with Similarities
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Classical algorithms that learn with similarities
Let K be a similarity measure (or w.l.o.g. a distance measure)
Nearest neighbor classification
ˆ(x) = `(NN(x))
NN(x) = arg maxx ′∈S
[K (x , x ′)
]Perceptron algorithm : X ⊂ Rd
ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd
ˆ(x) = sgn
(∑x ′∈S
α(x ′)K (x , x ′)`(x ′)
)K (x , x ′) =
⟨x , x ′
⟩w =
∑x ′∈S
α(x ′)`(x ′)
SVM allows use of arbitrary Positive semi-definite kernels
ˆ(x) = sgn
(∑x ′∈S
αSVM(x ′)K (x , x ′)`(x ′)
)
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Classical algorithms that learn with similarities
Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification
ˆ(x) = `(NN(x))
NN(x) = arg maxx ′∈S
[K (x , x ′)
]
Perceptron algorithm : X ⊂ Rd
ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd
ˆ(x) = sgn
(∑x ′∈S
α(x ′)K (x , x ′)`(x ′)
)K (x , x ′) =
⟨x , x ′
⟩w =
∑x ′∈S
α(x ′)`(x ′)
SVM allows use of arbitrary Positive semi-definite kernels
ˆ(x) = sgn
(∑x ′∈S
αSVM(x ′)K (x , x ′)`(x ′)
)
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Classical algorithms that learn with similarities
Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification
ˆ(x) = `(NN(x))
NN(x) = arg maxx ′∈S
[K (x , x ′)
]Perceptron algorithm : X ⊂ Rd
ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd
ˆ(x) = sgn
(∑x ′∈S
α(x ′)K (x , x ′)`(x ′)
)K (x , x ′) =
⟨x , x ′
⟩w =
∑x ′∈S
α(x ′)`(x ′)
SVM allows use of arbitrary Positive semi-definite kernels
ˆ(x) = sgn
(∑x ′∈S
αSVM(x ′)K (x , x ′)`(x ′)
)
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Classical algorithms that learn with similarities
Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification
ˆ(x) = `(NN(x))
NN(x) = arg maxx ′∈S
[K (x , x ′)
]Perceptron algorithm : X ⊂ Rd
ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd
ˆ(x) = sgn
(∑x ′∈S
α(x ′)K (x , x ′)`(x ′)
)K (x , x ′) =
⟨x , x ′
⟩w =
∑x ′∈S
α(x ′)`(x ′)
SVM allows use of arbitrary Positive semi-definite kernels
ˆ(x) = sgn
(∑x ′∈S
αSVM(x ′)K (x , x ′)`(x ′)
)
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Classical algorithms that learn with similarities
Let K be a similarity measure (or w.l.o.g. a distance measure)Nearest neighbor classification
ˆ(x) = `(NN(x))
NN(x) = arg maxx ′∈S
[K (x , x ′)
]Perceptron algorithm : X ⊂ Rd
ˆ(x) = sgn (〈w , x〉) for some w ∈ Rd
ˆ(x) = sgn
(∑x ′∈S
α(x ′)K (x , x ′)`(x ′)
)K (x , x ′) =
⟨x , x ′
⟩w =
∑x ′∈S
α(x ′)`(x ′)
SVM allows use of arbitrary Positive semi-definite kernels
ˆ(x) = sgn
(∑x ′∈S
αSVM(x ′)K (x , x ′)`(x ′)
)
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 7 / 29
Learning with Similarities
Learning with Similarities
A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]
A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29
Learning with Similarities
Learning with Similarities
A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]
Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29
Learning with Similarities
Learning with Similarities
A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]
However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29
Learning with Similarities
Learning with Similarities
A lot of work was done in trying to incorporate various similaritymeasures, distance measures into such frameworks[Pekalska and Duin, 2001, Weinberger and Saul, 2009]A fair amount went into algorithms that did not require PSDkernels as SVMs do [Goldfarb, 1984]Some very nice work involving isometric embeddings to(pseudo)Hilbert / Banach spaces [Gottlieb et al., 2010,von Luxburg and Bousquet, 2004, Haasdonk, 2005]However, none addressed the issue of suitability of thesimilarity/distance measure to the learning task
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 8 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformance
It is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performance
In general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)
Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Similarities
Suitable Similarities
A suitable similarity should intuitively give better classifierperformanceIt is very well known that the choice of the kernel has a significantimpact on SVM classifier performanceIn general, several domains have preferred notions of similarity(e.g. earth mover’s distance for images)Can formal notions of suitability lead to guaranteed performance ?
I For SVMs, suitability is formalized in terms of the margin offered bythe PSD kernel in its RKHS
I Having large margin does lead to generalization bounds[Shawe-Taylor et al., 1998, Balcan et al., 2006]
Can we do the same for non-PSD similarities ?
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Suitable Similarities
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 9 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
What is a good similarity function ?
Intuitively, a good similarity function should at least respect thelabeling of the domain
It should not assign small similarity to points with same label andlarge similarity to distinctly labeled points
Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w(x ′)
K (x , x ′)− w(x ′′)
K (x , x ′′)]≥ γ
In other words, according to the similarity function, most points, onan average, are more similar to points of the same label
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
What is a good similarity function ?
Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points
Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w(x ′)
K (x , x ′)− w(x ′′)
K (x , x ′′)]≥ γ
In other words, according to the similarity function, most points, onan average, are more similar to points of the same label
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
What is a good similarity function ?
Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points
Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w(x ′)
K (x , x ′)− w(x ′′)
K (x , x ′′)]≥ γ
In other words, according to the similarity function, most points, onan average, are more similar to points of the same label
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
What is a good similarity function ?
Intuitively, a good similarity function should at least respect thelabeling of the domainIt should not assign small similarity to points with same label andlarge similarity to distinctly labeled points
Definition ([Balcan and Blum, 2006])A similarity K : X × X → R is said to be (ε, γ)-good for a classificationproblem if for some weighing function w : X → [−1,1], at least a(1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w(x ′)
K (x , x ′)− w(x ′′)
K (x , x ′′)]≥ γ
In other words, according to the similarity function, most points, onan average, are more similar to points of the same label
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 10 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
Learning with a good similarity function
Theorem ([Balcan and Blum, 2006])
Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2
δ
labeled points (xi)ni=1, the classifier ˆdefined below has error at margin
γ2 no more than ε+ δ with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
w(xi)`(xi)K (x , xi)
)
Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
Learning with a good similarity function
Theorem ([Balcan and Blum, 2006])
Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2
δ
labeled points (xi)ni=1, the classifier ˆdefined below has error at margin
γ2 no more than ε+ δ with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
w(xi)`(xi)K (x , xi)
)
Notice that the classifier is very similar in form to the SVM andPerceptron classifiers
Consequently one can use these algorithms to learn this classifieras well
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29
Learning with Suitable Similarities Learning with a Suitable Similarity Function
Learning with a good similarity function
Theorem ([Balcan and Blum, 2006])
Given an (ε, γ)-good similarity function, for any δ > 0, given n = 16γ2 lg 2
δ
labeled points (xi)ni=1, the classifier ˆdefined below has error at margin
γ2 no more than ε+ δ with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
w(xi)`(xi)K (x , xi)
)
Notice that the classifier is very similar in form to the SVM andPerceptron classifiersConsequently one can use these algorithms to learn this classifieras well
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 11 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
What is a good distance function
Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)
D(x) <√
B andD−(x)D(x) <
√B, such that at least a (1− ε) probability mass of examples
x ∼ D satisfies
Prx ′∼D+
x ′′∼D−
[`(x)
(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)
)< 0
]≥ 1
2+ γ
The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
What is a good distance function
Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)
D(x) <√
B andD−(x)D(x) <
√B, such that at least a (1− ε) probability mass of examples
x ∼ D satisfies
Prx ′∼D+
x ′′∼D−
[`(x)
(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)
)< 0
]≥ 1
2+ γ
The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled points
Yet again this yields a classifier with guaranteed generalizationproperties
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
What is a good distance function
Definition ([Wang et al., 2007])A distance function d : X × X → R is said to be (ε, γ,B)-good for aclassification problem if there exist two class conditional probabilitydistributions D+ and D− such that for all x ∈ X , D+(x)
D(x) <√
B andD−(x)D(x) <
√B, such that at least a (1− ε) probability mass of examples
x ∼ D satisfies
Prx ′∼D+
x ′′∼D−
[`(x)
(`(x ′)d(x , x ′)− `(x ′′)d(x , x ′′)
)< 0
]≥ 1
2+ γ
The definition expects the distance function to set dissimilarlylabeled points farther off than similarly labeled pointsYet again this yields a classifier with guaranteed generalizationproperties
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 12 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
Learning with a good distance function
Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2
γ2 lg 1δ pairs of positive and negatively labeled points
(x+
i , x−i
)ni=1,
the classifier ˆdefined below has error at margin γB no more than ε+ δ
with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
βi sgn(d(x , x+
i )− d(x , x−1 )))
,n∑
i=1βi = 1, βi ≥ 0
This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn
(d(x , x+
i )− d(x , x−1 ))
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
Learning with a good distance function
Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2
γ2 lg 1δ pairs of positive and negatively labeled points
(x+
i , x−i
)ni=1,
the classifier ˆdefined below has error at margin γB no more than ε+ δ
with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
βi sgn(d(x , x+
i )− d(x , x−1 )))
,n∑
i=1βi = 1, βi ≥ 0
This naturally lends itself to a boosting-like implementation
Each of the pairs yields a stump sgn(d(x , x+
i )− d(x , x−1 ))
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29
Learning with Suitable Similarities Learning with a Suitable Distance Function
Learning with a good distance function
Theorem ([Wang et al., 2007])Given an (ε, γ,B)-good distance function, for any δ > 0, givenn = 4B2
γ2 lg 1δ pairs of positive and negatively labeled points
(x+
i , x−i
)ni=1,
the classifier ˆdefined below has error at margin γB no more than ε+ δ
with probability greater than 1− δ,
ˆ(x) = sgn(
n∑i=1
βi sgn(d(x , x+
i )− d(x , x−1 )))
,n∑
i=1βi = 1, βi ≥ 0
This naturally lends itself to a boosting-like implementationEach of the pairs yields a stump sgn
(d(x , x+
i )− d(x , x−1 ))
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 13 / 29
Data-sensitive Notions of Suitability
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
A unified notion of what is a good similarity/distance
Disparate as the last two models may seem, they are, in fact, quiterelated to each other
Motivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
A unified notion of what is a good similarity/distance
Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitive
This notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
A unified notion of what is a good similarity/distance
Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiers
The resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
A unified notion of what is a good similarity/distance
Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two models
Consequently, the model does not require separate treatment forsimilarity and distance functions either
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
A unified notion of what is a good similarity/distance
Disparate as the last two models may seem, they are, in fact, quiterelated to each otherMotivated by this observation we propose a notion of goodnessthat is data-sensitiveThis notion allows us to tune the goodness notion itself, allowingfor better classifiersThe resulting model subsumes the previous two modelsConsequently, the model does not require separate treatment forsimilarity and distance functions either
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 14 / 29
Data-sensitive Notions of Suitability
What is a good similarity/distance function
Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ
With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29
Data-sensitive Notions of Suitability
What is a good similarity/distance function
Definition (K. and Jain, 2011)A similarity function K : X × X → R is said to be (ε, γ,B)-good for aclassification problem if for some antisymmetric transfer functionf : R→ [−Cf ,Cf ] and some weighing function w : X × X → [−B,B], atleast a (1− ε) probability mass of examples x ∼ D satisfies
Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′)6=`(x)
[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))] ≥ 2Cfγ
With appropriate setting of the weighing function and the transferfunction, the previous two models can be recovered.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 15 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.
Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.
We will later on plug in the routines to learn the transfer functionas well.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The learning algorithm is not as simple as before since theguarantees we give hold only if the a good transfer function ischosen.Let us first see how, given a (good) transfer function, can we learna (good) classifier.We will later on plug in the routines to learn the transfer functionas well.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 16 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
Algorithm 1 LEARN-DISSIM
Require: A similarity function K , landmark pairs L =(
x+i , x
−i
)n
i=1, a good
transfer function f .Ensure: A classifier ˆ : X → {−1,+1}
1: Define ΦL : X → Rn as ΦL : x 7→(
f (K (x , x+i )− K (x , x−i ))
)n
i=1
2: Get a labeled training set T ={
tj}n′
j=1 ⊂ X sampled from D.
3: T ′ ←{
ΦL(tj )}n′
j=1 ⊂ Rn be the data set embedded in Rn
4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))
6: return ˆ
LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
Algorithm 1 LEARN-DISSIM
Require: A similarity function K , landmark pairs L =(
x+i , x
−i
)n
i=1, a good
transfer function f .Ensure: A classifier ˆ : X → {−1,+1}
1: Define ΦL : X → Rn as ΦL : x 7→(
f (K (x , x+i )− K (x , x−i ))
)n
i=1
2: Get a labeled training set T ={
tj}n′
j=1 ⊂ X sampled from D.
3: T ′ ←{
ΦL(tj )}n′
j=1 ⊂ Rn be the data set embedded in Rn
4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))
6: return ˆ
LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.
The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
Algorithm 1 LEARN-DISSIM
Require: A similarity function K , landmark pairs L =(
x+i , x
−i
)n
i=1, a good
transfer function f .Ensure: A classifier ˆ : X → {−1,+1}
1: Define ΦL : X → Rn as ΦL : x 7→(
f (K (x , x+i )− K (x , x−i ))
)n
i=1
2: Get a labeled training set T ={
tj}n′
j=1 ⊂ X sampled from D.
3: T ′ ←{
ΦL(tj )}n′
j=1 ⊂ Rn be the data set embedded in Rn
4: Learn a linear hyperplane over Rn using T ′, `lin ← LEARN-LINEAR(T ′)5: Let ˆ : X → {−1,+1} be defined as ˆ : x 7→ `lin (ΦL(x))
6: return ˆ
LEARN-LINEAR may be taken to be any linear hyperplanelearning algorithm such as Perceptron, SVM.The above procedure essentially creates a data-driven, problemspecific embedding of the domain X into a Euclidean space
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 17 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The results given earlier guarantee small classification error atlarge margin
Not amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]
We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Learning with data-sensitive notions of suitability
The results given earlier guarantee small classification error atlarge marginNot amenable to efficient algorithms as hyperplane classificationerror is NP-hard to minimize[Garey and Johnson, 1979, Arora et al., 1997]We provide our guarantees in terms of smooth Lipschitz losseslike hinge-loss, log-loss etc that can be efficiently minimized overlarge datasets.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 18 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Working with surrogate loss functions
Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E
x∼D[L(G(x))] ≤ ε where
G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)
[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]
Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2
L/ε21) ln(4B/δε1) landmark pairs, the
expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E
x
[L(ˆ(x))
]≤ ε+ ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29
Data-sensitive Notions of Suitability Learning with Data-sensitive Notions of Suitability
Working with surrogate loss functions
Definition (K. and Jain, 2011)A similarity function is said to be (ε,B)-good with respect to a lossfunction L : R→ R+ if for some transfer function f : R→ R and someweighing function w : X × X → [−B,B], E
x∼D[L(G(x))] ≤ ε where
G(x) = Ex ′∼D,`(x ′)=`(x)x ′′∼D,`(x ′′) 6=`(x)
[w (x ′, x ′′) f (K (x , x ′)− K (x , x ′′))]
Theorem (K. and Jain, 2011)If K is an (ε,B)-good similarity function with respect to a CL-Lipschitzloss function L then for any ε1 > 0, with probability at least 1− δ overthe choice of d = (16B2C2
L/ε21) ln(4B/δε1) landmark pairs, the
expected loss of the classifier ˆ(x) returned by LEARN-DISSIM withrespect to L satisfies E
x
[L(ˆ(x))
]≤ ε+ ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 19 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.
This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E
x∼D[L(G(x))] and let L(f ,L) denote the generalization loss
of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.
For any transfer function f and arbitrary set of landmarks L, letL(f ) = E
x∼D[L(G(x))] and let L(f ,L) denote the generalization loss
of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E
x∼D[L(G(x))] and let L(f ,L) denote the generalization loss
of the best classifier that uses the embedding ΦL defined by thelandmarks L.
The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
We give uniform convergence guarantees that enable standardERM-based routines to recover the best transfer from anycompact class of antisymmetric functions.This will yield a nested learning problem with the ERM-basedtransfer function learning algorithm calling the classifier learningalgorithm as a subroutine.For any transfer function f and arbitrary set of landmarks L, letL(f ) = E
x∼D[L(G(x))] and let L(f ,L) denote the generalization loss
of the best classifier that uses the embedding ΦL defined by thelandmarks L.The earlier result shows that for a fixed f , for a large enoughrandom L, L(f ,L) ≤ L(f ) + ε1.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 20 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1
4CLB .
Taking n =64B2C2
Lε21
ln(
16B·N (F ,r)δε1
)random landmark pairs, we havewith probability greater than (1− δ)
supf∈F
[|L(f ,L)− L(f )|] ≤ ε1
Algorithm 2 FTUNERequire: A family of transfer functions F , a
similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .
1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),
Lf ← L (f ,L)4: end for5: f∗ ← arg min
f∈FLf
6: return f∗.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 21 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Learning the transfer function
Theorem (K. and Jain, 2011)Let F be a compact class oftransfer functions with respect tothe infinity norm and ε1, δ > 0. LetN (F , r) be the size of the smallestε-net over F with respect to theinfinity norm at scale r = ε1
4CLB .
Taking n =64B2C2
Lε21
ln(
16B·N (F ,r)δε1
)random landmark pairs, we havewith probability greater than (1− δ)
supf∈F
[|L(f ,L)− L(f )|] ≤ ε1
Algorithm 2 FTUNERequire: A family of transfer functions F , a
similarity function K , a loss function L.Ensure: An optimal transfer function f∗ ∈ F .
1: Select d landmark pairs L .2: for all f ∈ F do3: wf ← LEARN-DISSIM(K ,L, f ),
Lf ← L (f ,L)4: end for5: f∗ ← arg min
f∈FLf
6: return f∗.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 21 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Intelligent choice of landmark points
If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossible
Thus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit
Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.
1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min
x∈T
∑x′∈S
K (x , x ′).
4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.
`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for
10: return L
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Intelligent choice of landmark points
If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasets
On large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit
Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.
1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min
x∈T
∑x′∈S
K (x , x ′).
4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.
`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for
10: return L
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Intelligent choice of landmark points
If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit
Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.
1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min
x∈T
∑x′∈S
K (x , x ′).
4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.
`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for
10: return L
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29
Data-sensitive Notions of Suitability Learning the Best Notion of Suitability
Intelligent choice of landmark points
If landmarks are clumpedtogether, then all points willget a similar embeddingand linear separationwould be impossibleThus we promote diversityamong the landmarks as aheuristic on small datasetsOn large datasets FTUNEitself is able to recover thebest transfer function as itdoes not over-fit
Algorithm 3 DSELECTRequire: A training set T .Ensure: A set of n landmark pairs.
1: S ← RANDOM-ELEMENT(T ),L ← ∅2: for j = 2 to n do3: z ← arg min
x∈T
∑x′∈S
K (x , x ′).
4: S ← S ∪ {z}, T ← T\{z}5: end for6: for j = 1 to n do7: Sample z1, z2 from S with replacement s.t.
`(z1) = 1, `(z2) = −18: L ← L ∪ {(z1, z2)}9: end for
10: return L
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29
Data-sensitive Notions of Suitability Results
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 22 / 29
Data-sensitive Notions of Suitability Results
Results
50 100 150 200 250 3000.5
0.6
0.7
0.8
0.9
1AmazonBinary (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE+DFTUNEBBS+DBBSDBOOST
50 100 150 200 250 3000
0.1
0.2
0.3
0.4
0.5
0.6
0.7Amazon47 (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE+DFTUNEBBS+DBBSDBOOST
50 100 150 200 250 3000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45Mirex07 (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE+DFTUNEBBS+DBBSDBOOST
0 100 200 3000
0.1
0.2
0.3
0.4
0.5
0.6
FaceRec (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE+DFTUNEBBS+DBBSDBOOST
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 23 / 29
Data-sensitive Notions of Suitability Results
Results
50 100 150 200 250 3000.65
0.7
0.75
0.8
0.85
0.9
0.95
1Isolet (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE (Single)FTUNE (Multiple)BBSDBOOST
50 100 150 200 250 3000.5
0.6
0.7
0.8
0.9
1Letters (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE (Single)FTUNE (Multiple)BBSDBOOST
50 100 150 200 250 3000.93
0.94
0.95
0.96
0.97
0.98
0.99
1Pen−digits (Accuracy vs Landmarks)
Number of Landmarks
Acc
urac
y
FTUNE (Single)FTUNE (Multiple)BBSDBOOST
50 100 150 200 250 3000.88
0.9
0.92
0.94
0.96
0.98
Number of Landmarks
Acc
urac
y
Opt−digits (Accuracy vs Landmarks)
FTUNE (Single)FTUNE (Multiple)BBSDBOOST
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 24 / 29
Data-sensitive Notions of Suitability Results
Discussion
BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.
In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29
Data-sensitive Notions of Suitability Results
Discussion
BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.
Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29
Data-sensitive Notions of Suitability Results
Discussion
BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.
In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29
Data-sensitive Notions of Suitability Results
Discussion
BBS performs reasonably well for small landmarking sizes whileDBOOST performs well for large landmarking sizes.In contrast, our method consistently outperforms the existingmethods in both the scenarios.Since FTUNE selects its output by way of validation, it issusceptible to over-fitting on small datasets.In these cases, DSELECT (intuitively) removes redundancies inthe landmark points thus allowing FTUNE to recover the besttransfer function.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 25 / 29
Data-sensitive Notions of Suitability Results
Thanks
Preprint available athttp://www.cse.iitk.ac.in/users/purushot/
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 26 / 29
References
Outline
1 An Introduction to Learning
2 A Brief History of Learning with Similarities
3 Learning with Suitable SimilaritiesLearning with a Suitable Similarity FunctionLearning with a Suitable Distance Function
4 Data-sensitive Notions of SuitabilityLearning with Data-sensitive Notions of SuitabilityLearning the Best Notion of SuitabilityResults
5 References
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 27 / 29
References
References I
Arora, S., Babai, L., Stern, J., and Sweedyk, Z. (1997).The Hardness of Approximate Optima in Lattices, Codes, and Systems of Linear Equations.
Journal of Computer and System Sciences, 54(2):317–331.
Balcan, M.-F. and Blum, A. (2006).On a Theory of Learning with Similarity Functions.In International Conference on Machine Learning, pages 73–80.
Balcan, M.-F., Blum, A., and Vempala, S. (2006).Kernels as Features: On Kernels, Margins, and Low-dimensional Mappings.Machine Learning, 65(1):79–94.
Garey, M. R. and Johnson, D. (1979).Computers and Intractability: A Guide to the theory of NP-Completeness.Freeman, San Francisco.
Goldfarb, L. (1984).A Unified Approach to Pattern Recognition.Pattern Recognition, 17(5):575–582.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 27 / 29
References
References II
Gottlieb, L.-A., Kontorovich, A. L., and Krauthgamer, R. (2010).Efficient Classification for Metric Data.In Annual Conference on Computational Learning Theory.
Haasdonk, B. (2005).Feature Space Interpretation of SVMs with Indefinite Kernels.IEEE Transactions on Pattern Analysis and Machince Intelligence, 27(4):482–492.
Kar, P. and Jain, P. (2011).Similarity-based Learning via Data Driven Embeddings.In 25th Annual Conference on Neural Information Processing Systems.(to appear).
Pekalska, E. and Duin, R. P. W. (2001).On Combining Dissimilarity Representations.In Multiple Classifier Systems, pages 359–368.
Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., and Anthony, M. (1998).Structural Risk Minimization Over Data-Dependent Hierarchies.IEEE Transactions on Information Theory, 44(5):1926–1940.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 28 / 29
References
References III
von Luxburg, U. and Bousquet, O. (2004).Distance-Based Classification with Lipschitz Functions.Journal of Machine Learning Research, 5:669–695.
Wang, L., Yang, C., and Feng, J. (2007).On Learning with Dissimilarity Functions.In International Conference on Machine Learning, pages 991–998.
Weinberger, K. Q. and Saul, L. K. (2009).Distance Metric Learning for Large Margin Nearest Neighbor Classification.Journal of Machine Learning Research, 10:207–244.
P. Kar and P. Jain (IITK/MSRI) Similarity-based Learning November 3, 2011 29 / 29