128
Transfer Learning and Optimal Transport Ievgen Redko [email protected] UMR • CNRS • 5516 • SAINT-ETIENNE Ievgen Redko SciDoLySE 1 / 79

Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko [email protected] UMR

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Transfer Learning

and

Optimal Transport

Ievgen Redko

[email protected]

UMR • CNRS • 5516 • SAINT-ETIENNE

Ievgen Redko SciDoLySE 1 / 79

Page 2: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Credits and acknowledgments

Documents used for this talk :

D. Xu, K. Saenko, I. Tsang. Tutorial on Domain Transfer Learning for VisionApplications, CVPR’12.

S. Pan, Q. Yang and W. Fan. Tutorial : Transfer Learning with Applications, IJCAI’13.

S. Ben-David. Towards Theoretical Understanding of Domain Adaptation Learning,workshop LNIID at ECML’09.

F. Sha and B. Kingsbury. Domain Adaptation in Machine Learning and SpeechRecognition, Tutorial - Interspeech 2012.

K. Grauman. Adaptation for Objects and Attributes, workshop VisDA at ICCV’13

J. Blitzer and H. DauméIII. Domain Adaptation, Tutorial - ICML 2010.

Acknowledegments :A. Habrard, Rémi Flamary, Nicolas Courty, Devis Tuia, Tien Nam Li, Marc Sebban

Ievgen Redko SciDoLySE 2 / 79

Page 3: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Outline

Introduction

Optimal transport for domain adaptationProblem formulationRegularization framework for domain adaptationNumerical experiments

Mapping estimation for discrete optimal transportProblem formulationApplication to domain adaptationApplication to seamless copy in images

Optimal transport for target shiftMotivationProposed modelExperimental results

Optimal transport for joint distribution adaptation

Other contributionsDifferentially private OT

Conclusions

Ievgen Redko SciDoLySE 3 / 79

Page 4: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Introduction

Ievgen Redko SciDoLySE 4 / 79

Page 5: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Artificial Intelligence

Ultimate goal : Build systems that can learn by exploring the world.

Ievgen Redko SciDoLySE 5 / 79

Page 6: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Artificial Intelligence

Ultimate goal : Build systems that can learn by exploring the world.

- Unfortunately, not easy (or almost impossible as for now)

Ievgen Redko SciDoLySE 5 / 79

Page 7: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Goals in AI

Intermediate goal : Build systems that can classify and recognize well

Solution : Use Machine learning (ML) methods = near-human performance

Ievgen Redko SciDoLySE 6 / 79

Page 8: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Issues of Traditional MLIssues :

- Near-human performance is achieved using lots of labeled data

- Some tasks do not have that much labeled data (biology, physics etc)

- Some data/tasks evolve with time

- There exist too many tasks !

Ievgen Redko SciDoLySE 7 / 79

Page 9: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Issues of Traditional MLIssues :

- Near-human performance is achieved using lots of labeled data

- Some tasks do not have that much labeled data (biology, physics etc)

- Some data/tasks evolve with time

- There exist too many tasks !

Solution : Transfer learning

+ Use systems build for different but related applications

Ievgen Redko SciDoLySE 7 / 79

Page 10: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks

Ievgen Redko SciDoLySE 8 / 79

Page 11: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Transfer Learning

Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks

Example

We have labeled images from a Web image corpus Is there a Person in unlabeled images from a Video corpus ?

Person no Person

?→

Is there a Person ?

Ievgen Redko SciDoLySE 8 / 79

Page 12: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Settings

Supervised learning vs Transfer learning

?

from the same domain

?

from different domains

Domains are modeled as probability distributions over an instance space

Tasks associated to a domain (classification, regression, clustering, ...)

Goal

Improve a target predictive function in the target domainusing knowledge from the source domain

Ievgen Redko SciDoLySE 9 / 79

Page 13: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A Taxonomy of Transfer Learning

“A survey on Transfer Learning” [Pan and Yang, TKDE 2010]

Ievgen Redko SciDoLySE 10 / 79

Page 14: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

In this tutorial

We focus on domain adaptation for classification

How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related

target distribution ?

Ievgen Redko SciDoLySE 11 / 79

Page 15: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

In this tutorial

We focus on domain adaptation for classification

How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related

target distribution ?

Why ?

“Hot topic” - tutorials at ICML 2010, CVPR 2012, Interspeech 2012,workshops at ICCV 2013, NIPS 2013, ECML 2014

Many real-world motivating examples

Ievgen Redko SciDoLySE 11 / 79

Page 16: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A toy problem : Inter-twinning moons

(a) 10 (b) 20

(c) 30

(d) 40 (e) 50

(f) 70

Ievgen Redko SciDoLySE 12 / 79

Page 17: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Intuition and motivation : computer vision

“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]

“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]

“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

Ievgen Redko SciDoLySE 13 / 79

Page 18: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Intuition and motivation : computer vision

“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]

“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]

“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

Ievgen Redko SciDoLySE 13 / 79

Page 19: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Intuition and motivation : computer vision

“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]

“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]

“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

Ievgen Redko SciDoLySE 13 / 79

Page 20: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Intuition and motivation : computer vision

“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]

“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]

“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]

Ievgen Redko SciDoLySE 13 / 79

Page 21: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Problems with data representations

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]

Ievgen Redko SciDoLySE 14 / 79

Page 22: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Hard to predict what will change in the new domain

[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]

Ievgen Redko SciDoLySE 15 / 79

Page 23: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Natural Language Processing

Part of Speech TaggingAdapt a tagger learned from medical papers to a journal

Text are represented by “words” (Bag of Words)

Ievgen Redko SciDoLySE 16 / 79

Page 24: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Spam detection

Adapt a classifier from a mailbox of an office worker to that of a hippie musician

Ievgen Redko SciDoLySE 17 / 79

Page 25: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Sentiment analysis

Adapt a classifier predicting the preferences for books to those of DVDs

Ievgen Redko SciDoLySE 18 / 79

Page 26: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Electronics to video games [Pan-IJCAI’13 tutorial]

Electronics Video games

(1) Compact ; easy to operate ; verygood picture quality ; looks sharp !

(2) A very good game ! It is action pa-cked and full of excitement. I am verymuch hooked on this game.

(3) I purchased this unit from CircuitCity and I was very excited about thequality of the picture. It is really nice andsharp.

(4) Very realistic shooting action andgood plots. We played this and werehooked.

(5) It is also quite blurry in very darksettings. I will never_buy HP again.

(6) It is so boring. I am extremelyunhappy and will probably never_buy

UbiSoft again.

Source specific : compact, sharp, blurry.

Target specific : hooked, realistic, boring.

Domain independent : good, excited, nice, never_buy, unhappy.

Ievgen Redko SciDoLySE 19 / 79

Page 27: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Other applications

Speech recognition [Tutorial at Interspeech’12]

Medecine

Biology

Time series

Wifi localization

Ievgen Redko SciDoLySE 20 / 79

Page 28: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Why transfer learning ?

Ievgen Redko SciDoLySE 21 / 79

Page 29: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Why transfer learning ? Smart people talk

Ievgen Redko SciDoLySE 22 / 79

Page 30: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Why transfer learning ? Smart people talk

Ievgen Redko SciDoLySE 23 / 79

Page 31: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A bit of vocabulary

Unsupervised Transfer Learning

= No labels in source and target domains

Unsupervised DA

= Presence of source labels, no target labels

Semi-supervised DA

= Presence of source labels, few target labels and a lot of unlabeled data

Semi-supervised learning

= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain

Ievgen Redko SciDoLySE 24 / 79

Page 32: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A bit of vocabulary

Unsupervised Transfer Learning

= No labels in source and target domains

Unsupervised DA

= Presence of source labels, no target labels

Semi-supervised DA

= Presence of source labels, few target labels and a lot of unlabeled data

Semi-supervised learning

= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain

Ievgen Redko SciDoLySE 24 / 79

Page 33: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A bit of vocabulary

Unsupervised Transfer Learning

= No labels in source and target domains

Unsupervised DA

= Presence of source labels, no target labels

Semi-supervised DA

= Presence of source labels, few target labels and a lot of unlabeled data

Semi-supervised learning

= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain

Ievgen Redko SciDoLySE 24 / 79

Page 34: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

A bit of vocabulary

Unsupervised Transfer Learning

= No labels in source and target domains

Unsupervised DA

= Presence of source labels, no target labels

Semi-supervised DA

= Presence of source labels, few target labels and a lot of unlabeled data

Semi-supervised learning

= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain

Ievgen Redko SciDoLySE 24 / 79

Page 35: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several key questions

1. How to estimate the distribution shift ?

2. What are the generalization guarantees ?

RPT(h) ≤?RPS

(h)?+?

3. When adaptation is possible ?

Ievgen Redko SciDoLySE 25 / 79

Page 36: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several key questions

1. How to estimate the distribution shift ?

2. What are the generalization guarantees ?

RPT(h) ≤?RPS

(h)?+?

3. When adaptation is possible ?

Ievgen Redko SciDoLySE 25 / 79

Page 37: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several key questions

1. How to estimate the distribution shift ?

2. What are the generalization guarantees ?

RPT(h) ≤?RPS

(h)?+?

3. When adaptation is possible ?

Ievgen Redko SciDoLySE 25 / 79

Page 38: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several key questions

4. How to design new algorithms ?

Ievgen Redko SciDoLySE 26 / 79

Page 39: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

3 main classes of algorithms

1. Instance-based methods

= Correct a sample bias by reweighting source labeled data : source instancesclose to target instances are more important

2. Feature-based methods

= Find a common space where source and target are close

3. Adjustment/Iterative methods

= Modify the model by incorporating pseudo-labeled information

Ievgen Redko SciDoLySE 27 / 79

Page 40: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimal transport for domainadaptation

Ievgen Redko SciDoLySE 28 / 79

Page 41: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

The following slides are courtesy of R. Flamary(https://remi.flamary.com/biblio/presvannes2016.pdf)

Ievgen Redko SciDoLySE 29 / 79

Page 42: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Problem setup

Amazon DLSR

Feature extraction Feature extraction

Source Domain Target Domain

+ Labels

not working !!!!

decision function

no labels !

Problems

Labels only in the source domain, and classification is in the target domain.

Classifier trained on the source data performs badly in the target domain

Ievgen Redko SciDoLySE 30 / 79

Page 43: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimal transport for domain adaptation

Assumptions There exist a transport T between the source and target domain.

The transport preserves the conditional distributions :

Ps(y|xs) = Pt(y|T(xs)).

Ievgen Redko SciDoLySE 31 / 79

Page 44: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimal transport for domain adaptation

Assumptions There exist a transport T between the source and target domain.

The transport preserves the conditional distributions :

Ps(y|xs) = Pt(y|T(xs)).

3-step strategy1. Estimate optimal transport between distributions.

2. Transport the training samples onto the target distribution.

3. Learn a classifier on the transported training samples.

Dataset

Class 1

Class 2

Samples

Samples

Classifier on

Optimal transport

Samples

Samples

Classification on transported samples

Samples

Samples

Classifier on

Ievgen Redko SciDoLySE 31 / 79

Page 45: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Objective function

Optimization problem

minγ∈P

〈γ,C〉F + λΩs(γ) + ηΩ(γ),

where

Ωs(γ) entropic regularization [Cuturi, 2013].

η ≥ 0 and Ωc(·) is a DA regularization term.

Regularization to avoid overfitting in high dimension and encode additionalinformation.

Ievgen Redko SciDoLySE 32 / 79

Page 46: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Entropic regularization

0 5 10 15

0

5

10

15

Optimal matrix γ

Ωs(γ) =∑

i,j

γ(i, j) log γ(i, j)

Extremely efficient optimization scheme (Sinkhorn Knopp).

Solution is not sparse anymore.

Ievgen Redko SciDoLySE 33 / 79

Page 47: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Class-based regularization [Courty et al., 2016]

0 5 10 15

0

5

10

15

Optimal matrix γ

Ωc(γ) =∑

j

c

||γ(Ic, j)||pq ,

Group components of γ using source labels

Target samples receive masses only from “same class” source samples.

Ievgen Redko SciDoLySE 34 / 79

Page 48: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Laplacian regularization for sample displacement

Sim. graph with S si,j>0 Small λ Large λ

Ωc(γ) =1

N2s

i,j

Ssi,j‖(xs

i − xsi )− (xs

j − xsj)‖2

Proposed in [Ferradans et al., 2013] for color transfer in images.

Similar samples defined in Ss have similar displacements

Similarity graph Ss is Similarity graph using source labels

Ievgen Redko SciDoLySE 35 / 79

Page 49: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimization problem

minγ∈P

〈γ,C〉F + λΩs(γ) + ηΩ(γ),

Special cases

η = 0 : Sinkhorn Knopp [Cuturi, 2013].

λ = 0 and Laplacian regularization : Large quadratic program solved withconditionnal gradient [Ferradans et al., 2013].

Non convex group lasso ℓp − ℓ1 : Majoration Minimization with SinkhornKnopp [Courty et al., 2014].

General framework with convex regularization Ω(γ)

Can we use efficient Sinkhorn Knopp scaling to solve the global problem ?

Yes using generalized conditional gradient [Bredies et al., 2009].

Linearization of the second regularization term but not the entropicregularization.

Ievgen Redko SciDoLySE 36 / 79

Page 50: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Barycentric mapping

How to transport the samples using the obtained coupling matrix ?

Ievgen Redko SciDoLySE 37 / 79

Page 51: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Barycentric mapping

How to transport the samples using the obtained coupling matrix ?

Use barycentric mapping

xis = argmin

xt∈XT

ij

γ(i, j)∗c(x, xjt)

If c(x, x′) is the Euclidean distance then

XS ≃ γ∗XT

Ievgen Redko SciDoLySE 37 / 79

Page 52: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Simulated problem with controllable complexity

(g) rotation=10 (h) rotation=30

(i) rotation=50 (j) rotation=70

Two moons problem [Germain et al., 2013]

Two entangled moons with a rotation between domains.

The rotation angle allow a control of the adaptation difficulty.

Ievgen Redko SciDoLySE 38 / 79

Page 53: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Results on the two moons dataset

10

20

30

40

50

70

90

SVM (no adapt.) 0 0.104 0.24 0.312 0.4 0.764 0.828

DASVM 0 0 0.259 0.284 0.334 0.747 0.820

PBDA 0 0.094 0.103 0.225 0.412 0.626 0.687

OT-exact 0 0.028 0.065 0.109 0.206 0.394 0.507

OT-IT 0 0.007 0.054 0.102 0.221 0.398 0.508

OT-GL 0 0 0 0.013 0.196 0.378 0.508

OT-Lap 0 0 0.004 0.062 0.201 0.402 0.524

Average prediction error for adaptation from 10 to 90.

Clear advantage of the optimal transport techniques.

Regularization helps (a lot) up to 40.

Ievgen Redko SciDoLySE 39 / 79

Page 54: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Visual adaptation datasets

Digit recognition, MNIST VS USPS (10 classes, d=256, 2 dom.).

Face recognition, PIE Dataset (68 classes, d=1024, 4 dom.).

Object recognition, Caltech-Office dataset (10 classes, d=800/4096, 4 dom.).

Ievgen Redko SciDoLySE 40 / 79

Page 55: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Comparison on vision datasets

Datasets Digits Faces ObjectsMethods ACC Nb best ACC Nb best ACC Nb best

1NN 48.66 0 26.22 0 28.47 0PCA 42.94 0 34.55 0 37.98 0GFK 52.56 0 26.15 0 39.21 0TSL 47.22 0 36.10 0 42.97 1JDA 57.30 0 56.69 7 44.34 1

OT-exact 49.96 0 50.47 0 36.69 0OT-IT 59.20 0 54.89 0 42.30 0OT-Lap 61.07 0 56.10 3 43.20 0OT-LpLq 64.11 1 55.45 0 46.42 1OT-GL 63.90 1 55.88 2 47.70 9

OT works very well on digits and object recognition

Good but not best on face recognition (-.5% wrt JDA).

Ievgen Redko SciDoLySE 41 / 79

Page 56: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Next step

Limits

Scales at least quadratically with the dataset.

What about domains with different class proportions ? [Tuia et al., 2015]

Out of sample extension ?

Ievgen Redko SciDoLySE 42 / 79

Page 57: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discreteoptimal transport

Ievgen Redko SciDoLySE 43 / 79

Page 58: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

Ievgen Redko SciDoLySE 44 / 79

Page 59: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

Ievgen Redko SciDoLySE 44 / 79

Page 60: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

Ievgen Redko SciDoLySE 44 / 79

Page 61: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

Ievgen Redko SciDoLySE 44 / 79

Page 62: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

How to estimate the mapping ?

Go back to Monge formulation ? No !

Fit the barycentric mapping but also introduce smoothness.

Ievgen Redko SciDoLySE 44 / 79

Page 63: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation for discrete optimal transport

Why estimate the mapping ?

Out of sample problem.

Solving optimization problem every time the dataset changes

Transporting a very large number of samples.

Interpretability (depending on the mapping model).

How to estimate the mapping ?

Go back to Monge formulation ? No !

Fit the barycentric mapping but also introduce smoothness.

Ievgen Redko SciDoLySE 44 / 79

Page 64: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping estimation

Problem formulation [Perrot et al., 2016]

argminT∈H,γ∈P

f(γ, T ) = λγ 〈γ,C〉F

︸ ︷︷ ︸

OT loss

+ ‖T (Xs)− nsγXt‖2F︸ ︷︷ ︸

Mapping data fitting

+ λTR(T )︸ ︷︷ ︸

Mapping reg.

where

Xs and Xt are the source and target datasets,

T (·) is applied for each elements of the above matrices,

nsγXt is the barycentric mapping for source samples with uniform weights,

H is the space of transformations (more details later),

R(·) is a regularization term controlling the complexity of T .

Ievgen Redko SciDoLySE 45 / 79

Page 65: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping family H

Linear transformations

H =

T : ∀x ∈ Ω, T (x) = xTL

.

L is a d× d real matrix.

R(T ) = ‖L− I‖2F

where I is the identity matrix.

Update is a classical linear least square regression.

Ievgen Redko SciDoLySE 46 / 79

Page 66: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Mapping family H

Linear transformations

H =

T : ∀x ∈ Ω, T (x) = xTL

.

L is a d× d real matrix.

R(T ) = ‖L− I‖2F

where I is the identity matrix.

Update is a classical linear least square regression.

Nonlinear transformations

H =

T : ∀x ∈ Ω, T (x) = kXs(xT )L

kXs(xT ) =

(k(x,xs

1) k(x,xs2) · · · k(x,xs

ns)).

k(·, ·) is a positive definite kernel.

L is a ns × d real matrix.

Update is a classical kernel least square regression.

Ievgen Redko SciDoLySE 46 / 79

Page 67: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Illustrative exampleLin

ear m

appi

ng

2D Dataset

Nonl

inea

r map

ping

Source samplesTarget samples

Barycentric displacement T displacement Out of sample T

Clearly a non-linear mapping.

The mapping model controls the barycentric mapping.

Ievgen Redko SciDoLySE 47 / 79

Page 68: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Domain adaptation : Caltech-Office dataset

Task 1NN GFK SA OT L1L2 OTEOTLin OTLinB OTKer OTKerBT γ T γ T γ T γ

D → W 89.5 93.3 95.6 77.0 95.7 95.7 97.3 97.3 97.3 97.3 98.4 98.5 98.5 98.5

D → A 62.5 77.2 88.5 70.8 74.9 74.8 85.7 85.7 85.8 85.8 89.9 89.9 89.5 89.5D → C 51.8 69.7 79.0 68.1 67.8 68.0 77.2 77.2 77.4 77.4 69.1 69.2 69.3 69.3W → D 99.2 99.8 99.6 74.1 94.4 94.4 99.4 99.4 99.8 99.8 97.2 97.2 96.9 96.9W → A 62.5 72.4 79.2 67.6 71.3 71.3 81.5 81.5 81.4 81.4 78.5 78.3 78.5 78.8W → C 59.5 63.7 55.0 63.1 67.8 67.8 75.9 75.9 75.4 75.4 72.7 72.7 65.1 63.3A → D 65.2 75.9 83.8 64.6 70.1 70.5 80.6 80.6 80.4 80.5 65.6 65.5 71.9 71.5A → W 56.8 68.0 74.6 66.8 67.2 67.3 74.6 74.6 74.4 74.4 66.4 64.8 70.0 68.9A → C 70.1 75.7 79.2 70.4 74.1 74.3 81.8 81.8 81.6 81.6 84.4 84.4 84.5 84.5

C → D 75.9 79.5 85.0 66.0 69.8 70.2 87.1 87.1 87.2 87.2 70.1 70.0 78.6 78.6C → W 65.2 70.7 74.4 59.2 63.8 63.8 78.3 78.3 78.5 78.5 80.0 80.4 73.5 73.4C → A 85.8 87.1 89.3 75.2 76.6 76.7 89.9 89.9 89.7 89.7 82.4 82.2 83.6 83.5Mean 70.3 77.8 81.9 68.6 74.5 74.6 84.1 84.1 84.1 84.1 79.6 79.4 80.0 79.7

XXX Clear advantage of the mapping estimation methods.

Ievgen Redko SciDoLySE 48 / 79

Page 69: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.

Ievgen Redko SciDoLySE 49 / 79

Page 70: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.

Ievgen Redko SciDoLySE 49 / 79

Page 71: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.

Ievgen Redko SciDoLySE 49 / 79

Page 72: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.

Ievgen Redko SciDoLySE 49 / 79

Page 73: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Seamless copy with gradient adaptation

Poisson image editing with gradient adaptation

Adapt the gradients from the source to the target domain :

∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)

Ts→t : R6 → R6 is the mapping between gradients of the source and target

images in the domain.

Ievgen Redko SciDoLySE 49 / 79

Page 74: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimal transport for target shift

Ievgen Redko SciDoLySE 50 / 79

Page 75: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Different reasons to adapt

Covariate shift

adapt when only marginal distributions of inputs change

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain

Ievgen Redko SciDoLySE 51 / 79

Page 76: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Different reasons to adapt

Covariate shift

adapt when only marginal distributions of inputs change

+ most popular scenario– conditional distributions may differ in practice

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain

Ievgen Redko SciDoLySE 51 / 79

Page 77: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Covariate shift and optimal transport (Courty et al., 2014)

Ievgen Redko SciDoLySE 52 / 79

Page 78: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Different reasons to adapt

Target shift

adapt when only marginal distributions of outputs change

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain

Ievgen Redko SciDoLySE 53 / 79

Page 79: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Different reasons to adapt

Target shift

adapt when only marginal distributions of outputs change

+ occurs in many real-world applications (imbalanced data)+ few contributions in literature– assumes the same distribution of inputs

−4 −2 0 2 40.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4Source domain

−4 −3 −2 −1 0 1 2 3 4

Target domain

Ievgen Redko SciDoLySE 53 / 79

Page 80: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Different reasons to adapt

Sample-selection bias, source component shift, domain shift etc.

rarely studied (and very difficult) cases

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.50.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Source domain

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Target domain

Ievgen Redko SciDoLySE 54 / 79

Page 81: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Wish list

General multi-source setting (more than 2 source domains)

Solving covariate shift between each pair source-target

Tackling target shift

Ievgen Redko SciDoLySE 55 / 79

Page 82: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

First insight

How to tackle target shift ? Take 1

Assume that target distribution and source distributions are defined as

PT =

C∑

i=1

πTi Pi P

πS =

i

πiPi,

with Pi being a distribution of class i ∈ 1, . . . , C. We want to solve Problem 1

π⋆ = argmin

π∈∆C

W (PπS , PT )

Ievgen Redko SciDoLySE 56 / 79

Page 83: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

First insight

How to tackle target shift ? Take 1

Assume that target distribution and source distributions are defined as

PT =

C∑

i=1

πTi Pi P

πS =

i

πiPi,

with Pi being a distribution of class i ∈ 1, . . . , C. We want to solve Problem 1

π⋆ = argmin

π∈∆C

W (PπS , PT )

Ievgen Redko SciDoLySE 56 / 79

Page 84: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

First insight

How to tackle target shift ? Take 1

We can prove the following

Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑

j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π

T.

Ievgen Redko SciDoLySE 57 / 79

Page 85: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

First insight

How to tackle target shift ? Take 1

We can prove the following

Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑

j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π

T.

Wasserstein distance is a good candidate for proportion estimation !

Ievgen Redko SciDoLySE 57 / 79

Page 86: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Second insight

How to tackle target shift ? Take 2

Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+

distance betweenclasses

∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1

distance betweenclasses

∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |

Ievgen Redko SciDoLySE 58 / 79

Page 87: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Second insight

How to tackle target shift ? Take 2

Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+

distance betweenclasses

∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1

distance betweenclasses

∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |

Ievgen Redko SciDoLySE 58 / 79

Page 88: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Second insight

How to tackle target shift ? Take 2

Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+

distance betweenclasses

∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1

distance betweenclasses

∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |

Ievgen Redko SciDoLySE 58 / 79

Page 89: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Second insight

How to tackle target shift ? Take 2

Consider a multi-source setting with a weighted source distribution

We can show that in case of binary classification

error intarget domain

≤ overallsource error+

distance betweenclasses

∗ distance between sourceand target class proportions

More formally, for class distributions P0 and P1

distance betweenclasses

∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −

N∑

j=1

αjπjS |

Reweight source samples to match target proportions !

Ievgen Redko SciDoLySE 58 / 79

Page 90: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Some notations

Data matrices X(k) ∼ µ(k) from K source domains

Data matrix X ∼ µ from target domain

Source proportions of classes h(k)c =

∫µ(k)c , µ(k) =

∑C

c=1 µ(k)c

Linear operators D(k)1 and D

(k)2 that transform m(k) to h

(k)c and back

Ievgen Redko SciDoLySE 59 / 79

Page 91: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Objective function

Putting it all together

h⋆ = argmin

h∈∆C ,Γ

K∑

k=1

λk KL(γ(k)|ζ(k))

s.t. ∀k D(k)1 γ

(k)1n = h.

A constrained Wasserstein barycenter problem

h⋆ reweights source instances to match source and target distributions

Efficient optimization with Bregman projections

Ievgen Redko SciDoLySE 60 / 79

Page 92: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Classification in the target domain

How would we obtain target labels ?

Before : Barycentric mapping

+ Accurate alignment of samples– Computationally costly : aligning + learning a classifier

Alternative : Label propagation

L =

K∑

k=1

λkD(k)1 γ

(k)

+ Directly obtaining target labels+ Majority vote by source domains = can be seen as boosting

Ievgen Redko SciDoLySE 61 / 79

Page 93: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Illustration

Covariate shift DA mixes instances from different classes !

Ievgen Redko SciDoLySE 62 / 79

Page 94: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Illustration

Proposed method handles target shift efficiently !

Ievgen Redko SciDoLySE 63 / 79

Page 95: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Real-world data

Zurich Summer’ data set composed of 20 satellite images

4 classes : Roads, Buildings, Trees and Grass

17 source and 1 target domain

Average class proportions [0.25± 0.07, 0.4± 0.13, 0.22± 0.11, 0.13± 0.11]

Ievgen Redko SciDoLySE 64 / 79

Page 96: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Examples of images

Input satellite images

Ievgen Redko SciDoLySE 65 / 79

Page 97: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Examples of images

Input satellite images

Satellite images with 4 classes

Ievgen Redko SciDoLySE 65 / 79

Page 98: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Classification results

# of sourcedomains

Average classproportions

# ofsource

instances

Noadaptation

OTDA

PT

OTDA

LP

MDA

Causal

JCPOT

LP

Targetonly

2 [0.17 0.4 0.16 0.27] 2’936 0.61 0.52 0.57 0.65 0.66 0.65

5 [0.22 0.39 0.18 0.21] 6’716 0.62 0.55 0.6 0.66 0.68 0.64

8 [0.25 0.46 0.17 0.12] 16’448 0.63 0.54 0.59 0.67 0.71 0.65

11 [0.26 0.48 0.16 0.1] 21’223 0.63 0.54 0.58 0.67 0.72 0.673

14 [0.26 0.45 0.19 0.1] 27’875 0.63 0.52 0.58 0.67 0.72 0.65

17 [0.25 0.42 0.20 0.13] 32’660 0.63 0.5 0.59 0.67 0.73 0.61

Ievgen Redko SciDoLySE 66 / 79

Page 99: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimal transport for jointdistribution adaptation

The following slides are courtesy of R. Flamary (OTML workshop, NIPS’17)http://otml17.marcocuturi.net/wp-content/uploads/2018/01/OTML_NIPS_

2017.pdf

Ievgen Redko SciDoLySE 67 / 79

Page 100: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Joint distribution and classifier estimation

Objectives of JDOT

• Model the transformation of labels (allow change of proportion/value).

• Learn an optimal target predictor with no labels on target samples.

• Approach theoretically justified.

Joint distributions and dataset

• We work with the joint feature/label distributions.

• Let Ω ∈ Rd be a compact input measurable space of dimension d and C the set of

labels.

• Let Ps(X,Y ) ∈ P(Ω× C) and Pt(X,Y ) ∈ P(Ω× C) the source and target joint

distribution.

• We have access to an empirical sampling Ps = 1Ns

∑Ns

i=1 δxsi,ys

iof the source

distribution defined by Xs = xsiNs

i=1 and label information Ys = ysi Ns

i=1.

• but the target domain is defined only by an empirical distribution in the feature

space with samples Xt = xtiNt

i=1.

14 / 29

Page 101: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Joint distribution OT (JDOT)

Proxy joint distribution

• Let f be a Ω → C function from a given class of hypothesis H.

• We define the following joint distribution that use f as a proxy of y

Pft = (x, f(x))x∼µt

(5)

and its empirical counterpart Pt

f= 1

Nt

∑Nt

i=1 δxti,f(xt

i) .

Learning with JDOT

We propose to learn the predictor f that minimize :

minf

W1(Ps, Pt

f) = inf

γ∈∆

ij

D(xsi ,y

si ;x

tj , f(x

tj))γij

(6)

• ∆ is the transport polytope.

• D(xsi ,y

si ;x

tj , f(x

tj)) = α‖xs

i − xtj‖2 + L(ys

i , f(xtj)) with α > 0.

• We search for the predictor f that better align the joint distributions.

15 / 29

Page 102: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Optimization problem

minf∈H,γ∈∆

i,j

γi,j

(αd(xs

i ,xtj) + L(ys

i , f(xtj))

)+ λΩ(f) (7)

Optimization procedure

• Ω(f) is a regularization for the predictor f

• We propose to use block coordinate descent (BCD)/Gauss Seidel.

• Provably converges to a stationary point of the problem.

γ update for a fixed f

• Classical OT problem.

• Solved by network simplex.

• Regularized OT can be used

(add a term to problem (7))

f update for a fixed γ

minf∈H

i,j

γi,jL(ysi , f(x

tj)) + λΩ(f) (8)

• Weighted loss from all source labels.

• γ performs label propagation.

19 / 29

Page 103: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Regression with JDOT

5 0 5x

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

y

Toy regression distributions

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Toy regression models

Source modelTarget modelSource samplesTarget samples

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Joint OT matrices

JDOT matrix linkOT matrix link

2.5 0.0 2.5 5.0x

1.0

0.5

0.0

0.5

1.0

Model estimated with JDOT

Source modelTarget modelJDOT model

Least square regression with quadratic regularization

For a fixed γ the optimization problem is equivalent to

minf∈H

j

1

nt

‖yj − f(xtj)‖2 + λ‖f‖2 (9)

• yj = nt

j γi,jysi is a weighted average of the source target values.

• Note that this problem is linear instead of quadratic.

• Can use any solver (linear, kernel ridge, neural network).

20 / 29

Page 104: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Classification with JDOT

0 2 4 6 8 10 12 14 160.0

0.2

0.4

0.6

0.8

1.0 Accuracy along BCD iterations

α=0.1

α=0.5

α=1.0

α=10.0

α=50.0

α=100.0

Multiclass classification with Hinge loss

For a fixed γ the optimization problem is equivalent to

minfk∈H

j,k

Pj,kL(1, fk(xtj)) + (1− Pj,k)L(−1, fk(x

tj)) + λ

k

‖fk‖2 (10)

• P is the class proportion matrix P =1Nt

γ⊤P

s.

• Ps and Y

s are defined from the source data with One-vs-All strategy as

Y si,k =

1 if ysi = k

−1 else, P s

i,k =

1 if ysi = k

0 else

with k ∈ 1, · · · ,K and K being the number of classes.21 / 29

Page 105: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Caltech-Office classification dataset

Domains Base SurK SA OT-IT OT-MM JDOT

caltech→amazon 92.07 91.65 90.50 89.98 92.59 91.54

caltech→webcam 76.27 77.97 81.02 80.34 78.98 88.81

caltech→dslr 84.08 82.80 85.99 78.34 76.43 89.81

amazon→caltech 84.77 84.95 85.13 85.93 87.36 85.22

amazon→webcam 79.32 81.36 85.42 74.24 85.08 84.75

amazon→dslr 86.62 87.26 89.17 77.71 79.62 87.90

webcam→caltech 71.77 71.86 75.78 84.06 82.99 82.64

webcam→amazon 79.44 78.18 81.42 89.56 90.50 90.71

webcam→dslr 96.18 95.54 94.90 99.36 99.36 98.09

dslr→caltech 77.03 76.94 81.75 85.57 83.35 84.33

dslr→amazon 83.19 82.15 83.19 90.50 90.50 88.10

dslr→webcam 96.27 92.88 88.47 96.61 96.61 96.61

Mean 83.92 83.63 85.23 86.02 86.95 89.04

Avg. rank 4.50 4.75 3.58 3.00 2.42 2.25

• Classical dataset [Saenko et al., 2010] dedicated to visual adaptation.

• Feature extraction by convolutional neural network [Donahue et al., 2014].

• Comparison with Surrogate Kernel [Zhang et al., 2013], Subspace Alignment

[Fernando et al., 2013] and OT Domain Adaptation [Courty et al., 2016b].

• Parameter selected via reverse cross-validation [Zhong et al., 2010].

• SVM (Hinge loss) classifiers with linear kernel.

• Best ranking method and 2% accuracy gain in average.

22 / 29

Page 106: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Amazon Review Classification dataset

Domains NN DANN JDOT (mse) JDOT (Hinge)

books→dvd 0.805 0.806 0.794 0.795

books→kitchen 0.768 0.767 0.791 0.794

books→electronics 0.746 0.747 0.778 0.781

dvd→books 0.725 0.747 0.761 0.763

dvd→kitchen 0.760 0.765 0.811 0.821

dvd→electronics 0.732 0.738 0.778 0.788

kitchen→books 0.704 0.718 0.732 0.728

kitchen→dvd 0.723 0.730 0.764 0.765

kitchen→electronics 0.847 0.846 0.844 0.845

electronics→books 0.713 0.718 0.740 0.749

electronics→dvd 0.726 0.726 0.738 0.737

electronics→kitchen 0.855 0.850 0.868 0.872

Mean 0.759 0.763 0.783 0.787

• Dataset aim at predicting reviews across domains [Blitzer et al., 2006].

• Comparison with Domain adversarial neural network [Ganin et al., 2016a].

• Classifier f is a neural network with same architecture as DANN.

• JDOT has better accuracy, classification loss is better than mean square error.

23 / 29

Page 107: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Wifi localization regression dataset

Domains KRR SurK DIP DIP-CC GeTarS CTC CTC-TIP JDOT

t1 → t2 80.84±1.14 90.36±1.22 87.98±2.33 91.30±3.24 86.76 ± 1.91 89.36±1.78 89.22±1.66 93.03 ± 1.24

t1 → t3 76.44±2.66 94.97±1.29 84.20±4.29 84.32±4.57 90.62±2.25 94.80±0.87 92.60 ± 4.50 90.06 ± 2.01

t2 → t3 67.12±1.28 85.83 ± 1.31 80.58 ± 2.10 81.22 ± 4.31 82.68 ± 3.71 87.92 ± 1.87 89.52 ± 1.14 86.76 ± 1.72

hallway1 60.02 ±2.60 76.36 ± 2.44 77.48 ± 2.68 76.24± 5.14 84.38 ± 1.98 86.98 ± 2.02 86.78 ± 2.31 98.83±0.58

hallway2 49.38 ± 2.30 64.69 ±0.77 78.54 ± 1.66 77.8± 2.70 77.38 ± 2.09 87.74 ± 1.89 87.94 ± 2.07 98.45±0.67

hallway3 48.42 ±1.32 65.73 ± 1.57 75.10± 3.39 73.40± 4.06 80.64 ± 1.76 82.02± 2.34 81.72 ± 2.25 99.27±0.41

• Objective is to predict position of a device on a discretized grid

[Zhang et al., 2013].

• Same experimental protocol as [Zhang et al., 2013, Gong et al., 2016].

• Comparison with domain-invariant projection and its cluster regularized version

([Baktashmotlagh et al., 2013], DIP and DIP-CC), generalized target shift

([Zhang et al., 2015], GeTarS), and conditional transferable components, with its

target information preservation regularization ([Gong et al., 2016], CTC and

CTC-TIP).

• JDOT solves the adaptation problem for transfer across device (10% accuracy

gain on Hallway).

24 / 29

Page 108: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Large scale JDOT Strategy

Large scale JDOT

• JDOT do not scale well to large datasets/ deep learning.

• Use minibach for computing the transport in the primal [Genevay et al., 2017].

• Evaluate batch-local couplings on (sufficiently large) couples of random (without

replacement) batches in source and target domain

• update f from these couplings

Algorithm : Deep JDOT

input Source data Xs, ys, Targte data Xt

for BCD Iterations do

for each Source/Target minibatch do

Solve OT with JDOT loss

Perform label propagation on minibatch

end for

Update model f on one epoch

end for

25 / 29

Page 109: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Large scale datasets

Description MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M

Source samples 60000 9298 73257 60000

Target samples 9298 60000 60000 60000

height/width 16×16 16×16 32×32×3 28×28×3

• Four cross domain digits datasets: MNIST, USPS, SVHN, MNIST-M .

• We consider a deep convolutional architecture.

• Dropout is used on the dens layers when training.

• Transport distance computed in the raw image space.

26 / 29

Page 110: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Experimental Results for large scale JDOT

Methods MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M

Source only (SO) 86.18 58.73 53.15 59.52

DeepCoral [Sun and Saenko, 2016] 88.43 (22.0) 85.02 (64.6) 69.61 (35.6) 62.18 (0.07)

MMD [Long and Wang, 2015] 89.89 (36.3) 79.19 (50.3) 53.27 (0.01) 52.53 (-19.1)

DANN [Ganin et al., 2016b] 89.06 (28.2) 87.03 (70.0) 73.85∗ (44.7) 76.63 (46.6)

ADDA [Tzeng et al., 2017] 91.22 (49.3) 79.98 (52.2) 76.0∗ (49.4) 79.16 (53.5)

DeepJDOT 91.50 (52.01) 91.21 (79.82) 83.62 (65.85) 67.84 (22.67)

Train on Target (TO) 96.41 99.42 99.42 96.21

• Accuracy in % of the DA methods.

• The values in () represent the coverage gap between SO (source only) and TO

(golden performance if the model is learnt on target labelled data), DA−SOTO−SO

.

• DeepJDOT is better in 3 out of 4 DA problems.

• Plots represent test performances along the BCD iterations.

27 / 29

Page 111: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Other contributions :

theoretical guarantees for DA with OT

Ievgen Redko SciDoLySE 68 / 79

Page 112: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Why domain adaptation works ?

Intuition : If two domains are similar then the adaptation should be easy ?

Ievgen Redko SciDoLySE 69 / 79

Page 113: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Why domain adaptation works ?

Intuition : If two domains are similar then the adaptation should be easy ?

Answer : Yes. Due to the following result [Ben David et al., 2007]

error in targetdomain

=

what we wantto learn

≤error in source

domain

=

can learn wellusing ML

+how different

two domains are ?

=

divergence betweenthe two domains

+is it possible

to adapt ?

=

A non-estimableterm

Can the Wasserstein distance be introduced into these bounds ?

Ievgen Redko SciDoLySE 69 / 79

Page 114: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several learning bounds

With the Wasserstein distance as a divergence term

between marginal distributions µS and µT [Redko et al., 2017, Shen et al.,2018]

RT (h) ≤ RS(h) +W (µS , µT ) +O(

1√n

)

+Non-estimable

term

between joint distributions PS and PT [Courty et al., 2017]

RT (h) ≤ RS(h) +W (PS , PT ) +O(

1√n

)

+Non-estimable

term+

Prob. TransferLipschitzness term

Ievgen Redko SciDoLySE 70 / 79

Page 115: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Several learning bounds

With the Wasserstein distance as a divergence term

between marginal distributions µS and µT [Redko et al., 2017, Shen et al.,2018]

RT (h) ≤ RS(h) +W (µS , µT ) +O(

1√n

)

+Non-estimable

term

between joint distributions PS and PT [Courty et al., 2017]

RT (h) ≤ RS(h) +W (PS , PT ) +O(

1√n

)

+Non-estimable

term+

Prob. TransferLipschitzness term

Ievgen Redko SciDoLySE 70 / 79

Page 116: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Other contributions :

differentially private OT

Ievgen Redko SciDoLySE 71 / 79

Page 117: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Privacy in Optimal Transport and Domain Adaptation

Challenge : to perform optimal transport for domain adaptation

One needs to have access to both source and target data.

At least one party (source or target) has to reveal data.

It raises privacy concerns.

Ievgen Redko SciDoLySE 72 / 79

Page 118: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Privacy in Optimal Transport and Domain Adaptation

Challenge : to perform optimal transport for domain adaptation

One needs to have access to both source and target data.

At least one party (source or target) has to reveal data.

It raises privacy concerns.

How to transfer knowledge with OT while protecting the privacy of users ?

Ievgen Redko SciDoLySE 72 / 79

Page 119: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Privacy in Optimal Transport and Domain Adaptation

Ievgen Redko SciDoLySE 73 / 79

Page 120: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Differential Privacy

ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :

P(M(X) = t) ≤ eεP(M(X ′) = t).

Two close datasets correspond to close distributions.

M(X) does not leak much information about any individual point from X.

Example : Johnson-Lindenstrauss transform with some noise.

Ievgen Redko SciDoLySE 74 / 79

Page 121: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Differential Privacy

ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :

P(M(X) = t) ≤ eεP(M(X ′) = t).

Two close datasets correspond to close distributions.

M(X) does not leak much information about any individual point from X.

Example : Johnson-Lindenstrauss transform with some noise.

Idea : Use Johnson-Lindenstrauss transform to obtain a new DP cost matrix !

Ievgen Redko SciDoLySE 74 / 79

Page 122: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

DPOT Algorithm

Input : Xs, Xt, and σ, ℓ > 0 (σ and ℓ known by both parties)

Step 1. Source generates a N (0, 1ℓ)k×ℓ matrix M (known by both parties)

and a N (0, σ)k×ℓ noise matrix ∆ (only known by the Source).

Step 2. Source sends M, Xs +∆, where Xs = XsM

Step 3. Target computes C = c(Xs +∆, Xt)− ℓσ2 where Xt = XtM

(ℓσ2 is subtracted from each entry to cancel the bias caused by ∆)

Step 4. Solve OT with cost matrix C and return P and W (Xs, Xt).

DPOT

Xs

Xt

Xs Xs

Xt

M

MCoupling matrix

~

~

Δ

Ievgen Redko SciDoLySE 75 / 79

Page 123: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Ievgen Redko SciDoLySE 76 / 79

Page 124: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Conclusions

Ievgen Redko SciDoLySE 76 / 79

Page 125: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Conclusions

OT is a very powerful tool for domain adaptation

A large variety of possible applications/scenarios (more remain uncovered)

Quite computationally efficient (but costly regularization is needed)

Ievgen Redko SciDoLySE 77 / 79

Page 126: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

Conclusions

OT is a very powerful tool for domain adaptation

A large variety of possible applications/scenarios (more remain uncovered)

Quite computationally efficient (but costly regularization is needed)

Try it using POT library !

Ievgen Redko SciDoLySE 77 / 79

Page 127: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

References I

Bredies, K., Lorenz, D. A., and Maass, P. (2009).

A generalized conditional gradient method and its connection to an iterativeshrinkage method.

Computational Optimization and Applications, 42(2) :173–193.

Courty, N., Flamary, R., and Tuia, D. (2014).

Domain adaptation with regularized optimal transport.

In European Conference on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML PKDD).

Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2016).

Optimal transport for domain adaptation.

Pattern Analysis and Machine Intelligence, IEEE Transactions on.

Cuturi, M. (2013).

Sinkhorn distances : Lightspeed computation of optimal transportation.

In Neural Information Processing Systems (NIPS), pages 2292–2300.

Ferradans, S., Papadakis, N., Rabin, J., Peyré, G., and Aujol, J.-F. (2013).

Regularized discrete optimal transport.

In Scale Space and Variational Methods in Computer Vision, SSVM, pages 428–439.

Ievgen Redko SciDoLySE 78 / 79

Page 128: Transfer Learning and Optimal Transportscidolyse.ens-lyon.fr/sites/default/files/2019-09/OT...Transfer Learning and Optimal Transport Ievgen Redko ievgen.redko@univ-st-etienne.fr UMR

References II

Germain, P., Habrard, A., Laviolette, F., and Morvant, E. (2013).

A PAC-Bayesian Approach for Domain Adaptation with Specialization to LinearClassifiers.

In ICML, pages 738–746, Atlanta, USA.

Perrot, M., Courty, N., Flamary, R., and Habrard, A. (2016).

Mapping estimation for discrete optimal transport.

In Neural Information Processing Systems (NIPS).

Tuia, D., Flamary, R., Rakotomamonjy, A., and Courty, N. (2015).

Multitemporal classification without new labels : a solution with optimal transport.

In 8th International Workshop on the Analysis of Multitemporal Remote SensingImages.

Ievgen Redko SciDoLySE 79 / 79