Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Transfer Learning
and
Optimal Transport
Ievgen Redko
UMR • CNRS • 5516 • SAINT-ETIENNE
Ievgen Redko SciDoLySE 1 / 79
Credits and acknowledgments
Documents used for this talk :
D. Xu, K. Saenko, I. Tsang. Tutorial on Domain Transfer Learning for VisionApplications, CVPR’12.
S. Pan, Q. Yang and W. Fan. Tutorial : Transfer Learning with Applications, IJCAI’13.
S. Ben-David. Towards Theoretical Understanding of Domain Adaptation Learning,workshop LNIID at ECML’09.
F. Sha and B. Kingsbury. Domain Adaptation in Machine Learning and SpeechRecognition, Tutorial - Interspeech 2012.
K. Grauman. Adaptation for Objects and Attributes, workshop VisDA at ICCV’13
J. Blitzer and H. DauméIII. Domain Adaptation, Tutorial - ICML 2010.
Acknowledegments :A. Habrard, Rémi Flamary, Nicolas Courty, Devis Tuia, Tien Nam Li, Marc Sebban
Ievgen Redko SciDoLySE 2 / 79
Outline
Introduction
Optimal transport for domain adaptationProblem formulationRegularization framework for domain adaptationNumerical experiments
Mapping estimation for discrete optimal transportProblem formulationApplication to domain adaptationApplication to seamless copy in images
Optimal transport for target shiftMotivationProposed modelExperimental results
Optimal transport for joint distribution adaptation
Other contributionsDifferentially private OT
Conclusions
Ievgen Redko SciDoLySE 3 / 79
Introduction
Ievgen Redko SciDoLySE 4 / 79
Artificial Intelligence
Ultimate goal : Build systems that can learn by exploring the world.
Ievgen Redko SciDoLySE 5 / 79
Artificial Intelligence
Ultimate goal : Build systems that can learn by exploring the world.
- Unfortunately, not easy (or almost impossible as for now)
Ievgen Redko SciDoLySE 5 / 79
Goals in AI
Intermediate goal : Build systems that can classify and recognize well
Solution : Use Machine learning (ML) methods = near-human performance
Ievgen Redko SciDoLySE 6 / 79
Issues of Traditional MLIssues :
- Near-human performance is achieved using lots of labeled data
- Some tasks do not have that much labeled data (biology, physics etc)
- Some data/tasks evolve with time
- There exist too many tasks !
Ievgen Redko SciDoLySE 7 / 79
Issues of Traditional MLIssues :
- Near-human performance is achieved using lots of labeled data
- Some tasks do not have that much labeled data (biology, physics etc)
- Some data/tasks evolve with time
- There exist too many tasks !
Solution : Transfer learning
+ Use systems build for different but related applications
Ievgen Redko SciDoLySE 7 / 79
Transfer Learning
Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks
Ievgen Redko SciDoLySE 8 / 79
Transfer Learning
Definition [Pan, TL-IJCAI’13 tutorial]Ability of a system to recognize and apply knowledge and skills learned in previousdomains/tasks to novel domains/tasks
Example
We have labeled images from a Web image corpus Is there a Person in unlabeled images from a Video corpus ?
Person no Person
?→
Is there a Person ?
Ievgen Redko SciDoLySE 8 / 79
Settings
Supervised learning vs Transfer learning
?
from the same domain
?
from different domains
Domains are modeled as probability distributions over an instance space
Tasks associated to a domain (classification, regression, clustering, ...)
Goal
Improve a target predictive function in the target domainusing knowledge from the source domain
Ievgen Redko SciDoLySE 9 / 79
A Taxonomy of Transfer Learning
“A survey on Transfer Learning” [Pan and Yang, TKDE 2010]
Ievgen Redko SciDoLySE 10 / 79
In this tutorial
We focus on domain adaptation for classification
How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related
target distribution ?
Ievgen Redko SciDoLySE 11 / 79
In this tutorial
We focus on domain adaptation for classification
How can we learn, using labeled data from a sourcedistribution, a low-error classifier for another related
target distribution ?
Why ?
“Hot topic” - tutorials at ICML 2010, CVPR 2012, Interspeech 2012,workshops at ICCV 2013, NIPS 2013, ECML 2014
Many real-world motivating examples
Ievgen Redko SciDoLySE 11 / 79
A toy problem : Inter-twinning moons
(a) 10 (b) 20
(c) 30
(d) 40 (e) 50
(f) 70
Ievgen Redko SciDoLySE 12 / 79
Intuition and motivation : computer vision
“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]
“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]
“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]
Ievgen Redko SciDoLySE 13 / 79
Intuition and motivation : computer vision
“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]
“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]
“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]
Ievgen Redko SciDoLySE 13 / 79
Intuition and motivation : computer vision
“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]
“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]
“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]
Ievgen Redko SciDoLySE 13 / 79
Intuition and motivation : computer vision
“Can we train classifiers with Flickr photos, as they have already beencollected and annotated, and hope the classifiers still work well on mobilecamera images ?” [Gonq et al., CVPR’12]
“object classifiers optimized on benchmark dataset often exhibit significantdegradation in recognition accuracy when evaluated on another one” [Gonq etal., ICML’13, Torralba et al., CVPR’11, Perronnin et al., CVPR’10]
“Hot topic” - Visual domain adaptation [Tutorial CVPR’12, ICCV’13]
Ievgen Redko SciDoLySE 13 / 79
Problems with data representations
[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]
Ievgen Redko SciDoLySE 14 / 79
Hard to predict what will change in the new domain
[Xu,Saenko,Tsang, Domain Transfer Tutorial - CVPR’12]
Ievgen Redko SciDoLySE 15 / 79
Natural Language Processing
Part of Speech TaggingAdapt a tagger learned from medical papers to a journal
Text are represented by “words” (Bag of Words)
Ievgen Redko SciDoLySE 16 / 79
Spam detection
Adapt a classifier from a mailbox of an office worker to that of a hippie musician
Ievgen Redko SciDoLySE 17 / 79
Sentiment analysis
Adapt a classifier predicting the preferences for books to those of DVDs
Ievgen Redko SciDoLySE 18 / 79
Electronics to video games [Pan-IJCAI’13 tutorial]
Electronics Video games
(1) Compact ; easy to operate ; verygood picture quality ; looks sharp !
(2) A very good game ! It is action pa-cked and full of excitement. I am verymuch hooked on this game.
(3) I purchased this unit from CircuitCity and I was very excited about thequality of the picture. It is really nice andsharp.
(4) Very realistic shooting action andgood plots. We played this and werehooked.
(5) It is also quite blurry in very darksettings. I will never_buy HP again.
(6) It is so boring. I am extremelyunhappy and will probably never_buy
UbiSoft again.
Source specific : compact, sharp, blurry.
Target specific : hooked, realistic, boring.
Domain independent : good, excited, nice, never_buy, unhappy.
Ievgen Redko SciDoLySE 19 / 79
Other applications
Speech recognition [Tutorial at Interspeech’12]
Medecine
Biology
Time series
Wifi localization
Ievgen Redko SciDoLySE 20 / 79
Why transfer learning ?
Ievgen Redko SciDoLySE 21 / 79
Why transfer learning ? Smart people talk
Ievgen Redko SciDoLySE 22 / 79
Why transfer learning ? Smart people talk
Ievgen Redko SciDoLySE 23 / 79
A bit of vocabulary
Unsupervised Transfer Learning
= No labels in source and target domains
Unsupervised DA
= Presence of source labels, no target labels
Semi-supervised DA
= Presence of source labels, few target labels and a lot of unlabeled data
Semi-supervised learning
= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain
Ievgen Redko SciDoLySE 24 / 79
A bit of vocabulary
Unsupervised Transfer Learning
= No labels in source and target domains
Unsupervised DA
= Presence of source labels, no target labels
Semi-supervised DA
= Presence of source labels, few target labels and a lot of unlabeled data
Semi-supervised learning
= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain
Ievgen Redko SciDoLySE 24 / 79
A bit of vocabulary
Unsupervised Transfer Learning
= No labels in source and target domains
Unsupervised DA
= Presence of source labels, no target labels
Semi-supervised DA
= Presence of source labels, few target labels and a lot of unlabeled data
Semi-supervised learning
= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain
Ievgen Redko SciDoLySE 24 / 79
A bit of vocabulary
Unsupervised Transfer Learning
= No labels in source and target domains
Unsupervised DA
= Presence of source labels, no target labels
Semi-supervised DA
= Presence of source labels, few target labels and a lot of unlabeled data
Semi-supervised learning
= No distribution shift, few labeled data and a lot of unlabeled data from thesame domain
Ievgen Redko SciDoLySE 24 / 79
Several key questions
1. How to estimate the distribution shift ?
2. What are the generalization guarantees ?
RPT(h) ≤?RPS
(h)?+?
3. When adaptation is possible ?
Ievgen Redko SciDoLySE 25 / 79
Several key questions
1. How to estimate the distribution shift ?
2. What are the generalization guarantees ?
RPT(h) ≤?RPS
(h)?+?
3. When adaptation is possible ?
Ievgen Redko SciDoLySE 25 / 79
Several key questions
1. How to estimate the distribution shift ?
2. What are the generalization guarantees ?
RPT(h) ≤?RPS
(h)?+?
3. When adaptation is possible ?
Ievgen Redko SciDoLySE 25 / 79
Several key questions
4. How to design new algorithms ?
Ievgen Redko SciDoLySE 26 / 79
3 main classes of algorithms
1. Instance-based methods
= Correct a sample bias by reweighting source labeled data : source instancesclose to target instances are more important
2. Feature-based methods
= Find a common space where source and target are close
3. Adjustment/Iterative methods
= Modify the model by incorporating pseudo-labeled information
Ievgen Redko SciDoLySE 27 / 79
Optimal transport for domainadaptation
Ievgen Redko SciDoLySE 28 / 79
The following slides are courtesy of R. Flamary(https://remi.flamary.com/biblio/presvannes2016.pdf)
Ievgen Redko SciDoLySE 29 / 79
Problem setup
Amazon DLSR
Feature extraction Feature extraction
Source Domain Target Domain
+ Labels
not working !!!!
decision function
no labels !
Problems
Labels only in the source domain, and classification is in the target domain.
Classifier trained on the source data performs badly in the target domain
Ievgen Redko SciDoLySE 30 / 79
Optimal transport for domain adaptation
Assumptions There exist a transport T between the source and target domain.
The transport preserves the conditional distributions :
Ps(y|xs) = Pt(y|T(xs)).
Ievgen Redko SciDoLySE 31 / 79
Optimal transport for domain adaptation
Assumptions There exist a transport T between the source and target domain.
The transport preserves the conditional distributions :
Ps(y|xs) = Pt(y|T(xs)).
3-step strategy1. Estimate optimal transport between distributions.
2. Transport the training samples onto the target distribution.
3. Learn a classifier on the transported training samples.
Dataset
Class 1
Class 2
Samples
Samples
Classifier on
Optimal transport
Samples
Samples
Classification on transported samples
Samples
Samples
Classifier on
Ievgen Redko SciDoLySE 31 / 79
Objective function
Optimization problem
minγ∈P
〈γ,C〉F + λΩs(γ) + ηΩ(γ),
where
Ωs(γ) entropic regularization [Cuturi, 2013].
η ≥ 0 and Ωc(·) is a DA regularization term.
Regularization to avoid overfitting in high dimension and encode additionalinformation.
Ievgen Redko SciDoLySE 32 / 79
Entropic regularization
0 5 10 15
0
5
10
15
Optimal matrix γ
Ωs(γ) =∑
i,j
γ(i, j) log γ(i, j)
Extremely efficient optimization scheme (Sinkhorn Knopp).
Solution is not sparse anymore.
Ievgen Redko SciDoLySE 33 / 79
Class-based regularization [Courty et al., 2016]
0 5 10 15
0
5
10
15
Optimal matrix γ
Ωc(γ) =∑
j
∑
c
||γ(Ic, j)||pq ,
Group components of γ using source labels
Target samples receive masses only from “same class” source samples.
Ievgen Redko SciDoLySE 34 / 79
Laplacian regularization for sample displacement
Sim. graph with S si,j>0 Small λ Large λ
Ωc(γ) =1
N2s
∑
i,j
Ssi,j‖(xs
i − xsi )− (xs
j − xsj)‖2
Proposed in [Ferradans et al., 2013] for color transfer in images.
Similar samples defined in Ss have similar displacements
Similarity graph Ss is Similarity graph using source labels
Ievgen Redko SciDoLySE 35 / 79
Optimization problem
minγ∈P
〈γ,C〉F + λΩs(γ) + ηΩ(γ),
Special cases
η = 0 : Sinkhorn Knopp [Cuturi, 2013].
λ = 0 and Laplacian regularization : Large quadratic program solved withconditionnal gradient [Ferradans et al., 2013].
Non convex group lasso ℓp − ℓ1 : Majoration Minimization with SinkhornKnopp [Courty et al., 2014].
General framework with convex regularization Ω(γ)
Can we use efficient Sinkhorn Knopp scaling to solve the global problem ?
Yes using generalized conditional gradient [Bredies et al., 2009].
Linearization of the second regularization term but not the entropicregularization.
Ievgen Redko SciDoLySE 36 / 79
Barycentric mapping
How to transport the samples using the obtained coupling matrix ?
Ievgen Redko SciDoLySE 37 / 79
Barycentric mapping
How to transport the samples using the obtained coupling matrix ?
Use barycentric mapping
xis = argmin
xt∈XT
∑
ij
γ(i, j)∗c(x, xjt)
If c(x, x′) is the Euclidean distance then
XS ≃ γ∗XT
Ievgen Redko SciDoLySE 37 / 79
Simulated problem with controllable complexity
(g) rotation=10 (h) rotation=30
(i) rotation=50 (j) rotation=70
Two moons problem [Germain et al., 2013]
Two entangled moons with a rotation between domains.
The rotation angle allow a control of the adaptation difficulty.
Ievgen Redko SciDoLySE 38 / 79
Results on the two moons dataset
10
20
30
40
50
70
90
SVM (no adapt.) 0 0.104 0.24 0.312 0.4 0.764 0.828
DASVM 0 0 0.259 0.284 0.334 0.747 0.820
PBDA 0 0.094 0.103 0.225 0.412 0.626 0.687
OT-exact 0 0.028 0.065 0.109 0.206 0.394 0.507
OT-IT 0 0.007 0.054 0.102 0.221 0.398 0.508
OT-GL 0 0 0 0.013 0.196 0.378 0.508
OT-Lap 0 0 0.004 0.062 0.201 0.402 0.524
Average prediction error for adaptation from 10 to 90.
Clear advantage of the optimal transport techniques.
Regularization helps (a lot) up to 40.
Ievgen Redko SciDoLySE 39 / 79
Visual adaptation datasets
Digit recognition, MNIST VS USPS (10 classes, d=256, 2 dom.).
Face recognition, PIE Dataset (68 classes, d=1024, 4 dom.).
Object recognition, Caltech-Office dataset (10 classes, d=800/4096, 4 dom.).
Ievgen Redko SciDoLySE 40 / 79
Comparison on vision datasets
Datasets Digits Faces ObjectsMethods ACC Nb best ACC Nb best ACC Nb best
1NN 48.66 0 26.22 0 28.47 0PCA 42.94 0 34.55 0 37.98 0GFK 52.56 0 26.15 0 39.21 0TSL 47.22 0 36.10 0 42.97 1JDA 57.30 0 56.69 7 44.34 1
OT-exact 49.96 0 50.47 0 36.69 0OT-IT 59.20 0 54.89 0 42.30 0OT-Lap 61.07 0 56.10 3 43.20 0OT-LpLq 64.11 1 55.45 0 46.42 1OT-GL 63.90 1 55.88 2 47.70 9
OT works very well on digits and object recognition
Good but not best on face recognition (-.5% wrt JDA).
Ievgen Redko SciDoLySE 41 / 79
Next step
Limits
Scales at least quadratically with the dataset.
What about domains with different class proportions ? [Tuia et al., 2015]
Out of sample extension ?
Ievgen Redko SciDoLySE 42 / 79
Mapping estimation for discreteoptimal transport
Ievgen Redko SciDoLySE 43 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
How to estimate the mapping ?
Go back to Monge formulation ? No !
Fit the barycentric mapping but also introduce smoothness.
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation for discrete optimal transport
Why estimate the mapping ?
Out of sample problem.
Solving optimization problem every time the dataset changes
Transporting a very large number of samples.
Interpretability (depending on the mapping model).
How to estimate the mapping ?
Go back to Monge formulation ? No !
Fit the barycentric mapping but also introduce smoothness.
Ievgen Redko SciDoLySE 44 / 79
Mapping estimation
Problem formulation [Perrot et al., 2016]
argminT∈H,γ∈P
f(γ, T ) = λγ 〈γ,C〉F
︸ ︷︷ ︸
OT loss
+ ‖T (Xs)− nsγXt‖2F︸ ︷︷ ︸
Mapping data fitting
+ λTR(T )︸ ︷︷ ︸
Mapping reg.
where
Xs and Xt are the source and target datasets,
T (·) is applied for each elements of the above matrices,
nsγXt is the barycentric mapping for source samples with uniform weights,
H is the space of transformations (more details later),
R(·) is a regularization term controlling the complexity of T .
Ievgen Redko SciDoLySE 45 / 79
Mapping family H
Linear transformations
H =
T : ∀x ∈ Ω, T (x) = xTL
.
L is a d× d real matrix.
R(T ) = ‖L− I‖2F
where I is the identity matrix.
Update is a classical linear least square regression.
Ievgen Redko SciDoLySE 46 / 79
Mapping family H
Linear transformations
H =
T : ∀x ∈ Ω, T (x) = xTL
.
L is a d× d real matrix.
R(T ) = ‖L− I‖2F
where I is the identity matrix.
Update is a classical linear least square regression.
Nonlinear transformations
H =
T : ∀x ∈ Ω, T (x) = kXs(xT )L
kXs(xT ) =
(k(x,xs
1) k(x,xs2) · · · k(x,xs
ns)).
k(·, ·) is a positive definite kernel.
L is a ns × d real matrix.
Update is a classical kernel least square regression.
Ievgen Redko SciDoLySE 46 / 79
Illustrative exampleLin
ear m
appi
ng
2D Dataset
Nonl
inea
r map
ping
Source samplesTarget samples
Barycentric displacement T displacement Out of sample T
Clearly a non-linear mapping.
The mapping model controls the barycentric mapping.
Ievgen Redko SciDoLySE 47 / 79
Domain adaptation : Caltech-Office dataset
Task 1NN GFK SA OT L1L2 OTEOTLin OTLinB OTKer OTKerBT γ T γ T γ T γ
D → W 89.5 93.3 95.6 77.0 95.7 95.7 97.3 97.3 97.3 97.3 98.4 98.5 98.5 98.5
D → A 62.5 77.2 88.5 70.8 74.9 74.8 85.7 85.7 85.8 85.8 89.9 89.9 89.5 89.5D → C 51.8 69.7 79.0 68.1 67.8 68.0 77.2 77.2 77.4 77.4 69.1 69.2 69.3 69.3W → D 99.2 99.8 99.6 74.1 94.4 94.4 99.4 99.4 99.8 99.8 97.2 97.2 96.9 96.9W → A 62.5 72.4 79.2 67.6 71.3 71.3 81.5 81.5 81.4 81.4 78.5 78.3 78.5 78.8W → C 59.5 63.7 55.0 63.1 67.8 67.8 75.9 75.9 75.4 75.4 72.7 72.7 65.1 63.3A → D 65.2 75.9 83.8 64.6 70.1 70.5 80.6 80.6 80.4 80.5 65.6 65.5 71.9 71.5A → W 56.8 68.0 74.6 66.8 67.2 67.3 74.6 74.6 74.4 74.4 66.4 64.8 70.0 68.9A → C 70.1 75.7 79.2 70.4 74.1 74.3 81.8 81.8 81.6 81.6 84.4 84.4 84.5 84.5
C → D 75.9 79.5 85.0 66.0 69.8 70.2 87.1 87.1 87.2 87.2 70.1 70.0 78.6 78.6C → W 65.2 70.7 74.4 59.2 63.8 63.8 78.3 78.3 78.5 78.5 80.0 80.4 73.5 73.4C → A 85.8 87.1 89.3 75.2 76.6 76.7 89.9 89.9 89.7 89.7 82.4 82.2 83.6 83.5Mean 70.3 77.8 81.9 68.6 74.5 74.6 84.1 84.1 84.1 84.1 79.6 79.4 80.0 79.7
XXX Clear advantage of the mapping estimation methods.
Ievgen Redko SciDoLySE 48 / 79
Seamless copy with gradient adaptation
Poisson image editing with gradient adaptation
Adapt the gradients from the source to the target domain :
∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)
Ts→t : R6 → R6 is the mapping between gradients of the source and target
images in the domain.
Ievgen Redko SciDoLySE 49 / 79
Seamless copy with gradient adaptation
Poisson image editing with gradient adaptation
Adapt the gradients from the source to the target domain :
∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)
Ts→t : R6 → R6 is the mapping between gradients of the source and target
images in the domain.
Ievgen Redko SciDoLySE 49 / 79
Seamless copy with gradient adaptation
Poisson image editing with gradient adaptation
Adapt the gradients from the source to the target domain :
∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)
Ts→t : R6 → R6 is the mapping between gradients of the source and target
images in the domain.
Ievgen Redko SciDoLySE 49 / 79
Seamless copy with gradient adaptation
Poisson image editing with gradient adaptation
Adapt the gradients from the source to the target domain :
∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)
Ts→t : R6 → R6 is the mapping between gradients of the source and target
images in the domain.
Ievgen Redko SciDoLySE 49 / 79
Seamless copy with gradient adaptation
Poisson image editing with gradient adaptation
Adapt the gradients from the source to the target domain :
∆f = div Ts→t(v) over Ω, with f |∂Ω = ft|∂Ω. (1)
Ts→t : R6 → R6 is the mapping between gradients of the source and target
images in the domain.
Ievgen Redko SciDoLySE 49 / 79
Optimal transport for target shift
Ievgen Redko SciDoLySE 50 / 79
Different reasons to adapt
Covariate shift
adapt when only marginal distributions of inputs change
−4 −2 0 2 40.0
0.2
0.4
0.6
0.8
1.0
1.2Source domain
−4 −3 −2 −1 0 1 2 3 4
Target domain
Ievgen Redko SciDoLySE 51 / 79
Different reasons to adapt
Covariate shift
adapt when only marginal distributions of inputs change
+ most popular scenario– conditional distributions may differ in practice
−4 −2 0 2 40.0
0.2
0.4
0.6
0.8
1.0
1.2Source domain
−4 −3 −2 −1 0 1 2 3 4
Target domain
Ievgen Redko SciDoLySE 51 / 79
Covariate shift and optimal transport (Courty et al., 2014)
Ievgen Redko SciDoLySE 52 / 79
Different reasons to adapt
Target shift
adapt when only marginal distributions of outputs change
−4 −2 0 2 40.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4Source domain
−4 −3 −2 −1 0 1 2 3 4
Target domain
Ievgen Redko SciDoLySE 53 / 79
Different reasons to adapt
Target shift
adapt when only marginal distributions of outputs change
+ occurs in many real-world applications (imbalanced data)+ few contributions in literature– assumes the same distribution of inputs
−4 −2 0 2 40.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4Source domain
−4 −3 −2 −1 0 1 2 3 4
Target domain
Ievgen Redko SciDoLySE 53 / 79
Different reasons to adapt
Sample-selection bias, source component shift, domain shift etc.
rarely studied (and very difficult) cases
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.50.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7Source domain
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Target domain
Ievgen Redko SciDoLySE 54 / 79
Wish list
General multi-source setting (more than 2 source domains)
Solving covariate shift between each pair source-target
Tackling target shift
Ievgen Redko SciDoLySE 55 / 79
First insight
How to tackle target shift ? Take 1
Assume that target distribution and source distributions are defined as
PT =
C∑
i=1
πTi Pi P
πS =
∑
i
πiPi,
with Pi being a distribution of class i ∈ 1, . . . , C. We want to solve Problem 1
π⋆ = argmin
π∈∆C
W (PπS , PT )
Ievgen Redko SciDoLySE 56 / 79
First insight
How to tackle target shift ? Take 1
Assume that target distribution and source distributions are defined as
PT =
C∑
i=1
πTi Pi P
πS =
∑
i
πiPi,
with Pi being a distribution of class i ∈ 1, . . . , C. We want to solve Problem 1
π⋆ = argmin
π∈∆C
W (PπS , PT )
Ievgen Redko SciDoLySE 56 / 79
First insight
How to tackle target shift ? Take 1
We can prove the following
Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑
j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π
T.
Ievgen Redko SciDoLySE 57 / 79
First insight
How to tackle target shift ? Take 1
We can prove the following
Assume that ∀i, ∄α ∈ ∆C |αi = 0, Pi =∑
j αjPj.Then, for any distribution PT , the unique solution π∗ toProblem 1 is given by π
T.
Wasserstein distance is a good candidate for proportion estimation !
Ievgen Redko SciDoLySE 57 / 79
Second insight
How to tackle target shift ? Take 2
Consider a multi-source setting with a weighted source distribution
We can show that in case of binary classification
error intarget domain
≤ overallsource error+
distance betweenclasses
∗ distance between sourceand target class proportions
More formally, for class distributions P0 and P1
distance betweenclasses
∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −
N∑
j=1
αjπjS |
Ievgen Redko SciDoLySE 58 / 79
Second insight
How to tackle target shift ? Take 2
Consider a multi-source setting with a weighted source distribution
We can show that in case of binary classification
error intarget domain
≤ overallsource error+
distance betweenclasses
∗ distance between sourceand target class proportions
More formally, for class distributions P0 and P1
distance betweenclasses
∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −
N∑
j=1
αjπjS |
Ievgen Redko SciDoLySE 58 / 79
Second insight
How to tackle target shift ? Take 2
Consider a multi-source setting with a weighted source distribution
We can show that in case of binary classification
error intarget domain
≤ overallsource error+
distance betweenclasses
∗ distance between sourceand target class proportions
More formally, for class distributions P0 and P1
distance betweenclasses
∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −
N∑
j=1
αjπjS |
Ievgen Redko SciDoLySE 58 / 79
Second insight
How to tackle target shift ? Take 2
Consider a multi-source setting with a weighted source distribution
We can show that in case of binary classification
error intarget domain
≤ overallsource error+
distance betweenclasses
∗ distance between sourceand target class proportions
More formally, for class distributions P0 and P1
distance betweenclasses
∗ distance between sourceand target class proportions=dist(P0, P1) ∗ |πT −
N∑
j=1
αjπjS |
Reweight source samples to match target proportions !
Ievgen Redko SciDoLySE 58 / 79
Some notations
Data matrices X(k) ∼ µ(k) from K source domains
Data matrix X ∼ µ from target domain
Source proportions of classes h(k)c =
∫µ(k)c , µ(k) =
∑C
c=1 µ(k)c
Linear operators D(k)1 and D
(k)2 that transform m(k) to h
(k)c and back
Ievgen Redko SciDoLySE 59 / 79
Objective function
Putting it all together
h⋆ = argmin
h∈∆C ,Γ
K∑
k=1
λk KL(γ(k)|ζ(k))
s.t. ∀k D(k)1 γ
(k)1n = h.
A constrained Wasserstein barycenter problem
h⋆ reweights source instances to match source and target distributions
Efficient optimization with Bregman projections
Ievgen Redko SciDoLySE 60 / 79
Classification in the target domain
How would we obtain target labels ?
Before : Barycentric mapping
+ Accurate alignment of samples– Computationally costly : aligning + learning a classifier
Alternative : Label propagation
L =
K∑
k=1
λkD(k)1 γ
(k)
+ Directly obtaining target labels+ Majority vote by source domains = can be seen as boosting
Ievgen Redko SciDoLySE 61 / 79
Illustration
Covariate shift DA mixes instances from different classes !
Ievgen Redko SciDoLySE 62 / 79
Illustration
Proposed method handles target shift efficiently !
Ievgen Redko SciDoLySE 63 / 79
Real-world data
Zurich Summer’ data set composed of 20 satellite images
4 classes : Roads, Buildings, Trees and Grass
17 source and 1 target domain
Average class proportions [0.25± 0.07, 0.4± 0.13, 0.22± 0.11, 0.13± 0.11]
Ievgen Redko SciDoLySE 64 / 79
Examples of images
Input satellite images
Ievgen Redko SciDoLySE 65 / 79
Examples of images
Input satellite images
Satellite images with 4 classes
Ievgen Redko SciDoLySE 65 / 79
Classification results
# of sourcedomains
Average classproportions
# ofsource
instances
Noadaptation
OTDA
PT
OTDA
LP
MDA
Causal
JCPOT
LP
Targetonly
2 [0.17 0.4 0.16 0.27] 2’936 0.61 0.52 0.57 0.65 0.66 0.65
5 [0.22 0.39 0.18 0.21] 6’716 0.62 0.55 0.6 0.66 0.68 0.64
8 [0.25 0.46 0.17 0.12] 16’448 0.63 0.54 0.59 0.67 0.71 0.65
11 [0.26 0.48 0.16 0.1] 21’223 0.63 0.54 0.58 0.67 0.72 0.673
14 [0.26 0.45 0.19 0.1] 27’875 0.63 0.52 0.58 0.67 0.72 0.65
17 [0.25 0.42 0.20 0.13] 32’660 0.63 0.5 0.59 0.67 0.73 0.61
Ievgen Redko SciDoLySE 66 / 79
Optimal transport for jointdistribution adaptation
The following slides are courtesy of R. Flamary (OTML workshop, NIPS’17)http://otml17.marcocuturi.net/wp-content/uploads/2018/01/OTML_NIPS_
2017.pdf
Ievgen Redko SciDoLySE 67 / 79
Joint distribution and classifier estimation
Objectives of JDOT
• Model the transformation of labels (allow change of proportion/value).
• Learn an optimal target predictor with no labels on target samples.
• Approach theoretically justified.
Joint distributions and dataset
• We work with the joint feature/label distributions.
• Let Ω ∈ Rd be a compact input measurable space of dimension d and C the set of
labels.
• Let Ps(X,Y ) ∈ P(Ω× C) and Pt(X,Y ) ∈ P(Ω× C) the source and target joint
distribution.
• We have access to an empirical sampling Ps = 1Ns
∑Ns
i=1 δxsi,ys
iof the source
distribution defined by Xs = xsiNs
i=1 and label information Ys = ysi Ns
i=1.
• but the target domain is defined only by an empirical distribution in the feature
space with samples Xt = xtiNt
i=1.
14 / 29
Joint distribution OT (JDOT)
Proxy joint distribution
• Let f be a Ω → C function from a given class of hypothesis H.
• We define the following joint distribution that use f as a proxy of y
Pft = (x, f(x))x∼µt
(5)
and its empirical counterpart Pt
f= 1
Nt
∑Nt
i=1 δxti,f(xt
i) .
Learning with JDOT
We propose to learn the predictor f that minimize :
minf
W1(Ps, Pt
f) = inf
γ∈∆
∑
ij
D(xsi ,y
si ;x
tj , f(x
tj))γij
(6)
• ∆ is the transport polytope.
• D(xsi ,y
si ;x
tj , f(x
tj)) = α‖xs
i − xtj‖2 + L(ys
i , f(xtj)) with α > 0.
• We search for the predictor f that better align the joint distributions.
15 / 29
Optimization problem
minf∈H,γ∈∆
∑
i,j
γi,j
(αd(xs
i ,xtj) + L(ys
i , f(xtj))
)+ λΩ(f) (7)
Optimization procedure
• Ω(f) is a regularization for the predictor f
• We propose to use block coordinate descent (BCD)/Gauss Seidel.
• Provably converges to a stationary point of the problem.
γ update for a fixed f
• Classical OT problem.
• Solved by network simplex.
• Regularized OT can be used
(add a term to problem (7))
f update for a fixed γ
minf∈H
∑
i,j
γi,jL(ysi , f(x
tj)) + λΩ(f) (8)
• Weighted loss from all source labels.
• γ performs label propagation.
19 / 29
Regression with JDOT
5 0 5x
2.0
1.5
1.0
0.5
0.0
0.5
1.0
1.5
y
Toy regression distributions
2.5 0.0 2.5 5.0x
1.0
0.5
0.0
0.5
1.0
Toy regression models
Source modelTarget modelSource samplesTarget samples
2.5 0.0 2.5 5.0x
1.0
0.5
0.0
0.5
1.0
Joint OT matrices
JDOT matrix linkOT matrix link
2.5 0.0 2.5 5.0x
1.0
0.5
0.0
0.5
1.0
Model estimated with JDOT
Source modelTarget modelJDOT model
Least square regression with quadratic regularization
For a fixed γ the optimization problem is equivalent to
minf∈H
∑
j
1
nt
‖yj − f(xtj)‖2 + λ‖f‖2 (9)
• yj = nt
∑
j γi,jysi is a weighted average of the source target values.
• Note that this problem is linear instead of quadratic.
• Can use any solver (linear, kernel ridge, neural network).
20 / 29
Classification with JDOT
0 2 4 6 8 10 12 14 160.0
0.2
0.4
0.6
0.8
1.0 Accuracy along BCD iterations
α=0.1
α=0.5
α=1.0
α=10.0
α=50.0
α=100.0
Multiclass classification with Hinge loss
For a fixed γ the optimization problem is equivalent to
minfk∈H
∑
j,k
Pj,kL(1, fk(xtj)) + (1− Pj,k)L(−1, fk(x
tj)) + λ
∑
k
‖fk‖2 (10)
• P is the class proportion matrix P =1Nt
γ⊤P
s.
• Ps and Y
s are defined from the source data with One-vs-All strategy as
Y si,k =
1 if ysi = k
−1 else, P s
i,k =
1 if ysi = k
0 else
with k ∈ 1, · · · ,K and K being the number of classes.21 / 29
Caltech-Office classification dataset
Domains Base SurK SA OT-IT OT-MM JDOT
caltech→amazon 92.07 91.65 90.50 89.98 92.59 91.54
caltech→webcam 76.27 77.97 81.02 80.34 78.98 88.81
caltech→dslr 84.08 82.80 85.99 78.34 76.43 89.81
amazon→caltech 84.77 84.95 85.13 85.93 87.36 85.22
amazon→webcam 79.32 81.36 85.42 74.24 85.08 84.75
amazon→dslr 86.62 87.26 89.17 77.71 79.62 87.90
webcam→caltech 71.77 71.86 75.78 84.06 82.99 82.64
webcam→amazon 79.44 78.18 81.42 89.56 90.50 90.71
webcam→dslr 96.18 95.54 94.90 99.36 99.36 98.09
dslr→caltech 77.03 76.94 81.75 85.57 83.35 84.33
dslr→amazon 83.19 82.15 83.19 90.50 90.50 88.10
dslr→webcam 96.27 92.88 88.47 96.61 96.61 96.61
Mean 83.92 83.63 85.23 86.02 86.95 89.04
Avg. rank 4.50 4.75 3.58 3.00 2.42 2.25
• Classical dataset [Saenko et al., 2010] dedicated to visual adaptation.
• Feature extraction by convolutional neural network [Donahue et al., 2014].
• Comparison with Surrogate Kernel [Zhang et al., 2013], Subspace Alignment
[Fernando et al., 2013] and OT Domain Adaptation [Courty et al., 2016b].
• Parameter selected via reverse cross-validation [Zhong et al., 2010].
• SVM (Hinge loss) classifiers with linear kernel.
• Best ranking method and 2% accuracy gain in average.
22 / 29
Amazon Review Classification dataset
Domains NN DANN JDOT (mse) JDOT (Hinge)
books→dvd 0.805 0.806 0.794 0.795
books→kitchen 0.768 0.767 0.791 0.794
books→electronics 0.746 0.747 0.778 0.781
dvd→books 0.725 0.747 0.761 0.763
dvd→kitchen 0.760 0.765 0.811 0.821
dvd→electronics 0.732 0.738 0.778 0.788
kitchen→books 0.704 0.718 0.732 0.728
kitchen→dvd 0.723 0.730 0.764 0.765
kitchen→electronics 0.847 0.846 0.844 0.845
electronics→books 0.713 0.718 0.740 0.749
electronics→dvd 0.726 0.726 0.738 0.737
electronics→kitchen 0.855 0.850 0.868 0.872
Mean 0.759 0.763 0.783 0.787
• Dataset aim at predicting reviews across domains [Blitzer et al., 2006].
• Comparison with Domain adversarial neural network [Ganin et al., 2016a].
• Classifier f is a neural network with same architecture as DANN.
• JDOT has better accuracy, classification loss is better than mean square error.
23 / 29
Wifi localization regression dataset
Domains KRR SurK DIP DIP-CC GeTarS CTC CTC-TIP JDOT
t1 → t2 80.84±1.14 90.36±1.22 87.98±2.33 91.30±3.24 86.76 ± 1.91 89.36±1.78 89.22±1.66 93.03 ± 1.24
t1 → t3 76.44±2.66 94.97±1.29 84.20±4.29 84.32±4.57 90.62±2.25 94.80±0.87 92.60 ± 4.50 90.06 ± 2.01
t2 → t3 67.12±1.28 85.83 ± 1.31 80.58 ± 2.10 81.22 ± 4.31 82.68 ± 3.71 87.92 ± 1.87 89.52 ± 1.14 86.76 ± 1.72
hallway1 60.02 ±2.60 76.36 ± 2.44 77.48 ± 2.68 76.24± 5.14 84.38 ± 1.98 86.98 ± 2.02 86.78 ± 2.31 98.83±0.58
hallway2 49.38 ± 2.30 64.69 ±0.77 78.54 ± 1.66 77.8± 2.70 77.38 ± 2.09 87.74 ± 1.89 87.94 ± 2.07 98.45±0.67
hallway3 48.42 ±1.32 65.73 ± 1.57 75.10± 3.39 73.40± 4.06 80.64 ± 1.76 82.02± 2.34 81.72 ± 2.25 99.27±0.41
• Objective is to predict position of a device on a discretized grid
[Zhang et al., 2013].
• Same experimental protocol as [Zhang et al., 2013, Gong et al., 2016].
• Comparison with domain-invariant projection and its cluster regularized version
([Baktashmotlagh et al., 2013], DIP and DIP-CC), generalized target shift
([Zhang et al., 2015], GeTarS), and conditional transferable components, with its
target information preservation regularization ([Gong et al., 2016], CTC and
CTC-TIP).
• JDOT solves the adaptation problem for transfer across device (10% accuracy
gain on Hallway).
24 / 29
Large scale JDOT Strategy
Large scale JDOT
• JDOT do not scale well to large datasets/ deep learning.
• Use minibach for computing the transport in the primal [Genevay et al., 2017].
• Evaluate batch-local couplings on (sufficiently large) couples of random (without
replacement) batches in source and target domain
• update f from these couplings
Algorithm : Deep JDOT
input Source data Xs, ys, Targte data Xt
for BCD Iterations do
for each Source/Target minibatch do
Solve OT with JDOT loss
Perform label propagation on minibatch
end for
Update model f on one epoch
end for
25 / 29
Large scale datasets
Description MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M
Source samples 60000 9298 73257 60000
Target samples 9298 60000 60000 60000
height/width 16×16 16×16 32×32×3 28×28×3
• Four cross domain digits datasets: MNIST, USPS, SVHN, MNIST-M .
• We consider a deep convolutional architecture.
• Dropout is used on the dens layers when training.
• Transport distance computed in the raw image space.
26 / 29
Experimental Results for large scale JDOT
Methods MNIST→ USPS USPS→MNIST SVHN→MNIST MNIST→ MNIST-M
Source only (SO) 86.18 58.73 53.15 59.52
DeepCoral [Sun and Saenko, 2016] 88.43 (22.0) 85.02 (64.6) 69.61 (35.6) 62.18 (0.07)
MMD [Long and Wang, 2015] 89.89 (36.3) 79.19 (50.3) 53.27 (0.01) 52.53 (-19.1)
DANN [Ganin et al., 2016b] 89.06 (28.2) 87.03 (70.0) 73.85∗ (44.7) 76.63 (46.6)
ADDA [Tzeng et al., 2017] 91.22 (49.3) 79.98 (52.2) 76.0∗ (49.4) 79.16 (53.5)
DeepJDOT 91.50 (52.01) 91.21 (79.82) 83.62 (65.85) 67.84 (22.67)
Train on Target (TO) 96.41 99.42 99.42 96.21
• Accuracy in % of the DA methods.
• The values in () represent the coverage gap between SO (source only) and TO
(golden performance if the model is learnt on target labelled data), DA−SOTO−SO
.
• DeepJDOT is better in 3 out of 4 DA problems.
• Plots represent test performances along the BCD iterations.
27 / 29
Other contributions :
theoretical guarantees for DA with OT
Ievgen Redko SciDoLySE 68 / 79
Why domain adaptation works ?
Intuition : If two domains are similar then the adaptation should be easy ?
Ievgen Redko SciDoLySE 69 / 79
Why domain adaptation works ?
Intuition : If two domains are similar then the adaptation should be easy ?
Answer : Yes. Due to the following result [Ben David et al., 2007]
error in targetdomain
=
what we wantto learn
≤error in source
domain
=
can learn wellusing ML
+how different
two domains are ?
=
divergence betweenthe two domains
+is it possible
to adapt ?
=
A non-estimableterm
Can the Wasserstein distance be introduced into these bounds ?
Ievgen Redko SciDoLySE 69 / 79
Several learning bounds
With the Wasserstein distance as a divergence term
between marginal distributions µS and µT [Redko et al., 2017, Shen et al.,2018]
RT (h) ≤ RS(h) +W (µS , µT ) +O(
1√n
)
+Non-estimable
term
between joint distributions PS and PT [Courty et al., 2017]
RT (h) ≤ RS(h) +W (PS , PT ) +O(
1√n
)
+Non-estimable
term+
Prob. TransferLipschitzness term
Ievgen Redko SciDoLySE 70 / 79
Several learning bounds
With the Wasserstein distance as a divergence term
between marginal distributions µS and µT [Redko et al., 2017, Shen et al.,2018]
RT (h) ≤ RS(h) +W (µS , µT ) +O(
1√n
)
+Non-estimable
term
between joint distributions PS and PT [Courty et al., 2017]
RT (h) ≤ RS(h) +W (PS , PT ) +O(
1√n
)
+Non-estimable
term+
Prob. TransferLipschitzness term
Ievgen Redko SciDoLySE 70 / 79
Other contributions :
differentially private OT
Ievgen Redko SciDoLySE 71 / 79
Privacy in Optimal Transport and Domain Adaptation
Challenge : to perform optimal transport for domain adaptation
One needs to have access to both source and target data.
At least one party (source or target) has to reveal data.
It raises privacy concerns.
Ievgen Redko SciDoLySE 72 / 79
Privacy in Optimal Transport and Domain Adaptation
Challenge : to perform optimal transport for domain adaptation
One needs to have access to both source and target data.
At least one party (source or target) has to reveal data.
It raises privacy concerns.
How to transfer knowledge with OT while protecting the privacy of users ?
Ievgen Redko SciDoLySE 72 / 79
Privacy in Optimal Transport and Domain Adaptation
Ievgen Redko SciDoLySE 73 / 79
Differential Privacy
ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :
P(M(X) = t) ≤ eεP(M(X ′) = t).
Two close datasets correspond to close distributions.
M(X) does not leak much information about any individual point from X.
Example : Johnson-Lindenstrauss transform with some noise.
Ievgen Redko SciDoLySE 74 / 79
Differential Privacy
ε-differential private [Dwork et al. 2006]A randomized mechanism M : Xn → Rd is ε- differential private (generalized lateron to (ε, δ)-differential private) if for any two datasets X,X ′ ∈ Xn differing in asingle element and for any output t of M :
P(M(X) = t) ≤ eεP(M(X ′) = t).
Two close datasets correspond to close distributions.
M(X) does not leak much information about any individual point from X.
Example : Johnson-Lindenstrauss transform with some noise.
Idea : Use Johnson-Lindenstrauss transform to obtain a new DP cost matrix !
Ievgen Redko SciDoLySE 74 / 79
DPOT Algorithm
Input : Xs, Xt, and σ, ℓ > 0 (σ and ℓ known by both parties)
Step 1. Source generates a N (0, 1ℓ)k×ℓ matrix M (known by both parties)
and a N (0, σ)k×ℓ noise matrix ∆ (only known by the Source).
Step 2. Source sends M, Xs +∆, where Xs = XsM
Step 3. Target computes C = c(Xs +∆, Xt)− ℓσ2 where Xt = XtM
(ℓσ2 is subtracted from each entry to cancel the bias caused by ∆)
Step 4. Solve OT with cost matrix C and return P and W (Xs, Xt).
DPOT
Xs
Xt
Xs Xs
Xt
M
MCoupling matrix
~
~
Δ
Ievgen Redko SciDoLySE 75 / 79
Ievgen Redko SciDoLySE 76 / 79
Conclusions
Ievgen Redko SciDoLySE 76 / 79
Conclusions
OT is a very powerful tool for domain adaptation
A large variety of possible applications/scenarios (more remain uncovered)
Quite computationally efficient (but costly regularization is needed)
Ievgen Redko SciDoLySE 77 / 79
Conclusions
OT is a very powerful tool for domain adaptation
A large variety of possible applications/scenarios (more remain uncovered)
Quite computationally efficient (but costly regularization is needed)
Try it using POT library !
Ievgen Redko SciDoLySE 77 / 79
References I
Bredies, K., Lorenz, D. A., and Maass, P. (2009).
A generalized conditional gradient method and its connection to an iterativeshrinkage method.
Computational Optimization and Applications, 42(2) :173–193.
Courty, N., Flamary, R., and Tuia, D. (2014).
Domain adaptation with regularized optimal transport.
In European Conference on Machine Learning and Principles and Practice ofKnowledge Discovery in Databases (ECML PKDD).
Courty, N., Flamary, R., Tuia, D., and Rakotomamonjy, A. (2016).
Optimal transport for domain adaptation.
Pattern Analysis and Machine Intelligence, IEEE Transactions on.
Cuturi, M. (2013).
Sinkhorn distances : Lightspeed computation of optimal transportation.
In Neural Information Processing Systems (NIPS), pages 2292–2300.
Ferradans, S., Papadakis, N., Rabin, J., Peyré, G., and Aujol, J.-F. (2013).
Regularized discrete optimal transport.
In Scale Space and Variational Methods in Computer Vision, SSVM, pages 428–439.
Ievgen Redko SciDoLySE 78 / 79
References II
Germain, P., Habrard, A., Laviolette, F., and Morvant, E. (2013).
A PAC-Bayesian Approach for Domain Adaptation with Specialization to LinearClassifiers.
In ICML, pages 738–746, Atlanta, USA.
Perrot, M., Courty, N., Flamary, R., and Habrard, A. (2016).
Mapping estimation for discrete optimal transport.
In Neural Information Processing Systems (NIPS).
Tuia, D., Flamary, R., Rakotomamonjy, A., and Courty, N. (2015).
Multitemporal classification without new labels : a solution with optimal transport.
In 8th International Workshop on the Analysis of Multitemporal Remote SensingImages.
Ievgen Redko SciDoLySE 79 / 79