1
Learning New Tricks From Old Dogs: Multi-Source Transfer Learning From Pre-Trained Networks Joshua Lee 1 , Prasanna Sattigeri 2 , and Gregory Wornell 1 1 Dept. EECS, MIT, 2 MIT-IBM Watson AI Lab, IBM Research Problem Setup Consider an ensemble of features f 1 (x), ..., f k (x) extracted from networks pre-trained on unknown objectives, and a target classification task with data X ∈X and labels Y ∈Y . We wish to train a classifier on very few samples using the pre-trained features without altering the existing networks (black-box feature access). Examples: Distributed transfer learning (e.g. learning from multiple mobile devices, each with their own network) Rapid adaptation to new environments with multiple candidate source models to transfer from Learning from old networks for which the original training data is lost Maximal Correlation Functions Hirschfeld-Gebelein-R´ enyi Maximal Correlation: σ = sup f :X→R,g :Y→R E[f (X )]=E[g (Y )]=0 E[f 2 (X )]=E[g 2 (Y )]=1 E[f (X )g (Y )] The optimal f and g are maximal correlation functions, and have been shown to be universally optimal in an information-preserving sense. [2] For a fixed f , the optimal g is given by: [1] g (y ) E P X |Y [f (x)|y ] The Max. Corr. Weighting (MCW) Method Given multiple pre-trained feature functions f 1 , ..., f k , the Maximal Correlation Objective is given by: L = E ˆ P t X,Y [F T (X )G(Y )] (1) This objective separates out as: L = X i,n E ˆ P t X,Y [f s n i (X )g s n i (Y )] (2) We can solve each term separately to find the associated g 1 , ..., g k and σ 1 , ..., σ k by taking conditional expectations over the empirical distribution of target samples: g i (y ) E ˆ P t X |Y [f i (x)|y ] σ i = E ˆ P t X,Y [f i (x)g i (y )] We then construct the approximate distribution: ˜ P t Y |X (y |x)= ˆ P t Y (y ) 1+ k X i=1 σ i f i (x)g i (y ) ! And apply an ML estimator to predict y given x. Experimental Setup We ran experiments on three image datasets: (a) CIFAR-100, (b) Stanford Dogs, and (c) Tiny ImageNet. For each dataset, we divide the images into a set of smaller, mutually-exclusive classification tasks. We select one task as the target and the remainder as the sources. For each source task, we train a neural net with the LeNet architecture [3] for classification. We use the penultimate layer of these nets as feature functions, and compute MCW parameters with respect to the target task. We compare the classification accuracy with that of an SVM trained on the same features. References [1] L. Breiman and J. H. Friedman.“Estimating optimal transformations for multiple regression and correlation”. In: Journal of the American statistical Association 80.391 (1985), pp. 580–598. [2] S.-L. Huang et al. On Universal Features for High-Dimensional Learning and Inference. preprint. http : / / allegro.mit.edu/~gww/unifeatures. Oct. 2019. [3] Y. Lecun et al.“Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE. 1998, pp. 2278–2324. Experimental Results Experimental results on target task for the CIFAR-100 dataset (10 source tasks, 2-way classification). Method 1-Shot Acc. 5-Shot Acc. 10-Shot Acc. 20-Shot Acc. Best Single Source SVM 56.9 ± 2.5 67.0 ± 3.0 70.4 ± 1.9 70.9 ± 1.2 Best Single Source MCW 59.2 ± 2.1 69.0 ± 3.0 67.0 ± 2.4 70.4 ± 1.5 Multi-Source SVM 64.7 ± 3.0 72.8 ± 2.7 76.2 ± 1.8 81.5 ± 0.6 Multi-Source MCW 69.0 ± 3.0 78.1 ± 0.8 80.1 ± 0.8 81.7 ± 0.6 Experimental results on target task for the Stanford Dogs dataset (10 source tasks, 5-way classification). Method 5-Shot Accuracy Best Single Source SVM 35.8 ± 0.8 Best Single Source MCW 38.2 ± 0.6 Multi-Source SVM 38.9 ± 0.3 Multi-Source MCW 41.6 ± 0.5 Experimental results on target task for the Tiny ImageNet dataset (10 source tasks, 5-way classification). Method 5-Shot Accuracy Best Single Source SVM 31.4 ± 0.9 Best Single Source MCW 33.9 ± 1.0 Multi-Source SVM 42.5 ± 1.4 Multi-Source MCW 47.4 ± 1.1 Experimental Results - Source Selection Average values of i σ i for each source task for the 5-shot transfer learning task on the CIFAR-100 dataset, with the target task of ”apple vs. fish.” Experimental results on target task for the CIFAR-100 dataset using the MCW method with different subsets of source networks. Source Tasks 5-Shot Accuracy 10 Tasks (All source tasks used) 78.1 ± 0.8 9 Tasks (Lowest correlation task ”camel”vs ”can”removed) 76.8 ± 1.0 9 Tasks (Highest correlation task ”dolphin”vs ”elephant”removed) 73.0 ± 1.3

LearningNewTricksFromOldDogs: Multi ...jk_lee/Lee_NeurIPS_2019.pdf[1]L. Breiman and J. H. Friedman.\Estimating optimal transformations for multiple regression and correlation". In:

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LearningNewTricksFromOldDogs: Multi ...jk_lee/Lee_NeurIPS_2019.pdf[1]L. Breiman and J. H. Friedman.\Estimating optimal transformations for multiple regression and correlation". In:

LearningNewTricks FromOldDogs: Multi-Source Transfer Learning FromPre-TrainedNetworksJoshua Lee1, Prasanna Sattigeri2, and Gregory Wornell1

1 Dept. EECS, MIT, 2 MIT-IBM Watson AI Lab, IBM Research

Problem Setup

Consider an ensemble of features f1(x), ..., fk(x) extracted from networks pre-trained on unknownobjectives, and a target classification task with data X ∈ X and labels Y ∈ Y. We wish to train aclassifier on very few samples using the pre-trained features without altering the existing networks(black-box feature access).

Examples:• Distributed transfer learning (e.g. learning from multiple mobile devices, each with their own

network)• Rapid adaptation to new environments with multiple candidate source models to transfer from• Learning from old networks for which the original training data is lost

Maximal Correlation Functions

• Hirschfeld-Gebelein-Renyi Maximal Correlation:

σ = supf :X→R,g:Y→R

E[f(X)]=E[g(Y )]=0

E[f2(X)]=E[g2(Y )]=1

E[f(X)g(Y )]

The optimal f and g are maximal correlation functions, and have been shown to be universallyoptimal in an information-preserving sense. [2]

• For a fixed f , the optimal g is given by: [1]

g(y) ∝ EPX|Y [f(x)|y]

The Max. Corr. Weighting (MCW) Method

• Given multiple pre-trained feature functions f1, ..., fk, the Maximal Correlation Objective is givenby:

L = EP tX,Y

[FT (X)G(Y )] (1)

• This objective separates out as:

L =∑i,n

EP tX,Y

[fsni (X)gsni (Y )] (2)

• We can solve each term separately to find the associated g1, ..., gk and σ1, ..., σk by takingconditional expectations over the empirical distribution of target samples:

gi(y) ∝ EP tX|Y

[fi(x)|y]

σi = EP tX,Y

[fi(x)gi(y)]

• We then construct the approximate distribution:

P tY |X(y|x) = P t

Y (y)

(1 +

k∑i=1

σifi(x)gi(y)

)And apply an ML estimator to predict y given x.

Experimental Setup

• We ran experiments on three image datasets: (a) CIFAR-100, (b) Stanford Dogs, and (c) TinyImageNet.

• For each dataset, we divide the images into a set of smaller, mutually-exclusive classificationtasks.

• We select one task as the target and the remainder as the sources.• For each source task, we train a neural net with the LeNet architecture [3] for classification.• We use the penultimate layer of these nets as feature functions, and compute MCW parameters

with respect to the target task.• We compare the classification accuracy with that of an SVM trained on the same features.

References

[1] L. Breiman and J. H. Friedman. “Estimating optimal transformations for multiple regression and correlation”.In: Journal of the American statistical Association 80.391 (1985), pp. 580–598.

[2] S.-L. Huang et al. On Universal Features for High-Dimensional Learning and Inference. preprint. http://

allegro.mit.edu/~gww/unifeatures. Oct. 2019.[3] Y. Lecun et al. “Gradient-based learning applied to document recognition”. In: Proceedings of the IEEE. 1998,

pp. 2278–2324.

Experimental Results

Experimental results on target task for the CIFAR-100 dataset (10 source tasks, 2-way classification).

Method 1-Shot Acc. 5-Shot Acc. 10-Shot Acc. 20-Shot Acc.Best Single Source SVM 56.9 ± 2.5 67.0 ± 3.0 70.4 ± 1.9 70.9 ± 1.2Best Single Source MCW 59.2 ± 2.1 69.0 ± 3.0 67.0 ± 2.4 70.4 ± 1.5Multi-Source SVM 64.7 ± 3.0 72.8 ± 2.7 76.2 ± 1.8 81.5 ± 0.6Multi-Source MCW 69.0 ± 3.0 78.1 ± 0.8 80.1 ± 0.8 81.7 ± 0.6

Experimental results on target task for the Stanford Dogs dataset (10 source tasks, 5-way classification).

Method 5-Shot AccuracyBest Single Source SVM 35.8 ± 0.8Best Single Source MCW 38.2 ± 0.6Multi-Source SVM 38.9 ± 0.3Multi-Source MCW 41.6 ± 0.5

Experimental results on target task for the Tiny ImageNet dataset (10 source tasks, 5-way classification).

Method 5-Shot AccuracyBest Single Source SVM 31.4 ± 0.9Best Single Source MCW 33.9 ± 1.0Multi-Source SVM 42.5 ± 1.4Multi-Source MCW 47.4 ± 1.1

Experimental Results - Source Selection

Average values of∑

i σi for each source task for the 5-shot transfer learning task on theCIFAR-100 dataset, with the target task of ”apple vs. fish.”

Experimental results on target task for the CIFAR-100 dataset using the MCW method withdifferent subsets of source networks.

Source Tasks 5-Shot Accuracy10 Tasks (All source tasks used) 78.1 ± 0.89 Tasks (Lowest correlation task ”camel” vs ”can” removed) 76.8 ± 1.09 Tasks (Highest correlation task ”dolphin” vs ”elephant” removed) 73.0 ± 1.3