Transferability and Hardness of Supervised Classiﬁcation Tasks · Anh T. Tran VinAI Research [email protected] Cuong V. Nguyen Amazon Web Services [email protected] Tal Hassner

Transferability and Hardness of Supervised Classification Tasks

Anh T. Tran∗

VinAI [email protected]

Cuong V. NguyenAmazon Web [email protected]

Tal Hassner∗

Facebook [email protected]

Abstract

We propose a novel approach for estimating the diffi-culty and transferability of supervised classification tasks.Unlike previous work, our approach is solution agnosticand does not require or assume trained models. Instead,we estimate these values using an information theoretic ap-proach: treating training labels as random variables andexploring their statistics. When transferring from a sourceto a target task, we consider the conditional entropy be-tween two such variables (i.e., label assignments of the twotasks). We show analytically and empirically that this valueis related to the loss of the transferred model. We furthershow how to use this value to estimate task hardness. Wetest our claims extensively on three large scale data sets—CelebA (40 tasks), Animals with Attributes 2 (85 tasks), andCaltech-UCSD Birds 200 (312 tasks)—together represent-ing 437 classification tasks. We provide results showing thatour hardness and transferability estimates are strongly cor-related with empirical hardness and transferability. As acase study, we transfer a learned face recognition model toCelebA attribute classification tasks, showing state of theart accuracy for tasks estimated to be highly transferable.

1. Introduction

How easy is it to transfer a representation learned for onetask to another? How can we tell which of several tasks ishardest to solve? Answers to these questions are vital inplanning model transfer and reuse, and can help reveal fun-damental properties of tasks and their relationships in theprocess of developing universal perception engines [3]. Theimportance of these questions is therefore driving researchefforts, with several answers proposed in recent years.

Some of the answers to these questions establishedtask relationship indices, as in the Taskonomy [71] andTask2Vec [1, 2] projects. Others analyzed task relationshipsin the context of multi-task learning [31, 37, 61, 68, 73].Importantly, however, these and other efforts are computa-

∗Work at Amazon Web Services, prior to joining current affiliation.

tional in nature, and so build on specific machine learningsolutions as proxy task representations.

By relying on such proxy task representations, these ap-proaches are naturally limited in their application: Ratherthan insights on the tasks themselves, they may reflect rela-tionships between the specific solutions chosen to representthem, as noted by previous work [71]. Some, moreover, es-tablish task relationships by maintaining model zoos, withexisting trained models already available. They may there-fore also be computationally expensive [1, 71]. Finally,in some scenarios, establishing task relationships requiresmulti-task learning of the models, to measure the influencedifferent tasks have on each other [31, 37, 61, 68, 73].

We propose a radically different, solution agnostic ap-proach: We seek underlying relationships, irrespective ofthe particular models trained to solve these tasks or whetherthese models even exist. We begin by noting that supervisedlearning problems are defined not by the models trained tosolve them, but rather by the data sets of labeled exam-ples and a choice of loss functions. We therefore go to thesource and explore tasks directly, by examining their datasets rather than the models they were used to train.

To this end, we consider supervised classification tasksdefined over the same input domain. As a loss, we as-sume the cross entropy function, thereby including mostcommonly used loss functions. We offer the following sur-prising result: By assuming an optimal loss on two tasks,the conditional entropy (CE) between the label sequencesof their training sets provides a bound on the transferabilityof the two tasks—that is, the log-likelihood on a target taskfor a trained representation transferred from a source task.We then use this result to obtain a-priori estimates of tasktransferability and hardness.

Importantly, we obtain effective transferability and hard-ness estimates by evaluating only training labels; we do notconsider the solutions trained for each task or the input do-main. This result is surprising considering that it greatlysimplifies estimating task hardness and task relationships,yet, as far as we know, was overlooked by previous work.

We verify our claims with rigorous tests on a total of437 tasks from the CelebA [35], Animals with Attributes 2

1

arX

iv:1

908.

0814

2v1

[cs

.LG

] 2

1 A

ug 2

019

(AwA2) [67], and Caltech-UCSD Birds 200 (CUB) [66]sets. We show that our approach reliably predicts task trans-ferability and hardness. As a case study, we evaluate trans-ferability from face recognition to facial attribute classifica-tion. On attributes estimated to be highly transferable fromrecognition, our results outperform the state of the art de-spite using a simple approach, involving training a linearsupport vector machine per attribute.

2. Related work

Our work is related to many fields in machine learningand computer vision, including transfer learning [69], metalearning [56], domain shifting [54], and multi-task learn-ing [29]. Below we provide only a cursory overview of sev-eral methods directly related to us. For more principled sur-veys on transfer learning, we refer to others [4, 48, 65, 70].

Transfer learning. This paper is related to transfer learn-ing [48, 65, 69] and our work can be used to select goodsource tasks and data sets when transferring learned models.Previous theoretical analysis of transfer learning is exten-sive [2, 5, 6, 7, 8, 39]. These papers allowed generalizationbounds to be proven but they are abstract and hard to com-pute in practice. Our transferability measure, on the otherhand, is easily computed from the training sets and can po-tentially be useful also for continual learning [44, 45, 52].

Task spaces. Tasks in machine learning are often repre-sented as labeled data sets and a loss function. For someapplications, qualitative exploration of the training data canreveal relationships between two tasks and, in particular, thebiases between them [59].

Efforts to obtain more complex task relationships in-volved trained models. Data sets were compared using fixeddimensional lists of statistics, produced using an autoen-coder trained for this purpose [19]. The successful Taskon-omy project [71], like us, assumes multiple task labels forthe same input images (same input domain). They train onemodel per-task and then evaluate transfers between tasksthereby creating a task hypergraph—their taxonomy.

Finally, Task2Vec constructs vector representations fortasks, obtained by mapping partially trained probe networksdown to low dimensional task embeddings [1, 2]. Unlikethese methods, we consider only the labels provided in thetraining data for each task, without using trained models.

Multi-task learning. Training a single model to solvemultiple tasks can be mutually beneficial to the individualtasks [24, 51, 64]. When two tasks are only weakly related,however, attempting to train a model for them both can pro-duce a model which under-performs compared to modelstrained for each task separately. Early multi-branch net-works and their variants encoded human knowledge on therelationships of tasks in their design, joining related tasks

or separating unrelated tasks [28, 50, 53].Others adjusted for related vs. unrelated tasks during

training of a deep multi-task network. Deep cross residuallearning does this by introducing cross-residuals for regu-larization [28], cross-stitch combines activations from mul-tiple task-specific networks [43], and UberNet proposed atask-specific branching scheme [30].

Some sought to discover what and how should be sharedacross tasks during training by automatic discovery of net-work designs that would group similar tasks together [37]or by solving tensor factorization problems [68]. Alter-natively, parts of the input rather than the network weremasked according to the task at hand [61]. Finally, mod-ulation modules were proposed to seek destructive interfer-ences between unrelated tasks [73].

3. Transferability via conditional entropy

We seek information on the transferability and hardnessof supervised classification tasks. Previous work obtainedthis information by examining machine learning models de-veloped for these tasks [1, 2, 73]. Such models are pro-duced by training on labeled data sets that represent thetasks. These models can therefore be considered views ontheir training data. In this work we instead use informationtheory to produce estimates from the source: the data itself.

Like others [71], we assume our tasks share the same in-put instances and are different only in the labels they assignto each input. Such settings describe many practical scenar-ios. A set of face images, for instance, can have multiplelabels for each image, representing tasks such as recogni-tion [40, 41] and classification of various attributes [35].

We estimate transferability using the CE between the la-bel sequences of the target and source tasks. Task hardnessis similarly estimated: by computing transferability from atrivial task. We next formalize our assumptions and claims.

3.1. Task transferability

We assume a single input sequence of training sam-ples, X = (x1, x2, . . . , xn) ∈ Xn, along with two la-bel sequences Y = (y1, y2, . . . , yn) ∈ Yn and Z =(z1, z2, . . . , zn) ∈ Zn, where yi and zi are labels assignedto xi under two separate tasks: source task TZ = (X,Z)and target task TY = (X,Y ). Here, X is the domain of thevalues of X , while Y = range(Y ) and Z = range(Z) arethe sets of different values in Y and Z respectively. Thus,if Z contains binary labels, then Z = {0, 1}.

We consider a classification model M = (w, h) on thesource task, TZ . The first part, w : X → RD, is sometransformation function, possibly learned, that outputs a D-dimensional representation r = w(x) ∈ RD for an inputx ∈ X . The second part, h : RD → P(Z), is a classi-fier that takes a representation r and produces a probability

2

distribution h(r) ∈ P(Z), where P(Z) is the space of allprobability distributions over Z .

This description emphasizes the two canonical stages ofa machine learning system [17]: representation followed byclassification. As an example, a deep neural network withSoftmax output is represented by a learned w which mapsthe input into some feature space, producing a deep embed-ding, r, followed by classification layers (one or more), h,which maps the embedding into the prediction probability.

Now, assume we train a model (wZ , hZ) to solve TZ byminimizing the cross entropy loss on Z:

wZ , hZ = argminw,h∈(W,H)

LZ(w, h), (1)

where W and H are our chosen spaces of possible valuesfor w and h, and LZ(w, h) is the cross entropy loss (equiva-lently, the negative log-likelihood) of the parameters (w, h):

LZ(w, h) = −lZ(w, h) = − 1

n

n∑i=1

logP (zi|xi;w, h),

(2)where lZ(w, h) is the log-likelihood of (w, h).

To transfer this model to target task TY , we fix the func-tion wZ and retrain only the classifier on the labels of TY .Denote the new classifier kY , selected from our chosenspace K of target classifiers. Note that kY does not nec-essarily share the same architecture as hZ . We train kY byminimizing the cross entropy loss on Y with the fixed wZ :

kY = argmink∈K

LY (wZ , k), (3)

where LY (wZ , k) is defined similarly to Eq. (2) but for thelabel set Y . Under this setup, we define the transferabilityof task TZ to task TY as follows.

Definition 1 The transferability of task TZ to task TY ismeasured by the expected accuracy of the model (wZ , kY )on a random test example (x, y) of task TY :

Trf(TZ → TY ) = E [acc(y, x;wZ , kY )] , (4)

which indicates how well a representation wZ trained ontask TZ performs on task TY .

In practice, if the trained model does not overfit, thelog-likelihood on the training set, lY (wZ , kY ), provides agood indicator of Eq (4), that is, how well the representationwZ and the classifier kY performs on task TY . This non-overfitting assumption holds even for large networks thatare properly trained and tested on datasets sampled from thesame distribution [72]. Thus, in the subsequent sections, weinstead consider the following log-likelihood as an alterna-tive measure of transferability:

Trf(TZ → TY ) = lY (wZ , kY ). (5)

3.2. The conditional entropy of label sequences

From the label sequences Y and Z, we can compute theempirical joint distribution P (y, z) for all (y, z) ∈ Y × Zby counting, as follows:

P (y, z) =1

n|{i : yi = y and zi = z}|. (6)

We now adopt the definition of CE between two ran-dom variables [14] to define the CE between our label se-quences Y and Z.

Definition 2 The CE of a label sequence Y given a labelsequence Z, H(Y |Z), is the CE of a random variable (orrandom label) y given a random variable (or random label)z, where (y, z) are drawn from the empirical joint distribu-tion P (y, z) of Eq. (6):

H(Y |Z) = −∑y∈Y

∑z∈Z

P (y, z) logP (y, z)

P (z), (7)

where P (z) is the empirical marginal distribution on Z:

P (z) =∑y∈Y

P (y, z) =1

n|{i : zi = z}|. (8)

CE represents a measure of the amount of informationprovided by the value of one random variable on the valueof another. By treating the labels assigned to both tasks asrandom variables and measuring the CE between them, weare measuring the information required to estimate a labelin one task given a (known) label in another task.

We now prove a relationship between the CE of Eq. (7)and the tranferability of Eq. (5). In particular, we show thatthe log-likelihood on the target task TY is lower boundedby log-likelihood on the source task TZ minus H(Y |Z), ifthe optimal input representation wZ trained on TZ is trans-ferred to TY .

To prove our theorem, we assume the space K of targetclassifiers contains a classifier k whose log-likelihood lowerbounds that of kY . We construct k as follows. For each in-put x, we compute the Softmax output pZ = hZ(wZ(x)),which is a probability distribution on Z . We then convertpZ into a Softmax on Y by taking the expectation of theempirical conditional probability P (y|z) = P (y, z)/P (z)with respect to pZ . That is, for all y ∈ Y , we define:

pY (y) = Ez∼pZ[P (y|z)] =

∑z∈Z

P (y|z) pZ(z), (9)

where pZ(z) is the probability of the label z returned by pZ .For any input wZ(x), we let the output of k be pY . That is,k(wZ(x)) = pY . We can now prove the following theorem.

3

Figure 1. Visualizing toy examples. The transferability between two tasks, represented as sequences (X,Y ) and (X,Z). The horizontalaxis represent instances and the values for Z (in red) and Y (cyan). In which of these examples would it be easiest to transfer a modeltrained for task TZ to task TY ? See discussion and details in Sec. 3.3.

Theorem 1 Under the training procedure described inSec. 3.1, we have:

Trf(TZ → TY ) ≥ lZ(wZ , hZ)−H(Y |Z). (10)

Proof sketch.1 From the definition of kY and the assump-tion that k ∈ K, we have Trf(TZ → TY ) = lY (wZ , kY ) ≥lY (wZ , k). From the construction of k, we have:

lY (wZ , k) =1

n

n∑i=1

log

(∑z∈Z

P (yi|z)P (z|xi;wZ , hZ)

)

≥ 1

n

n∑i=1

log(P (yi|zi)P (zi|xi;wZ , hZ)

)(11)

=1

n

n∑i=1

log P (yi|zi) +1

n

n∑i=1

logP (zi|xi;wZ , hZ). (12)

We can easily show that the first term in Eq. (12) equals−H(Y |Z), while the second term is lZ(wZ , hZ).

Discussion 1: Generality of our settings. Our settings forspaces W,H,K are general and include a variety of prac-tical use cases. For example, neural networks W will in-clude all possible (vector) values of the network weightsuntil the penultimate layer, while H and K would includeall possible (vector) values of the last layer’s weights. Al-ternatively, we can use support vector machines (SVM) forK. In this case, K would include all possible values of theSVM parameters [57]. Our result even holds when the fea-tures are fixed, as when using tailored representations suchas SIFT [36]. In these cases, space W would contain onlyone transformation function from raw input to the features.

Discussion 2: Assumptions. We can easily satisfy the as-sumption k ∈ K by first choosing a space K ′ (e.g., theSVMs) which will play the role of K \ {k}. We solve theoptimization problem of Eq. (3) on K ′ instead of K to ob-tain the optimal classifier k′. To get the optimal classifierkY on K = K ′ ∪ {k}, we simply compare the losses of k′

and k and select the best one as kY .The optimization problems of Eq. (1) and (3) are global

optimization problems. In practice, for complex deep net-works trained with stochastic gradient descent, we often

1Full derivations provided in the appendix.

only obtain the local optima of the loss. In this case, we caneasily change and prove Theorem 1 which would includethe differences in the losses between the local optimum andthe global optimum in the right-hand-side of Eq. (10). Inmany practical applications, the difference between localoptimum and global optimum is not significant [13, 46].

Discussion 3: Extending our result to test log-likelihood.In Theorem 1, we consider the empirical log-likelihood,which is generally unbounded. If we make the (strong)assumption of bounded differences between empirical log-likelihoods, we can apply McDiarmids inequality [42] toget an upper-bound on the left hand side of Eq. (10) by theexpected log-likelihood with some probability.

Discussion 4: Implications. Theorem 1 shows that thetransferability from task TZ to task TY depends on boththe CE H(Y |Z) and the log-likelihood lZ(wZ , hZ). Notethat the log-likelihood lZ(wZ , hZ) is optimal for task TZ

and so it represents the hardness (or easiness) of task TZ .Thus, from the theorem, if lZ(wZ , hZ) is small (i.e., thesource task is hard), transferability would reduce. Besides,if the CE H(Y |Z) is small, transferability would increase.

Finally, we note that when the source task TZ is fixed,the log-likelihood lZ(wZ , hZ) is a constant. In this case,the transferability only depends on the CE H(Y |Z). Thus,we can estimate the transferability from one source task tomultiple target tasks by considering only the CE.

3.3. Intuition and toy examples

To gain intuition on CE and transferability, consider thetoy examples illustrated in Fig. 1. The (joint) input set isrepresented by the X axis. Each input xi ∈ X is assignedwith two labels, yi ∈ Y and zi ∈ Z, for the two tasks. InFig. 1(a,b), task TZ is the trivial task with a constant labelvalue (red line) and in Fig. 1(c–e) TZ is a binary classifi-cation task, whereas TY is binary in Fig. 1(a–d) and multi-label in Fig. 1(e). In which of these examples would trans-ferring a representation trained for TZ to TY be hardest?

Of the five examples, (c) is the easiest transfer as itprovides a 1-1 mapping from Z to Y . Appropriately, inthis case, H(Y |Z) = 0. Next up are (d) and (e) withH(Y |Z) = log 2: In both cases each class in TZ is mappedto two classes in TY . Note that TY being non-binary is nat-urally handled by the CE. Finally, transfers (a) and (b) have

4

H(Y |Z) = 4 log 2; the highest CE. Because TZ is trivial,the transfer must account for the greatest difference in theinformation between the tasks and so the transfer is hardest.

4. Task hardnessA potential application of transferability is task hardness.

In Sec. 3.2, we mentioned that the hardness of a task canbe measured from the optimal log-likelihood on that task.Formally, we can measure the hardness of a task TZ by:

Hard(TZ) = minw,h∈(W,H)

LZ(w, h) = −lZ(wZ , hZ). (13)

This definition of hardness depends on our choice of(W,H), which may determine various factors such as repre-sentation size or network architecture. The intuition behindthe definition is that if the task TZ is hard for all models in(W,H), we should expect higher loss even after training.

Using Theorem 1, we can bound lZ(wZ , hZ) in Eq. (13)by transferring from a trivial task TC to TZ . We definea trivial task as the task for which all input values are as-signed the same, constant label. Let C be the (constant)label sequences of the trivial task TC . From Theorem 1 andEq. (13), we can easily show that:

Hard(TZ) = −lZ(wZ , hZ) ≤ H(Z|C). (14)

Thus, we can approximate the hardness of task TZ by look-ing at the CE H(Z|C). We note that the CE H(Z|C) isalso used to estimate the transferability Trf(TC → TZ).So, Hard(TZ) is closely related to Trf(TC → TZ). Par-ticularly, if task TZ is hard, we expect it is more difficult totransfer from a trivial task to TZ .

This relationship between hardness and transferabil-ity from a trivial task is similar to the one proposed byTask2Vec [1]. They too indexed task hardness as the dis-tance from a trivial task. To compute task hardness, how-ever, they required training deep models, whereas we obtainthis measure by simply computing H(Z|C) using Eq. (7).

Of course, estimating the hardness by H(Z|C) ignoresthe input and is hence only an approximation. In particu-lar, one could possibly design scenarios where this measurewould not accurately reflect the hardness of a given task.Our results in Sec. 5.3 show, however, that these label statis-tics provide a strong cue for task hardness.

5. ExperimentsWe rigorously evaluate our claims using three large

scale, widely used data sets representing 437 classificationtasks. Although Sec. 3.2 provides a bound on the trainingloss, test accuracy is generally more important. We thus re-port results on test images not included in the training data.

Benchmarks. The Celeb Faces Attributes (CelebA) set [35]was extensively used to evaluate transfer learning [18, 33,

34, 58]. CelebA contains over 202k face images of 10,177subjects. Each image is labeled with subject identity aswell as 40 binary attributes. We used the standard train /test splits (182,626 / 19,961 images, respectively). To ourknowledge, of the three sets, it is the only one that providesbaseline results for attribute classification.

Animals with Attributes 2 (AwA2) [67] includes over 37kimages labeled as belonging to one of 50 animals classes.Images are labeled based on their class association with 85different attributes. Models were trained on 33,568 trainingimages and tested on 3,754 separate test images.

Finally, Caltech-UCSD Birds 200 (CUB) [66] offers11,788 images of 200 bird species, labeled with 312 at-tributes as well as Turker Confidence attributes. Labelswere averaged across multiple Turkers using confidences.Finally, we kept only reliable labels, using a threshold of0.5 on the average confidence value. We used 5,994 imagesfor training and 5,794 images for testing.

We note that the Task Bank set with its 26 tasks was alsoused for evaluating task relationships [71]. We did not useit here as it mostly contains regression tasks rather than theclassification problems we are concerned with.

5.1. Evaluating task transferability

We compared our transferability estimates from the CEto the actual transferability of Eq. (4). To this end, for eachattribute TZ in a data set, we measure the actual transfer-ability, Trf(TZ → TY ), to all other attributes TY in thatset using the test split. We then compare these transferabil-ity scores to the corresponding CE estimates of Eq. (7) usingan existing correlation analysis [44].

We note again that when the source task is fixed, as inthis case, the transferability estimates can be obtained byconsidering only the CE. Furthermore, since Trf(TZ →TY ) and the CE H(Y |Z) are negative correlated, wecompare the correlation between the test error rate, 1 −Trf(TZ → TY ), and the CE H(Y |Z) instead.

Transferring representations. We keep the learned rep-resentation, wZ , and produce a new classifier kY by train-ing on the target task (Sec. 3.1). We used ResNet18 [25],trained with standard cross entropy loss, on each sourcetask TZ (source attribute). These networks were selectedas they were deep enough to obtain good accuracy on ourbenchmarks, but not too deep to overfit [72]. The penul-timate layer of these networks produce embeddings r ∈R2048 which the networks classified using hZ—their last,fully connected (FC) layers—to binary attribute values.

We transferred from source to target task by freezing thenetworks, only replacing their FC layers with linear SVM(lSVM). These lSVM were trained to predict the binary la-bels of target tasks given the embeddings produced for thesource tasks by wZ as their input. The test errors of thelSVM, which are measures of 1− Trf(TZ → TY ), were

5

0.2 0.4 0.6Conditional entropy

0.0

0.2

Err

oron

targ

etSource att: 18

corr=0.93, p < 0.001


0.0

0.2

Err

oron

targ

et

Source att: 20corr=0.93, p < 0.001


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(a) Heavy makeup (b) Male (c) Pale skin (d) Wearing lipstick

0.0 0.2 0.4 0.6Conditional Entropy

0.0

0.2

Err

oron

targ

et


0.2 0.4 0.6Conditional Entropy

0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(e) Swim (f) Pads (g) Paws (h) Strong


0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et


(i) Curved Bill (j) Iridescent Wings (k) Brown Upper Parts (l) Olive Under PartsFigure 2. Attribute prediction; CE vs. test errors on target tasks. Examples from CelebA (a-d), AwA2 (e-h), and CUB (i-l). Plot titlesname the source tasks TZ ; points represent different target tasks TY . Corr is the Pearson correlation coefficient between the two variablesand p is the statistical significance of the correlation. In all cases, the correlation is statistically significant. See Sec. 5.1 for details.

Attribute: Male Bald Gray Hair Mustache Double Chin . . . Attractive Wavy Hair High Cheeks Smiling Mouth Open Average (all)1 LNets+ANet 2015 [35] 0.980 0.980 0.970 0.950 0.920

. . .

0.810 0.800 0.870 0.920 0.920 0.8732 Walk and Learn 2016 [63] 0.960 0.920 0.950 0.900 0.930 0.840 0.850 0.950 0.980 0.970 0.8873 MOON 2016 [55] 0.981 0.988 0.981 0.968 0.963 0.817 0.825 0.870 0.926 0.935 0.9094 LMLE 2016 [26] 0.990 0.900 0.910 0.730 0.740 0.880 0.830 0.920 0.990 0.960 0.8385 CR-I 2017 [16] 0.960 0.970 0.950 0.940 0.890 0.830 0.790 0.890 0.930 0.950 0.8666 MCNN-AUX 2017 [23] 0.982 0.989 0.982 0.969 0.963 0.831 0.839 0.876 0.927 0.937 0.9137 DMTL 2018 [22] 0.980 0.990 0.960 0.970 0.990 0.850 0.870 0.880 0.940 0.940 0.9268 Face-SSD 2019 [27] 0.973 0.986 0.976 0.960 0.960 0.813 0.851 0.868 0.918 0.919 0.9039 CE↑ (decreasing transferability) 0.017 0.026 0.052 0.062 0.083 0.361 0.381 0.476 0.521 0.551 -10 Dedicated Res18 0.985 0.990 0.980 0.968 0.959 0.823 0.842 0.878 0.933 0.943 0.91111 Transfer 0.992 0.991 0.981 0.968 0.963 0.820 0.800 0.859 0.909 0.901 0.902

Table 1. Transferability from face recognition to facial attributes. Results for CelebA attributes, sorted in ascending order of row 9 (de-creasing transferability). Results are shown for the five attributes most and least transferable from recognition. Subject specific attributes,e.g., male and bald, are more transferable than expression related attributes such as smiling and mouth open. Unsurprisingly, transferresults (row 11) are best on the former than the latter. Rows 1-8 provide published state of the art results. Despite training only an lSVMfor attribute, row 11 results are comparable with more elaborate attribute classification systems. For details, see Sec. 5.2.

then compared with the CE, H(Y |Z).

We use lSVM as it allows us to focus on the informationpassed from TZ to TY . A more complex classifier couldpotentially mask this information by being powerful enoughto offset any loss of information due to the transfer. In prac-tical use cases, when transferring a deep network from onetask to another, it may be preferable to fine tune the last lay-ers of the network or its entirety, provided that the trainingdata on the target task is large enough.

Transferability results. Fig. 2 reports selected quantitative

transferability results on the three sets.2 Each point in thesegraphs represents the CE, H(Y |Z), vs. the target test error,1− Trf(TZ → TY ). The graphs also provide the linear re-gression model fit with 95% confidence interval, the Pear-son correlation coefficients between the two values, and thestatistical significance of the correlation, p.

In all cases, the CE and target test error are highly posi-tively correlated with statistical significance. These resultstestify that the CE of Eq. (7) is indeed a good predictor forthe actual transferability of Eq. (4). This is remarkable es-

2For full results see the appendix.

6


0.0

0.2

0.4E

rror

onta

rget

Transfer from IDcorr=0.78, p < 0.001

Figure 3. Identity to attribute; CE vs. test errors on targettasks. Predicting 40 CelebA attributes using a face recognitionnetwork. Corr is the Pearson correlation coefficient between thetwo variables, and p is the statistical significance of the correlation.

Figure 4. Identity to attribute; transferred − dedicated accu-racy. Differences between CelebA accuracy of transferred recog-nition model and models trained for each attribute. Results aresorted by decreasing transferability (same as Table 1).

pecially since the relationship between tasks is evaluatedwithout considering the input domain or the machine learn-ing models trained to solve these tasks.

5.2. Case study: Identity to facial attributes

A key challenge when training effective attribute classi-fiers is the difficulty of obtaining labeled attribute trainingdata. Whereas face images are often uploaded to the Inter-net along with subject names [9, 21], it is far less commonto find images labeled with attributes such as high cheekbones, bald, or even male [32]. It is consequently harder toassemble training sets for attribute classification at the samescale and diversity as those used to train other tasks.

To reduce the burden of collecting attribute data, wetherefore explore transferring a representation learned forface recognition. In this setting, we can also compute esti-mated transferability scores (via the CE) between the sub-ject labels provided by CelebA and the labels of each at-tribute. We note that unlike the previous examples, thesource labels are not binary and include over 10k values.

Face recognition network. We compare our estimatedtransferability vs. actual transferability using a deep facerecognition network. To this end, we use a ResNet101 ar-chitecture trained for face recognition on the union of the

(a) Male

(b) Double chinFigure 5. Classification accuracy for varying training set sizes.Top: male; bottom: double chin. Dedicated classification net-works trained from scratch (blue) vs. face recognition networktransferred to the attributes with an lSVM (red). Because recogni-tion transfers well to these attributes, we obtain accurate classifi-cation with a fraction of the training data and effort.

MS-Celeb-1M [21] and VGGFace2 [9] training sets (fol-lowing removal of subjects included in CelebA), with a co-sine margin loss (m = 0.4) [62]. This network achievesaccuracy comparable to the state of the art reported by oth-ers, with different systems, on standard benchmarks [15].

Transferability results: recognition to attributes. Table 1reports results for the five attributes most transferable fromrecognition (smallest CE; Eq. (7)) and the five least trans-ferable (largest CE). Columns are sorted by increasing CEvalues (decreasing transferability), listed in row 9. Row 11reports accuracy of the transferred network with the lSVMtrained on the target task. Estimated vs. actual transfer-ability is further visualized in Fig. 3. Evidently, correla-tion between the two is statistically significant, testifyingthat Eq. (7) is a good predictor of actual transferability, heredemonstrated on a source task with multiple labels.

For reference, Table 1 provides in Row 10 the accuracyof the dedicated ResNet18 networks trained for each at-tribute. Finally, rows 1 through 8 provide results for pub-lished state of the art on the same tasks.

Analysis of results. Subject specific attributes such as maleand bald are evidently more transferable from recognition(left columns of Table 1) than attributes that are related to

7

0.2 0.4 0.6Hardness estimate

0.0

0.1

0.2

0.3

Tes

ter

ror

corr=0.58, p < 0.001

0.2 0.4 0.6Hardness estimate

0.0

0.1

0.2

Tes

ter

ror

corr=0.82, p < 0.001

0.0 0.2 0.4 0.6Hardness estimate

0.0

0.2

0.4

Tes

ter

ror

corr=0.96, p < 0.001

(a) CelebA (b) AwA2 (c) CUBFigure 6. Estimated task hardness vs. empirical errors on the three benchmarks. Estimated hardness is well correlated with empiricalhardness with significance p < 0.001.

expressions (e.g., smiling and mouth open, right columns).Although this relationship has been noted by others, previ-ous work used domain knowledge to determine which at-tributes are more transferable from identity [35], as othershave done in other domains [20, 38]. By comparison, ourwork shows how these relationships emerge from our esti-mation of transferability.

Also, notice that for the transferable attributes, our re-sults are comparable to dedicated networks trained for eachattribute, although they gradually drop off for the less trans-ferable attributes in the last columns. This effect is visual-ized in Fig. 4 which shows the growing differences in at-tribute classification accuracy for a transferred face recog-nition model and models trained for each attribute. Resultsare sorted by decreasing transferability (same as in Table 1).

Results in Fig. 4 show a few notable exceptions wheretransfer performs substantially better than dedicated models(e.g., the two positive peaks representing attributes youngand big nose). These and other occasional discrepancies inour results can be explained in the difference between thetrue transferability of Eq. (4), which we measure on the testsets, and Eq. (5), defined on the training sets and shown inSec. 3.2 to be bounded by the CE.

Finally, we note that our goal is not to develop a state ofthe art facial attribute classification scheme. Nevertheless,results obtained by training an lSVM on embeddings trans-ferred from a face recognition network are only 2.4% lowerthan the best scores reported by DMTL 2018 [22] (last col-umn of Table 1). The effort involved in developing a stateof the art face recognition network can be substantial. Bytransferring this network to attributes these efforts are amor-tized in training multiple facial attribute classifiers.

To emphasize this last point, consider Fig. 5 which re-ports classification accuracy on male and double chin forgrowing training set sizes. These attributes were selected asthey are highly transferable from recognition (see Table 1).The figure compares the accuracy obtained by training adedicated network (in blue) to a network transferred fromrecognition (red). Evidently, on these attributes, transferredaccuracy is much higher with far less training data.

5.3. Evaluating task hardnessWe evaluate our hardness estimates for all attribute clas-

sification tasks in the three data sets, using the CE H(Z|C)in Eq. (14). Fig. 6 compares the hardness estimates for eachtask vs. the errors of our dedicated networks, trained fromscratch to classify each attribute. Results are provided forCelebA, AwA2, and CUB.

The correlation between estimated hardness and classifi-cation errors is statistically significant with p < 0.001, sug-gesting that the CE H(Z|C) in Eq. (14) indeed captures thehardness of these tasks. That is, in the three data sets, testerror rates strongly correlate with our estimated hardness:the harder a task is estimated to be, the higher the errorsproduced by the model trained for the task. Of course, thisresult does not imply that the input domain has no impacton task hardness; only that the distribution of training labelsalready provides a strong predictor for task hardness.

6. ConclusionsWe present a practical method for estimating the hard-

ness and transferability of supervised classification tasks.We show that, in both cases, we produce reliable estimatesby exploring training label statistics, particularly the condi-tional entropy between the sequences of labels assigned tothe training data of each task. This approach is simpler thanexisting work, which obtains similar estimates by assum-ing the existence of trained models or by careful inspectionof the training process. In our approach, computing con-ditional entropy is cheaper than training deep models, re-quired by others for the same purpose.

We assume that different tasks share the same input do-main (the same input images). It would be useful to ex-tend our work to settings where the two tasks are definedover different domains (e.g., face vs. animal images). Ourwork further assumes discrete labels. Conditional entropywas originally defined over distributions. It is therefore rea-sonable that CE could be extended to non-discrete labeledtasks, such as, for faces, 3D reconstruction [60], pose esti-mation [10, 11] or segmentation [47].Acknowledgements. We thank Alessandro Achille, PietroPerona, and the reviewers for their helpful discussions.

8

References[1] Alessandro Achille, Michael Lam, Rahul Tewari, Avinash

Ravichandran, Subhransu Maji, Charless Fowlkes, StefanoSoatto, and Pietro Perona. Task2Vec: Task embedding formeta-learning. arXiv preprint arXiv:1902.03545, 2019.

[2] Alessandro Achille, Giovanni Paolini, Glen Mbeng, andStefano Soatto. The information complexity of learningtasks, their structure and their distance. arXiv preprintarXiv:1904.03292, 2019.

[3] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma.Provable bounds for learning some deep representations. InInt. Conf. Mach. Learning, pages 584–592, 2014.

[4] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan,Atsuto Maki, and Stefan Carlsson. Factors of transferabilityfor a generic convnet representation. Trans. Pattern Anal.Mach. Intell., 38(9):1790–1802, 2015.

[5] Kamyar Azizzadenesheli, Anqi Liu, Fanny Yang, and An-imashree Anandkumar. Regularized learning for domainadaptation under label shifts. In Int. Conf. on Learning Rep-resentations, 2019.

[6] Shai Ben-David, John Blitzer, Koby Crammer, AlexKulesza, Fernando Pereira, and Jennifer Wortman Vaughan.A theory of learning from different domains. Mach. Learn.,79(1-2):151–175, 2010.

[7] Shai Ben-David and Reba Schuller. Exploiting task relat-edness for multiple task learning. In Learning Theory andKernel Machines, pages 567–580. Springer, 2003.

[8] John Blitzer, Koby Crammer, Alex Kulesza, FernandoPereira, and Jennifer Wortman. Learning bounds for domainadaptation. In Neural Inform. Process. Syst., pages 129–136,2008.

[9] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman.VGGFace2: A dataset for recognising faces across pose andage. In Automatic Face and Gesture Recognition, 2018.

[10] Feng-Ju Chang, Anh Tran, Tal Hassner, Iacopo Masi, RamNevatia, and Gerard Medioni. Faceposenet: Making a casefor landmark-free face alignment. In Proc. Int. Conf. Com-put. Vision Workshops, 2017.

[11] Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi,Ram Nevatia, and Gerard Medioni. Deep, landmark-freefame: Face alignment, modeling, and expression estimation.Int. J. Comput. Vision, 127(6-7):930–956, 2019.

[12] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang,Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, andZheng Zhang. Mxnet: A flexible and efficient machinelearning library for heterogeneous distributed systems. arXivpreprint arXiv:1512.01274, 2015.

[13] Anna Choromanska, Mikael Henaff, Michael Mathieu,Gerard Ben Arous, and Yann LeCun. The loss surfaces ofmultilayer networks. In Artificial Intelligence and Statistics,pages 192–204, 2015.

[14] Thomas M Cover and Joy A Thomas. Elements of informa-tion theory. John Wiley & Sons, 2012.

[15] Prithviraj Dhar, Carlos Castillo, and Rama Chellappa. Onmeasuring the iconicity of a face. In Winter Conf. on App. ofComput. Vision, pages 2137–2145. IEEE, 2019.

[16] Qi Dong, Shaogang Gong, and Xiatian Zhu. Class rectifi-cation hard mining for imbalanced deep learning. In Proc.Conf. Comput. Vision Pattern Recognition, pages 1851–1860, 2017.

[17] Richard O Duda, Peter E Hart, and David G Stork. Patternclassification. John Wiley & Sons, 2012.

[18] Emilien Dupont. Learning disentangled joint continuous anddiscrete representations. In Neural Inform. Process. Syst.,pages 708–718, 2018.

[19] Harrison Edwards and Amos Storkey. Towards a neuralstatistician. arXiv preprint arXiv:1606.02185, 2016.

[20] Deepti Ghadiyaram, Du Tran, and Dhruv Mahajan. Large-scale weakly-supervised pre-training for video action recog-nition. In Proc. Conf. Comput. Vision Pattern Recognition,pages 12046–12055, 2019.

[21] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, andJianfeng Gao. MS-Celeb-1M: A dataset and benchmark forlarge scale face recognition. In European Conf. Comput. Vi-sion. Springer, 2016.

[22] Hu Han, Anil K Jain, Fang Wang, Shiguang Shan, and XilinChen. Heterogeneous face attribute estimation: A deepmulti-task learning approach. Trans. Pattern Anal. Mach.Intell., 40(11):2597–2609, 2018.

[23] Emily M Hand and Rama Chellappa. Attributes for improvedattributes: A multi-task network utilizing implicit and ex-plicit relationships for facial attribute classification. In AAAIConf. on Artificial Intelligence, 2017.

[24] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proc. Int. Conf. Comput. Vision, pages2961–2969, 2017.

[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proc. Conf.Comput. Vision Pattern Recognition, June 2016.

[26] Chen Huang, Yining Li, Chen Change Loy, and XiaoouTang. Learning deep representation for imbalanced classi-fication. In Proc. Conf. Comput. Vision Pattern Recognition,pages 5375–5384, 2016.

[27] Youngkyoon Jang, Hatice Gunes, and Ioannis Patras.Registration-free face-ssd: Single shot analysis of smiles, fa-cial attributes, and affect in the wild. Comput. Vision ImageUnderstanding, 2019.

[28] Brendan Jou and Shih-Fu Chang. Deep cross residual learn-ing for multitask visual recognition. In Int. Conf. Multime-dia, pages 998–1007. ACM, 2016.

[29] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-tasklearning using uncertainty to weigh losses for scene geome-try and semantics. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 7482–7491, 2018.

[30] Iasonas Kokkinos. Ubernet: Training a universal convo-lutional neural network for low-, mid-, and high-level vi-sion using diverse datasets and limited memory. In Proc.Conf. Comput. Vision Pattern Recognition, pages 6129–6138, 2017.

[31] Giwoong Lee, Eunho Yang, and Sung Hwang. Asymmetricmulti-task learning based on task relatedness and loss. In Int.Conf. Mach. Learning, pages 230–238, 2016.

9

[32] Gil Levi and Tal Hassner. Age and gender classification us-ing convolutional neural networks. In Proc. Conf. Comput.Vision Pattern Recognition Workshops, June 2015.

[33] Yu Liu, Fangyin Wei, Jing Shao, Lu Sheng, Junjie Yan, andXiaogang Wang. Exploring disentangled feature represen-tation beyond face identification. In Proc. Conf. Comput.Vision Pattern Recognition, pages 2080–2089, 2018.

[34] Yen-Cheng Liu, Yu-Ying Yeh, Tzu-Chien Fu, Sheng-DeWang, Wei-Chen Chiu, and Yu-Chiang Frank Wang. Detachand adapt: Learning cross-domain disentangled deep repre-sentation. In Proc. Conf. Comput. Vision Pattern Recogni-tion, pages 8867–8876, 2018.

[35] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.Deep learning face attributes in the wild. In Proc. Int. Conf.Comput. Vision, 2015.

[36] David G Lowe. Object recognition from local scale-invariantfeatures. In Proc. Int. Conf. Comput. Vision, page 1150,1999.

[37] Yongxi Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng,Tara Javidi, and Rogerio Feris. Fully-adaptive feature shar-ing in multi-task networks with applications in person at-tribute classification. In Proc. Conf. Comput. Vision PatternRecognition, pages 5334–5343, 2017.

[38] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan,Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe,and Laurens van der Maaten. Exploring the limits of weaklysupervised pretraining. In European Conf. Comput. Vision,pages 181–196, 2018.

[39] Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh.Domain adaptation: Learning bounds and algorithms. InConference on Learning Theory, 2009.

[40] I. Masi, F. J. Chang, J. Choi, S. Harel, J. Kim, K. Kim, J.Leksut, S. Rawls, Y. Wu, T. Hassner, W. AbdAlmageed, G.Medioni, L. P. Morency, P. Natarajan, and R. Nevatia. Learn-ing pose-aware models for pose-invariant face recognition inthe wild. Trans. Pattern Anal. Mach. Intell., 2018.

[41] Iacopo Masi, Anh Tuan Tran, Tal Hassner, Gozde Sahin, andGerard Medioni. Face-specific data augmentation for un-constrained face recognition. Int. J. Comput. Vision, 127(6-7):642–667, 2019.

[42] Colin McDiarmid. On the method of bounded differences.Surveys in combinatorics, 141(1):148–188, 1989.

[43] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Mar-tial Hebert. Cross-stitch networks for multi-task learning.In Proc. Conf. Comput. Vision Pattern Recognition, pages3994–4003, 2016.

[44] Cuong V Nguyen, Alessandro Achille, Michael Lam, TalHassner, Vijay Mahadevan, and Stefano Soatto. Towardunderstanding catastrophic forgetting in continual learning.arXiv:1908.01091, 2019.

[45] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard ETurner. Variational continual learning. In Int. Conf. onLearning Representations, 2018.

[46] Quynh Nguyen and Matthias Hein. The loss surface of deepand wide neural networks. In Int. Conf. Mach. Learning,pages 2603–2612, 2017.

[47] Yuval Nirkin, Iacopo Masi, Anh Tran Tuan, Tal Hassner, andGerard Medioni. On face segmentation, face swapping, andface perception. In Int. Conf. on Automatic Face and GestureRecognition, pages 98–105. IEEE, 2018.

[48] Sinno Jialin Pan and Qiang Yang. A survey on transfer learn-ing. Trans. Knowledge and Data Eng., 22(10):1345–1359,2010.

[49] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B.Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machinelearning in Python. J. Mach. Learning Research, 12:2825–2830, 2011.

[50] Rajeev Ranjan, Vishal M Patel, and Rama Chellappa. Hy-perface: A deep multi-task learning framework for face de-tection, landmark localization, pose estimation, and genderrecognition. Trans. Pattern Anal. Mach. Intell., 41(1):121–135, 2019.

[51] Rajeev Ranjan, Swami Sankaranarayanan, Carlos D Castillo,and Rama Chellappa. An all-in-one convolutional neural net-work for face analysis. In Int. Conf. on Automatic Face andGesture Recognition, pages 17–24. IEEE, 2017.

[52] Mark B Ring. CHILD: A first step towards continual learn-ing. Mach. Learn., 28(1):77–104, 1997.

[53] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deepexpectation of apparent age from a single image. In Proc. Int.Conf. Comput. Vision Workshops, pages 10–15, 2015.

[54] Artem Rozantsev, Mathieu Salzmann, and Pascal Fua. Be-yond sharing weights for deep domain adaptation. Trans.Pattern Anal. Mach. Intell., 41(4):801–814, 2019.

[55] Ethan M Rudd, Manuel Gunther, and Terrance E Boult.Moon: A mixed objective optimization network for therecognition of facial attributes. In European Conf. Comput.Vision, pages 19–35. Springer, 2016.

[56] Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, OriolVinyals, Razvan Pascanu, Simon Osindero, and Raia Had-sell. Meta-learning with latent embedding optimization. InInt. Conf. on Learning Representations, 2019.

[57] Bernhard Scholkopf and Alexander J Smola. Learning withkernels: support vector machines, regularization, optimiza-tion, and beyond. MIT press, 2001.

[58] Sukrit Shankar, Duncan Robertson, Yani Ioannou, AntonioCriminisi, and Roberto Cipolla. Refining architectures ofdeep convolutional neural networks. In Proc. Conf. Comput.Vision Pattern Recognition, pages 2212–2220, 2016.

[59] A Torralba and AA Efros. Unbiased look at dataset bias.In Proc. Conf. Comput. Vision Pattern Recognition, pages1521–1528. IEEE Computer Society, 2011.

[60] Anh Tuan Tran, Tal Hassner, Iacopo Masi, Eran Paz, YuvalNirkin, and Gerard Medioni. Extreme 3D face reconstruc-tion: Looking past occlusions. In Proc. Conf. Comput. VisionPattern Recognition, 2018.

[61] Andreas Veit, Serge Belongie, and Theofanis Karaletsos.Conditional similarity networks. In Proc. Conf. Comput. Vi-sion Pattern Recognition, pages 830–838, 2017.

[62] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, DihongGong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cos-face: Large margin cosine loss for deep face recognition.

10

In Proc. Conf. Comput. Vision Pattern Recognition, pages5265–5274, 2018.

[63] Jing Wang, Yu Cheng, and Rogerio Schmidt Feris. Walk andlearn: Facial attribute representation learning from egocen-tric video and contextual data. In Proc. Conf. Comput. VisionPattern Recognition, pages 2295–2304, 2016.

[64] Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, BrianPrice, and Alan L Yuille. Towards unified depth and seman-tic prediction from a single image. In Proc. Conf. Comput.Vision Pattern Recognition, pages 2800–2809, 2015.

[65] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang.A survey of transfer learning. Journal of Big Data, 3(1):9,2016.

[66] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Be-longie, and P. Perona. Caltech-UCSD Birds 200. TechnicalReport CNS-TR-2010-001, California Institute of Technol-ogy, 2010.

[67] Yongqin Xian, Christoph H Lampert, Bernt Schiele, andZeynep Akata. Zero-shot learning-a comprehensive evalu-ation of the good, the bad and the ugly. Trans. Pattern Anal.Mach. Intell., 2018.

[68] Yongxin Yang and Timothy Hospedales. Deep multi-taskrepresentation learning: A tensor factorisation approach. InInt. Conf. on Learning Representations, 2017.

[69] Wei Ying, Yu Zhang, Junzhou Huang, and Qiang Yang.Transfer learning via learning to transfer. In Int. Conf. Mach.Learning, pages 5072–5081, 2018.

[70] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.How transferable are features in deep neural networks? InNeural Inform. Process. Syst., pages 3320–3328, 2014.

[71] Amir R Zamir, Alexander Sax, William Shen, Leonidas JGuibas, Jitendra Malik, and Silvio Savarese. Taskonomy:Disentangling task transfer learning. In Proc. Conf. Comput.Vision Pattern Recognition, pages 3712–3722, 2018.

[72] Chiyuan Zhang, Samy Bengio, Moritz Hardt, BenjaminRecht, and Oriol Vinyals. Understanding deep learning re-quires rethinking generalization. In Int. Conf. on LearningRepresentations, 2017.

[73] Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang,and Ying Wu. A modulation module for multi-task learningwith applications in image retrieval. In European Conf. Com-put. Vision, pages 401–416, 2018.

A. Proof of theorem 1From the definition of Trf(TZ → TY ), we have:

Trf(TZ → TY )

= lY (wZ , kY ) (definition of Trf)≥ lY (wZ , k) (definition of kY and k ∈ K)

=1

n

n∑i=1

log

(∑z∈Z

P (yi|z)P (z|xi;wZ , hZ)

)(construction of k)

≥ 1

n

n∑i=1

log(P (yi|zi)P (zi|xi;wZ , hZ)

)(replacing the sum by one of its elements)

=1

n

n∑i=1

log P (yi|zi) +1

n

n∑i=1

logP (zi|xi;wZ , hZ).

(15)

Note that the second term in Eq. (15) is:

1

n

n∑i=1

logP (zi|xi;wZ , hZ) = lZ(wZ , hZ). (16)

Furthermore, the first term in Eq. (15) is:

1

n

n∑i=1

log P (yi|zi)

=1

n

∑y∈Y

∑z∈Z

∑i : yi=y and zi=z

log P (y|z)

(group the summands by values of yi and zi)

=1

n

∑y∈Y

∑z∈Z

(|{i : yi = y and zi = z}| log P (y|z)

)(by counting)

=∑y∈Y

∑z∈Z

(|{i : yi = y and zi = z}|

nlog P (y|z)

)

=∑y∈Y

∑z∈Z

(P (y, z) log

P (y, z)

P (z)

)(definitions of P (y, z) and P (y|z))

= −H(Y |Z). (17)

From Eq. (15), (16), and (17), we haveTrf(TZ → TY ) ≥ lZ(wZ , hZ) − H(Y |Z). Hence,the theorem holds.

B. More details on task hardness

On the definition of task hardness. In our paper, we as-sume non-overfitting of trained models. When train and test

11

sets are sampled from the same distribution, this assump-tion typically holds for appropriately trained models [72].This property also shows that our definition of hardness,Eq. (13), does not conflict with the results of Zhang etal. [72]: In such cases, the training loss of Eq. (13) cor-relates with the test error, and thus this definition indeedreflects task hardness, explaining the relationships betweentrain and test errors observed in our hardness results.

On the representation for trivial tasks. Any representa-tion for a trivial source task can fit the constant label per-fectly (zero training loss). In theory, if we choose the op-timal wZ in Eq. (1) as our representation, we can showEq. (14). In practice, of course we cannot infer the opti-mal wZ from the trivial source task, but Eq. (14) shows thatwe can still connect it to H(Z|C).

C. Technical implementation details

Computing the CE. Computing the CE is straightforwardand involves the following steps:

1. Loop through the training labels of both tasks TZ

and TY and compute the empirical joint distributionP (y, z) by counting (Eq. (6) in the paper).

2. Loop through the training labels again and compute theCE using Eq. (17) above. That is,

H(Y |Z) = − 1

n

n∑i=1

log P (yi|zi).

Thus, computing the CE only requires running two loopsthrough the training labels. This process is computationallyefficient. In the most extreme case, computing the transfer-ability of face recognition (|Z| > 10k) to a facial attribute,with |Y| = 2, required less than a second on a standardCPU.

This run time should be compared with the hours (ordays) required to train deep models in order to empiricallymeasure transferability following the process described byprevious work. In particular, Taskonomy [71] reported over47 thousand hours of GPU runtime in order to establish re-lationships between their 26 tasks.

Dedicated attribute training. Given a source task TZ , wetrain a dedicated CNN for this task with standard ResNet-18V2 implemented in the MXNet deep learning library [12].3

We set the initial learning rate to 0.01. Learning rate wasthen divided by 10 after each 12 epochs. Training convergedin less than 40 epochs in all 437 tasks.

3Model available from: https://mxnet.apache.org/api/python/gluon/model_zoo.html.

Task transfer with linear SVM. After training a deep rep-resentation for a source task TZ , we transfer it to a targettask TY using linear support vector machines (lSVM).

First, we use the trained CNN, denoted in the paper aswZ , to extract deep embeddings for the entire training data(one embedding per input image. Each embedding is a vec-tor r ∈ R2048, which we obtain from the penultimate layerof the network. We then use these embeddings, along withthe corresponding labels for target task, TY , to train a stan-dard lSVM classifier, implemented by SK-Learn [49]. ThelSVM parameters were kept unchanged from their defaultvalues.

Given unseen testing data, we first extract their embed-dings with wZ . We then apply the trained lSVM classifieron these features to predict labels for target task, TY .

D. Additional results: Generalization to multi-class

Transferability generalizes well to multi-class, as evi-dent in our face recognition (10k labels)-to-attribute testsin Sec. 5.2. Table 2 below reports hardness tests with multi-class, CelebA, attribute aggregates. Generally speaking, theharder the task, the lower the accuracy obtained.

Multi-class Straight/Wavy/Other Black/Blonde/Other Arched/Bushy/OtherHardness ↓ 1.040 0.925 0.867Dedicated Res18 0.713 0.859 0.797Multi-class Bangs/Receding/Other Gray/Blonde/Other Goatee/Beard/NoneHardness ↓ 0.690 0.575 0.557Dedicated Res18 0.900 0.943 0.937

Table 2. Multi-class hardness examples on CelebA data.

E. Full transferability results• Attribute prediction on CelebA [35]: see Fig. 7.

• CelebA: Transferability from identity to attributes:see Table 3.

• Attribute prediction on AwA2 [67]: see Fig. 8, 9,and 10.

F. Full hardness results• CelebA [35] attribute prediction hardness: see Ta-

ble 4.

• AwA2 [67] attribute prediction hardness: see Ta-ble 5.

• CUB [66] attribute prediction hardness: see Table 6.

12

https://mxnet.apache.org/api/python/gluon/model_zoo.html

https://mxnet.apache.org/api/python/gluon/model_zoo.html

Attribute: Male Bald Gray Hair Mustache Double Chin Chubby Sideburns Goatee Young Wear Hat

→

1 LNets+ANet 2015 [35] 0.980 0.980 0.970 0.950 0.920 0.910 0.960 0.950 0.870 0.9902 Walk and Learn 2016 [63] 0.960 0.920 0.950 0.900 0.930 0.890 0.920 0.920 0.860 0.9603 MOON 2016 [55] 0.981 0.988 0.981 0.968 0.963 0.954 0.976 0.970 0.881 0.9904 LMLE 2016 [26] 0.990 0.900 0.910 0.730 0.740 0.790 0.880 0.950 0.870 0.9905 CR-I 2017 [16] 0.960 0.970 0.950 0.940 0.890 0.870 0.920 0.960 0.840 0.9806 MCNN-AUX 2017 [23] 0.982 0.989 0.982 0.969 0.963 0.957 0.978 0.972 0.885 0.9907 DMTL 2018 [22] 0.980 0.990 0.960 0.970 0.990 0.970 0.980 0.980 0.900 0.9908 Face-SSD 2019 [27] 0.973 0.986 0.976 0.960 0.960 0.951 0.966 0.963 0.876 0.9859 Conditional Entropy↑ 0.017 0.026 0.052 0.062 0.083 0.087 0.088 0.089 0.095 0.107

10 Dedicated Res18 0.985 0.990 0.980 0.968 0.959 0.951 0.976 0.974 0.879 0.99111 FromID SVM 0.992 0.991 0.981 0.968 0.963 0.957 0.976 0.973 0.899 0.988

Attribute: Eye glasses Pale Skin Wear Necktie Blurry No Beard Receding Hairline 5 clock Shadow Rosy Cheeks Blond Hair Big Lips

→

1 0.990 0.910 0.930 0.840 0.950 0.890 0.910 0.900 0.950 0.6802 0.970 0.850 0.840 0.910 0.900 0.840 0.840 0.960 0.920 0.7803 0.995 0.970 0.966 0.957 0.956 0.936 0.940 0.948 0.959 0.7154 0.980 0.800 0.900 0.590 0.960 0.760 0.820 0.780 0.990 0.6005 0.960 0.920 0.880 0.850 0.940 0.870 0.900 0.880 0.950 0.6806 0.996 0.970 0.965 0.962 0.960 0.938 0.945 0.952 0.960 0.7157 0.990 0.970 0.970 0.960 0.970 0.940 0.950 0.960 0.910 0.8808 0.992 0.957 0.956 0.950 0.949 0.931 0.929 0.943 0.936 0.7789 0.109 0.122 0.131 0.139 0.141 0.141 0.145 0.152 0.16 0.161

10 0.997 0.970 0.963 0.963 0.961 0.936 0.942 0.950 0.961 0.71511 0.996 0.958 0.941 0.956 0.958 0.933 0.937 0.939 0.949 0.710

Attribute: Bushy Eyebrows Wear Lipstick Big Nose Bangs Narrow Eyes Wear Necklace Heavy Makeup Black Hair Wear Earrings Arched Eyebrows

→

1 0.900 0.930 0.780 0.950 0.810 0.710 0.900 0.880 0.820 0.7902 0.930 0.920 0.910 0.960 0.790 0.770 0.960 0.840 0.910 0.8703 0.926 0.939 0.840 0.958 0.865 0.870 0.910 0.894 0.896 0.8234 0.820 0.990 0.800 0.980 0.590 0.590 0.980 0.920 0.830 0.7905 0.840 0.940 0.800 0.950 0.720 0.740 0.840 0.900 0.830 0.8006 0.928 0.941 0.845 0.960 0.872 0.866 0.915 0.898 0.904 0.8347 0.850 0.930 0.920 0.960 0.900 0.890 0.920 0.850 0.910 0.8608 0.896 0.926 0.823 0.952 0.890 0.878 0.907 0.879 0.869 0.8209 0.192 0.202 0.232 0.236 0.252 0.252 0.27 0.286 0.291 0.306

10 0.927 0.935 0.828 0.961 0.875 0.859 0.916 0.901 0.896 0.83411 0.919 0.940 0.845 0.950 0.863 0.865 0.897 0.869 0.853 0.822

Attribute: Brown Hair Bags U Eyes Oval Face Straight Hair Pointy Nose Attractive Wavy Hair High Cheeks Smiling Mouth Open Average (all)1 0.800 0.790 0.660 0.730 0.720 0.810 0.800 0.870 0.920 0.920 0.8732 0.810 0.870 0.790 0.750 0.770 0.840 0.850 0.950 0.980 0.970 0.8873 0.894 0.849 0.757 0.823 0.765 0.817 0.825 0.870 0.926 0.935 0.9094 0.870 0.730 0.680 0.730 0.720 0.880 0.830 0.920 0.990 0.960 0.8385 0.860 0.800 0.660 0.730 0.730 0.830 0.790 0.890 0.930 0.950 0.8666 0.892 0.849 0.758 0.836 0.775 0.831 0.839 0.876 0.927 0.937 0.9137 0.960 0.990 0.780 0.850 0.780 0.850 0.870 0.880 0.940 0.940 0.9268 0.835 0.825 0.748 0.834 0.749 0.813 0.851 0.868 0.918 0.919 0.9039 0.315 0.324 0.339 0.339 0.341 0.361 0.381 0.476 0.521 0.551

10 0.886 0.834 0.752 0.836 0.769 0.823 0.842 0.878 0.933 0.943 0.91111 0.854 0.838 0.733 0.812 0.769 0.820 0.800 0.859 0.909 0.901 0.902

Table 3. Transferability from face recognition to facial attributes. (Extended from Table 1 in the paper) Results for CelebA attributes,sorted in ascending order of row 9 (decreasing transferability). Classification accuracies are shown for all 40 attributes. Subject specificattributes, e.g., male and bald, are more transferable than expression related attributes such as smiling and mouth open. These identityspecific attributes corresponds to the automatic grouping presented in the original CelebA paper [35]. Unlike them, however, we obtainthis grouping without necessitating the training of a deep attribute classification model. Unsurprisingly, transfer results (row 11) are beston these subject specific attributes and worst for less related attributes. Rows 1-8 provide published state of the art results. Despite trainingonly an lSVM for attribute, row 11 results are comparable with more elaborate attribute classification systems. For details, see Sec. 5.2.

1 Attribute Bald Mustache Gray Hair Pale Skin Double Chin Wearing Hat Blurry Sideburns Chubby Goatee →2 Conditional Entropy↑ 0.107 0.173 0.174 0.177 0.189 0.194 0.201 0.217 0.22 0.235 →→ Eyeglasses Rosy Cheeks Wearing Necktie Receding Hairline 5 oClock Shadow Narrow Eyes Wearing Necklace Bushy Eyebrows Blond Hair Bangs →→ 0.241 0.242 0.261 0.278 0.349 0.357 0.373 0.409 0.419 0.425 →→ No Beard Wearing Earrings Bags Under Eyes Brown Hair Straight Hair Young Big Nose Black Hair Big Lips Arched Eyebrows→→ 0.448 0.485 0.507 0.508 0.512 0.535 0.545 0.55 0.552 0.58 →→ Pointy Nose Oval Face Wavy Hair Heavy Makeup Male High Cheekbones Wearing Lipstick Smiling Mouth Slightly Open Attractive→ 0.591 0.597 0.627 0.667 0.679 0.689 0.692 0.693 0.693 0.693

Table 4. CelebA task hardness. CelebA facial attributes sorted in ascending order of hardness along with their respective hardness scores.Hardness scores listed above are compared with empirical test errors for each task and shown to be strongly correlated (Fig. 6(a) in thepaper). Note that the male classification task, appearing here as relatively hard, is the easiest task to transfer from face recognition (Table 1).

1 Attribute Flys Red Skimmer Desert Plankton Insects Tunnels Hands Tusks Strainteeth Cave Blue Stripes Scavenger Hops →2 Conditional Entropy↑ 0.057 0.089 0.112 0.125 0.140 0.170 0.181 0.200 0.205 0.229 0.243 0.251 0.263 0.270 0.284 →→ Oldworld Orange Yellow Quadrapedal Flippers Ground Ocean Coastal Arctic Walks Swims Water Weak Longneck Bipedal→→ 0.294 0.303 0.315 0.324 0.345 0.381 0.391 0.392 0.408 0.429 0.433 0.433 0.439 0.444 0.444 →→ Tree Chewteeth Hibernate Nocturnal Fast Furry Stalker Newworld Tail Horns Hairless Jungle Buckteeth Spots Active →→ 0.450 0.454 0.477 0.487 0.501 0.507 0.511 0.514 0.515 0.518 0.524 0.526 0.527 0.562 0.570 →→ Mountains Strong Bush Pads Fish Big Timid Hunter Small Brown Longleg Hooves Agility Nestspot Smart →→ 0.580 0.582 0.585 0.593 0.599 0.601 0.617 0.624 0.630 0.643 0.644 0.645 0.646 0.648 0.648 →→ Group Meat Patches Fierce Forest Claws Black Muscle Meatteeth Slow Fields Vegetation Domestic Grazer Gray →→ 0.649 0.651 0.656 0.661 0.667 0.667 0.669 0.672 0.674 0.675 0.678 0.679 0.682 0.686 0.690 →→ Paws Plains Solitary Toughskin Bulbous White Forager Inactive Smelly Lean→ 0.691 0.692 0.692 0.692 0.693 0.693 0.693 0.693 0.693 0.693

Table 5. AWA2 task hardness. AWA2 attributes sorted in ascending order of hardness along with their respective hardness scores. Hardnessscores listed above are compared with empirical test errors for each task and shown to be strongly correlated (Fig. 6(b) in the paper).

13


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et


(0) 5 o’clock shadow (1) Arched eyebrows (2) Attractive (3) Bags under eyes (4) Bald


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(5) Bangs (6) Big lips (7) Big nose (8) Black hair (9) Blond hair


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(10) Blurry (11) Brown hair (12) Bushy eyebrows (13) Chubby (14) Double chin


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(15) Eyeglasses (16) Goatee (17) Gray hair (18) Heavy makeup (19) High cheekbones


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(20) Male (21) Mouth slightly open (22) Mustache (23) Narrow eyes (24) No Beard


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

etSource att: 29

corr=0.84, p < 0.001

(25) Oval face (26) Pale skin (27) Pointy nose (28) Receding hairline (29) Rosy cheeks


0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(30) Sideburns (31) Smiling (32) Straight hair (33) Wavy hair (34) Wearing earrings


0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(35) Wearing hat (36) Wearing lipstick (37) Wearing necklace (38) Wearing necktie (39) Young

Figure 7. Attribute prediction; CE vs. test errors on CelebA (Extended from Fig. 2(a-d) in the paper). The source attribute, TZ , ineach plot is named in the plot title. Points represent different target tasks TY . Corr is the Pearson correlation coefficient between the twovariables and p is the statistical significance of the correlation. In all cases, the correlation is statistically significant.

14


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(0) Black (1) White (2) Blue (3) Brown (4) Gray


0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(5) Orange (6) Red (7) Yellow (8) Patches (9) Spots


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(10) Stripes (11) Furry (12) Hairless (13) Toughskin (14) Big


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(15) Small (16) Bulbous (17) Lean (18) Flippers (19) Hands


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(20) Hooves (21) Pads (22) Paws (23) Longleg (24) Longneck


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(25) Tail (26) Chewteeth (27) Meatteeth (28) Buckteeth (29) Strainteeth


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(30) Horns (31) Claws (32) Tusks (33) Smelly (34) Flys

Figure 8. Attribute prediction; CE vs. test errors on AwA2 (Extended from Fig. 2(e-h) in the paper; part 1). The source attribute,TZ , in each plot is named in the plot title. Points represent different target tasks TY . Corr is the Pearson correlation coefficient betweenthe two variables and p is the statistical significance of the correlation. In all cases, the correlation is statistically significant.

15


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(35) Hops (36) Swims (37) Tunnels (38) Walks (39) Fast


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(40) Slow (41) Strong (42) Weak (43) Muscle (44) Bipedal


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

etSource att: 47

corr=0.94, p < 0.001


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(45) Quadrapedal (46) Active (47) Inactive (48) Nocturnal (49) Hibernate


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4E

rror

onta

rget



0.0

0.2

0.4

Err

oron

targ

et


(50) Agility (51) Fish (52) Meat (53) Plankton (54) Vegetation


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(55) Insects (56) Forager (57) Grazer (58) Hunter (59) Scavenger


0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(60) Skimmer (61) Stalker (62) Newworld (63) Oldworld (64) Arctic


0.0

0.2

Err

oron

targ

et



0.0

0.2

0.4

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(65) Coastal (66) Desert (67) Bush (68) Plains (69) Forest


16


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

etSource att: 72

corr=0.96, p < 0.001


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(70) Fields (71) Jungle (72) Mountains (73) Ocean (74) Ground


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(75) Water (76) Tree (77) Cave (78) Fierce (79) Timid


0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et



0.0

0.2

Err

oron

targ

et


(80) Smart (81) Group (82) Solitary (83) Nestspot (84) Domestic


17

1A

ttrib

ute

ec::p

urpl

elc

::gre

enec

::gre

enec

::oliv

ebl

c::g

reen

blc:

:pur

ple

brc:

:pur

ple

ec::p

ink

bec:

:pur

ple

sh::o

wl-

like

→2

Con

ditio

nalE

ntro

py↑

0.01

10.

015

0.01

50.

016

0.01

70.

019

0.02

20.

022

0.02

50.

025

→→

lc::o

live

unc:

:pur

ple

fc::p

ink

upc:

:pin

kun

tc::p

urpl

ecc

::pin

kw

c::p

ink

bec:

:pin

ktc

::pur

ple

lc::i

ride

scen

t→

→0.

025

0.02

70.

027

0.02

80.

030

0.03

00.

031

0.03

20.

033

0.03

4→

→tc

::pin

kw

c::p

urpl

ebr

c::p

ink

uptc

::pur

ple

blc:

:oliv

eun

tc::p

ink

bkc:

:pin

klc

::pur

ple

cc::p

urpl

epc

::pin

k→

→0.

035

0.03

50.

035

0.03

50.

036

0.03

60.

036

0.03

70.

037

0.03

8→

→nc

::pin

kun

c::p

ink

nc::p

urpl

ebk

c::p

urpl

ebl

c::p

ink

fc::p

urpl

eup

c::p

urpl

eec

::blu

epc

::pur

ple

uptc

::pin

k→

→0.

038

0.04

00.

040

0.04

10.

042

0.04

20.

042

0.04

90.

051

0.05

3→

→tc

::gre

enfc

::gre

enbk

c::r

ufou

sup

tc::r

ufou

sbl

c::ir

ides

cent

sh::l

ong-

legg

ed-l

ike

bec:

:ruf

ous

untc

::ruf

ous

ec::r

ufou

sbr

c::g

reen

→→

0.05

40.

054

0.05

70.

057

0.05

70.

061

0.06

40.

064

0.06

40.

064

→→

wc:

:ruf

ous

lc::r

ufou

scc

::gre

enup

c::r

ufou

sun

tc::g

reen

bec:

:gre

entc

::oliv

ebr

c::r

ufou

sbl

c::r

ufou

sun

c::g

reen

→→

0.06

60.

067

0.06

90.

070

0.07

10.

072

0.07

20.

074

0.07

40.

074

→→

tc::r

ufou

sbr

c::ir

ides

cent

nc::g

reen

pc::r

ufou

sun

c::r

ufou

sup

tc::g

reen

nc::r

ufou

sbe

c::ir

ides

cent

ec::b

uff

tc::i

ride

scen

t→

→0.

074

0.07

50.

076

0.07

60.

076

0.07

70.

077

0.08

00.

081

0.08

2→

→bl

s::h

ooke

dfc

::ruf

ous

lc::b

lue

ec::o

rang

eun

tc::i

ride

scen

tun

c::ir

ides

cent

uptc

::iri

desc

ent

untc

::ora

nge

bkc:

:ora

nge

cc::r

ufou

s→

→0.

082

0.08

20.

082

0.08

40.

084

0.08

90.

089

0.08

90.

089

0.09

0→

→fc

::iri

desc

ent

ec::y

ello

wsh

::chi

cken

-lik

e-m

arsh

nc::o

rang

ecc

::iri

desc

ent

si::v

ery-

larg

etc

::ora

nge

untc

::red

pc::i

ride

scen

tup

tc::o

rang

e→

→0.

092

0.09

20.

093

0.09

30.

094

0.09

40.

095

0.09

50.

095

0.09

7→

→bk

c::g

reen

cc::o

rang

eec

::gre

ypc

::gre

enw

c::g

reen

lc::y

ello

ww

c::ir

ides

cent

brc:

:oliv

eup

c::g

reen

nc::i

ride

scen

t→

→0.

099

0.09

90.

099

0.09

90.

100

0.10

00.

101

0.10

30.

107

0.10

7→

→bl

s::s

peci

aliz

edbk

c::ir

ides

cent

fc::o

live

cc::o

live

blc:

:blu

efc

::ora

nge

bec:

:blu

ebl

s::c

urve

dbl

s::n

eedl

ew

c::o

rang

e→

→0.

108

0.10

80.

110

0.11

20.

112

0.11

30.

113

0.11

30.

113

0.11

4→

→be

c::o

live

lc::p

ink

wc:

:red

sh::u

pwl

unc:

:oliv

eup

tc::r

edbk

c::r

edec

::red

upc:

:ora

nge

upc:

:irid

esce

nt→

→0.

115

0.11

70.

117

0.11

70.

118

0.11

80.

119

0.12

00.

123

0.12

6→

→up

c::r

edun

c::b

lue

sh::h

awk-

like

brc:

:ora

nge

nc::o

live

brc:

:blu

ebe

c::o

rang

esh

::upl

and-

grou

nd-l

ike

hp::s

potte

dtc

::blu

e→

→0.

129

0.13

00.

133

0.13

40.

134

0.13

70.

139

0.13

90.

141

0.14

1→

→un

c::o

rang

eun

tc::o

live

ec::b

row

npc

::ora

nge

si::l

arge

lc::r

edup

tc::o

live

hp::u

niqu

e-pa

ttern

bec:

:red

ec::w

hite

→→

0.14

40.

145

0.14

70.

153

0.15

50.

156

0.15

60.

156

0.16

10.

162

→→

pc::r

ednc

::red

hp::c

rest

edlc

::whi

tebl

c::r

edsh

::sw

allo

w-l

ike

tc::r

edbr

c::r

edbl

s::s

patu

late

unc:

:red

→→

0.16

20.

162

0.16

50.

167

0.16

70.

170

0.17

20.

173

0.17

40.

175

→→

fc::r

edbk

c::o

live

pc::o

live

wc:

:oliv

ehp

::mas

ked

untc

::blu

esh

::hum

min

gbir

d-lik

ets

::for

ked-

tail

sh::s

andp

iper

-lik

ecc

::red

→→

0.17

50.

180

0.18

10.

182

0.18

30.

184

0.18

40.

189

0.19

00.

193

→→

upc:

:oliv

efc

::blu

ew

c::b

lue

uptc

::blu

enc

::blu

ebl

c::w

hite

blc:

:yel

low

bkc:

:blu

ecc

::blu

eup

c::b

lue

→→

0.19

30.

205

0.20

80.

209

0.21

20.

217

0.21

90.

220

0.22

20.

226

→→

tp::s

potte

dsh

::pig

eon-

like

bll::

long

er-t

han-

head

uptc

::yel

low

sh::d

uck-

like

pc::b

lue

bep:

:spo

tted

bls:

:hoo

ked-

seab

ird

brp:

:spo

tted

sh::t

ree-

clin

ging

-lik

e→

→0.

229

0.23

10.

231

0.23

10.

231

0.23

50.

240

0.24

00.

248

0.25

0→

→sh

::gul

l-lik

ehp

::str

iped

blc:

:ora

nge

cc::y

ello

wun

tc::y

ello

wbk

p::s

potte

dbl

c::b

row

nw

p::s

potte

dnc

::yel

low

ws:

:long

-win

gs→

→0.

254

0.26

00.

262

0.27

10.

278

0.28

40.

293

0.29

40.

294

0.29

4→

→tc

::bro

wn

bkc:

:yel

low

fc::y

ello

ww

s::b

road

-win

gslc

::bro

wn

hp::m

alar

wc:

:yel

low

bec:

:bro

wn

lc::o

rang

ets

::fan

-sha

ped-

tail→

→0.

298

0.30

10.

307

0.31

30.

318

0.31

80.

318

0.32

30.

328

0.33

0→

→ts

::squ

ared

-tai

lup

c::y

ello

wun

c::b

row

nbe

p::s

trip

edtc

::yel

low

ec::b

lack

hp::c

appe

dhp

::eye

line

hp::e

yebr

owcc

::whi

te→

→0.

336

0.34

60.

356

0.36

00.

363

0.36

80.

371

0.37

40.

375

0.38

0→

→br

c::b

row

nw

s::ta

pere

d-w

ings

brp:

:str

iped

fc::b

uff

cc::b

uff

bls:

:dag

ger

ts::r

ound

ed-t

ail

fc::w

hite

pc::y

ello

wbr

c::y

ello

w→

→0.

381

0.38

20.

383

0.38

40.

386

0.40

20.

403

0.41

20.

416

0.42

1→

→tc

::buf

fbl

c::b

uff

uptc

::buf

fbe

c::y

ello

wun

tc::b

uff

bec:

:bla

ckhp

::eye

ring

bep:

:mul

ti-co

lore

dun

c::y

ello

wtc

::gre

y→

→0.

425

0.43

10.

437

0.44

00.

443

0.44

40.

446

0.45

00.

450

0.45

1→

→nc

::buf

ftp

::str

iped

uptc

::whi

tefc

::bro

wn

bkc:

:buf

flc

::buf

fnc

::bro

wn

bec:

:gre

yup

c::b

uff

pc::b

uff

→→

0.45

60.

466

0.47

60.

483

0.48

30.

483

0.48

40.

485

0.49

00.

490

→→

brc:

:buf

fsi

::med

ium

bkc:

:whi

tebe

c::b

uff

unc:

:bla

ckbr

p::m

ulti-

colo

red

wc:

:buf

fcc

::bro

wn

brc:

:gre

yun

c::b

uff

→→

0.49

20.

493

0.49

40.

494

0.49

60.

496

0.49

80.

501

0.50

30.

510

→→

unc:

:gre

ybk

p::s

trip

edfc

::gre

yun

tc::b

row

nbr

c::b

lack

uptc

::bro

wn

ts::p

oint

ed-t

ail

cc::g

rey

tc::b

lack

si::v

ery-

smal

l→

→0.

513

0.51

40.

518

0.52

00.

528

0.53

00.

531

0.53

40.

544

0.54

9→

→nc

::whi

teun

tc::w

hite

bkp:

:mul

ti-co

lore

dpc

::bro

wn

nc::g

rey

bkc:

:bro

wn

upc:

:whi

teun

tc::g

rey

wp:

:str

iped

hp::p

lain

→→

0.54

90.

554

0.55

50.

564

0.57

20.

573

0.58

20.

584

0.58

70.

588

→→

bls:

:con

eup

tc::g

rey

blc:

:gre

yw

c::w

hite

upc:

:bro

wn

pc::w

hite

nc::b

lack

pc::g

rey

wc:

:bro

wn

tp::m

ulti-

colo

red→

→0.

590

0.59

10.

594

0.59

60.

599

0.60

20.

604

0.60

50.

606

0.60

6→

→lc

::bla

ckbk

c::g

rey

ws:

:poi

nted

-win

gslc

::gre

yw

p::s

olid

wp:

:mul

ti-co

lore

dw

c::g

rey

upc:

:gre

yfc

::bla

ckbe

p::s

olid

→→

0.61

00.

616

0.61

90.

622

0.62

60.

627

0.62

80.

633

0.64

30.

646

→→

cc::b

lack

pc::b

lack

tc::w

hite

bkc:

:bla

ckbl

l::sa

me-

as-h

ead

ts::n

otch

ed-t

ail

uptc

::bla

ckbr

c::w

hite

bls:

:all-

purp

ose

brp:

:sol

id→

→0.

647

0.65

10.

654

0.65

50.

659

0.66

60.

667

0.66

90.

669

0.67

0→

→bl

l::sh

orte

r-th

an-h

ead

bec:

:whi

tew

s::r

ound

ed-w

ings

unc:

:whi

teun

tc::b

lack

upc:

:bla

ckbl

c::b

lack

bkp:

:sol

idtp

::sol

idw

c::b

lack

→→

0.68

10.

681

0.68

20.

684

0.68

50.

688

0.68

80.

691

0.69

10.

691

→→

sh::p

erch

ing-

like

si::s

mal

l→

0.69

30.

693

Table 6. CUB task hardness. CUB attributes sorted in ascending order of hardness along with their respective hardness scores. Hardnessscores listed above are compared with empirical test errors for each task and shown to be strongly correlated (Fig. 6(c) in the paper).Attribute names are abbreviated due to space concerns. Full names are provided in Table 7.

18

bls has bill shape uptc has upper tail color untc has under tail color tp has tail patternwc has wing color hp has head pattern nc has nape color bep has belly patternupc has upperparts color brc has breast color bec has belly color pc has primary colorunc has underparts color tc has throat color ws has wing shape lc has leg colorbrp has breast pattern ec has eye color si has size blc has bill colorbkc has back color bll has bill length sh has shape cc has crown colorts has tail shape fc has forehead color bkp has back pattern wp has wing pattern

Table 7. CUB attribute name abbreviations. Abbreviations used in Table 6 for the attributes in the CUB dataset [66].

19

Documents

Transferability and Hardness of Supervised Classiﬁcation Tasks · Anh T. Tran VinAI Research [email protected] Cuong V. Nguyen Amazon Web Services [email protected] Tal Hassner