12
Masksembles for Uncertainty Estimation Nikita Durasov 1 * Timur Bagautdinov 2 Pierre Baque 2 Pascal Fua 1 1 Computer Vision Laboratory, EPFL, {name.surname}@epfl.ch 2 Neural Concept SA, {name.surname}@neuralconcept.com https://nikitadurasov.github.io/projects/masksembles/ Abstract Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions re- mains challenging. Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate. MC-Dropout is another popular alternative, which is less expensive, but also less reliable. Our central intuition is that there is a continuous spectrum of ensemble-like mod- els of which MC-Dropout and Deep Ensembles are extreme examples. The first one uses an effectively infinite number of highly correlated models, while the second relies on a finite number of independent models. To combine the benefits of both, we introduce Masksem- bles. Instead of randomly dropping parts of the network as in MC-dropout, Masksemble relies on a fixed number of bi- nary masks, which are parameterized in a way that allows changing correlations between individual models. Namely, by controlling the overlap between the masks and their size one can choose the optimal configuration for the task at hand. This leads to a simple and easy to implement method with performance on par with Ensembles at a fraction of the cost. We experimentally validate Masksembles on two widely used datasets, CIFAR10 and ImageNet. 1. Introduction The ability of deep neural networks to produce useful predictions is now abundantly clear [50, 4, 44, 47, 8, 9, 25, 31] but assessing the reliability of these predictions remains a challenge. Among all the methods that can be used to this end, MC-Dropout [12] and Deep Ensembles [32] have emerged as two of the most popular ones. Both of those methods exploit the concept of ensembles to produce uncer- tainty estimates. MC-Dropout does so implicitly - by train- * This work was supported in part by the Swiss National Science Foun- dation ing a single stochastic network [14, 51, 29], where random- ness is achieved by dropping different subsets of weights for each observed sample. At test time, one can obtain multi- ple predictions by running this network multiple times, each time with a different weight configuration. Deep Ensem- bles, on the other hand, build an explicit ensemble of mod- els, where each model is randomly initialized and trained in- dependently using stochastic gradient descent [46, 28]. As importantly, both methods rely on the fact that the outputs of individual models are diverse, which is achieved by in- troducing stochasticity into the training or testing process. However, simply adding randomness does not always lead to diverse predictions. In practice, MC-Dropout often performs significantly worse than Deep Ensembles on un- certainty estimation tasks [32, 40, 17]. We argue that this can be attributed to the fact that each weight is dropped randomly and independently from the others. For many configurations of hyperparameters, in particular the dropout rate, this results in similar weight configurations and, con- sequently, less diverse predictions [10]. Deep Ensembles seem to not share this weakness for reasons that are not yet fully understood and usually produce significantly more re- liable uncertainty estimates. Unfortunately, this comes at a price, both at training and inference time: Building an ensemble requires training multiple models, and using it means loading all of them simultaneously in memory, typi- cally a very scarce resource. In this work, we introduce Masksembles, an approach to uncertainty estimation that tackles these challenges and produces reliable uncertainty estimates on par with Deep Ensembles at a significantly lower computational cost. The main idea behind the method is simple - to introduce a more structured way to drop model parameters than that of MC- Dropout. Masksembles produces a fixed number of binary masks which specify the network parameters to be dropped. The properties of the masks effectively define the properties of the final ensemble in terms of capacity and correlations. arXiv:2012.08334v2 [cs.LG] 25 Jun 2021

@epfl.ch

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: @epfl.ch

Masksembles for Uncertainty Estimation

Nikita Durasov1* Timur Bagautdinov2 Pierre Baque2 Pascal Fua1

1Computer Vision Laboratory, EPFL, {name.surname}@epfl.ch2Neural Concept SA, {name.surname}@neuralconcept.com

https://nikitadurasov.github.io/projects/masksembles/

Abstract

Deep neural networks have amply demonstrated theirprowess but estimating the reliability of their predictions re-mains challenging. Deep Ensembles are widely consideredas being one of the best methods for generating uncertaintyestimates but are very expensive to train and evaluate.

MC-Dropout is another popular alternative, which isless expensive, but also less reliable. Our central intuitionis that there is a continuous spectrum of ensemble-like mod-els of which MC-Dropout and Deep Ensembles are extremeexamples. The first one uses an effectively infinite number ofhighly correlated models, while the second relies on a finitenumber of independent models.

To combine the benefits of both, we introduce Masksem-bles. Instead of randomly dropping parts of the network asin MC-dropout, Masksemble relies on a fixed number of bi-nary masks, which are parameterized in a way that allowschanging correlations between individual models. Namely,by controlling the overlap between the masks and their sizeone can choose the optimal configuration for the task athand. This leads to a simple and easy to implement methodwith performance on par with Ensembles at a fraction ofthe cost. We experimentally validate Masksembles on twowidely used datasets, CIFAR10 and ImageNet.

1. IntroductionThe ability of deep neural networks to produce useful

predictions is now abundantly clear [50, 4, 44, 47, 8, 9, 25,31] but assessing the reliability of these predictions remainsa challenge. Among all the methods that can be used tothis end, MC-Dropout [12] and Deep Ensembles [32] haveemerged as two of the most popular ones. Both of thosemethods exploit the concept of ensembles to produce uncer-tainty estimates. MC-Dropout does so implicitly - by train-

*This work was supported in part by the Swiss National Science Foun-dation

ing a single stochastic network [14, 51, 29], where random-ness is achieved by dropping different subsets of weights foreach observed sample. At test time, one can obtain multi-ple predictions by running this network multiple times, eachtime with a different weight configuration. Deep Ensem-bles, on the other hand, build an explicit ensemble of mod-els, where each model is randomly initialized and trained in-dependently using stochastic gradient descent [46, 28]. Asimportantly, both methods rely on the fact that the outputsof individual models are diverse, which is achieved by in-troducing stochasticity into the training or testing process.

However, simply adding randomness does not alwayslead to diverse predictions. In practice, MC-Dropout oftenperforms significantly worse than Deep Ensembles on un-certainty estimation tasks [32, 40, 17]. We argue that thiscan be attributed to the fact that each weight is droppedrandomly and independently from the others. For manyconfigurations of hyperparameters, in particular the dropoutrate, this results in similar weight configurations and, con-sequently, less diverse predictions [10]. Deep Ensemblesseem to not share this weakness for reasons that are not yetfully understood and usually produce significantly more re-liable uncertainty estimates. Unfortunately, this comes ata price, both at training and inference time: Building anensemble requires training multiple models, and using itmeans loading all of them simultaneously in memory, typi-cally a very scarce resource.

In this work, we introduce Masksembles, an approachto uncertainty estimation that tackles these challenges andproduces reliable uncertainty estimates on par with DeepEnsembles at a significantly lower computational cost. Themain idea behind the method is simple - to introduce a morestructured way to drop model parameters than that of MC-Dropout.

Masksembles produces a fixed number of binary maskswhich specify the network parameters to be dropped. Theproperties of the masks effectively define the properties ofthe final ensemble in terms of capacity and correlations.

arX

iv:2

012.

0833

4v2

[cs

.LG

] 2

5 Ju

n 20

21

Page 2: @epfl.ch

During training, for each sample, we randomly choose oneof the masks and drop the corresponding parts of the modeljust like a standard dropout. During inference, we run themodel multiple times, once per mask, to obtain a set ofpredictions and, ultimately, an uncertainty estimate. Ourmethod has three key hyperparameters: the total number ofmasks, the average overlap between masks and the numberof ones and zeros in each mask. Intuitively, using a largenumber of masks approximates MC-Dropout, while usinga set of non-overlapping, completely disjoint masks yieldsan Ensemble-like behavior. In other words, as illustrated inFig. 5 Masksembles defines a spectrum of model configu-rations of which MC-Dropout and Deep Ensembles can beseen as extreme cases. We evaluate our method on severalsynthetic and real datasets and demonstrate that it outper-forms MC-Dropout in terms of accuracy, calibration, andout-of-distribution (OOD) detection performance. Whencompared to Deep Ensembles, our method has similar per-formance but is more favorable in terms of training time andmemory requirements. Furthermore, Masksembles is sim-ple to implement and can serve as a drop-in replacementfor MC-Dropout in virtually any model. To summarize, themain contributions of this work are as follows:

• We propose an easy to implement, drop-in replacementmethod that performs better than MC-Dropout, at asimilar computational cost, and that matches Ensem-bles at a fraction of the cost.

• We provide theoretical insight into two popular uncer-tainty estimation methods - MC-Dropout and Ensem-bles, by creating a continuum between the two.

• We provide a comprehensive evaluation of our methodon several public datasets. This validates our claimsand guides the choice of Masksemble parameters inpractical applications.

2. Related WorkMC-Dropout [12] and Deep Ensembles [32] have

emerged as two of the most prominent and practical uncer-tainty estimation methods for deep neural networks [1, 40,17]. In what follows, we provide a brief background on un-certainty estimation, review the best-known methods, anddiscuss potential alternatives [53, 20, 2, 37, 36].

2.1. Background

The goal of Uncertainty Estimation (UE) is to producea measure of confidence for model predictions. Two majortypes of uncertainty can be modeled within the Bayesianframework [7]. Aleatoric uncertainty, also known as Datauncertainty [11], captures noise that arises from data andtherefore is irreducible, because the natural complexity ofthe data directly causes it. Sensor noise, labels noise, and

classes overlapping are among the most common sourcesof aleatoric uncertainty. Epistemic uncertainty or model un-certainty [11] accounts for uncertainty in the parameters ofthe model that we learned, that is, it represents our igno-rance about parameters of the data generation process. Thistype of uncertainty is reducible and therefore, given enoughdata, could be eliminated entirely. In addition to the afore-mentioned types of uncertainty, there is also Distributionaluncertainty. Also known as dataset shift [43], this type ofuncertainty is useful when there is a mismatch between thetraining and testing data distributions, and a model is con-fronted with unfamiliar samples.

For all the aforementioned types of uncertainty, ideallywe want our model to output an additional signal indicat-ing how reliable its prediction is for a particular sample.More formally, given an input, we want our model to out-put the actual prediction target as well as the correspond-ing uncertainty measure. For regression tasks, the uncer-tainty measure can be the variance or confidence intervalsaround the prediction made by the network [41]. For classi-fication tasks, popular uncertainty measures are the entropyand max-probability [37, 33]. Ideally, the measure of un-certainty should be high when the model is incapable ofproducing accurate prediction for the given input and lowotherwise. Generally speaking, uncertainty can be eitherlearned from the data directly [33, 27], in particular for bigdata regimes and aleatoric uncertainty, or could be acquiredfrom a diverse set of predictions [12, 33], which are ob-tained from stochastic regularization techniques or ensem-bles. In this work, we focus on the latter family of methods.

2.2. Ensembles

Deep Ensembles [32] involve training an ensemble ofdeep neural networks on the same data by initializing eachnetwork randomly. This approach is originally inspired by aclassical ensembling method, bagging [19]. An uncertaintyestimate is obtained by running all the elements in the en-semble and aggregating their predictions. The major draw-back of Deep Ensembles is the computational overhead: itrequires training multiple independent networks, and duringinference it is desirable to keep all these networks in mem-ory. More generally, ensembling has been a popular way toboost the performance of individual neural network mod-els [18, 30, 33, 42, 54, 56], with the quality improvementsoften being associated with diverse predictions. In [10] au-thors provide a comprehensive comparison of several pop-ular ensemble learning methods as well as several approxi-mate Bayesian inference techniques. Extensive experimentson vision tasks suggest that simple ensembles have the low-est correlation between individual models. This is an im-portant advantage of these methods, as ultimately this leadsto highly diversified predictions on the samples which arevery distinct from those observed in training.

Page 3: @epfl.ch

Snapshot-based methods take a slightly different ap-proach towards creating an ensemble, and try to collect adiverse set of weight configurations over the course of a sin-gle training. Snapshot ensembles [24] build upon the idea ofusing cyclical learning rates to ensure that the optimizationprocess can efficiently explore multiple local optima. Sim-ilarly, Fast Geometric Ensembling [13, 26], uses a methodto identify low-error paths between local optima to obtaina collection of diverse representations. Although these ap-proaches partially tackle training time issues, they typicallyperform worse than the standard ensembles [1] and con-sume the same amount of memory. Furthermore, because ofthe specifics of their training procedure, in particular, sharedinitialization with a pre-trained network, individual snap-shots are not guaranteed to be decorrelated, which mighthurt the performance [1].

2.3. Dropout

Dropout [48] is an extremely popular and simple ap-proach to stochastically regularizing deep neural networks.During training, neurons are randomly dropped with fixedprobability, which effectively produces an approximationof several slightly different model architectures and allowsa single model to mimic ensemble behavior. OriginallyDropout was used during training as a stochastic regular-ization technique, but [12, 11] proposed using it during in-ference, providing sound mathematical justification for en-sembling and Bayesian model averaging analogies. MC-Dropout is extremely easy to use and has zero memory over-head compared to a single model but shows consistentlyworse performance compared to Deep Ensembles [40, 17].In particular, different predictions made by several forwardpasses with randomly generated masks seem to be overlycorrelated and strongly underestimate the variance. Inthis paper, we postulate that the training procedure, whichmakes any combination of the weights possible, will tend tocreate uniformity between predictions. As opposed to En-semble techniques, MC-Dropout techniques do not let mul-tiple versions of the network actually be created, and theweights are forced to converge jointly.

2.4. Other Methods

Although Bayesian inference provides a natural way toassess uncertainty of model predictions, it is prohibitivelyexpensive to apply these techniques to deep neural networksdirectly. A number of approximate inference techniqueswere developed for Bayesian Neural Networks (BNN) [34,38, 35] that try to alleviate these problems, to name afew: Bayes by Backprop [3] uses Variational Inference tolearn a parametric distribution over model weights that ap-proximates the true posterior, Laplace Approximation [45]tries to find a Gaussian approximation to the posteriormodes, MC-Dropout [12] could be seen as an approximate

Bayesian procedure as well. Although these techniques aretheoretically grounded, Deep Ensembles often show signif-icantly better performance in practice [40, 1], in terms ofboth accuracy and quality of the resulting uncertainty esti-mates.

3. MethodIn this section, we introduce Masksembles, an approach

to uncertainty estimation that creates a continuum betweensingle networks, deep ensembles, and MC-Dropout. Itbuilds on the intuition that dropout-based methods can beseen as stochastic variants of ensembles [33], albeit en-sembles with infinite cardinality. We will argue that themain reason for MC-Dropout’s poor performance is thehigh correlation between the ensemble elements that makesthe overall predictions insufficiently diverse. Masksemblesis designed to give control over the correlation between theelements of an ensemble and, hence, achieve a satisfac-tory compromise between reliable uncertainty estimationand acceptable computational cost.

3.1. Masksembles as Structured Dropout

Both Ensembles and MC-Dropout are ensemble meth-ods, that is, they produce multiple predictions for a giveninput and then use an aggregated measure such as varianceor entropy as an uncertainty estimator. Formally, consider adataset {xi, yi}i, where xi ∈ RF are input features, and yiare scalars or labels. For classification problems, we modelconditional distribution p(y|x;θ) as a mixture

p(y|x;θ) = 1

N

N∑k=1

p(y|x;θk) ,

where θk are the weights of element k of the ensemble, andN is the size of the ensemble.

For Deep Ensembles, individual models are independentand do not share any weights, and hence each θk is a sep-arate set of weights, trained entirely independently. Bycontrast, for MC-Dropout there is a single shared θ, andindividual models are obtained as θk = btk · θ, wherebtk ∈ {0, 1}|θ| are random binary masks which are re-sampled for each iteration t. The elements of btk followBernoulli distribution parameterized by dropout probabilityp, and the parameters corresponding to the same networkactivations are dropped simultaneously. At inference time, asmall number of predictions is produced by randomly sam-pling the masks. In practice, for any reasonable choice ofp, the masks btk will overlap significantly, which will tendto make predictions p(y|x;θk) highly correlated, and thuscould lead to underestimated uncertainty. Furthermore, be-cause masks are re-sampled at each training iteration, eachnetwork unit needs to be able to form a coherent response

Page 4: @epfl.ch

with any other unit, which creates a mixing effect and leadsto uniformity between the predictions from different masks.

To tackle these issues, we propose using a finite set ofpre-defined binary masks, which are generated so that theiroverlap can be controlled. They are then used to drop cor-responding network activations so that the resulting modelsare sufficiently decorrelated and do not suffer from the mix-ing effect described above. In what follows, we describethe algorithm for mask generation, explain the way we ap-ply generated masks to models, and discuss the trade-offs itincurs.

3.2. Generating Masks

Generating masks with a controllable amount of overlapis central to Masksembles. Let us first introduce the keyparameters of our mask generation process:

• N - number of masks.

• M - number of ones in each mask.

• S - scale that controls the amount of overlap given Nand M.

The masks generation algorithm is summarized below. Itstarts by generating N vectors of all zeros of size M × S.Then, in each of those vectors, it randomly sets M of theseelements to be 1 in each vector. As depicted by Fig. 1, itthen looks for positions that are all zero in the resulting vec-tors. These correspond to features that will be dropped.

For S = 1.0, the algorithm then generates N masks ofsize M that will contain only ones and therefore fully over-lapping. Conversely, setting S to large positive values willgenerate N masks with mostly zero values and thereforevery few overlapping one values.

3.3. Inserting Masksembles Layer into Networks

We start with the very simple case depicted by Fig. 2.It shows an MLP with one hidden layer of size 3 and 2-dimensional inputs and outputs. We insert Masksembleslayer right after the hidden layer and increase its size toM × S −D where D is the number of dropped features sothat the sizes of the masks and the size of the layer match.Then at run-time, we apply the mask values and ignore allcomponents of the hidden layer for which they are zero.

Fig. 2 features 3 different configurations obtained by fix-ing both M and N — here, N is sufficiently large so thatD = 0 — and changing the value of S to 1.0, 1.7, and 2.3,therefore increasing masks sizes and hidden size in range[3, 5, 7].

Applying such modification to the model creates a largerone that allows for multiple relatively uncorrelated predic-tions. This process extends naturally to MLPs of any dimen-sion. It also extends to convolutional architectures with-out any significant modification. As in MC-dropout, we

0

1

1

0 0 0

0

1 0

1

1

1

0 1 1 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

1

1

0

1 0

1

1

1

0 1 1 0

Initial zerosmasks

Masks filledwith ones

Final trimmedmasks

(1) (2)

N masks

S Mfeatures

0

1

1

0 0 0

0

1 0

1

0 1 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

Initial zerosmasks

Masks filledwith ones

Final trimmedmasks

(1) (2)

N masks

S Mfeatures

0 0 0 0

0 0 0 0

0

1

1

0 0 0

0

1 0

1

1

1

0 1 1 0

0

0

10

1

0

0

0 1

0

0

0

1

1

0 0 0

0

1 0

1

0 1 0

0

0

10

1

0

0

0 1

0

0

1

1

0

1 0

1

0 1 0

0

0

0

0

10

1

0

0 1

0

Trimmingunused features

Trimmingunused features

S

S

2

3

Figure 1. Masks generation algorithm. Here we use N = 4,M = 2, S = 2.0 and 3.0. In the first case M · S = 4 and in thesecond M·S = 6. First, we generate 4 masks with M·S zeros each.Second, we randomly choose 2 positions for every of 4 masks andfill them with ones. Finally, we drop features that are not used inany mask. In the first case, the IoU is 0.44 and in the second 0.16.

1 1 1 0 1 01 1 1 0 00 11 0

Pointwisemultiplication

Trainableweight 1 Active

feature 0 Droppedfeature

Figure 2. Masksembles in a simple case. Example of Masksem-bles MLP behavior for different values of S ∈ [1.0, 1.7, 2.3] (Nand M are fixed): xk - inputs, hk - activations, yk - outputs.

can drop entire channels [49] instead of individual activa-tions. In other words, treating the whole channel as a singlefeature is equivalent to the fully connected case discussedabove.

In general, this means that we may have to adapt the sizeof network layers each time we change M, N, or S, whichcan be cumbersome. Fortunately, keeping S ×M equal tothe original channels number, 3 in the MLP example, makesit possible to apply Masksembles without any such changeof the network configuration.

3.4. Parameters and Model Transitions

The chosen parameters N, M and S define several keymodel properties:

• Total model size: The total number of model param-eters. In the above MLP example, this is equal to4 · (M× S−D) where D is the number of featurestrimmed during the mask generation procedure. Thisfunctional relationship is illustrated in Fig. 3.

Page 5: @epfl.ch

• Model capacity: The number of activations that areused during one forward pass of the model, which isequal to M.

• Masks IoU average Intersection over Union (IoU)between generated masks. We derive approximationfor this quantity that appears to depend only on S :IoU(S) = 1/(2S− 1) and therefore N,M do not influ-ence it, as show in Fig. 3.

100 101 102 103

scale S

10

20

30

mas

ks N

4x8x

16x

32x

5 10scale S

0.25

0.50

0.75

1.00av

erag

e Io

U 12S 1

Figure 3. Influence of the parameters. (Left) Relative Totalmodel size as a function of S and N, with M being fixed. Werefer to size of single model (S = 1,N = 1) as 1x. (Right) IoUof the generated masks as a function of S. Note that it does notdepend on neither N nor M.

In fact, Deep Ensembles and MC-Dropout can be bothseen as extreme cases and Masksembles can transitionsmoothly from one to the other (Fig. 5).

Ensembles: When S goes to infinity, Masks IoU goes tozero and Masksembles starts behaving like Ensembles thatuses non-overlapping masks.

Dropout: As N increases, each individual mask config-uration becomes less and less likely to be picked duringtraining. In the limit where N goes to infinity, we reproducethe situation where each mask is seen only once, which isequivalent to MC-Dropout. In this case, the dropout ratewould be 1− 1

S .

Transition: In Fig. 14, we use a simple classificationproblem to illustrate the span of behaviors Masksemblescan cover. Specifically, we show that reducing maskoverlap, that is, decreasing the correlation between binarymasks, yields progressively more diverse predictions onout-of-training examples and generates more ensemble-likebehavior.

3.5. Implementation details

The only extra computation performed in Masksemblesover a single network is cheap tensor-to-mask multiplica-tion. Therefore, compared to convolutional and fully con-nected layers, this product induces negligible computationaloverhead. Furthermore, we rely on similar optimizations asin [52] to make our implementation more efficient.

15 10 5 0 5 10 15

6

4

2

0

2

4

6

15 10 5 0 5 10 15

6

4

2

0

2

4

6

15 10 5 0 5 10 15

6

4

2

0

2

4

6

(a) (b) (c)

15 10 5 0 5 10 15

6

4

2

0

2

4

6

15 10 5 0 5 10 15

6

4

2

0

2

4

6

15 10 5 0 5 10 15

6

4

2

0

2

4

6

(c) (e) (f)Figure 4. Single Model to Ensembles transition via Masksem-bles. The task is to classify the red vs blue data points drawn in therange x ∈ [−5, 5] from two sinusoidal functions with added Gaus-sian noise. The background color depicts the entropy assigned bydifferent models to individual points on the grid, low in blue andhigh in yellow. (a) Single model, (b) - (e) Masksembles modelswith N = 4, M = 100, S = [1.1, 2.0, 3.0, 10.0], (f) - Ensemblesmodel. For a fair comparison with ensembles, we used a fixedvalue for M = 100 when training Masksembles with different S,so that each Masksembles-model has the same 100 hidden-unitscapacity and the correlation between models decreases from (b)to (e). For high mask-overlap values, Masksembles behaves al-most like a Single Model but starts behaving more and more likeEnsembles as the overlap decreases.

Optimalconfiguration

EnsemblesMC-Dropout

SingleModel

(Modelsize/Capacity)

Figure 5. Ensembles to MC-Dropout transition. Green / Redpoints denote extreme cases that correspond to these models,whereas the Yellow point represents an optimal parameter choiceone could choose for a specific computational cost vs performancetrade-off. Here M is fixed while N and S vary.

4. Experiments

In this section, we first introduce the evaluation met-rics which we use to quantitatively compare Masksemblesagainst MC-Dropout and Ensembles. We then compare ourmethod with the baselines on two broadly used benchmarkdatasets CIFAR10 [30] and ImageNet [6].

Note that we did not use any complex data augmenta-tion schemes, such as AugMix [23], Mixup [57] or adver-

Page 6: @epfl.ch

sarial training [15], in order to avoid entangling their ef-fects with those of Masksembles. Ultimately, these tech-niques are complementary to ours and could easily be usedtogether.

4.1. Metrics

Given a classifier with parameters θ, let p(y|x;θ) be thepredicted probability for a sample represented by featuresx to have a label y, and let pgt(y|x) denote the true condi-tional distribution over the labels given the features. We useseveral different metrics to analyze how close p(y|x;θ) isto pgt(y|x) and to compare our method to others, in termsof both accuracy and quality of uncertainty estimation.

Accuracy. For classification tasks, we use the standardclassification accuracy, i.e. the percentage of correctly clas-sified samples on the test set.

Entropy (ENT). It is one of the most popular measuresused to quantify uncertainty [33, 37]. For classificationtasks it is defined as

−Nc∑c=1

p(y = c|x;θ) log p(y = c|x;θ)

where Nc is the number of classes.

Expected Calibration Error (ECE). ECE quantifies thequality of uncertainty estimates specifically for in-domainuncertainty. A model is considered well-calibrated if itspredicted distribution over labels p(y|x;θ) is close to thetrue distribution pgt(y|x). As in [16], we define ECE to bethe average discrepancy between the predicted and real datadistribution:

Ex,y∼pgt|p(y|x;θ)− pgt(y|x)| ,

which is a common metric for calibration evaluation.

Out-of-Distribution Detection (OOD ROC / PR). Do-main shift occurs when the features in the training data donot follow the same probability distribution as those in thetest distribution. Detecting out-of-distribution samples is animportant task, and uncertainty estimation can be used forthis purpose under the assumption that our model returnshigher uncertainty estimates for such samples. To this end,we use the uncertainty measure produced by our models asa classification score that determines whether a sample isin-domain or not. We then apply standard detection met-rics, ROC and PR AUCs, to quantify the performance ofour model.

Model Size This corresponds to the Total model size de-fined in Section 3.4. A major motivation for Masksemblesis to match the performance of Deep Ensembles while usinga smaller model that requires significantly less memory. Weuse a total number of weights that parameterize our modelsas a proxy for that. In addition to the model size, we alsoreport the total GPU time used to train any particular model.

1 2 3 4 5severity

0.0

0.2

0.4

0.6

ECE

MethodSingleMasksemblesEnsemblesMC-Dropout

1 2 3 4 5severity

0.2

0.4

0.6

0.8

Accu

racy

MethodSingleMasksemblesEnsemblesMC-Dropout

Figure 6. CIFAR-10 results on corrupted images. Masksemblesand MC-Dropout models with tuned (on validation set) dropoutrate parameter values, Ensemble of independently trained modelsand Single model. All approaches, except Single, use 4 modelswhose predictions are averaged.

4.2. CIFAR10

CIFAR10 is among the most popular benchmarks for un-certainty estimation. The dataset is relatively small-scale,and very large models tend to overfit the data easily. We,therefore, picked the Wide-ResNet-16-4 [55] for our exper-iments and used training procedures similar to the originalpaper. Following [12], we place dropout layers before allthe convolutional and fully connected layers in the MC-Dropout model. Masksembles layers are added exactly thesame way.

Accuracy and ECE. One of the standard ways to as-sess the robustness of a model is to evaluate the depen-dency of its performance with respect to inputs perturba-tions [22, 40]. In our experiments, we investigate the be-haviour of all the considered models under the influence ofdomain shifts induced by typical sources of noise.

Following [52, 40], we evaluate model accuracy andECE on a corrupted version of CIFAR10 [22]. Namely,we consider 19 different ways to artificially corrupted theimages and 5 different levels of severity for each of thosecorruptions.

We train all the models on the original uncorrupted CI-FAR10 training set and test them both on the original im-ages and their corrupted versions. In Table 1, we reportour results on the original uncorrupted images and in Fig. 6the results on the noisy ones. Each box represents the firstand third quantiles of accuracy and ECE while the errorbars denote the minimum and maximum values. We testedfour different approaches, that is, a single network, MC-Dropout, Ensembles, and our Masksembles. Unsurpris-ingly, as the severity of the perturbations increases, using

Page 7: @epfl.ch

0.89 0.90 0.91 0.92Accuracy

0.00

0.02

0.04

0.06

0.08

ECE

Optimal configuration

EnsemblesMC-DropoutMasksembles

1.01.52.02.53.03.5

mod

el si

ze

Figure 7. Spanning the space of behaviors. Models in the bottomright corner are better. The color represents the relative size ofthe model compared to a single model. The size of Masksemblesmarkers denotes mask overlap between 1.0 and 0.2.

a single network is the least robust approach and degradesfastest. Masksembles performs on par with Ensembles andconsistently outperforms MC-dropout, even though we ex-tensively tuned MC-Dropout for the best possible perfor-mance by doing a grid search on the dropout rate (whichturned out to be p ≈ 0.1).

In these experiments, Masksembles was trained with afixed number of N = 4 masks. We varied both S in therange [1.0, 5.0] and M in the range [0.3C, 2.0C], where Cis the number of channels in the original model. Fig. 7 de-picts the resulting range of behaviors. The 2D coordinatesof the markers depict their accuracy and ECE, their colors -the corresponding model size, and the size of the star mark-ers denotes the value of the mask-overlap. For comparisonpurposes, we also display MC-dropout and Ensembles re-sults in a similar manner, simply replacing the star with asquare and a circle, respectively. As can be seen, the opti-mal Masksemble configuration depicted by the green star inthe lower right corner delivers performances similar to thoseof Ensembles for a model size that is almost half the size.This is the one we also used in the experiments depicted byFig. 6.

In Fig. 8, we chose the Masksembles parameters so thatthe model size always remains the same: We fix the numberof masks N = 4 and vary S and M so that the total modelsize remains constant, which means that a lower mask over-lap results in a lower capacity for each individual model.This procedure is important in practice because it enables usto take a large network, such as ResNet, and add Masksem-bles layers as we would add Dropout layer without havingto change the rest of the architecture.

For this set of experiments we obtain a consistent im-provement in calibration quality by reducing mask overlapand achieve ensembles-like performance with S = 5.

OOD ROC and PR. For OOD task, we followed theevaluation protocol of [5]. Namely, we trained our mod-els on CIFAR10, and then used CIFAR10 test images as

None 1 2 3 4 5severity

0.000.050.100.150.200.250.300.35

ECE

Single ModelMasksemblesEnsembles

1

2

3

4

5

scal

e S

Figure 8. CIFAR10 ECE. ECE as a function of severity of imagecorruptions. Each Masksembles curve corresponds to a differentS. In this experiment we force M×S to remain constant and equalto the original Wide-ResNet channels number and we vary S inthe range [1., 1.1, 1.4, 2.0, 3.0]. Larger values of S correspond tomore ensembles-like behavior.

1.0 1.1 1.3 2.0 4.0Scale S

0.01

50.

030.

045

0.06

ECE

1.0 1.1 1.3 2.0 4.0Scale S

0.89

0.9

0.91

0.92

Accuracy

MasksemblesEnsemblesSingle Model

1.0 1.1 1.3 2.0 4.0Scale S

0.91

0.92

0.93

0.94

OOD ROC

Figure 9. Fixed capacity for CIFAR10. ECE, Accuracy and OODas functions of the scale parameter S with fixed M and N. Thecolors represent different masks overlap values.

Single MC-D MaskE NaiveEAccuracy 0.89 0.90 0.92 0.92

ECE 0.06 0.01 0.01 0.01Size 1x 1x 2.3x 4xTime 10m 12m 16m 40m

OOD ROC 0.91 0.92 0.94 0.94OOD PR 0.94 0.95 0.96 0.96

Table 1. CIFAR10 results. Each model uses 4 samples duringinference and we used the uncorrupted versions of the images.

our in-distribution samples and images from the SVHNdataset [39] (which belong to classes that are not in CI-FAR10), as our out-of-distribution samples. We report ourOOD results in the two bottom rows of Table 1. Again, per-formance of our method is similar to that of Ensembles ata fraction of the computational cost, and significantly betterthan that of a single network and MC-dropout.

Fixed Capacity Model. In order to provide clear ev-idence for Single Model ←→ Ensembles transition ofMasksembles we perform fixed capacity experiments andreport the results in Fig. 9. Similarly to our previous exper-iments we set N = 4 and vary the scale parameter in rangeS ∈ [1.0, 4.0], while keeping parameter M fixed. Theseresults confirm that Masksembles is able to span the en-tire spectrum of behaviours between different models: the

Page 8: @epfl.ch

None 1 2 3 4 5severity

0.020.040.060.080.100.120.14

ECE

Single ModelMasksemblesEnsembles

1.0

1.5

2.0

2.5

3.0

scal

e S

Figure 10. ImageNet ECE. ECE as a function of severity of imagecorruptions. Each Masksembles curve corresponds to a differentS. In this experiment we force M × S to be constant and equal tooriginal ResNet-50 channels number.

key metrics gradually improve as S is increased, and theMasksembles’s behavior (solid line) clearly transitions fromSingle Model to Ensembles behavior (dashed lines).

4.3. ImageNet

The ImageNet [6] is another dataset that is widely usedto assess the ability of neural networks to produce uncer-tainty estimates. For this dataset, we rely on the well-knownResNet-50 [21] architecture as a base model, using the setupsimilar to that in [1]. Similarly to our CIFAR10 experi-ments, for the MC-Dropout and Masksembles models, weintroduce dropout (masking) layers before every convolu-tional layer starting from stage3 layers.

We followed exactly the same evaluation protocol as inSection 4.2 and report our results on the original images inTable 2 and on the corrupted ones in Fig. 11. In Table 2and Fig. 10 we use the same procedure as for CIFAR10 toselect Masksembles parameters so that the total model sizeremains fixed.

For this dataset, Masksembles demonstrates its best per-formance for a choice of parameters corresponding to aMasks IoU of 0.5. Similarly to our results on CIFAR10, theperformance of Masksembles both in terms of accuracy andECE is very close to that of Ensembles and significantly bet-ter than that of MC-Dropout. Note that our method achievesthese results with a training time and memory consumptionthat is 4 (four) times smaller than that of Ensembles, andnearly the same as that of a single network. In terms of cali-bration quality on corrupted images, we achieve Ensemble-like performance starting from S = 2 and up.

5. ConclusionIn this work, we introduced Masksembles, a new ap-

proach to uncertainty estimation in deep neural networks.Instead of using a fixed number of independently trainedmodels as in Ensembles or drawing a new set of ran-dom binary masks at each training step as in MC-Dropout,Masksembles predefines a pool of binary masks with a con-

Single Masksembles NaiveE MC-DPAcc. 0.71 0.72 0.71 0.70 0.74 0.69ECE 0.07 0.03 0.02 0.02 0.02 0.03IoU 1.0 0.7 0.3 0.2 - -Time 50h 55h 60h 70h 200h 80h

Table 2. ImageNet results. Accuracy and ECE results for Single,Masksembles models using masks overlapping values (0.7, 0.3,0.2), Ensembles, and MC-Dropout. All the models have the samesize as the Single model, except for Ensembles which is 4x timeslarger.

1 2 3 4 5severity

0.00

0.05

0.10

0.15

0.20

ECE

MethodSingleMasksemblesEnsemblesMC-Dropout

1 2 3 4 5severity

0.0

0.2

0.4

0.6Ac

cura

cy

MethodSingleMasksemblesEnsemblesMC-Dropout

Figure 11. ImageNet results on corrupted images. Masksem-bles and MC-Dropout with tuned dropout rate parameter values,Ensemble of independently trained models and Single model. Allapproaches, except Single, include 4 models, their predictions areaveraged.

trollable overlap and stochastically iterates through them.By changing the parameters that control the mask gener-ation process, we can span a range of behaviors betweenthose of MC-Dropout and Ensembles. This allows us toidentify model configurations that provide a useful trade-off between the high-quality uncertainty estimates of DeepEnsembles at a high computational cost and the lower per-formance of MC-Dropout at a lower computational cost.In fact, our experiments demonstrate that we can achievethe performance on par with that of Deep Ensembles at afraction of the cost. Because Masksembles is easy to im-plement, we believe it can serve as a drop-in replacementfor MC-Dropout. In future work, we will explore ways toaccelerate Masksembles by exploiting the structure of themasks, and apply our method to tasks that require a scalableuncertainty estimation method, in particular reinforcementlearning and Bayesian optimization.

Page 9: @epfl.ch

References[1] Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov,

and Dmitry Vetrov. Pitfalls of in-domain uncertainty es-timation and ensembling in deep learning. arXiv preprintarXiv:2002.06470, 2020.

[2] Andrei Atanov, Arsenii Ashukha, Dmitry Molchanov, KirillNeklyudov, and Dmitry Vetrov. Uncertainty estimation viastochastic batch normalization. In International Symposiumon Neural Networks, pages 261–269. Springer, 2019.

[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu,and Daan Wierstra. Weight uncertainty in neural networks.arXiv preprint arXiv:1505.05424, 2015.

[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs. IEEE transactions on patternanalysis and machine intelligence, 40(4):834–848, 2017.

[5] Kamil Ciosek, Vincent Fortuin, Ryota Tomioka, Katja Hof-mann, and Richard Turner. Conservative uncertainty estima-tion by fitting prior networks. In International Conference onLearning Representations, 2019.

[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In 2009 IEEE conference on computer vision andpattern recognition, pages 248–255. Ieee, 2009.

[7] Armen Der Kiureghian and Ove Ditlevsen. Aleatory or epis-temic? does it matter? Structural safety, 31(2):105–112,2009.

[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-vain Gelly, et al. An image is worth 16x16 words: Trans-formers for image recognition at scale. arXiv preprintarXiv:2010.11929, 2020.

[9] Nikita Durasov, Mikhail Romanov, Valeriya Bubnova, PavelBogomolov, and Anton Konushin. Double refinement net-work for efficient monocular depth estimation. In IROS,pages 5889–5894, 2019.

[10] Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan.Deep ensembles: A loss landscape perspective. arXivpreprint arXiv:1912.02757, 2019.

[11] Yarin Gal. Uncertainty in deep learning. University ofCambridge, 1(3), 2016.

[12] Y. Gal and Z. Ghahramani. Dropout as a Bayesian Approx-imation: Representing Model Uncertainty in Deep Learn-ing. In International Conference on Machine Learning, pages1050–1059, 2016.

[13] T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A.Wilson. Loss Surfaces, Mode Connectivity and Fast En-sembling of DNNs. In Advances in Neural InformationProcessing Systems, 2018.

[14] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Dropblock:A regularization method for convolutional networks. arXivpreprint arXiv:1810.12890, 2018.

[15] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy.Explaining and harnessing adversarial examples. arXivpreprint arXiv:1412.6572, 2014.

[16] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger.On calibration of modern neural networks. arXiv preprintarXiv:1706.04599, 2017.

[17] Fredrik K Gustafsson, Martin Danelljan, and Thomas BSchon. Evaluating scalable bayesian deep learning methodsfor robust computer vision. In Proceedings of the IEEE/CVFConference on Computer Vision and Pattern RecognitionWorkshops, pages 318–319, 2020.

[18] L. K. Hansen and P. Salamon. Neural network ensem-bles. IEEE Transactions on Pattern Analysis and MachineIntelligence, 12(10):993–1001, 1990.

[19] T. Hastie, R. Tibshirani, and J. Friedman. The Elements ofStatistical Learning: Data Mining, Inference and Prediction.Springer, 2009.

[20] Marton Havasi, Rodolphe Jenatton, Stanislav Fort,Jeremiah Zhe Liu, Jasper Snoek, Balaji Lakshminarayanan,Andrew M Dai, and Dustin Tran. Training indepen-dent subnetworks for robust prediction. arXiv preprintarXiv:2010.06610, 2020.

[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceedingsof the IEEE conference on computer vision and patternrecognition, pages 770–778, 2016.

[22] Dan Hendrycks and Thomas Dietterich. Benchmarking neu-ral network robustness to common corruptions and perturba-tions. arXiv preprint arXiv:1903.12261, 2019.

[23] Dan Hendrycks, Norman Mu, Ekin D Cubuk, Barret Zoph,Justin Gilmer, and Balaji Lakshminarayanan. Augmix: Asimple data processing method to improve robustness anduncertainty. arXiv preprint arXiv:1912.02781, 2019.

[24] Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John EHopcroft, and Kilian Q Weinberger. Snapshot ensembles:Train 1, get m for free. arXiv preprint arXiv:1704.00109,2017.

[25] Sergey Ivanov, Nikita Durasov, and Evgeny Burnaev. Learn-ing node embeddings for influence set completion. In 2018IEEE International Conference on Data Mining Workshops(ICDMW), pages 1034–1037. IEEE, 2018.

[26] Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, DmitryVetrov, and Andrew Gordon Wilson. Averaging weightsleads to wider optima and better generalization. arXivpreprint arXiv:1803.05407, 2018.

[27] Alex Kendall and Yarin Gal. What uncertainties do we needin bayesian deep learning for computer vision? In Advancesin neural information processing systems, pages 5574–5584,2017.

[28] Diederik P Kingma and Jimmy Ba. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980,2014.

[29] Diederik P Kingma, Tim Salimans, and Max Welling. Vari-ational dropout and the local reparameterization trick. arXivpreprint arXiv:1506.02557, 2015.

[30] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. 2009.

[31] Ivan Kruzhilov, Mikhail Romanov, Dmitriy Babichev, andAnton Konushin. Double refinement network for room lay-out estimation. In Asian Conference on Pattern Recognition,pages 557–568. Springer, 2019.

Page 10: @epfl.ch

[32] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simpleand Scalable Predictive Uncertainty Estimation Using DeepEnsembles. In Advances in Neural Information ProcessingSystems, 2017.

[33] Balaji Lakshminarayanan, Alexander Pritzel, and CharlesBlundell. Simple and scalable predictive uncertainty es-timation using deep ensembles. In Advances in neuralinformation processing systems, pages 6402–6413, 2017.

[34] David JC MacKay. Bayesian methods for adaptive models.PhD thesis, California Institute of Technology, 1992.

[35] David JC MacKay. A practical bayesian framework for back-propagation networks. Neural computation, 4(3):448–472,1992.

[36] Andrey Malinin, Sergey Chervontsev, Ivan Provilkov, andMark Gales. Regression prior networks. arXiv preprintarXiv:2006.11590, 2020.

[37] Andrey Malinin and Mark Gales. Predictive uncertaintyestimation via prior networks. In Advances in NeuralInformation Processing Systems, pages 7047–7058, 2018.

[38] Radford M Neal. Bayesian learning for neural networks, vol-ume 118. Springer Science & Business Media, 2012.

[39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-sacco, Bo Wu, and Andrew Y Ng. Reading digits in naturalimages with unsupervised feature learning. 2011.

[40] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, DavidSculley, Sebastian Nowozin, Joshua Dillon, Balaji Lak-shminarayanan, and Jasper Snoek. Can you trust yourmodel’s uncertainty? evaluating predictive uncertainty underdataset shift. In Advances in Neural Information ProcessingSystems, pages 13991–14002, 2019.

[41] Tim Pearce, Alexandra Brintrup, Mohamed Zaki, and AndyNeely. High-quality prediction intervals for deep learning:A distribution-free, ensembled approach. In InternationalConference on Machine Learning, pages 4075–4084. PMLR,2018.

[42] Michael P Perrone and Leon N Cooper. When networks dis-agree: Ensemble methods for hybrid neural networks. Tech-nical report, BROWN UNIV PROVIDENCE RI INST FORBRAIN AND NEURAL SYSTEMS, 1992.

[43] Joaquin Quionero-Candela, Masashi Sugiyama, AntonSchwaighofer, and Neil D Lawrence. Dataset shift inmachine learning. The MIT Press, 2009.

[44] Joseph Redmon, Santosh Divvala, Ross Girshick, and AliFarhadi. You only look once: Unified, real-time object de-tection. In Proceedings of the IEEE conference on computervision and pattern recognition, pages 779–788, 2016.

[45] Hippolyt Ritter, Aleksandar Botev, and David Barber. Ascalable laplace approximation for neural networks. In6th International Conference on Learning Representations,ICLR 2018-Conference Track Proceedings, volume 6. Inter-national Conference on Representation Learning, 2018.

[46] Sebastian Ruder. An overview of gradient descent optimiza-tion algorithms. arXiv preprint arXiv:1609.04747, 2016.

[47] David Silver, Aja Huang, Chris J Maddison, Arthur Guez,Laurent Sifre, George Van Den Driessche, Julian Schrit-twieser, Ioannis Antonoglou, Veda Panneershelvam, MarcLanctot, et al. Mastering the game of go with deep neuralnetworks and tree search. nature, 529(7587):484–489, 2016.

[48] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R.Salakhutdinov. Dropout: A Simple Way to Prevent NeuralNetworks from Overfitting. Journal of Machine LearningResearch, 15:1929–1958, 2014.

[49] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bre-gler. Efficient Object Localization Using Convolutional Net-works. In Conference on Computer Vision and PatternRecognition, pages 648–656, 2015.

[50] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il-lia Polosukhin. Attention is all you need. arXiv preprintarXiv:1706.03762, 2017.

[51] Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, andRob Fergus. Regularization of neural networks using drop-connect. In International conference on machine learning,pages 1058–1066. PMLR, 2013.

[52] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble:an alternative approach to efficient ensemble and lifelonglearning. arXiv preprint arXiv:2002.06715, 2020.

[53] Mitchell Wortsman, Vivek Ramanujan, Rosanne Liu,Aniruddha Kembhavi, Mohammad Rastegari, Jason Yosin-ski, and Ali Farhadi. Supermasks in superposition. Advancesin Neural Information Processing Systems, 33, 2020.

[54] Teresa Yeo, Oguzhan Fatih Kar, and Amir Zamir. Ro-bustness via cross-domain ensembles. arXiv preprintarXiv:2103.10919, 2021.

[55] Sergey Zagoruyko and Nikos Komodakis. Wide residual net-works. arXiv preprint arXiv:1605.07146, 2016.

[56] Amir R Zamir, Alexander Sax, Nikhil Cheerla, Rohan Suri,Zhangjie Cao, Jitendra Malik, and Leonidas J Guibas. Ro-bust learning through cross-task consistency. In Proceedingsof the IEEE/CVF Conference on Computer Vision andPattern Recognition, pages 11197–11206, 2020.

[57] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, andDavid Lopez-Paz. mixup: Beyond empirical risk minimiza-tion. arXiv preprint arXiv:1710.09412, 2017.

Page 11: @epfl.ch

A. Supplementary MaterialIn this appendix, we study in more detail the relation-

ship between the choice for N, M, and S parameters of Sec-tion 3.2 that control the mask generation process and theresulting model size, mask correlation, and models’ diver-sity.

A.1. Expected models size

We first provide a formal derivation for the expected sizeof a generated model given the N,M,S parameters. Let usconsider N vectors of size M× S filled with zeros, and thenin each of those vectors, we randomly set M of these ele-ments to be 1. Let ζij be a random variable that representswhether or not j-th position in i-th vector is equal to one.

P(ζij = 1) =M

M× S=

1

S. (1)

This arises from the fact that there are(M×S

M

)ways to chose

M positions from M×S places and only(M×S−1

M−1)

such con-figurations where j-th position is fixed to be one, therefore

P(ζij = 1) =

(M× S− 1

M− 1

)·(

M× SM

)−1=

MM× S

=1

S

(2)

Let εj = [∑N

i=1 ζij > 0] be the random variable repre-senting whether at least one 1 appears in the j-th positionamong all generated vectors. Given that P(ζij = 0) = 1− 1

Sand P(X) = 1−P(X), then probability of εj can be writtenas

P(εj = 1) = 1− (1− 1/S)N . (3)

To compute the expected model size, we compute the ex-pectation of εj sum, since effectively it represents expectednumber of features that one will acquire after a trimmingprocedure:

Size(N,M,S) = EM×S∑i=1

εj =

M×S∑i=1

Eεj

= M× S[1− (1− 1/S)N] (4)

A.2. Average IoU

We now justify our 12S−1 approximation for the average

mask IoU. Let us consider two vectors produced by ourmask generation algorithm. As above, let ζ1j and ζ2j berandom variables that represent the value in the j-th posi-tion in the first and second masks, respectively. Startingfrom the standard definition of the IoU, we estimate the av-erage intersection I of two such masks as the number of

common ones:

EI = EM×S∑j=1

ζ1j · ζ2j =M×S∑j=1

Eζ1j · Eζ2j

= M× S · 1S2 =

MS. (5)

Given two masks with M ones each and intersection I , theirIoU is I

2M−I . Therefore, a simple approximation for ex-pected IoU is

E [IoU] = E[

I

2M− I

]≈ EI

2M− EI

=M/S

2M−M/S=

1

2S− 1

(6)

In Fig. 12, we plot this value as a function of S along withempirical values and the agreement is excellent.

2 4 6 8 10Scale S

0.25

0.50

0.75

1.00Io

UEmpirical IoU1/(2S 1)

Figure 12. Empirical and Analytical IoU. The plot representshow close real IoU of generated masks are and analytical approx-imation for it. In wide range of S values we demonstrate a closematch of considered quantities.

A.3. Diversity analysis

It is known that less correlated ensembles of modelsdeliver better performance, produce more accurate predic-tions [33, 42, 18], and demonstrate lower calibration er-ror [40]. Hence, a strength of Masksembles is that it pro-vides the ability to control how correlated models withinMasksembles are by adjusting the N, M, and S parameters.

We now perform a diversity analysis of Masksemblesmodels using the metric of [10]. It involves comparing twodifferent models trained on the same data in terms of howdifferent their predictions are. Measuring fraction of thetest data points on which predictions of models disagree,the diversity, and normalizing it by the models error rate,we write

Diversity =fraction of disagreed labels

1.0− accuracy(7)

Plotting this diversity against accuracy allows us to lookat our models from a bias-variance trade-off perspective.

Page 12: @epfl.ch

0.875 0.880 0.885 0.890 0.895 0.900 0.905Validation Accuracy

0.0

0.5

1.0

1.5

2.0Pr

edict

ion

dive

rsity Single Model

MasksemblesEnsembles

Figure 13. Diversity vs Accuracy trade-off for Cifar10. Everypoint is associated with a pair of compared models. Larger diver-sity values correspond to less correlated models. Green and Reddashed lines represent the worst and the best theoretically possiblediversity for a fixed accuracy.

Since we want our models to be less correlated—larger di-versity— and at the same time to be accurate, the upper-right corner of Fig. 13 is where the best models should be.

For this experiment we used a single trained model; sev-eral Masksembles models with N = 4, fixed M to have thesame capacity as the single one and varying S in [2, 3, 4, 5];ensembles model with 4 members. Fig. 13 depicts the re-sults on CIFAR10 dataset. Ensembles have the largest di-versity but Masksembles gives us the ability to achieve verysimilar results by controlling its parameters. The singlemodel has 0 diversity by the definition.

A.4. Calibration Plots

Since Expected Calibration Error (ECE) provides onlylimited and aggregated information about model’s calibra-tion therefore in this section we present full calibrationplots for experiments performed on CIFAR and ImageNetdatasets in experiments sections.

0.00 0.25 0.50 0.75 1.00probability

0.00

0.25

0.50

0.75

1.00

Accu

racy

MasksemblesEnsembles

0.00 0.25 0.50 0.75 1.00probability

0.00

0.25

0.50

0.75

1.00

Accu

racy

MasksemblesEnsembles

Figure 14. Calibration plots. Calibration results for CIFAR10(left) and ImageNet (right) test sets. Perfect calibration corre-sponds to y = x curve. Masksembles exhibits a more ensembles-like behavior for lower values of masks overlap.

For both datasets, the calibration plots support our claimthat lower model correlation yields better calibration. Fur-thermore, reducing masks overlap for Masksembles enablesus to match Ensembles behavior very closely.