arXiv:2010.11724v1 [cs.CV] 17 Oct 2020Mepro-MIC of Beijing Jiaotong U BJTU-Mepro-MIC 61.98 35 NUST LEAP Group of PCA Lab 61.48 7 - chohk (wsol-aug) 52.89 36 Beijing Normal University

LID 2020: The Learning from Imperfect Data Challenge Results

Yunchao Wei, Shuai Zheng, Ming-Ming Cheng, Hang Zhao, Liwei Wang, Errui Ding, Yi Yang,Antonio Torralba, Ting Liu, Guolei Sun, Wenguan Wang, Luc Van Gool, Wonho Bae, Junhyug Noh,Jinhwan Seo, Gunhee Kim, Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang,

Chuangchuang Tan, Tao Ruan, Guanghua Gu, Shikui Wei, Yao Zhao, Mariia Dobko, Ostap Viniavskyi,Oles Dobosevych, Zhendong Wang, Zhenyuan Chen, Chen Gong, Huanqing Yan, Jun He

Abstract

Learning from imperfect data becomes an issue in manyindustrial applications after the research community hasmade profound progress in supervised learning from per-fectly annotated datasets. The purpose of the Learningfrom Imperfect Data (LID) workshop is to inspire and facili-tate the research in developing novel approaches that wouldharness the imperfect data and improve the data-efficiencyduring training. A massive amount of user-generated datanowadays available on multiple internet services. How toleverage those and improve the machine learning models isa high impact problem. We organize the challenges in con-junction with the workshop. The goal of these challengesis to find the state-of-the-art approaches in the weakly su-pervised learning setting for object detection, semantic seg-mentation, and scene parsing. There are three tracks inthe challenge, i.e., weakly supervised semantic segmenta-tion (Track 1), weakly supervised scene parsing (Track 2),and weakly supervised object localization (Track 3). InTrack 1, based on ILSVRC DET [12], we provide pixel-level annotations of 15K images from 200 categories forevaluation. In Track 2, we provide point-based annotationsfor the training set of ADE20K [73]. In Track 3, basedon ILSVRC CLS-LOC [12], we provide pixel-level anno-tations of 44,271 images for evaluation [71]. Besides, wefurther introduce a new evaluation metric proposed by [71],i.e., IoU curve, to measure the quality of the generated ob-ject localization maps. This technical report summarizesthe highlights from the challenge. The challenge submis-sion server and the leaderboard will continue to open forthe researchers who are interested in it. More details re-garding the challenge and the benchmarks are available athttps://lidchallenge.github.io.

1. IntroductionWeakly supervised learning refers to various studies

that attempt to address the challenging image recognition

tasks by learning from weak or imperfect supervision. Su-pervised learning methods, including Deep ConvolutionalNeural Networks (DCNNs), have significantly improved theperformance in many problems in the field of computer vi-sion, thanks to the rise of large-scale annotated data set andthe advance in computing hardware. However, these su-pervised learning approaches are notorious “data-hungry”,which makes them are sometimes not practical in manyreal-world industrial applications. We are often facing theproblem that we are not able to acquire enough amountof perfect annotations (e.g., object bounding boxes, andpixel-wise masks) for reliable training models. To addressthis problem, many efforts in so-called weakly supervisedlearning approaches have been made to improve the DC-NNs training to deviate from traditional paths of supervisedlearning using imperfect data. For instance, various ap-proaches have proposed new loss functions or novel trainingschemes. Weakly supervised learning is a popular researchdirection in Computer Vision and Machine Learning com-munities. Many research works have been devoted to re-lated topics, leading to the rapid growth of related publica-tions in the top-tier conferences and journals such as CVPR,ICCV, ECCV, NeurIPS, TIP, IJCV, and TPAMI.

This year, we provide additional annotations for exist-ing benchmarks to enable the weakly-supervised trainingor evaluation and introduce three challenge tracks to ad-vance the research of weakly-supervised semantic segmen-tation [1, 2, 6, 7, 13–17, 20, 22–25, 27, 28, 30, 32, 35–38, 40–42, 44–46, 48–57, 59–61, 63, 66, 67, 74], weaklysupervised scene parsing using point supervision [39] andweakly supervised object localization [3–5, 9–11, 18, 19,21, 26, 29, 31, 33, 34, 43, 47, 58, 62, 64, 65, 68–72, 75],respectively. More details are given in the next section.

We organize this workshop to investigate principle waysof building industry level AI systems relying on learningfrom imperfect data. We hope this workshop will attractattention and discussions from both industry and academicpeople and ease the future research of weakly supervisedlearning for computer vision.

1

arX

iv:2

010.

1172

4v1

[cs

.CV

] 1

7 O

ct 2

020

https://lidchallenge.github.io

2. The LID Challenge2.1. Datasets

ILSVRC-LID-200 is used in Track 1, aiming to performobject semantic segmentation using image-level annota-tions as supervision. This dataset is built upon the objectdetection track of ImageNet Large Scale Visual Recogni-tion Competition (ILSVRC) [12], which totally includes456,567 training images from 200 categories. By parsingthe provided XML files given by [12], 349,310 images withobject(s) from the 200 categories are left. To facilitate thepixel-level evaluation, we provide pixel-level annotationsfor 15K images, including 5,000 and 10,000 for validationand testing, respectively. Following previous practices, themean Intersection-Over-Union (IoU) score over 200 cate-gories is employed as the key evaluation metric.

ADE20K-LID is used in Track 2, aiming to learn to per-form scene parsing using points-based annotation as super-vision. This dataset is built upon the ADE20K dataset [73].There are 20,210 images in the training set, 2,000 images inthe validation set, and 3,000 images in the testing set. Weprovide the additional point-based annotations on the train-ing set. In particular, this point-based weakly-supervisedsetting is firstly provided by [39]. Following [39], we con-sider 150 meaningful categories and generate the pixel an-notation for each independent instance (or region) in eachtraining image. The performances are evaluated by pixel-wise accuracy and mean IoU, which is consistent with theevaluation metrics of the standard ADE20K [73].

ILSVRC is used in Track 3, aiming to make the classifi-cation networks be equipped with the ability of object lo-calization. This dataset is built upon the image classifi-cation/localization track of ImageNet Large Scale VisualRecognition Competition (ILSVRC), which includes 1.2million training images from 1000 categories. Differentfrom previous works to evaluate the performance in an indi-rect way, i.e. bounding box, we annotate pixel-level masksof 44,271 images (validation/testing: 23,151/21,120) to fa-cilitate the evaluation to be performed in a direct way. Theseannotations are provided by [71], where a new evaluationmetric, i.e., IoU-Threshold curve, is also introduced. Par-ticularly, the IoU-Threshold curve is obtained by calculat-ing IoU scores with the masks binarized with a wide rangeof thresholds from 0 to 255. The best IoU score, i.e., Peak-IoU, is used as the key evaluation metric for the comparison.Please refer to [71] for more details of the annotation masksand the new evaluation metric.

2.2. Rules and Descriptions

Rules This year, we issue two strict rules for all the teams

• For training, only the images provided in the trainingset are permitted. Competitors can use the classifica-

Table 1. Results and rankings of methods submitted to Track 1.Team Username Mean IoU Mean Accuracy Pixel Accuracy

ETH Zurich cvl 45.18 59.62 80.46Seoul National University VL-task1 37.73 60.15 82.98Ukrainian Catholic University UCU & SoftServe 37.34 54.87 83.64- IOnlyHaveSevenDays 36.24 68.27 84.10- play-njupt 31.90 46.07 82.63- xingxiao 29.48 48.66 80.82- hagenbreaker 22.50 39.92 77.38- go-g0 19.80 38.30 76.21- lasthours-try 12.56 24.65 64.35- WH-ljs 7.79 16.59 62.52

Table 2. Results and rankings of methods submitted to Track 2.Team Username Mean IoU Mean Accuracy Pixel Accuracy

Tsinghua&Intel fromandto 25.33 36.88 64.95

Table 3. Results and rankings of methods submitted to Track 3.Team Username Peak-IoU Peak-Threshold

Seoul National University VL-task3 63.08 24Mepro-MIC of Beijing Jiaotong U BJTU-Mepro-MIC 61.98 35NUST LEAP Group of PCA Lab 61.48 7- chohk (wsol-aug) 52.89 36Beijing Normal University TEN 48.17 42

tion models pre-trained on the training set of ILSVRCCLS-LOC to initialize the parameters. However, theyCANNOT leverage any datasets with pixel-level anno-tations. In particular, for Track 1 and Track 3, onlythe image-level annotations of training images can beleveraged for supervision, and the bounding-box anno-tations are NOT permitted.

• We encourage competitors to design elegant and ef-fective models competing for all the tracks rather thanensembling multiple models. Therefore, we restrictthe parameter size of the inference model(s) shouldbe LESS than 150M (slightly more than two DeepLabV3+ [8] models using ResNet-101 as the backbone).The competitors ranked in the Top 3 are required tosubmit the inference code for verification.

Timeline The challenge started on Mar 22, 2020, and endedon June 8, 2020. Each participant was allowed a maximumof 5 submissions for the testing split of each track.

Participating Teams We received submissions from 15teams in total. In particular, the teams from the ComputerVision Lab at ETH Zurich, Tsinghua University, the Vision& Learning Lab at Seoul National University achieve the1st place in Track 1, Track 2, and Track 3, respectively.

2.3. Results and Methods

Tables 1, 2, 3 show the leaderboard results of Track1, Track 2 and Track 3, respectively. Particularly, theteam from ETH Zurich significantly outperforms others bya large margin in the Track 1. In the Track 3, the top3 teams achieve similar Peak-IoU scores from 61.48 to63.08. Furthermore, we demonstrate the comparison ofIoU-Threshold curves of 5 teams for Track 3 in Figure 1.

0 20 40 60 80 100 120 140 160 180 200 220 240 260

Threshold

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7Io

U

VL-task3BJTU-Mepro-MICLEAP Group@PCA Labchohk (wsol_aug)TEN

Figure 1. The comparison of IoU-Threshold curves of differentteams for Track 3.

GAPGAPGAP

GAP GAP GAP

Fm ϕ ϕ ϕ

ϕ ϕ ϕ

0

1

10

...

person

table

cat

cow

...

0

1

00

...

person

table

cat

cow

...

cat

cow

...

0

1

01

...

person

table

cat

cow

...

0

1

00

...

person

table

cat

cow

...

0

0

01

...

person

table

cat

cow

...

Sm

Sn

smlm

m n

SmSm

smlm ln

m n m\n

Fmm n Fm

m\n

Fnm n Fn

n\m

ln lm ln ln\ lm

Snm n

Snn\m

snm nsn snn\m

Co-attentionContrastive

co-attention

Fn

share weight

Im

In

Co-attentive features

Contrastiveco-attentive features

Fmm n

Fnm n Fn

n\m

Fmm\n

(a) Overview of our co-attention classifier during training phase

w. co-attention

(c) Object localization maps

w. and w/o. co-attention

w/o. co-attention

(b)Visualization of (contras-

tive) co-attentive features

0

0

10

...

person

table

lm\ ln

smm\n

Figure 2: (a) In addition to mining object semantics from single-image labels, semantic similarities and differences between paired trainingimages are both leveraged for supervising object pattern learning. (b) Co-attentive and contrastive co-attentive features complimentarilycapture the shared and unshared objects. (c) Our co-attention classifier is able to learn object patterns more comprehensively.

two co-attentions work in a cooperative and complimentarymanner, together making the classifier understand objectpatterns more comprehensively. Another advantage is thatour co-attention based classifier learning paradigm bringsan efficient data augmentation strategy, due to the use oftraining image pairs. With above efforts, our method ranked1st place in the Weakly-supervised Semantic SegmentationTrack of CVPR2020 Learning from Imperfect Data (LID)Challenge [8] (LID20), outperforming other competitors bylarge margins.

2. Our Algorithm2.1. Co-attention Classification Network

Let us denote the training data as I={(In, ln)}n, whereIn is the nth training image, and ln ∈{0, 1}K is the asso-ciated ground-truth image label for K semantic categories.As shown in Fig.2(a), image pairs, i.e., (Im, In), are sam-pled from I for training the classifier. After feeding Im andIn into the conv part of the classifier, corresponding featuremaps, Fm∈RC×H×W and Fn∈RC×H×W, are obtained, eachwith H×W spatial dimension and C channels. Then we canfirst separately pass Fmand Fn to a class-aware fully convo-lutional layer ϕ(·) to generate class-aware activation maps,i.e., Sm=ϕ(Fm)∈RK×H×W and Sn=ϕ(Fn)∈RK×H×W,respectively. Then, we apply global average pooling (GAP)over Sm and Sn to obtain class score vectors sm∈RK andsn∈ RK for Im and In, respectively. Finally, the sigmoidcross entropy (CE) loss is used for supervision:Lmnbasic

((Im, In), (lm, ln)

)=LCE(sm, lm)+LCE(sn, ln),=LCE

(GAP(ϕ(Fm)), lm

)+

LCE(GAP(ϕ(Fn)), ln

).

(1)

Next we will endow the classifier with a co-attention mech-anism for further mining cross-image semantics and even-tually better localizing objects.

Co-Attention for Cross-Image Common Semantics Min-ing. Our co-attention attends to the two images, i.e., Im andIn, simultaneously, and captures their correlations. We firstcompute the affinity matrix P between Fm and Fn [4]:

P = F>mWPFn ∈RHW×HW , (2)

where WP ∈RC×C is a learnable matrix. P stores similar-ity scores corresponding to all pairs of positions in Fm andFn. Then, P is normalized column-wise to derive atten-tion maps across Fm for each position in Fn, and row-wiseto derive attention maps across Fn for each position in Fm:

Am=softmax(P ) ∈ [0, 1]HW×HW,An=softmax(P>)∈ [0, 1]HW×HW,

(3)

where softmax is performed column-wise. In this way, Anand Am store the co-attention maps in their columns. Next,we can compute attention summaries of Fm (Fn) in light ofeach position of Fn (Fm):

Fm∩nm =FnAn∈RC×H×W, Fm∩nn =FmAm∈RC×H×W. (4)Co-attentive feature Fm∩nm , derived from Fn, preserves thecommon semantics between Fmand Fnand locate the com-mon objects in Fm. Thus we can expect only the commonsemantics lm∩ ln1 can be safely derived from Fm∩nm , andthe same goes for Fm∩nn . Such co-attention based commonsemantic classification can let the classifier understand theobject patterns more completely and precisely.

To make things intuitive, consider the example in Fig. 2,where Im contains Table and Person, and In has Cow andPerson. As the co-attention is essentially the affinity com-putation between all the position pairs between Im and In,only the semantics of the common objects, Person, will bepreserved in the co-attentive features, i.e., Fm∩nm and F

m∩nn

(Fig.2(b)). If we feed Fm∩nm and Fm∩nn into the class-aware

1The set operation ‘∩’ is extended here to represent bitwise-and.

Figure 2. An overview of the proposed approach by the team ofCVL at ETH Zurich.

2.3.1 CVL at ETH Zurich Team(Won the 1st place of Track 1)

The proposed approach adopts cross-image semantic rela-tions for comprehensive object pattern mining. Two neuralco-attentions are incorporated into the classifier to comple-ment capture cross-image semantic similarities and differ-ences. In particular, the classifier is equipped with a dif-ferentiable co-attention mechanism that addresses semantichomogeneity and difference understanding across trainingimage pairs. More specifically, two kinds of co-attentionsare learned in the classifier. The former one aims to capturecross-image common semantics, which enables the clas-sifier to better ground the common semantic labels overthe co-attentive regions. The latter, called contrastive co-attention, focuses on the rest, unshared semantics, whichhelps the classifier better separate semantic patterns of dif-ferent objects. These two co-attentions work in a coop-erative and complimentary manner, together making theclassifier understand object patterns more comprehensively.Another advantage is that the co-attention based classifierlearning paradigm brings an efficient data augmentationstrategy due to the use of training image pairs. An overviewof the proposed approach is shown in Figure 2.

(a) Image (b) Step1. Classification (c) Step2. IRNet (d) Step3. Segmentation (e) Mask

Figure 2: Object localization maps for (a) an image at each consecutive step of our method: (b) map after CAM extraction,(c) improved map by IRNet trained on the outcomes of step 1, (d) prediction of DeepLabV3+ trained on step 2 results, allcompared to (e) ground truth mask.

4. Experiments and Results4.1. Model Evaluation

For the classification step, we validate our models by cal-culating the F1 score on image-level labels. This allows usto select the model which performs best in the classificationof our extremely imbalanced data.

The best segmentation model is selected based on themean IoU achieved on the validation set during the training.

4.2. Results

We evaluate performance of our method on the competi-tion server on both validation and test sets using the meanIoU metric. Our method achieves 37.34 mean IoU on thetest set, which positions us at the third place. There are twoother metrics calculated on the competition server: meanaccuracy and mean pixel accuracy. Comparison of top-3solutions in the LID Challenge is presented in Table 1.

Solution mean IoU mean accuracy pixel accuracy1st 45.18 59.62 80.462nd 37.73 60.15 82.983rd.Ours 37.34 54.87 83.64

Table 1: Top 3 solutions in the LID Challenge comparedusing three different metrics.

In Table 2, we show how the results improve on valida-tion with each step of our approach

We also experiment by testing two encoder architecturesof DeepLabv3+ [3] model, different thresholds after IRNet[1] model, including or excluding ’person’ class, and ap-plying various postprocessing techniques. All these experi-ments are reported in the Table 3.

We provide qualitative results of segmentation on several

Figure 3. The produced localization maps of each consecutive stepby the team of Machine Learning Lab at Ukrainian Catholic Uni-versity.

2.3.2 The Machine Learning Lab at UkrainianCatholic University Team(Won the 3rd place of Track 1)

The approach proposed by this team consists of three con-secutive steps. The first two steps extract high-qualitypseudo masks from image-level annotated data, which arethen used to train a segmentation model on the third step.The presented approach also addresses two problems in thedata: class imbalance and missing labels. All these threesteps make the proposed approach be capable of segment-ing various classes and complex objects using only image-level annotations as supervision. The results produced fromdifferent steps are shown in Figure 3

2.3.3 Tsinghua&Intel Team(Won the 1st place of Track 2)

The team reveals two critical issues existing in the currentstate-of-the-art method [39]: 1) it relies upon softmax out-puts, or say logits. It is known that logits can be over-confident upon the wrong prediction; 2) harvesting pseudolabels using logits would introduce thresholds, and it is verytime-consuming to tune thresholds for modern deep net-works. Some observations are shown in Figure 4.

To tackle these issues, the proposed approach buildsupon uncertainty measures instead of logits and is free ofthreshold tuning, which is motivated by a large-scale analy-sis of the distribution of uncertainty measures using strongmodels and challenging databases. This analysis leads tothe discovery of a statistical phenomenon called uncertaintymixture. Inspired by this discovery, this team proposes todecompose the distribution of uncertainty measures with aGamma mixture model, leading to a principled method to

mixture. It allows us to decide on the optimal thresholdof uncertainty measures in an automatic way. Specificallyspeaking, we decompose the uncertainty measures for un-labeled points with a Gamma mixture and harvest pseudolabels belonging to the certain component.

Beyond the direct application of Gamma mixture, a newregularized model tailored for our problem is proposed, an-alyzed, implemented and evaluated. We assume the uncer-tainty measures for labeled points are drawn from the cer-tain component. Intuitively, this helps the mixture model tobetter capture the shape of the certain component. Mathe-matically, this amounts to an added regularization term inthe objective function of an EM procedure. Since the modelhas not been visited before, we present a systematic expo-sition of its analytical properties, from convergence guaran-tee to solver details. Empirically, this regularized Gammamixture harvests pseudo labels of higher quality than thebaseline and leads to better scene parsing performance, inall experimental settings we inspected.

Last but not least, our method is extensively bench-marked on challenging public datasets, namely PascalCon-text and ADE20k. It turns out that our method works ro-bustly in various settings. This robustness is attributed to thenature of the method: our mixture modeling decides on theoptimal threshold of uncertainty measures automatically.On an absolute scale, our method collaborates well withstrong network architectures and training techniques, result-ing in new state-of-the-art performance on both datasets.We believe our solution to be a useful one due to its ro-bustness and good performance.

2. Uncertainty MixtureA fully-supervised scene parsing dataset is denoted as

{Ii, Li}, in which Ii is the image and Li is the label map.A pointly-supervised version of this dataset is depicted as{Ii, P i} and P i is the degenerated version of Li. First, wetrain a model M(∗; Θ1) using only P i and we call it thefirst-round model. Formally, we solve this problem:

arg minΘ1

∑

i

CE(M(Ii; Θ1), P i)

in which CE(∗, ∗) is the cross-entropy loss function.Only labeled points in P i provide supervision signals forthe model. Note that M(Ii; Θ1) gives a softmax result(logit) for every pixel in Ii. By taking the index of the max-imum value in M(Ii; Θi), we can get the pseudo label mapF i. If we use F i as it is, this can be considered as a productof applying natural argmax thresholds upon M(Ii; Θ1). Itis also possible to exploit thresholds on M(Ii; Θ1) to har-vest a subset of F i as pseudo labels.

For image Ii, we generate an uncertainty map U i. Ourgoal is to look for rules to tell certain pixels from uncertainones. We take Fig 2 as an example, in which two pixels

Image Pseudo Label Map Uncertainty Map

Pb: Bed, Correct, Low UncertaintyPa: Blanket, Incorrect, High Uncertainty

Bed Point

Figure 2. Pixel A: a wrong pseudo label with high uncertainty.Pixel B: a correct pseudo label with low uncertainty.

are highlighted. In the pseudo label map generated by thefirst round model, pixel A is wrongly predicted as blanketso that we should not use it for training. Meanwhile, pixelB is correctly predicted as bed and considered as a goodpseudo label. It can be clearly seen that in the uncertaintymap, pixel A has higher color temperature than pixel B.

We conduct an analysis as such:We assume there areC categories and F ij to be the binary

mask of category j in the corresponding pseudo label map.Here j = 1, ..., C. Applying this mask on the uncertaintymap U i gives us U ij which is the uncertainty measures fora specific category. By stacking vectorized U ij and exclud-ing zero entries, we collect all the uncertainty measures forpoints labeled as j, and this array is called Uj .

The uncertainty mixture phenomenon emerges in thisanalysis: The histogram of Uj is shaped as a two-peak mix-ture, as Fig 3 demonstrates.

3. Pointly-supervised Scene ParsingIn the mixture of uncertainty measures, we name the

low uncertainty component as LUC and the high uncer-tainty component as HUC. We regard LUC and HUC asGamma distributions parameterized by θ1 = {α1, β1} andθ2 = {α2, β2}:

p1(x;α1, β1) =xα1−1e−

xβ1

βα11 Γ(α1)

p2(x;α2, β2) =xα2−1e−

xβ2

βα22 Γ(α2)

Here Γ(∗) is the Gamma function.And p1 + p2 characterizes the underlying distribution of

every histogram of Uj as visualized in Fig 3. In order todetermine the parameters θ1, θ2, we can use a standard EMalgorithm described in [4] or our regularized EM algorithm.As such, we can harvest pseudo labels belonging to LUCand train the model with them.

Here we formally summarize the whole pipeline of train-ing pointly-supervised scene parsing models with uncer-tainty mixture. We divide the method into five steps:

(1) Train the first round model M(∗; Θ1) on the originalpointly-supervised dataset {Ii, P i};

2

Figure 4. Some observations discovered by the team of MachineLearning Lab at Ukrainian Catholic University. Pa: a wrongpseudo label with high uncertainty. Pb: a correct pseudo labelwith low uncertainty.

𝑤!,# ∗ + 𝑤$,# ∗ + 𝑤%,# ∗ +⋯+ 𝑤&,# ∗ =

𝐌%𝐅& 𝐅' 𝐅( 𝐅)

GAP

…

𝑤!,#

𝑤$,#

𝑤%,#

…

𝑘:𝐦𝐨𝐧𝐤𝐞𝐲

𝐩234

1 2 3 𝐶⋯

𝐅

M1: Global Average Pooling

M2: Weighted Averaging

CNN

𝐌%5

> 𝜏&'(

𝐁%

M3: Thresholding

localization result

resize

Phenomena observed in the feature map (𝐅)

P1:

P2:

P3:

Figure 1: The overview of the CAM pipeline. We investigate three phenomena of the feature maps (F). P1. The areas of theactivated regions largely differ by channel. P2. The activated regions corresponding to the negative weights (wc < 0) oftencover large parts of the target object (e.g. monkey). P3. The most activated regions of each channel significantly overlap atsmall regions. The three modules of CAM in gray boxes (M1–M3) does not take these phenomena into account correctly,which results in the localization being limited to small discriminative regions of an object.

where pgapc denotes a scalar of pgap at c-th channel, andFc(h,w) is an activation of Fc at spatial position (h,w).

The pooled feature is then transformed intoK-dim logitsthrough the FC layer where K is the number of classes. Wedenote the weights of the FC layer as W ∈ RC×K . Thenthe class activation map for a class k denoted as Mk is

Mk =C∑

c=1

wc,k · Fc, (2)

where Mk ∈ RH×W and wc,k is an (c, k) element of W.For localization, M′k is first generated by resizing Mk to

the original image size. With a localization threshold

τloc = θloc ·maxM′k, (3)

a binary mask Bk identifies the regions where the activa-tions of M′k is greater than τloc: Bk = 1(M

′k > τloc).

Finally, localization is predicted as a bonding box that cir-cumscribes the contour of the regions with the largest pos-itive area of Bk. Note that for Track 3, saliency maps fromM′k are produced instead.

2.2. Thresholded Average Pooling (TAP)

Problem. In WSOL, a GAP layer is employed to com-pute a weight of each channel to generate a class activationmap. But, it tends to produce distorted weights for local-ization. As in Eq.(1), it sums all the activations and dividesby H×W without considering the actual activated area perchannel. The difference in the activated area is, however, notnegligible. As an example in Fig 2, suppose i-th feature in(a) captures the head of a bird whereas j-th feature captures

GAP

GAP2.5

⋮

9.9

⋮

𝑧

⋮

𝑤!,#(= 0.04)

𝑤$,#(= 0.01)

𝐅$ (max: 59.2)

𝐅! (max: 64.7)

(a) Classification phase (b) Localization phase

= 0.100 +0.099 + ⋯

𝐩%&'

∗ 𝑤",$ =

∗ 𝑤%,$ =

Figure 2: An example illustrating a problem of using theGAP layer. The GAP layer causes the features with smallactivation area Fi to be underestimated so that the corre-sponding weight wi,k is trained to be larger than wi,k. Asa result, the weighted feature with small activation region,wi,k · Fi, is highly overstated in localization phase.

its body. While the area activated in Fi is smaller than Fj ,the GAP layer divides both of them byH×W , so the pooledfeature pgapi of Fi is also smaller than p

gapj . But, it does not

mean the importance of Fi for classification is less than Fjas their contributions to logit z are almost the same as 0.1and 0.099. For the ground truth class (k: bird), to compen-sate this difference, the FC weight wi,k corresponding to Fiis trained to be higher than wj,k. As a result, when generat-ing Mk in Eq.(2), small activated regions of Fi are highlyoverstated due to a large value of wi,k. It causes localizationto be limited to small region as localization depends on themaximum value of a class activation map.

Solution. To alleviate the problem of GAP, we proposethe thresholded average pooling (TAP) layer. By replacinga GAP layer with a TAP layer, the pooled feature at c-thchannel (Eq.(1)) is redefined as

Figure 5. An overview of the proposed approach by the team ofVision&Learning Lab at Seoul National University.

harvest reliable pseudo labels. Beyond that, the team as-sumes the uncertainty measures for labeled points are al-ways drawn from a particular component. This amounts toa regularized Gamma mixture model. They provide a thor-ough theoretical analysis of this model, showing that it canbe solved with an EM-style algorithm with a convergenceguarantee.

2.3.4 Vision&Learning Lab at Seoul National Univer-sity Team(Won the 1st and 2nd places for Track 3 andTrack 1)

This team demonstrates the popular class activationmaps [72] suffers from three fundamental issues: (i) thebias of GAP to assign a higher weight to a channel witha small activation area, (ii) negatively weighted activationsinside the object regions, and (iii) instability from the useof the maximum value of a class activation map as a thresh-olding reference. They collectively cause the problem thatthe localization prediction to be highly limited to the smallregion of an object. The proposed approach incorporatesthree simple but robust techniques that alleviate the regard-ing problems, including thresholded average pooling, nega-tive weight clamping, and percentile as a thresholding stan-dard. More details can be found in Figure 5.

Beijing Jiaotong University & Mepro Team Chuangchuang Tan, Tao Ruan, Guanghua Gu, Shikui Wei, Yao Zhao Institute of Information Science, Beijing Jiaotong University The proposed method achieves localization on any convolutional layer of a classification model by exploiting two kinds of gradients, called Dual-Gradients Localization (DGL) framework. DGL framework is developed based on two branches: 1) Pixel-level Class Selection, leveraging gradients of target class to identify the correlation ratio of pixels to the target class within any convolutional feature maps, and 2) Class-aware Enhanced Map, utilizing gradients of classification loss function to mine entire target object regions, which would not damage classification performance. The proposed method architecture is shown in Figure1.

Figure 1: Beijing Jiaotong University & Mepro Team: Dual-Gradients Localization (DGL)

framework architecture To exploit the integral object regions, gradients of classification loss function are used to enhance information of the specific class on any convolutional layer. DGL framework demonstrates that any convolutional layer of classification model has the localization ability and some layers have better performance than the last convolutional layer in an offline manner. Details of this work can be found in our workshop proceedings: Dual-Gradients Localization framework for Dual-Gradients Localization framework for Weakly Supervised Object Localization

Figure 6. An overview of the Dual-Gradients Localization (DGL)framework proposed by the team of Mepro at Beijing JiaotongUniversity.

2.3.5 Mepro at Beijing Jiaotong University Team(Won the 2nd place of Track 3)

The proposed method achieves localization on any convo-lutional layer of a classification model by exploiting twokinds of gradients, called the Dual-Gradients Localization(DGL) framework. DGL framework is developed basedon two branches: 1) Pixel-level Class Selection, leverag-ing gradients of target class to identify the correlation ratioof pixels to the target class within any convolutional featuremaps, and 2) Class-aware Enhanced Map, utilizing gradi-ents of the classification loss function to mine entire targetobject regions, which would not damage classification per-formance. The proposed architecture is shown in Figure 6.

To acquire the integral object regions, gradients of clas-sification loss function are used to enhance the informationof the specific class on any convolutional layer. DGL frame-work demonstrates that any convolutional layer of the clas-sification model has the localization ability, and some layershave better performance than the last convolutional layer inan offline manner.

2.3.6 PCA Lab at Nanjing University of Science andTechnology Team(Won the 3rd place of Track 3)

The proposed model is composed of two auto-encoders, asshown in Figure 7. In the first auto-encoder, a classifieris trained with the global average pooling by using image-level annotations as supervision. The learned classifier isfurther applied to obtain Class Activation Maps (CAMs) ac-cording to [72]. Then, the team uses the binary images gen-erated by CAMs as pseudo-pixel-level annotations to con-duct binary classification. After the first decoder, a bilinearupsampling operation is further applied to get the binaryimage with the same size as the raw image.

In the second auto-encoder, the team aims to recover thebinary image to the raw image to obtain the refined binaryoutput image. The output binary image of the first decoderis used as the input of the second encoder. The binary image

Zhendong Wang, Zhenyuan Chen, Chen Gong Nanjing University Of Science And Technology

The proposed model is composed of two auto-encoders. In the first part, we train a classifier with global average pooling in the first encoder under the supervision of image-level annotations. We weight and sum the convs to get class activation maps. Then we use binary images generated by CAMs using binary classification as pseudo pixel-level annotations. Followed by the first decoder, we use bilinear upsampling to get the binary image with the same size of the raw image and calculate the loss in each conv.

Figure 1: Attention Based Autoencoder Network architecture

In the second part, we expect to recover the binary image to the raw image in order

to get the refined binary image. The output binary image of the first decoder is used as the input of the second encoder. The binary image is compounded with raw image and then divided into front and background. The front part and the background part are operated in different channels of a layer independently and then fused as the final predict image. We transfer front and background images in all layers into binary images with a threshold, add front and background binary images together reversely, fuse binary images of all layers so that we get the refined binary images, which are used as pseudo annotations for the next iteration and iteratively training.

Figure 7. An overview of the framework proposed by the team of PCA Lab at Nanjing University of Science and Technology.

is compounded with a raw image and is then divided intofront and background. The front part and the backgroundpart are operated in different channels of a layer indepen-dently and then are fused to the final prediction image. Tobe specific, they transfer front and background images inall layers into binary images with a threshold, add front andbackground binary images together, and then fuse the bi-nary images generated by all layers. As a result, we may getthe refined binary output images, which are used as pseudoannotations for the next iteration. The above process iter-ates until convergence.

2.3.7 Beijing Normal University(Won the 5th place of Track 3)

This team simply apply the approach proposed in [10] tocompete the Track 3.

Acknowledgments We thank our sponsor of the Pad-dlePaddle Team from Baidu.

References[1] Jiwoon Ahn and Suha Kwak. Learning pixel-level semantic

affinity with image-level supervision for weakly supervisedsemantic segmentation. In CVPR, pages 4981–4990, 2018.1

[2] Aditya Arun, CV Jawahar, and M Pawan Kumar. Weakly su-pervised instance segmentation by learning annotation con-sistent instances. In ECCV, 2020. 1

[3] Wonho Bae, Junhyug Noh, and Gunhee Kim. Rethinkingclass activation mapping for weakly supervised object local-ization. In ECCV, 2020. 1

[4] Archith John Bency, Heesung Kwon, Hyungtae Lee, SKarthikeyan, and BS Manjunath. Weakly supervised local-ization using deep feature maps. In eccv, pages 714–731,2016.

[5] Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader,and Vineeth N Balasubramanian. Grad-cam++: General-ized gradient-based visual explanations for deep convolu-tional networks. In WACV, pages 839–847, 2018. 1

[6] Arslan Chaudhry, Puneet K Dokania, and Philip HS Torr.Discovering class-specific pixels for weakly-supervised se-

mantic segmentation. arXiv preprint arXiv:1707.05821,2017. 1

[7] Liyi Chen, Weiwei Wu, Chenchen Fu, Xiao Han, and Yun-tao Zhang. Weakly supervised semantic segmentation withboundary exploration. In ECCV, 2020. 1

[8] Liang-Chieh Chen, Yukun Zhu, George Papandreou, FlorianSchroff, and Hartwig Adam. Encoder-decoder with atrousseparable convolution for semantic image segmentation. InECCV, pages 801–818, 2018. 2

[9] Junsuk Choe, Seong Joon Oh, Seungho Lee, SanghyukChun, Zeynep Akata, and Hyunjung Shim. Evaluatingweakly supervised object localization methods right. In cvpr,2020. 1

[10] Junsuk Choe and Hyunjung Shim. Attention-based dropoutlayer for weakly supervised object localization. In CVPR,June 2019. 5

[11] Ramazan Gokberk Cinbis, Jakob Verbeek, and CordeliaSchmid. Weakly supervised object localization with multi-fold multiple instance learning. TPAMI, 39(1):189–203,2016. 1

[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,and Li Fei-Fei. Imagenet: A large-scale hierarchical imagedatabase. In CVPR, pages 248–255, 2009. 1, 2

[13] Junsong Fan, Zhaoxiang Zhang, Chunfeng Song, and TieniuTan. Learning integral objects with intra-class discrimina-tor for weakly-supervised semantic segmentation. In CVPR,2020. 1

[14] Junsong Fan, Zhaoxiang Zhang, and Tieniu Tan. Employingmulti-estimations for weakly-supervised semantic segmenta-tion. In ECCV, 2020.

[15] Qibin Hou, PengTao Jiang, Yunchao Wei, and Ming-MingCheng. Self-erasing network for integral object attention. InNIPS, pages 549–559, 2018.

[16] Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, andJingdong Wang. Weakly-supervised semantic segmentationnetwork with deep seeded region growing. In CVPR, pages7014–7023, 2018.

[17] Peng-Tao Jiang, Qibin Hou, Yang Cao, Ming-Ming Cheng,Yunchao Wei, and Hong-Kai Xiong. Integral object miningvia online attention accumulation. In ICCV, pages 2070–2079, 2019. 1

[18] Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, and WeiLiu. Deep self-taught learning for weakly supervised objectlocalization. In CVPR, 2017. 1

[19] Vadim Kantorov, Maxime Oquab, Minsu Cho, and IvanLaptev. Contextlocnet: Context-aware deep network models

https://github.com/PaddlePaddle/Paddlehttps://github.com/PaddlePaddle/Paddle

for weakly supervised localization. In ECCV, pages 350–365, 2016. 1

[20] Anna Khoreva, Rodrigo Benenson, Jan Hendrik Hosang,Matthias Hein, and Bernt Schiele. Simple does it: Weaklysupervised instance and semantic segmentation. In CVPR,volume 1, page 3, 2017. 1

[21] Dahun Kim, Donghyeon Cho, Donggeun Yoo, and InSo Kweon. Two-phase learning for weakly supervised ob-ject localization. In ICCV, pages 3534–3543, 2017. 1

[22] Alexander Kolesnikov and Christoph H Lampert. Seed, ex-pand and constrain: Three principles for weakly-supervisedimage segmentation. In ECCV, pages 695–711, 2016. 1

[23] Viveka Kulharia, Siddhartha Chandra, Amit Agrawal, PhilipTorr, and Ambrish Tyagi. Box2seg: Attention weighted lossand discriminative feature learning for weakly supervisedsegmentation. In ECCV, 2020.

[24] Suha Kwak, Seunghoon Hong, Bohyung Han, et al. Weaklysupervised semantic segmentation using superpixel poolingnetwork. In AAAI, pages 4111–4117, 2017.

[25] Jungbeom Lee, Eunji Kim, Sungmin Lee, Jangho Lee, andSungroh Yoon. Ficklenet: Weakly and semi-supervised se-mantic image segmentation using stochastic inference. InCVPR, pages 5267–5276, 2019. 1

[26] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. Weakly supervised object localization withprogressive domain adaptation. In CVPR, pages 3512–3520,2016. 1

[27] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, andYun Fu. Tell me where to look: Guided attention inferencenetwork. In CVPR, pages 9215–9223, 2018. 1

[28] Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and YunFu. Attention bridging network for knowledge transfer. InCVPR, pages 5198–5207, 2019. 1

[29] Xiaodan Liang, Si Liu, Yunchao Wei, Luoqi Liu, Liang Lin,and Shuicheng Yan. Towards computational baby learn-ing: A weakly-supervised approach for object detection. InICCV, pages 999–1007, 2015. 1

[30] Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, and Jian Sun.Scribblesup: Scribble-supervised convolutional networks forsemantic segmentation. CVPR, 2016. 1

[31] Weizeng Lu, Xi Jia, Weicheng Xie, Linlin Shen, YicongZhou, and Jinming Duan. Geometry constrained weakly su-pervised object localization. In ECCV, 2020. 1

[32] Wenfeng Luo and Meng Yang. Semi-supervised seman-tic segmentation via strong-weak dual-branch network. InECCV, 2020. 1

[33] Jinjie Mai, Meng Yang, and Wenfeng Luo. Erasing inte-grated learning: A simple yet effective approach for weaklysupervised object localization. In CVPR, 2020. 1

[34] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic.Is object localization for free?-weakly-supervised learningwith convolutional neural networks. In CVPR, pages 685–694, 2015. 1

[35] George Papandreou, Liang-Chieh Chen, Kevin Murphy, andAlan L Yuille. Weakly-and semi-supervised learning of adcnn for semantic image segmentation. In ICCV, 2015. 1

[36] Deepak Pathak, Philipp Krahenbuhl, and Trevor Darrell.Constrained convolutional neural networks for weakly super-vised segmentation. In ICCV, pages 1796–1804, 2015.

[37] Pedro O Pinheiro and Ronan Collobert. From image-level to

pixel-level labeling with convolutional networks. In CVPR,2015.

[38] Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, HengshuangZhao, and Jiaya Jia. Augmented feedback in semantic seg-mentation under image level supervision. In ECCV, pages90–105, 2016. 1

[39] Rui Qian, Yunchao Wei, Honghui Shi, Jiachen Li, JiayingLiu, and Thomas Huang. Weakly supervised scene pars-ing with point-based distance metric learning. In AAAI, vol-ume 33, pages 8843–8850, 2019. 1, 2, 3

[40] Amir Rahimi, Amirreza Shaban, Thalaiyasingam Ajanthan,Richard Hartley, and Byron Boots. Pairwise similarityknowledge transfer for weakly supervised object localiza-tion. In ECCV, 2020. 1

[41] O Russakovsky, AL Bearman, V Ferrari, and L Fei-Fei.What’s the point: Semantic segmentation with point super-vision. In ECCV, pages 549–565, 2016.

[42] Fatemehsadat Saleh, Mohammad Sadegh Ali Akbarian,Mathieu Salzmann, Lars Petersson, Stephen Gould, andJose M Alvarez. Built-in foreground/background prior forweakly-supervised semantic segmentation. In ECCV, pages413–432, 2016. 1

[43] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.Grad-cam: Visual explanations from deep networks viagradient-based localization. In ICCV, pages 618–626, 2017.1

[44] Yunhang Shen, Rongrong Ji, Yan Wang, Yongjian Wu, andLiujuan Cao. Cyclic guidance for weakly supervised jointdetection and segmentation. In CVPR, 2019. 1

[45] Wataru Shimoda and Keiji Yanai. Distinct class-specificsaliency maps for weakly supervised semantic segmentation.In ECCV, pages 218–234, 2016.

[46] Wataru Shimoda and Keiji Yanai. Self-supervised differencedetection for weakly-supervised semantic segmentation. InICCV, 2019. 1

[47] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek:Forcing a network to be meticulous for weakly-supervisedobject and action localization. In ICCV, pages 3544–3553,2017. 1

[48] Chunfeng Song, Yan Huang, Wanli Ouyang, and LiangWang. Box-driven class-wise region masking and filling rateguided loss for weakly supervised semantic segmentation. InCVPR, 2019. 1

[49] Guolei Sun, Wenguan Wang, Jifeng Dai, and Luc Van Gool.Mining cross-image semantics for weakly supervised seman-tic segmentation. In ECCV, 2020.

[50] Meng Tang, Abdelaziz Djelouah, Federico Perazzi, YuriBoykov, and Christopher Schroers. Normalized cut loss forweakly-supervised cnn segmentation. In CVPR, 2018.

[51] Meng Tang, Federico Perazzi, Abdelaziz Djelouah, IsmailBen Ayed, Christopher Schroers, and Yuri Boykov. On reg-ularized losses for weakly-supervised cnn segmentation. InECCV, pages 507–522, 2018.

[52] Olga Veksler. Regularized loss for weakly supervised singleclass semantic segmentation. In ECCV, 2020.

[53] Bin Wang, Guojun Qi, Sheng Tang, Tianzhu Zhang, YunchaoWei, Linghui Li, and Yongdong Zhang. Boundary percep-tion guidance: a scribble-supervised semantic segmentation

approach. In IJCAI, pages 3663–3669, 2019.[54] Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and

Xilin Chen. Self-supervised equivariant attention mech-anism for weakly supervised semantic segmentation. InCVPR, 2020.

[55] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-MingCheng, Yao Zhao, and Shuicheng Yan. Object region miningwith adversarial erasing: A simple classification to semanticsegmentation approach. In CVPR, 2017.

[56] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Zequn Jie,Yanhui Xiao, Yao Zhao, and Shuicheng Yan. Learning tosegment with image-level annotations. Pattern Recognition,2016.

[57] Yunchao Wei, Xiaodan Liang, Yunpeng Chen, Xiaohui Shen,Ming-Ming Cheng, Jiashi Feng, Yao Zhao, and ShuichengYan. Stc: A simple to complex framework for weakly-supervised semantic segmentation. TPAMI, 2016. 1

[58] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni,Jian Dong, Yao Zhao, and Shuicheng Yan. Hcp: A flexiblecnn framework for multi-label image classification. TPAMI,(9):1901–1907, 2015. 1

[59] Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, JiashiFeng, and Thomas S Huang. Revisiting dilated convolution:A simple approach for weakly-and semi-supervised semanticsegmentation. In CVPR, pages 7268–7277, 2018. 1

[60] Huaxin Xiao, Yunchao Wei, Yu Liu, Maojun Zhang, and Ji-ashi Feng. Transferable semi-supervised semantic segmen-tation. In AAAI, 2018.

[61] Jia Xu, Alexander G Schwing, and Raquel Urtasun. Learn-ing to segment under various forms of weak supervision. InCVPR, 2015. 1

[62] Haolan Xue, Chang Liu, Fang Wan, Jianbin Jiao, XiangyangJi, and Qixiang Ye. Danet: Divergent activation for weaklysupervised object localization. In ICCV, October 2019. 1

[63] Yu Zeng, Yunzhi Zhuge, Huchuan Lu, and Lihe Zhang. Jointlearning of saliency detection and weakly supervised seman-tic segmentation. In ICCV, October 2019. 1

[64] Chen-Lin Zhang, Yun-Hao Cao, and Jianxin Wu. Rethinkingthe route towards weakly supervised object localization. InCVPR, 2020. 1

[65] Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen,and Stan Sclaroff. Top-down neural attention by excitationbackprop. In ECCV, pages 543–559, 2016. 1

[66] Tianyi Zhang, Guosheng Lin, Jianfei Cai, Tong Shen,Chunhua Shen, and Alex C Kot. Decoupled spatial neu-ral attention for weakly supervised semantic segmentation.21(11):2930–2941, 2019. 1

[67] Tianyi Zhang, Guosheng Lin, Weide Liu, Jianfei Cai, andAlex Kot. Splitting vs. merging: Mining object regions withdiscrepancy and intersection loss for weakly supervised se-mantic segmentation. In ECCV, 2020. 1

[68] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, andThomas Huang. Adversarial complementary learning forweakly supervised object localization. In CVPR, 2018. 1

[69] Xiaolin Zhang, Yunchao Wei, Guoliang Kang, Yi Yang,and Thomas Huang. Self-produced guidance for weakly-supervised object localization. In ECCV, 2018.

[70] Xiaolin Zhang, Yunchao Wei, and Yi Yang. Inter-imagecommunication for weakly supervised localization. In

ECCV, 2020.[71] Xiaolin Zhang, Yunchao Wei, Yi Yang, and Fei Wu. Re-

thinking localization map: Towards accurate object per-ception with self-enhancement maps. arXiv preprintarXiv:2006.05220, 2020. 1, 2

[72] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,and Antonio Torralba. Learning deep features for discrim-inative localization. In CVPR, pages 2921–2929, 2016. 1,4

[73] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, AdelaBarriuso, and Antonio Torralba. Scene parsing throughade20k dataset. In CVPR, pages 633–641, 2017. 1, 2

[74] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and JianbinJiao. Weakly supervised instance segmentation using classpeak response. In CVPR, pages 3791–3800, 2018. 1

[75] Yi Zhu, Yanzhao Zhou, Qixiang Ye, Qiang Qiu, and JianbinJiao. Soft proposal networks for weakly supervised objectlocalization. arXiv preprint arXiv:1709.01829, 2017. 1

Documents

arXiv:2010.11724v1 [cs.CV] 17 Oct 2020Mepro-MIC of Beijing Jiaotong U BJTU-Mepro-MIC 61.98 35 NUST LEAP Group of PCA Lab 61.48 7 - chohk (wsol-aug) 52.89 36 Beijing Normal University