7
Weakly Supervised Few-shot Object Segmentation using Co-Attention with Visual and Semantic Inputs Mennatullah Siam 1,4* , Naren Doraiswamy 2* , Boris N. Oreshkin 3* , Hengshuai Yao 4 and Martin Jagersand 1 1 University of Alberta 2 Indian Institute of Science 3 Element AI 4 HiSilicon, Huawei Research Abstract Significant progress has been made recently in de- veloping few-shot object segmentation methods. Learning is shown to be successful in few seg- mentation settings, including pixel-level, scribbles and bounding boxes. This paper takes another ap- proach, i.e., only requiring image-level classifica- tion data for few-shot object segmentation. We propose a novel multi-modal interaction module for few-shot object segmentation that utilizes a co- attention mechanism using both visual and word embedding. Our model using image-level labels achieves 4.8% improvement over previously pro- posed image-level few-shot object segmentation. It also outperforms state-of-the-art methods that use weak bounding box supervision on PASCAL-5 i . Our results show that few-shot segmentation ben- efits from utilizing word embeddings, and that we are able to perform few-shot segmentation using stacked joint visual semantic processing with weak image-level labels. We further propose a novel setup, Temporal Object Segmentation for Few-shot Learning (TOSFL) for videos. TOSFL requires only image-level labels for the first frame in order to segment objects in the following frames. TOSFL provides a novel benchmark for video segmenta- tion, which can be used on a variety of public video data such as Youtube-VOS, as demonstrated in our experiment. 1 Introduction Existing literature in few-shot object segmentation has mainly relied on manually labelled segmentation masks. A few recent works [Rakelly et al., 2018; Zhang et al., 2019b; Wang et al., 2019] started to conduct experiments using weak annotations such as scribbles or bounding boxes. However, these weak forms of supervision involve more manual work compared to image level labels, which can be collected from text and images publicly available on the web. Limited re- search has been conducted on using image-level supervision * equally contributing Encoder Encoder Shared Weights ‘boat’ Co-attention Module ‘boat’ Co-attention Module …. Nx Few-Shot Object Segmentation with Image-Level Labels Support Set Query Image K-shot Figure 1: Overview of stacked co-attention to relate the support set and query image using image-level labels. Nx: Co-attention stacked N times. “K-shot” refers to using K support images. for few-shot segmentation [Raza et al., 2019]. Most current weakly supervised few-shot segmentation methods lag signif- icantly behind their strongly supervised counterparts. On the other hand, deep semantic segmentation networks are very successful when trained and tested on relatively large-scale manually labelled datasets such as PASCAL- VOC [Everingham et al., 2015] and MS-COCO [Lin et al., 2014]. However, the number of object categories they cover is still limited despite the significant sizes of the data used. The limited number of annotated objects with pixel-wise labels in- cluded in existing datasets restricts the applicability of deep learning in inherently open-set domains such as robotics [De- hghan et al., 2019; Pirk et al., 2019]. Human visual sys- tem has the ability to generalize to new categories from a few labelled samples. It has been shown that adults and even children demonstrate a phenomenon known as “stimu- lus equivalence” when novel concepts are taught by engaging learning to a combination of visual, textual and verbal stim- uli [Sidman, 2009]. The relations learned from one modal- ity to another (e.g. written words/pictures related to spoken arXiv:2001.09540v2 [cs.CV] 7 Feb 2020

arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

Weakly Supervised Few-shot Object Segmentationusing Co-Attention with Visual and Semantic Inputs

Mennatullah Siam1,4∗ , Naren Doraiswamy2∗ , Boris N. Oreshkin3∗ , Hengshuai Yao4

and Martin Jagersand1

1 University of Alberta2 Indian Institute of Science

3 Element AI4 HiSilicon, Huawei Research

AbstractSignificant progress has been made recently in de-veloping few-shot object segmentation methods.Learning is shown to be successful in few seg-mentation settings, including pixel-level, scribblesand bounding boxes. This paper takes another ap-proach, i.e., only requiring image-level classifica-tion data for few-shot object segmentation. Wepropose a novel multi-modal interaction modulefor few-shot object segmentation that utilizes a co-attention mechanism using both visual and wordembedding. Our model using image-level labelsachieves 4.8% improvement over previously pro-posed image-level few-shot object segmentation. Italso outperforms state-of-the-art methods that useweak bounding box supervision on PASCAL-5i.Our results show that few-shot segmentation ben-efits from utilizing word embeddings, and that weare able to perform few-shot segmentation usingstacked joint visual semantic processing with weakimage-level labels. We further propose a novelsetup, Temporal Object Segmentation for Few-shotLearning (TOSFL) for videos. TOSFL requiresonly image-level labels for the first frame in orderto segment objects in the following frames. TOSFLprovides a novel benchmark for video segmenta-tion, which can be used on a variety of public videodata such as Youtube-VOS, as demonstrated in ourexperiment.

1 IntroductionExisting literature in few-shot object segmentation hasmainly relied on manually labelled segmentation masks. Afew recent works [Rakelly et al., 2018; Zhang et al., 2019b;Wang et al., 2019] started to conduct experiments using weakannotations such as scribbles or bounding boxes. However,these weak forms of supervision involve more manual workcompared to image level labels, which can be collected fromtext and images publicly available on the web. Limited re-search has been conducted on using image-level supervision

∗equally contributing

Encoder Encoder

Shared Weights

‘boat’

Co-attention Module

‘boat’

Co-attention Module….

Nx

Few-Shot Object Segmentation with Image-Level Labels

Support SetQuery Image

K-shot

Figure 1: Overview of stacked co-attention to relate the support setand query image using image-level labels. Nx: Co-attention stackedN times. “K-shot” refers to using K support images.

for few-shot segmentation [Raza et al., 2019]. Most currentweakly supervised few-shot segmentation methods lag signif-icantly behind their strongly supervised counterparts.

On the other hand, deep semantic segmentation networksare very successful when trained and tested on relativelylarge-scale manually labelled datasets such as PASCAL-VOC [Everingham et al., 2015] and MS-COCO [Lin et al.,2014]. However, the number of object categories they cover isstill limited despite the significant sizes of the data used. Thelimited number of annotated objects with pixel-wise labels in-cluded in existing datasets restricts the applicability of deeplearning in inherently open-set domains such as robotics [De-hghan et al., 2019; Pirk et al., 2019]. Human visual sys-tem has the ability to generalize to new categories from afew labelled samples. It has been shown that adults andeven children demonstrate a phenomenon known as “stimu-lus equivalence” when novel concepts are taught by engaginglearning to a combination of visual, textual and verbal stim-uli [Sidman, 2009]. The relations learned from one modal-ity to another (e.g. written words/pictures related to spoken

arX

iv:2

001.

0954

0v2

[cs

.CV

] 7

Feb

202

0

Page 2: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

words) can be transferred to accelerate the learning of newconcepts and new relations (e.g. pictures of objects in re-lation to written words) via the stimulus equivalence princi-ple. Inspired by this, we propose a multi-modal interactionmodule to bootstrap the efficiency of weakly supervised few-shot object segmentation by combining the visual input withneural word embeddings. Our method iteratively guides abi-directional co-attention between the support and the querysets using both visual and neural word embedding inputs, us-ing only image-level supervision as shown in Fig. 1. It outper-forms [Raza et al., 2019] by 4.8% and improves over meth-ods that use bounding box supervision [Zhang et al., 2019b;Wang et al., 2019].

Most work in few-shot segmentation considers the staticsetting where query and support images do not have tem-poral relations. However, in real world applications suchas robotics, segmentation methods can benefit from tempo-ral continuity and multiple viewpoints. For real time seg-mentation, it may be of tremendous benefits to utilize tem-poral knowledge existing in video sequences. Observationsthat pixels moving together mostly belong to the same objectseem to be very common in videos, and it can be exploitedto improve segmentation accuracy. We propose a novelsetup, temporal object segmentation with few-shot learning(TOSFL), where support and query images are temporally re-lated. The TOSFL setup for video object segmentation gen-eralizes to novel object classes as can be seen in our exper-iments on Youtube-VOS dataset [Xu et al., 2018]. TOSFLonly requires image-level labels for the first frames (supportimages) to segment the objects that appear in the frames thatfollow. The TOSFL setup is interesting because it is moresimilar to the nature of learning of objects by human than thestrongly supervised static segmentation setup.

Youtube-VOS [Xu et al., 2018] provides a way to evaluateon unseen categories. However, it does not utilize the cate-gory labels in the segmentation model. Our setup relies onthe image-level label for the support image to segment dif-ferent parts from the query image conditioned on the wordembeddings of this image-level label. In order to ensure theevaluation for the few-shot method is not biased to a certaincategory, it is best to split into multiple folds and evaluate ondifferent ones similar to [Shaban et al., 2017].

1.1 Contributions• We propose a novel few-shot object segmentation algo-

rithm based on a multi-modal interaction module trainedusing image-level supervision. It relies on a multi-stageattention mechanism and uses both visual and semanticrepresentations to relate relevant spatial locations in thesupport and query images.

• We propose a novel weakly supervised few-shot videoobject segmentation setup. It complements the existingfew-shot object segmentation benchmarks by consider-ing a practically important use case not covered by pre-vious datasets. Video sequences are provided instead ofstatic images which can simplify the few-shot learningproblem.

• We conduct a comparative study of different architec-

tures proposed in this paper to solve few-shot object seg-mentation with image-level supervision. Our methodcompares favourably against the state-of-the-art meth-ods relying on pixel-level supervision and outperformsthe most recent methods using weak annotations [Razaet al., 2019; Wang et al., 2019; Zhang et al., 2019b].

2 Related Work2.1 Few-shot Object SegmentationShaban et al. [2017] proposed the first few-shot segmenta-tion method using a second branch to predict the final seg-mentation layer parameters. Rakelly et al. [2018] proposed aguidance network for few-shot segmentation where the guid-ance branch receives the support set image-label pairs. Dongand Xing [2018] utilized the second branch to learn proto-types. Zhang et al. [2019b] proposed a few-shot segmen-tation method based on a dense comparison module with asiamese-like architecture that uses masked average poolingto extract features on the support set, and an iterative opti-mization module to refine the predictions. Siam et al. [2019]proposed a method to perform few-shot segmentation usingadaptive masked proxies to directly predict the parametersof the novel classes. Zhang et al. [2019a] in a more recentwork proposed a pyramid graph network which learns atten-tion weights between the support and query sets for furtherlabel propagation. Wang et al. [2019] proposed prototypealignment by performing both support-to-query and query-to-support few-shot segmentation using prototypes.

The previous literature focused mainly on using stronglylabelled pixel-level segmentation masks for the few exam-ples in the support set. It is labour intensive and impracticalto provide such annotations for every single novel class, es-pecially in certain robotics applications that require to learnonline. A few recent works experimented with weaker an-notations based on scribbles and/or bounding boxes [Rakellyet al., 2018; Zhang et al., 2019b; Wang et al., 2019]. In ouropinion, the most promising approach to solving the problemof intense supervision requirements in the few-shot segmen-tation task, is to use publicly available web data with image-level labels. Raza et al. [2019] made a first step in this di-rection by proposing a weakly supervised method that usesimage-level labels. However, the method lags significantlybehind other approaches that use strongly labelled data.

2.2 Attention MechanismsAttention was initially proposed for neural machine transla-tion models [Bahdanau et al., 2014]. Several approaches wereproposed for utilizing attention. Yang et al. [2016] proposeda stacked attention network which learns attention maps se-quentially on different levels. Lu et al. [2016] proposed co-attention to solve a visual question and answering task by al-ternately shifting attention between visual and question rep-resentations. Lu et al. [2019] used co-attention in video ob-ject segmentation between frames sampled from a video se-quence. Hsieh et al. [2019] rely on attention mechanism toperform one-shot object detection. However, they mainly useit to attend to the query image since the given bounding box

Page 3: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

Encoder

Encoder

Shared Weights

‘boat’Word2Vec + Projection

LayerSpatial Tiling

Multi-Modal Interaction ModuleNx

Support Set

Query Set

Few-Shot Object Segmentation with Image-Level Supervision

Decoder

Figure 2: Architecture of Few-Shot Object segmentation model with co-attention. The ⊕ operator denotes concatenation, ◦ denotes element-wise multiplication. Only the decoder and multi-modal interaction module parameters are learned, while the encoder is pretrained on Ima-geNet.

provides them with the region of interest in the support set im-age. To the best of our knowledge, this work is the first one toexplore the bidirectional attention between support and querysets as a mechanism for solving the few-shot image segmen-tation task with image-level supervision.

3 Proposed MethodThe human perception system is inherently multi-modal. In-spired from this and to leverage the learning of new conceptswe propose a multi-modal interaction module that embeds se-mantic conditioning in the visual processing scheme as shownin Fig. 2. The overall model consists of: (1) Encoder. (2)Multi-modal Interaction module. (3) Segmentation Decoder.The multi-modal interaction module is described in detail inthis section while the encoder and decoder modules are ex-plained in Section 5.1. We follow a 1-way k-shot setting sim-ilar to [Shaban et al., 2017].

3.1 Multi-Modal Interaction ModuleOne of the main challenges in dealing with the image-levelannotation in few-shot segmentation is that quite often bothsupport and query images may contain a few salient commonobjects from different classes. Inferring a good prototype forthe object of interest from multi-object support images with-out relying on pixel-level cues or even bounding boxes be-comes particularly challenging. Yet, it is exactly in this sit-uation, that we can expect the semantic word embeddings tobe useful at helping to disambiguate the object relationshipsacross support and query images. Below we discuss the tech-nical details behind the implementation of this idea depictedin Fig. 2. Initially, in a k-shot setting, a base network is usedto extract features from ith support set image Iis and fromthe query image Iq , which we denote as Vs ∈ RW×H×C

and Vq ∈ RW×H×C . Here H and W denote the height andwidth of feature maps, respectively, whileC denotes the num-ber of feature channels. Furthermore, a projection layer isused on the semantic word embeddings to construct z ∈ Rd

(d = 256). It is then spatially tiled and concatenated withthe visual features resulting in flattened matrix representa-tions Vq ∈ RC×WH and Vs ∈ RC×WH . An affinity matrixS is computed to capture the similarity between them via afully connected layer Wco ∈ RC×C learning the correlationbetween feature channels:

S = VsTWcoVq.

The affinity matrix S ∈ RWH×WH relates each pixel in Vqand Vs. A softmax operation is performed on S row-wise andcolumn-wise depending on the desired direction of relation:

Sc = softmax(S), Sr = softmax(ST )

For example, column Sc∗,j contains the relevance of the jth

spatial location in Vq with respect to all spatial locations ofVs, where j = 1, ...,WH . The normalized affinity matrix isused to compute attention summaries Uq and Us:

Uq = VsSc, Us = VqS

r.

The attention summaries are further reshaped such thatUq, Us ∈ RW×H×C and gated using a gating function fgwith learnable weights Wg and bias bg:

fg(Uq) = σ(Wg ∗ Uq + bg),

Uq = fg(Uq) ◦ Uq.

Here the ◦ operator denotes element-wise multiplication. Thegating function restrains the output to the interval [0, 1] usinga sigmoid activation function σ in order to mask the attentionsummaries. The gated attention summaries Uq are concate-nated with the original visual features Vq to construct the finaloutput from the attention module to the decoder.

3.2 Stacked Gated Co-AttentionWe propose to stack the multi-modal interaction module de-scribed in Section 3.1 to learn an improved representation.

Page 4: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

Encoder EncoderShared Weights

‘boat’

Co-attention Module

Co-attention Module….

Nx

Support SetQuery Set

Spatial Tiling Decoder

(a) V+S

Encoder EncoderShared Weights

Support SetQuery Set

Co-attention Module

Decoder

(b) V

Encoder

‘boat’Query Set

Spatial Tiling

Decoder

(c) S

Figure 3: Different variants for image-level labelled few-shot object segmentation. V+S: Stacked Co-Attention with Visual and Semanticrepresentations. V: Co-Attention with Visual features only. S: Conditioning on semantic representation only from word embeddings.

Stacking allows for multiple iterations between the supportand the query images. The co-attention module has twostreams fq, fs that are responsible for processing the queryimage and the support set images respectively. The inputs tothe co-attention module, V i

q and V is , represent the visual fea-

tures at iteration i for query image and support image respec-tively. In the first iteration, V 0

q and V 0s are the output visual

features from the encoder. Each multi-modal interaction thenfollows the recursion ∀i = 0, .., N − 1:

V i+1q = φ(V i

q + fq(Viq , V

is , z))

The nonlinear projection φ is performed on the output fromeach iteration, which is composed of a 1x1 convolutionallayer followed by a ReLU activation function. We use resid-ual connections in order to improve the gradient flow and pre-vent vanishing gradients. The support set features V i

s ,∀i =0, .., N − 1 are computed similarly.

4 Temporal Object Segmentation withFew-shot Learning Setup

We propose a novel few-shot video object segmentation(VOS) task. In this task, the image-level label of the firstframe is provided to learn object segmentation in the sampledframes from the ensuing sequence. This is a more challeng-ing task than the one relying on the pixel-level supervisionin semi-supervised VOS. The task is designed as a binarysegmentation problem and the categories are split in multiplefolds, consistent with existing few-shot segmentation tasksdefined on Pascal-5i and MS-COCO. This design ensuresthat the proposed task assesses the ability of few-shot videosegmentation algorithms to generalize over unseen classes.We utilize Youtube-VOS dataset training data which has 65classes, and we split them into 5 folds. Each fold has 13classes that are used as novel classes, while the rest are usedin the meta-training phase. A randomly sampled class Ys andsequence V = {I1, I2, ..., IN} are used to construct the sup-port set Sp = {(I1, Ys)} and query images Ii. For each queryimage a binary segmentation mask MY

s is constructed by la-belling all the instances belonging to Ys as foreground. Ac-cordingly, the same image can have multiple binary segmen-tation masks depending on the sampled Ys.

5 ExperimentsIn this section we demonstrate results of experiments con-ducted on the PASCAL-5i dataset [Shaban et al., 2017] com-pared to state of the art methods in section 5.2. Not only dowe set strong baselines for image level labelled few shot seg-mentation and outperform previously proposed work [Razaet al., 2019], but we also perform close to the state of theart conventional few shot segmentation methods that use de-tailed pixel-wise segmentation masks. We then demonstratethe results for the different variants of our approach depictedin Fig. 3 and experiment with the proposed TOSFL setup insection 5.3.

5.1 Experimental SetupNetwork Details: We utilize a ResNet-50 [He et al., 2016]encoder pre-trained on ImageNet [Deng et al., 2009] to ex-tract visual features. The segmentation decoder is comprisedof an iterative optimization module (IOM) [Zhang et al.,2019b] and an atrous spatial pyramid pooling (ASPP) [Chenet al., 2017a,b]. The IOM module takes the output featuremaps from the multi-modal interaction module and the previ-ously predicted probability map in a residual form.

Meta-Learning Setup: We sample 12,000 tasks duringthe meta-training stage. In order to evaluate test performance,we average accuracy over 5000 tasks with support and querysets sampled from the meta-test dataset Dtest belonging toclasses Ltest. We perform 5 training runs with different ran-dom generator seeds and report the average of the 5 runs andthe 95% confidence interval.

Evaluation Protocol: PASCAL-5i splits PASCAL-VOC20 classes into 4 folds each having 5 classes. The mean IoUand binary IoU are the two metrics used for the evaluationprocess. The mIoU computes the intersection over union forall 5 classes within the fold and averages them neglecting thebackground. Whereas the bIoU metric proposed by Rakellyet al. [2018] computes the mean of foreground and back-ground IoU in a class agnostic manner. We have noticed somedeviation in the validation schemes used in previous works.Zhang et al. [2019b] follow a procedure where the valida-tion is performed on the test classes to save the best model,whereas Wang et al. [2019] do not perform validation and

Page 5: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

Table 1: Quantitative results for 1-way, 1-shot segmentation on the PASCAL-5i dataset showing mean-Iou and binary-IoU. P: stands forusing pixel-wise segmentation masks for supervision. IL: stands for using weak supervision from Image-Level labels. BB: stands for usingbounding boxes for weak supervision.

1-shot 5-shotMethod Type 1 2 3 4 mIoU bIoU 1 2 3 4 mIoUFG-BG P - - - - - 55.1 - - - - -OSLSM [Shaban et al., 2017] P 33.6 55.3 40.9 33.5 40.8 - 35.9 58.1 42.7 39.1 43.9CoFCN [Rakelly et al., 2018] P 36.7 50.6 44.9 32.4 41.1 60.1 37.5 50.0 44.1 33.9 41.4PLSeg [Dong and Xing, 2018] P - - - - - 61.2 - - - - -AMP [Siam et al., 2019] P 41.9 50.2 46.7 34.7 43.4 62.2 41.8 55.5 50.3 39.9 46.9PANet [Wang et al., 2019] P 42.3 58.0 51.1 41.2 48.1 66.5 51.8 64.6 59.8 46.5 55.7CANet [Zhang et al., 2019b] P 52.5 65.9 51.3 51.9 55.4 66.2 55.5 67.8 51.9 53.2 57.1PGNet [Zhang et al., 2019a] P 56.0 66.9 50.6 50.4 56.0 69.9 57.7 68.7 52.9 54.6 58.5CANet [Zhang et al., 2019b] BB - - - - 52.0 - - - - - -PANet [Wang et al., 2019] BB - - - - 45.1 - - - - - 52.8[Raza et al., 2019] IL - - - - - 58.7 - - - - -Ours(V+S)-1 IL 49.5 65.5 50.0 49.2 53.5 65.6 - - - - -Ours(V+S)-2 IL 42.5 64.8 48.1 46.5 50.5 64.1 45.9 65.7 48.6 46.6 51.7

±0.7 ±0.4 ±0.07

rather train for a fixed number of iterations. We choose themore challenging approach in [Wang et al., 2019].

Training Details: During the meta-training, we freezeResNet-50 encoder weights while learning both the multi-modal interaction module and the decoder. We train all mod-els using momentum SGD with learning rate 0.01 that is re-duced by 0.1 at epoch 35, 40 and 45 and momentum 0.9. L2regularization with a factor of 5x10−4 is used to avoid over-fitting. Batch size of 4 and input resolution of 321×321 areused during training with random horizontal flipping and ran-dom centered cropping for the support set. An input resolu-tion of 500×500 is used for the meta-testing phase similarto [Shaban et al., 2017]. In each fold the model is meta-trained for a maximum number of 50 epochs on the classesoutside the test fold.

5.2 Comparison to the state-of-the-art

We compare the result of our best variant (see Fig. 3), i.e:Stacked Co-Attention (V+S) against the other state of theart methods for 1-way 1-shot and 5-shot segmentation onPASCAL-5i in Table 1. We report the results for different val-idation schemes. Ours(V+S)-1 follows [Zhang et al., 2019b]and Ours(V+S)-2 follows [Wang et al., 2019]. Without theutilization of segmentation mask or even sparse annotations,our method with the least supervision of image level labelsperforms (53.5%) close to the current state of the art stronglysupervised methods (56.0%) in 1-shot case and outperformsthe ones that use bounding box annotations. It improves overthe previously proposed image-level supervised method witha significant margin (4.8%). For the k-shot extension of ourmethod we perform average of the attention summaries dur-ing the meta-training on the k-shot samples from the supportset. Table 2 demonstrates results on MS-COCO [Lin et al.,2014] compared to the state of the art method using pixel-wise segmentation masks for the support set.

Table 2: Quantitative Results on MS-COCO Few-shot 1-way.

Method Type 1-shot 5-shotPANet [Wang et al., 2019] P 20.9 29.7Ours-(V+S) IL 15.0 15.6

5.3 Ablation Study

We perform an ablation study to evaluate different variantsof our method depicted in Fig. 3. Table 3 shows the resultson the three variants we proposed on PASCAL-5i. It clearlyshows that using the visual features only (V-method), lags 5%behind utilizing word embeddings in the 1-shot case. This ismainly due to having multiple common objects between thesupport set and the query image. Semantic representation ob-viously helps to resolve the ambiguity and improves the resultsignificantly as shown in Figure 5. Going from 1 to 5 shots,the V-method improves, because multiple shots are likely torepeatedly contain the object of interest and the associatedambiguity decreases, but still it lags behind both variants sup-ported by semantic input. Interestingly, our results show thatthe baseline of conditioning on semantic representation is avery competitive variant: in the 1-shot case it even outper-forms the (V+S) variant. However, the bottleneck in usingthe simple scheme to integrate semantic representation de-picted in Fig. 3c is that it is not able to benefit from multipleshots in the support set. The (V+S)-method in the 5-shot caseimproves over the 1-shot case by 1.2% on average over the5 runs, which confirms its ability to effectively utilize moreabundant visual features in the 5-shot case. One reason couldexplain the strong performance of the (S) variant. In the caseof a single shot, the word embedding pretrained on a mas-sive text database may provide a more reliable guidance sig-nal than a single image containing multiple objects that doesnot necessarily have visual features close to the object in thequery image.

Table 4 shows the results on our proposed novel videosegmentation task, comparing variants of the proposed ap-

Page 6: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

(a) ’bicycle’ (b) ’bottle’ (c) ’bird’

(d) ’bicycle’ (e) ’bird’ (f) ’boat’

Figure 4: Qualitative evaluation on PASCAL-5i 1-way 1-shot. The support set and prediction on the query image are shown in pairs.

Table 3: Ablation Study on 4 folds of Pascal-5i for few-shot seg-mentation for different variants showing mean-IoU. V: visual, S: se-mantic. V+S: both features.

Method 1-shot 5-shotV 44.4± 0.3 49.1± 0.3S 51.2± 0.6 51.4± 0.3V+S 50.5± 0.7 51.7± 0.07

Table 4: Quantitative Results on Youtube-VOS One-shot weakly su-pervised setup showing IoU per fold and mean-IoU over all foldssimilar to pascal-5i. V: visual, S: semantic. V+S: both features.

Method 1 2 3 4 5 Mean-IoUV 40.8 34.0 44.4 35.0 35.5 38.0± 0.7S 42.7 40.8 48.7 38.8 37.6 41.7± 0.7V+S 46.1 42.0 50.7 41.2 39.2 43.8± 0.5

proach. As previously, the baseline V-method based on co-attention module with no word embeddings, similar to [Luet al., 2019], lags behind both S- and (V+S)-methods. It isworth noting that unlike the conventional video object seg-mentation setups, the proposed video object segmentationtask poses the problem as a binary segmentation task con-ditioned on the image-level label. Both support and queryframes can have multiple salient objects appearing in them,however the algorithm has to segment only one of them cor-responding to the image-level label provided in the supportframe. According to our observations, this multi-object situa-tion occurs in this task much more frequently than e.g. in thecase of Pascal-5i. Additionally, not only the target, but all thenuisance objects present in the video sequence will relate viadifferent viewpoints or deformations. We demonstrate in Ta-ble 4 that the (V+S)-method’s joint visual and semantic pro-cessing in such scenario clearly provides significant gain.

6 ConclusionIn this paper we proposed a multi-modal interaction mod-ule that relates the support set image and query image us-

(a) Label ’Bike’ (b) Prediction (V) (c) Prediction (V+S)

Figure 5: Visual Comparison between the predictions from two vari-ants of our method.

ing both visual and word embeddings. We proposed to meta-learn a stacked co-attention module that guides the segmen-tation of the query based on the support set features and viceversa. The two main takeaways from the experiments are that(i) few-shot segmentation significantly benefits from utilizingword embeddings and (ii) it is viable to perform high qual-ity few-shot segmentation using stacked joint visual semanticprocessing with weak image-level labels.

ReferencesDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neu-

ral machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, KevinMurphy, and Alan L Yuille. Deeplab: Semantic image segmen-tation with deep convolutional nets, atrous convolution, and fullyconnected crfs. IEEE transactions on pattern analysis and ma-chine intelligence, 40(4):834–848, 2017.

Liang-Chieh Chen, George Papandreou, Florian Schroff, andHartwig Adam. Rethinking atrous convolution for semantic im-age segmentation. arXiv preprint arXiv:1706.05587, 2017.

Masood Dehghan, Zichen Zhang, Mennatullah Siam, Jun Jin, LauraPetrich, and Martin Jagersand. Online object and task learningvia human robot interaction. In 2019 International Conferenceon Robotics and Automation (ICRA), pages 2132–2138. IEEE,2019.

Page 7: arXiv:2001.09540v2 [cs.CV] 7 Feb 2020 · Mennatullah Siam1;4, Naren Doraiswamy2, Boris N. Oreshkin3, Hengshuai Yao4 and Martin Jagersand1 1 University of Alberta 2 Indian Institute

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recogni-tion, pages 248–255. Ieee, 2009.

Nanqing Dong and Eric P. Xing. Few-shot semantic segmentationwith prototype learning. In BMVC, volume 3, page 4, 2018.

Mark Everingham, S.M. Ali Eslami, Luc Van Gool, Christopher K.I.Williams, John Winn, and Andrew Zisserman. The pascal visualobject classes challenge: A retrospective. International journalof computer vision, 111(1):98–136, 2015.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deepresidual learning for image recognition. In Proceedings of theIEEE conference on computer vision and pattern recognition,pages 770–778, 2016.

Ting-I Hsieh, Yi-Chen Lo, Hwann-Tzong Chen, and Tyng-Luh Liu.One-shot object detection with co-attention and co-excitation.In Advances in Neural Information Processing Systems, pages2721–2730, 2019.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, PietroPerona, Deva Ramanan, Piotr Dollar, and C Lawrence Zitnick.Microsoft coco: Common objects in context. In European con-ference on computer vision, pages 740–755. Springer, 2014.

Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchi-cal question-image co-attention for visual question answering. InAdvances In Neural Information Processing Systems, pages 289–297, 2016.

Xiankai Lu, Wenguan Wang, Chao Ma, Jianbing Shen, Ling Shao,and Fatih Porikli. See more, know more: Unsupervised videoobject segmentation with co-attention siamese networks. In Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3623–3632, 2019.

Soren Pirk, Mohi Khansari, Yunfei Bai, Corey Lynch, and PierreSermanet. Online object representations with contrastive learn-ing. arXiv preprint arXiv:1906.04312, 2019.

Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha Efros, andSergey Levine. Conditional networks for few-shot semantic seg-mentation. 2018.

Hasnain Raza, Mahdyar Ravanbakhsh, Tassilo Klein, and MoinNabi. Weakly supervised one shot segmentation. In Proceed-ings of the IEEE International Conference on Computer VisionWorkshops, pages 0–0, 2019.

Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and By-ron Boots. One-shot learning for semantic segmentation. arXivpreprint arXiv:1709.03410, 2017.

Mennatullah Siam, Boris N Oreshkin, and Martin Jagersand. Amp:Adaptive masked proxies for few-shot segmentation. In Proceed-ings of the IEEE International Conference on Computer Vision,pages 5249–5258, 2019.

Murray Sidman. Equivalence relations and behavior: An introduc-tory tutorial. The Analysis of verbal behavior, 25(1):5–17, 2009.

Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Ji-ashi Feng. Panet: Few-shot image semantic segmentation withprototype alignment. In Proceedings of the IEEE InternationalConference on Computer Vision, pages 9197–9206, 2019.

Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang,Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprintarXiv:1809.03327, 2018.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and AlexSmola. Stacked attention networks for image question answer-ing. In Proceedings of the IEEE conference on computer visionand pattern recognition, pages 21–29, 2016.

Chi Zhang, Guosheng Lin, Fayao Liu, Jiushuang Guo, Qingyao Wu,and Rui Yao. Pyramid graph networks with connection attentionsfor region-based one-shot semantic segmentation. In Proceedingsof the IEEE International Conference on Computer Vision, pages9587–9595, 2019.

Chi Zhang, Guosheng Lin, Fayao Liu, Rui Yao, and Chunhua Shen.Canet: Class-agnostic segmentation networks with iterative re-finement and attentive few-shot learning. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition,pages 5217–5226, 2019.