1
Improving Weakly-Supervised Object Localization By Micro-Annotation Alexander Kolesnikov [email protected] Christoph H. Lampert [email protected] IST Austria Am Campus 1 3400 Klosterneuburg Austria Localization score map Similarity matrix CNN Mid-level Patterns For every image in the training set Cluster I Cluster II (i) (ii) (iii) Localization score map Localization score map with suppressed distractors CNN Object Distractor (iv) Object localization is a crucial step needed for building automatic systems for visual scene understanding. This task can be successfully tackled using fully-supervised learning methods, but these require annotations in a form of bounding boxes or per-pixel segmentation masks that are time-consuming and expensive to ac- quire. Therefore, it is important to develop weakly-supervised object localization learning techniques, which require much cheaper forms of annotation, e.g. image-level class labels. Analyzing the current methods for weakly- supervised object localization we arrive at the conclusion that they tend to fail for object classes that consistently co-occur with the same back- ground elements (distractors), e.g. trains on tracks. We overcome these failures by develo- ping a new procedure that determines seman- tic parts that constitute the object detection and then discards distractor parts. The main steps of our approach are (see Figure above) (i) represent all predicted foreground regions of all images by mid-level features learned by a deep neural network, (ii) cluster these features using spectral clustering (the number of clusters is determined automatically), (iii) visualize the clusters and let a human annotator select which ones actually corresponds to the object class of interest. The information about clusters and their annotation can then be used to better localize objects: (iv) for any (new) image, predict a foreground map using only the image regions that match clusters labeled as ’object’. Note, that the proposed method requires vir- tually negligible amount of additional supervi- sion: an annotator has to answer a few binary questions (typically 2 or 3) per semantic class. Huge datasets, such as ILSVRC, can be anno- tated by one annotator in just a few hours. The proposed approach can be readily used in combination with many existing localization methods. In this work we combine it with the current state-of-the art methods for weakly- supervised bounding box prediction [2] and for weakly-supervised semantic segmentation [1], showing improved results on the challenging ILSVRC 2014 and PASCAL VOC 2012 datasets. [1] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. ECCV, 2016. [2] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.

Improving Weakly-Supervised Object Localization By Micro ... · Improving Weakly-Supervised Object Localization By Micro-Annotation Alexander Kolesnikov [email protected] Christoph

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Improving Weakly-Supervised Object Localization By Micro ... · Improving Weakly-Supervised Object Localization By Micro-Annotation Alexander Kolesnikov akolesnikov@ist.ac.at Christoph

Improving Weakly-Supervised Object Localization By Micro-Annotation

Alexander [email protected]

Christoph H. [email protected]

IST AustriaAm Campus 13400 KlosterneuburgAustria

Localizationscore map

Similarity matrix

CNNM

id-lev

elPat

tern

s

For every image in

the training set

Cluster I

Cluster II

(i)

(ii) (iii)

Localizationscore map

Localizationscore map with

suppressed distractors

CNN

Object

Distractor

(iv)

Object localization is a crucial step neededfor building automatic systems for visual sceneunderstanding. This task can be successfullytackled using fully-supervised learning methods,but these require annotations in a form ofbounding boxes or per-pixel segmentation masksthat are time-consuming and expensive to ac-quire. Therefore, it is important to developweakly-supervised object localization learningtechniques, which require much cheaper formsof annotation, e.g. image-level class labels.

Analyzing the current methods for weakly-supervised object localization we arrive at theconclusion that they tend to fail for object classesthat consistently co-occur with the same back-ground elements (distractors), e.g. trains ontracks. We overcome these failures by develo-ping a new procedure that determines seman-tic parts that constitute the object detection andthen discards distractor parts. The main steps ofour approach are (see Figure above) (i) representall predicted foreground regions of all imagesby mid-level features learned by a deep neuralnetwork, (ii) cluster these features using spectralclustering (the number of clusters is determinedautomatically), (iii) visualize the clusters and leta human annotator select which ones actually

corresponds to the object class of interest. Theinformation about clusters and their annotationcan then be used to better localize objects: (iv)for any (new) image, predict a foreground mapusing only the image regions that match clusterslabeled as ’object’.

Note, that the proposed method requires vir-tually negligible amount of additional supervi-sion: an annotator has to answer a few binaryquestions (typically 2 or 3) per semantic class.Huge datasets, such as ILSVRC, can be anno-tated by one annotator in just a few hours.

The proposed approach can be readily usedin combination with many existing localizationmethods. In this work we combine it withthe current state-of-the art methods for weakly-supervised bounding box prediction [2] and forweakly-supervised semantic segmentation [1],showing improved results on the challengingILSVRC 2014 and PASCAL VOC 2012 datasets.

[1] A. Kolesnikov and C. H. Lampert. Seed,expand and constrain: Three principlesfor weakly-supervised image segmentation.ECCV, 2016.

[2] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva,and A. Torralba. Learning deep features fordiscriminative localization. In CVPR, 2016.