A Novel Representation and Pipeline for Object Detection...a method to train an object detection network where the training set contains images for classes but bounding box labels

A Novel Representation and Pipeline for Object Detection

Vishakh HegdeStanford University

[email protected]

Manik DharStanford [email protected]

Abstract

Object detection is an important problem in ComputerVision research. Neural network based models have notreached performance as high as they have reached in ob-ject classification which is an intimately related task. Thesemethods usually consider the background as another classfor an object classifier which doesn’t exploit the differentnature of background as compared to objects. We proposea novel training criterion which tackles background sepa-rately. At the same time, we examine how Learning withoutForgetting and finetuning perform in transferring from theclassification to the detection task. We train on the canon-ical PASCAL VOC dataset. We provide results for a smallnetwork trained from scratch and results for a larger net-work pre-trained on ImageNet followed by finetuning withand without Learning without forgetting for object detec-tion.

1. IntroductionImage perception, or the ability to understand the con-

tents of an image is the holy grail in computer vision andartificial intelligence. The task of image classification wasvery hard for computers, until very recently. With the ad-vent of deep convolutional neural networks, computers arenow able to beat humans on image classification (at least onlarge scale datasets like ImageNet).

A related task, object detection aims to localize and clas-sify an object in an image. Having good models for objectdetection is very important in a variety of tasks includingmedical imaging, surveillance and object tracking, amongothers. Current state-of-the-art object detection modelsfaster-RCNN [7] and SPP-Net [5] re-factor existing classi-fication models like AlexNet and ZF-Net to suit the require-ments of object detection.

Parts of the image which does not contain the object isgenerally called background. Making a distinction betweenbackground and object is crucial, since most images containbackground and a bulk of the image is usually backgroundin most natural images. Most state-of-the-art object recog-

nition algorithms treat background as just another category,along with object categories. Classification is usually per-formed on a region in the image large enough to hold ob-jects completely, but small enough to exclude a lot of back-ground.

Treating background as another category does not makeintuitive sense since it is present universally in every natu-ral image, while a given object is usually not present in allimages. Another distinction is that Background images canpotentially have large intra-class variation, while most ob-jects have lesser intra-class variation. Therefore, we shouldnot be using the same representation that is used to performclassification to also perform object detection.

In this work, we place special emphasis on backgroundand design good loss functions that can force a neural net-work to activate only for objects and not activate at all fornon-objects (or background). This will translate to learninga better representation specific to object detection.

2. Previous workObject detection is a much harder problem compared to

object classification. The object needs to be localized withina region along with identifying it. R-CNN was one of theearlier attempts which involved finding region proposals us-ing a method like selective search and then later using aconvolutional neural network to extract features from it forObject detection and classification [4]. At the same time,another approach involving framing the localization as a re-gression problem was proposed [9] and it did not performas well as R-CNN. R-CNN is slow to train and test and con-sumes a lot of disk space. For that reason, variants wereproposed to increase the training and testing time. SPP-Net [5] and Fast-RCNN [3] are the two most notable vari-ants. Faster-RCNN [8] uses a region proposal network toproduce object proposal leading to an end-to-end trainablesystem for object detection.

The methods mentioned train a new layer on top of fc7layer of AlexNet. This new layer accounts for new classesand an extra background class which ideally captures ev-erything apart from the objects of interest. Our approach isindependent to the previous work and can be easily adapted

1

to state of the art object detection architectures like Faster-RCNN. Our approach is vastly different, in that we force ourrepresentation to output a non-zero vector only if the inputimage is an object. This way, we force our neural networkto encode information in all the neurons of the feature layer.This is not necessarily the case in networks like RCNN (asubset of neurons might actually be sufficient). We proposethis with the intuition that part of the power of a deep neuralnetwork comes from its ability to learn a distributed repre-sentation that it can combine in multiple ways.

Learning without forgetting (LwF) was introduced in[10] as a replacement transfer learning strategy for fine tun-ing; in that, the model also performs well on the originaltask. Apart from the obvious advantage of performing wellon both old and new task, LwF also acts as a regularizerwhile training for the new task and therefore prevents overfitting on the new task.

However, in all their experiments, the authors train theirmodels on a large-scale dataset like ImageNet [1] or Places2dataset for image classification and transfer the knowledgeto smaller datasets like PASCAL VOC [2] for classification.Their ’old’ and ’new’ skills are the same (namely classifi-cation). They show some evidence that performing LwF ona very different task (like classifying different kind of im-ages) within the same skill domain (classification) will sig-nificantly degrade its performance on the old task [10]. Inparticular, they show that training a model for classificationon Places2 and performing LwF on CUB dataset resulted ina significant degradation of performance on Places2, sincethese tasks are very dissimilar. An interesting related ques-tion is to see if LwF works well when applied on dissim-ilar skills (like classification to localization/bounding boxregression). To this end we use AlexNet pre-trained on Im-ageNet and train it via Learning without Forgetting.

LSDA: Large Scale Detection through Adaptation [6] isa method to train an object detection network where thetraining set contains images for classes but bounding boxlabels only for a subset of these classes. The changes wediscuss for R-CNN can also be adapted for LSDA. We dis-cuss this further in a later section.

3. Main Contributions

3.1. A New Representation for Object Detection

This is obtained using a novel loss function that forcesthe neural network to activate only to objects and not ac-tivate at all to non-objects. This translates to pushing fea-ture vectors corresponding to non-objects to the origin ofthe feature space and to the surface of a unit hyper-spherefor feature vectors corresponding to objects.

Figure 1: Example from PASCAL VOC detection dataset

3.2. Compare Transfer Learning Strategies

We compare Learning without Forgetting [10] and fine-tuning transfer learning strategies on learning a new skilllike object detection, using weights learned for image clas-sification. The idea is that learning without forgetting strat-egy acts as a regularizer and therefore is a better transferlearning strategy on small datasets.

4. Dataset Used

While there are multiple datasets out there to train learn-ing algorithms for object detection, we use PASCAL VOC2012 (for detection) for training, validation and testing.PASCAL VOC (for detection) consists of images containingobjects belonging to 20 different categories. It consists ofa bunch of transportation vehicles, animals (including peo-ple) and everyday objects. The metadata, for each image,consists of a list of all objects in the image and their corre-sponding ground truth bounding boxes. An example fromPASCAL VOC can be seen in figure 1.

4.1. Region Proposals

Ground truth regions themselves are not sufficient totrain a neural network since it does not contain backgroundregions explicitly. [4] provide bounding boxes for the trainand test sets they use. They obtain these by running the im-ages through a selective search algorithm. However, theseregion do not come pre-assigned with a label. It is uponus to use the ground truth bounding box information to in-fer what the bounding boxes from selective search consistsof. We write our program to assign labels to these regionproposals.

Figure 2: Crops from selective search produced by [4]. Thetop row consists of objects while the bottom row has back-ground crops

4.1.1 Label Assignment For Proposed Regions

We use the Intersection over Union (IoU) metric to assignlabels. For each proposal, we find the IoU over all groundtruth bounding boxes with a threshold of 0.7. i.e. we areonly interested in ground truth bounding boxes that have anIoU of > 0.7. If we manage to find multiple such groundtruth boxes, we assign the label corresponding to the maxi-mum IoU ground truth box to the proposal. If we fail to findIoU values crossing a threshold of 0.7, we treat the proposalto be background.

For each image, we have about 2500 bounding boxes asobtained from selective search. With the threshold we use,we find that roughly 10% corresponds to some object, whilethe remaining 90% are background. We provide some ex-amples of crops thus generated in figure 2.

4.1.2 Engineering Limitations

Due to hardware limitations (disk space) we were forcedto use a subset of the full dataset. We obtain about 38000crops corresponding to objects and more than 1M back-ground training examples. However, we randomly discardmost of them and keep only 100000 randomly chosen back-ground crops for training.

5. Technical DetailsOur goal is to get a good representation for object de-

tection. As mentioned before, we want to force the neuralnetwork to produce non-zero activation only when it is fedan object. For background crops, it should ideally not pro-duce activation. Concretely, this means that the L2 norm ofthe final features layer should be zero for non-objects andclose to 1 for objects. We achieve this by designing lossfunctions that force the norm of the final features to be zerofor non-objects.

5.1. Loss Function for Object Classification

Given that the proposed region has an object in it, wetrain a softmax classifier with the standard cross-entropy

loss on top of the feature layer to classify the object. Form classes and n image crops, the cross-entropy loss is:

− 1

n

n∑i=1

m∑j=1

1{y(i)j = 1} log(softmaxθ(φ(x(i)), j))

softmaxθ(φ(x), j) =eθ

Tj φ(x)

m∑k=1

eθTk φ(x)

where θ is the classifier weight vector, x is an input imagecrop, y is the one-hot vector for the class labels and φ is afunction representing the neural network architecture. TheRCNN model [4] also has a similar loss function for objectclassification. The main distinction is that our classifier willnot have a background class, whereas the RCNN classifiertreats the background as another class.

5.2. Loss Function to Control L2 norm

The loss function should penalize high L2 norm valuesfor non-objects, and low L2 norm values for objects. Forthis we design the following two loss functions:

• Spherical Hinge Loss

• Spherical Softmax Loss

5.2.1 Spherical Hinge Loss

We define the L2 norm hinge loss as follows:

1

n

n∑i=1

{(‖φ(x(i))‖22 − 1)(−1)1{‖y(i)‖1=1}}+

where{x}+ = x if x > 0, else 0

Here, 1{‖y‖1 = 1} indicates if an image crop contains anobject or not. For example, if there is no object in the imagecrop, the class vector will be zero.

5.2.2 Spherical Softmax Loss

In this approach we train a 2-class softmax classifier onthe square of the norm of the last feature layer to findbackground images. We provide the equation below (whichis simplified because there are only 2 classes and the featureis just a scalar):

1

n

n∑i=1

(1{‖y‖1 = 0} log(1 + ek‖φ(x)‖22+b)

+1{‖y‖1 = 1} log(1 + e−k‖φ(x)‖22−b))

k and b are two scalar parameters which we need to trainover.

Figure 3: Schematic of the three layer convolutional neuralnetwork

5.3. Neural Network Architectures Used

5.3.1 Three Layer CNN

In order to quickly validate our hypothesis on using the lossfunctions on L2 norm, we use a three layer neural networksince it is fast and easy to train. In a bid to reduce thenumber of parameters, we resize all crops to have a size of80× 80× 3. We provide a schematic of our neural networkin figure 3

5.3.2 AlexNet

Once we validated our hypothesis of using L2 norm forclassification, we started using AlexNet pretrained on Im-ageNet. We use AlexNet to compare finetuning and LwFtransfer learning strategies. The reason for this choice isthat pretrained weights TensorFlow is available online andis one of the simplest deep networks for analysis.

5.4. Learning without Forgetting (LwF)

The network is initially trained on classifying the Ima-geNet dataset. To ensure that previously learned capabili-ties are not forgotten, we use Learning without Forgetting(LwF) transfer learning strategy. LwF also provides goodregularization while training the weights of the network.

Let φ represent the original network, θimg be the originalweights for the ImageNet classifier andmimg be the numberof classes in ImageNet. We compute,

z(i) = softmaxθimg (φ(x(i)))

where softmaxθimg is the output of the softmax layer forthe ImageNet classifier which will be an mimg size vector.We use knowledge distillation loss to minimize the changein the output of the old task. The loss function for learningwithout forgetting is,

− 1

n

mimg∑j=1

n∑i=1

z′(i)j log(z

′(i)j )

z′(i)j =

(z(i)j )

1T

mimg∑k=1

(z(k)j )

1T

, z′(i)j =

(z(i)j )

1T

mimg∑k=1

(z(k)j )

1T

where,z(i)j = softmaxθimg (φ(x(i)))

This loss function ensures that information about the previ-ous task is maintained and acts as a regularization term forour object detection task.

6. Experiments6.1. Experiment 1: Comparing Detection Pipelines

There are three different to take in an image crop andperform classification during inference:

• RCNN like classification: Here the object is treated asjust another class and the network is trained to clas-sify crops into one of 21 categories. The schematic isshown in figure 4

• Network trained using Spherical Hinge Loss: Thenorm of the final features is first computed. If the normis less than 1, it is declared background. Otherwise, itis passed through a softmax classifier which classifiesit into one of 20 object categories. A linear combina-tion of the two loss values is taken with the weightsbeing hyper-parameters. For our experiments, we usea weight of 1 for each of the loss values.

• Network trained using Spherical Softmax: In thispipeline, the norm is used to directly infer (due to bi-nary softmax loss) whether or not it is an object. If itis an object, it is passed through a softmax classifierwhich classifies it into one of 20 object categories. Alinear combination of the two loss values is taken withthe weights being hyper-parameters. For our experi-ments, we use a weight of 1 for each of the loss values.

The schematic for the latter two networks is depicted in fig-ure 5. In order to compare these three approaches, we usea three layer convolutional neural network as the base net-work, as mentioned mentioned previously. In the first ex-periment, we compare the classification accuracy on a fixedvalidation set for all the three pipelines mentioned above.

6.2. Experiment 2: Comparing Transfer LearningStrategies

RCNN uses fine-tuning transfer learning strategy onAlexNet weights learned on ImageNet. We want to see ifthe newly introduced Learning without Forgetting (LwF)fine-tuning strategy works better than fine-tuning. We foundfrom the first experiment that network trained with Spher-ical Hinge Loss (we refer to it as SHL) performs better

Figure 4: Schematic of the network used for RCNN likeclassification. Base network is a three layer CNN

Figure 5: Schematic of the network used for classificationusing Spherical Hinge Loss and Spherical Softmax Loss onthe L2 norm. Base network is a three layer CNN

than both RCNN like object classification (we refer to itas RCNN(ours)) and network trained with Spherical Soft-max Loss (we refer to it as SSL). Therefore, we narrowdown this experiment to comparing finetuning strategies onRCNN(ours) and SHL. The base network used is AlexNetpre-trained on ImageNet.

6.2.1 Finetuning

We finetune RCNN(ours) and SHL from conv4 layer on-ward. The reason for this choice is that initial layers ofthe neural network are found to be simple edge detectorsand Gabor like filters, which generalize well across multi-ple datasets. However, we expect it not to work very wellon SPH since it tries to drastically alter the distribution ofdata in the space of representation.

6.2.2 Learning without Forgetting (LwF)

We use AlexNet trained on ImageNet as an anchor networkwhose weights never get updated. We use a second copy ofAlexNet pre-trained on ImageNet but update the weights ac-cording to a loss function which takes into account the LwFloss. Similar to finetuning, we only update weights of thebase network from conv4 layer onward. The schematic forsuch a network for SHL is given in figure 6 and the same forRCNN(ours) is given in figure 7. The final loss function isthe linear combination of each of the loss functions given in

Figure 6: Schematic of SPH with LwF loss function added

Figure 7: Schematic of RCNN(ours) with LwF loss func-tion added

Figure 8: Histogram (log-log scale) of the squared-normvalues of the prefinal layer for objects (red) and non object(green) with spherical hinge loss

the respective figures, where the weights in the linear com-bination are hyper-parameters.

7. Results

7.1. Experiment 1

From 6.1, we find that SHL performs better thanRCNN(ours) and the network trained on SSL. We train the

Figure 9: Histogram (log-log scale) of the squared-normvalues of the prefinal layer for objects (red) and non object(green) with spherical softmax loss

model over 100 epochs and obtain validation accuracies insteps of 10 epochs. We report the best accuracies amongthem in table 7.1.

We use the following abbreviations,

• OCA = Object classification accuracy. This is the clas-sification accuracy among the 20 object categories.

• CA = Over all classification accuracy of all objects,including background

• BC = Background classification. This essentially mea-sures the accuracy of classifying between objects andnon-objects.

Model OCA CA BCRCNN like classifier 0.133 0.764 0.793

Spherical Hinge Loss Net 0.348 0.755 0.801Spherical Softmax Loss Net 0.241 0.653 0.716

7.1.1 Discussion

We also plot histograms of the norm-squared value of thepre-final layer of SHL 8 and SSL 9. We observe thatSHL leads to nice and clear separation between the twoclasses, while there is a lot more overlap between the objectand non-object classes when we use the Spherical SoftmaxLoss. This is again validated when we compare the classifi-cation accuracies.

We use t-SNE to visualize how object and backgroundimages are distributed in the embedding space in figures 10,11 and 10. We see that for the vanilla network, the back-ground images are distributed haphazardly whereas for SHLand SSL, they are more concentrated. They are more con-centrated for SHL, than SFL.

Figure 10: t-SNE diagram for objects (red) and background(blue) for RCNN(ours)

Figure 11: t-SNE diagram for objects (red) and background(blue) for SPH

7.2. Experiment 2

7.2.1 Comparison between SHL and RCNN (ours) forLwF strategy

From figures 13, 14 and 15, we find that while SHL per-forms better than RCNN(ours) on OCA, it performs verybadly on CA and BC metrics.

7.2.2 Comparison between SHL and RCNN (ours) forfinetuning strategy

From figures 16, 17 and 18, we find that while RCNN(ours)performs better than SHL on all the accuracy metrics.

Figure 12: t-sne diagram for objects (red) and background(blue) for the spherical softmax network

Figure 13: Comparison of OCA for SHL and RCNN(ours)with LwF strategy

7.2.3 Comparison of SHL for finetuning and LwFtransfer learning strategies

From figures 19, 20 and 21, we find that SHL with LwFperforms better than SHL with FT on OCA and worse worseon CA and BC metrics.

7.2.4 Discussion

Observations 7.2.1 and 7.2.2 is not at all surprising. Thereason is that we use pre-trained AlexNet which is trainedto perform classification. SHL imposes drastic con-straints on the embeddings of the neural network, whereasRCNN(ours) can simply build off of the weight structuresproduced by pre-training on ImageNet. Also, since we donot train all the network weights, this effect is even morepronounced.

Figure 14: Comparison of CA for SHL and RCNN(ours)with LwF strategy

Figure 15: Comparison of BC for SHL and RCNN(ours)with LwF strategy

Figure 16: Comparison of OCA for SHL and RCNN(ours)with Finetuning strategy

Figure 17: Comparison of CA for SHL and RCNN(ours)with Finetuning strategy

Figure 18: Comparison of BC for SHL and RCNN(ours)with Finetuning strategy

Figure 19: Comparison of OCA for SHL on Finetuningstrategy against SHL on LwF strategy

8. ConclusionsWe base our exploration on the intuition that background

should be treated differently. We try to incorporate this by

Figure 20: Comparison of CA for SHL on Finetuning strat-egy against SHL on LwF strategy

Figure 21: Comparison of BC for SHL on Finetuning strat-egy against SHL on LwF strategy

using special loss functions like the Spherical Hinge Lossand the Spherical Softmax Loss on the L2 norm of the em-beddings of the base network.

We perform two experiments: 6.1 and 6.2 and find thatwhen we train networks with random initialization, usingSpherical Hinge Loss on the L2 norm is more effective thanRCNN(ours), where background is treated as another class.

However, when warm-starting the learning process (bothfinetuning and LwF) on ImageNet pre-trained networks,RCNN(ours) performs better than SPH. The reason for thishas been discussed in 7.2.4. We believe that training SPHon large scale object detection datasets like ImageNet forobject detection and then using transfer learning techniquesfor smaller datasets like PASCAL VOC might actually per-form better than the original RCNN [4].

Figure 22: Detection with the LSDA network. Given an image, extract region proposals, reshape the regions to fit into thenetwork size and finally produce detection scores per category for the region. Layers with red dots/fill indicate they have beenmodified/learned during fine-tuning with available bounding box annotated data. Learning without forgetting can allow us toprotect these layers from loosing information about classes in set A. Background detection can be done using the sphericalhinge loss.

9. Future Direction

9.1. Fast and Faster-RCNN

The methods we describe can also be run on the Fast andFaster-RCNN network. Their current implementation is inCaffe. We had started out with working on TensorFlow andimplementing Fast and Faster-RCNN requires Region of In-terest Pooling layers (introduced in Fast-RCNN) which isnot currently implemented in tensorflow. Open-source im-plementations we found didn’t work well. Therefore, wedecided to run experiments on the R-CNN network instead.It is important to note that the improvements made by Fastand Faster-RCNN networks are orthogonal to our modifica-tions in purpose and therefore can be used together to createa object detection system.

9.2. LSDA

LSDA: Large Scale Detection through Adaptation net-work [6] solves a more general problem. They considera scenario where you have a dataset with training data forclassification, but only a subset of that dataset has boundingbox training data for object detection. Their method allowsthem to train a network which can solve the object detectionproblem for the whole dataset.

We describe their approach here. The set of classes issplit in two depending on whether they have bounding boxlabel images. Say set A doesn’t and set B does. They startout with a network which is trained for classification on thewhole dataset (A ∪ B). Unlike a usual classification net-work they don’t use normalized softmax values and insteaduse linear scores which can lie anywhere over the reals. Af-

ter the training we have final layer which provides objectdetection scores for the whole dataset. fA correspond tocells which provide scores for classes in set A. Similarlywe have fB .

Next, they initialize new empty cells in the last layer toencode object detection information for the background andthe classes with bounding box data available δB. The ob-ject detection score for classes in B is computed by addingthe classification scores and the scores from the new cellsfB + δB. For classes in set A they find the nearest neigh-bors (according to the weights in the last layer) in set Band average their scores to get approximate scores for a δAlayer if the data was available.

The additions we consider for RCNN: Spherical HingeLoss and LwF can be used on the LSDA network as well.While training, all the previous layers are finetuned overdata corresponding to set B. The LwF loss would act as aregularizer and prevent knowledge about set A from beinglost. Similarly for background classification, Hinge Losscan be used instead.

References[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.

ImageNet: A Large-Scale Hierarchical Image Database. InCVPR09, 2009. 2

[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.2

[3] R. Girshick. Fast r-cnn. In International Conference on Com-puter Vision (ICCV), 2015. 1

[4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-ture hierarchies for accurate object detection and semanticsegmentation. In Computer Vision and Pattern Recognition,2014. 1, 2, 3, 8

[5] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pool-ing in deep convolutional networks for visual recognition.CoRR, abs/1406.4729, 2014. 1

[6] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue,R. Girshick, T. Darrell, and K. Saenko. Lsda: Large scale de-tection through adaptation. In Z. Ghahramani, M. Welling,C. Cortes, N. D. Lawrence, and K. Q. Weinberger, edi-tors, Advances in Neural Information Processing Systems 27,pages 3536–3544. Curran Associates, Inc., 2014. 2, 9

[7] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-wards real-time object detection with region proposal net-works. In Advances in Neural Information Processing Sys-tems (NIPS), 2015. 1

[8] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towardsreal-time object detection with region proposal networks. InC. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, andR. Garnett, editors, Advances in Neural Information Process-ing Systems 28, pages 91–99. Curran Associates, Inc., 2015.1

[9] C. Szegedy, A. Toshev, and D. Erhan. Deep neural net-works for object detection. In C. J. C. Burges, L. Bottou,M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi-tors, Advances in Neural Information Processing Systems 26,pages 2553–2561. Curran Associates, Inc., 2013. 1

[10] D. H. Zhizhong Li. Learning without forgetting. arXivpreprint arXiv:1606.09282v2, 2016. 2

Documents

A Novel Representation and Pipeline for Object Detection...a method to train an object detection network where the training set contains images for classes but bounding box labels