University of Technology, Sydney arXiv:1512.08086v1 [cs.CV ...University of Technology, Sydney Sydney, Ultimo, NSW 2007, Australia [email protected] Zhe Xu* Shanghai

Part-Stacked CNN for Fine-Grained Visual Categorization

Shaoli Huang*University of Technology, Sydney

Sydney, Ultimo, NSW 2007, [email protected]

Zhe Xu*Shanghai Jiao Tong University

Shanghai, 200240, [email protected]

Dacheng TaoUniversity of Technology, Sydney

Sydney, Ultimo, NSW 2007, [email protected]

Ya ZhangShanghai Jiao Tong University

Shanghai, 200240, Chinaya [email protected]

August 17, 2019

Abstract

In the context of fine-grained visual categorization, theability to interpret models as human-understandable vi-sual manuals is sometimes as important as achieving highclassification accuracy. In this paper, we propose a novelPart-Stacked CNN architecture that explicitly explains thefine-grained recognition process by modeling subtle dif-ferences from object parts. Based on manually-labeledstrong part annotations, the proposed architecture consistsof a fully convolutional network to locate multiple ob-ject parts and a two-stream classification network that en-codes object-level and part-level cues simultaneously. Byadopting a set of sharing strategies between the computa-tion of multiple object parts, the proposed architecture isvery efficient running at 20 frames/sec during inference.Experimental results on the CUB-200-2011 dataset revealthe effectiveness of the proposed architecture, from boththe perspective of classification accuracy and model inter-pretability.

1 Introduction

Fine-grained visual categorization aims to distinguish ob-jects at the subordinate level, e.g. different species ofbirds [43, 41, 4], pets [17, 29], flowers [28, 1] and cars

California Gull Ring billed Gull The class has its beak mostly different from the class

Figure 1: Overview of the proposed approach. We pro-pose to classify fine-grained categories by modeling thesubtle difference from specific object parts. Beyond clas-sification results, the proposed PS-CNN architecture alsooffers human-understandable instructions on how to clas-sify highly similar object categories explicitly.

[35, 25]. It is a highly challenging task due to the smallinter-class variance caused by highly similar subordinatecategories, and the large intra-class variance by nuisancefactors such as pose, viewpoint and occlusion. Inspir-ingly, huge progress has been made over the last few years[40, 4, 39, 18, 45], making fine-grained recognition tech-niques a large step closer to practical use in various ap-plications, such as wildlife observation and surveillancesystems.

1

arX

iv:1

512.

0808

6v1

[cs

.CV

] 2

6 D

ec 2

015

Whilst numerous attempts have been made to boost theclassification accuracy of fine-grained visual categoriza-tion [10, 9, 6, 21], we argue that another important as-pect of the problem has yet been severely overlooked, i.e.,the ability to generate a human-understandable “manual”on how to distinguish fine-grained categories in detail.For example, volunteers for ecological protection maycertainly benefit from an algorithm that could not onlyclassify bird species accurately, but also provide brief in-structions on how to distinguish a category from its mostsimilar subspecies - e.g., a salient difference between aRinged-billed gull and a California gull lies in the patternon their beaks (Figure 1) - with some intuitive illustrationexamples. Existing fine-grained recognition methods thataim to provide a visual field guide mostly follow the rou-tine of “part-based one-vs-one features” (POOFs) [3, 2, 4]or employ human-in-the-loop methods [20, 7, 38]. Sincethe data size has been increasing drastically, a method thatsimultaneously implements and interprets fine-grained vi-sual categorization using the latest deep learning methods[19] is therefore highly advocated.

It is widely acknowledged that the subtle difference be-tween fine-grained categories mostly resides in the uniqueproperties of object parts [31, 3, 9, 26, 47]. Therefore,a practical solution to interpret classification results ashuman-understandable manuals is to discover classifica-tion criteria from object parts. Some of existing fine-grained datasets have provided detailed part annotationsincluding part landmarks and attributes [41, 25]. How-ever, they are usually associated with a large number ofobject parts, which poses heavy computational burden forboth part detection and classification. From this perspec-tive, one would like to seek a method that follows theobject-part-aware strategy to provide interpretable pre-dicting criteria, while requiring minimum computationaleffort to deal with a possibly large number of parts.

In this paper, we propose a new part-based CNN archi-tecture for fine-grained visual categorization that modelsmultiple object parts in a unified framework with high ef-ficiency. Similar with previous fine-grained recognitionapproaches, the proposed method consists of a localiza-tion module to detect object parts (“where pathway”) anda classification module to classify fine-grained categoriesat the subordinate level (“what pathway”). In particular,we employ a fully convolutional network (FCN) to per-form object part localization. The inferred part locations

are fed into the classification network, in which a two-stream architecture is proposed to analyze images in bothobject-level (bounding boxes) and part-level (part land-marks). The computation of multiple parts is first con-ducted via a shared feature extraction route, then sepa-rated through a novel part crop layer, concatenated, andthen fed into a shallower network to perform object classi-fication. Except for categorical predictions, the proposedmethod also generates interpretable classification instruc-tions based on object parts. Since the proposed architec-ture employs a sharing strategy that stacks the compu-tation of multiple parts together, we call it Part-StackedCNN (PS-CNN).

The contributions of this paper include: 1) we present anovel and efficient part-based CNN architecture for fine-grained recognition; 2) our architecture adopts an FCN tolocalize object parts, which has seldom been studied be-fore in the context of object recognition; 3) our classifica-tion network follows a two-stream structure that capturesboth object-level and part-level information, in which anew share-and-divide strategy is presented on the com-putation of multiple object parts. As a result, the pro-posed architecture is very efficient, with a capacity of 20frames/sec1 on a Tesla K80 to classify images at test timeusing 15 object parts; 4) to the best of our knowledge, theproposed method is the first attempt to both implementand explicitly interpret the process of fine-grained visualcategorization using deep learning methods. The effec-tiveness of the proposed method is demonstrated throughsystematic experiments on the Caltech-UCSD Birds-200-2011 [41] dataset, in which we achieved 76% classifi-cation accuracy. We also present practical examples ofhuman-understanding manuals generated by the proposedmethod for the task of fine-grained visual categorization.

The rest of the paper is organized as follows. Section2 summarizes related works. The proposed architectureincluding the localization network and the classificationnetwork is described in Section 3. Detailed performancestudies and analysis are conducted in Section 4. Section 5concludes the paper and proposes discussions on the ap-plication scenarios of the proposed PS-CNN.

1For reference, a single CaffeNet runs at 50 frames/sec under thesame experimental setting.

2

2 Related WorkFine-Grained Visual Categorization. A number ofmethods have been developed to classify object cate-gories at the subordinate level. Recently, the best per-forming methods mostly sought for improvement broughtby the following three aspects: more discriminative fea-tures including deep CNNs for better visual representation[5, 32, 19, 36, 34], explicit alignment approaches to elim-inate pose displacements [6, 14], and part-based methodsto study the impact of object parts [3, 48, 26, 47, 15]. An-other line of research explored human-in-the-loop meth-ods [8, 10, 42] to identify the most discriminative re-gions for classifying fine-grained categories. Althoughsuch methods provided direct references of how peopleperform fine-grained recognition in real life, they wereimpossible to scale for large systems due to the need ofhuman interactions at test time.

Current state-of-the-art methods for fine-grained recog-nition are part-based R-CNN by Zhang et al. [47] and Bi-linear CNN by Lin et al. [21], which both employed atwo-stage pipeline of part detection and part-based objectclassification. The main idea of the proposed PS-CNN islargely inherited from [47], who first detected the locationof two object parts and then trained an individual CNNbased on the unique properties of each part. Comparedto part-based R-CNN, the proposed method is far moreefficient in both detection and classification phrases. Asa result, we are able to employ much more object partsthan that of [47], while still being significantly faster attest time.

On the other hand, Lin et al. [21] argued that manuallydefined parts were sub-optimal for the task of objectrecognition, and thus proposed a bilinear model consist-ing of two streams whose roles were interchangeable asdetectors or features. Although this design enjoyed thedata-driven nature that could possibly lead to optimalclassification performance, it also made the resultantmodel hard to interpret. On the contrary, our method triesto balance the need of both both classification accuracyand model interpretability in fine-grained recognitionsystems.

Fully Convolutional Networks. Fully convolutional net-work (FCN) is a fast and effective approach to producedense prediction with convolutional networks. Successful

examples can be found on tasks including sliding windowdetection [33], semantic segmentation [22], and humanpose estimation [37]. We find the problem of part land-mark localization in fine-grained recognition closely re-lated to human pose estimation, in which a critical step isto detect a set of key points indicating multiple compo-nents of human body.

3 Part-Stacked CNNWe present the model architecture of the proposed Part-Stacked CNN in this section. In accordance with thecommon framework for fine-grained recognition, the pro-posed architecture is decomposed into a Localization Net-work (Section 3.1) and a Classification Network (Section3.2). We adopt CaffeNet [16], a slightly modified versionof the standard seven-layer AlexNet [19] architecture, asthe basic structure of the network; deeper networks couldpotentially lead to better recognition accuracy, but mayalso result in lower efficiency.

A unique design in our architecture is that the messagetransferring operation from the localization network to theclassification network, i.e. using detected part locationsto perform part-based classification, is conducted directlyon the conv5 output feature maps within the process ofdata forwarding. It is a significant difference compared tothe standard two-stage pipeline of part-based R-CNN [47]that consecutively localizes object parts and then trainspart-specific CNNs on the detected regions. Based on thisdesign, a set of sharing schemes are performed to makethe proposed PS-CNN fairly efficient for both learningand inference. Figure 2 illustrates the overall network ar-chitecture.

3.1 Localization NetworkThe first stage of the proposed architecture is a localiza-tion network that aims to detect the location of objectparts. We employ the simplest form of part landmarkannotations, i.e. a 2D key point is annotated at the centerof each object part. Assume that M - the number ofobject parts labeled in the dataset, is sufficient large tooffer a complete set of object parts on which fine-grainedcategories are usually different from each other. Mo-tivated by recent progress of human pose estimation

3

2x resolution Input Image

454x454

Input Image 227x227

ALEXNET Conv+ReLU

+Pool (5 stages)

ALEXNET Conv+ReLU

+Pool (5 stages)

27x27x256

13x13x256 6x6x256

FCN Conv+ReLU+Pool (7 stages)

6x6x32

4096 4096

K

6x6x 32x

(M+8)

Pool5

conv5 fmap

M part locations

conv5_1 1x1 conv

reduce dim.

27x27x32

PART CROP

crown

belly

tail

①

②

③

④

fc6

fc7 fc8

Figure 2: Network architecture of the proposed Part-Stacked CNN model. The model consists of: 1) a fully convolu-tional network for part landmark localization; 2) a part stream where multiple parts share the same feature extractionprocedure, while being separated by a novel part crop layer given detected part locations; 3) an object stream withlower spatial-resolution input images to capture bounding-box level supervision; and 4) three fully connected layersto achieve the final classification results based on a concatenated feature map containing information from all partsand the bounding box.

[22] and semantic segmentation [37], we adopt a fullyconvolutional network (FCN) [27] to generate denseoutput feature maps for locating object parts.

Fully convolutional network. A fully convolutionalnetwork is achieved by replacing the parameter-richfully connected layers in standard CNN architectures byconvolutional layers with 1 × 1 kernels. Given an inputRGB image, the output of a fully convolutional networkis a feature map in reduced dimension compared to theinput. The computation of each unit in the feature maponly corresponds to pixels inside a region with fixed sizein the input image, which is called its receptive field.FCN is preferred in our framework due to the followingthree reasons: 1) feature maps generated by FCN canbe directly utilized as the part locating results in theclassification network, which will be detailed in Section

3.2; 2) results of multiple object parts can be obtainedsimultaneously using an FCN; 3) FCN is very efficient inboth learning and inference.

Learning. We model the part localization process as amulti-class classification problem on dense output spatialpositions. In particular, suppose the output of the last con-volutional layer in the FCN is in the size of h × w × d,where h and w are spatial dimensions and d is the numberof channels. We set d = M + 1. Here M is the num-ber of object parts and 1 denotes for an additional chan-nel to model the background. To generate correspond-ing ground-truth labels in the form of feature maps, unitsindexed by h × w spatial positions are labeled by theirnearest object part; units that are not close to any of thelabeled parts (with an overlap < 0.5 with respect to re-ceptive field) are labeled as background.

4

2x resolution Input Image

454x454

ALEXNET Conv+ReLU

+Pool (5 stages)

27x27x256 27x27x512 27x27x(M+1)

5x5 Gaussian

Kernel

27x27 Max-pooling

M locations TRAINING

conv5 conv6 1x1

conv+ ReLU

1x1 conv

27x27x(M+1) 27x27x(M+1)

conv7 softmax

Figure 3: Demonstration of the localization network. Training process is denoted inside the dashed box. For inference,a Gaussian kernel is then introduced to remove noise. The results areM 2D part locations in the 27×27 conv5 featuremap.

A practical problem here is to determine the modeldepth and the size of input images for training the FCN.Generally speaking, layers at later stages carry more dis-criminative power and thus are more likely to generatepromising localization results; however, their receptivefields are also much larger than those of previous layers.For example, the receptive field of conv5 layer in Caf-feNet has a size of 163× 163 compared to the 227× 227input image, which is too large to model an object part.We propose a simple trick to deal with this problem, i.e.,upsampling the input images so that the fixed-size recep-tive fields denoting object parts become relatively smallercompared to the whole object, while still being able to uselayers at later stages to guarantee enough discriminativepower.

The localization network in the proposed PS-CNN is il-lustrated in Figure 3. The input of the FCN is a bounding-box-cropped RGB image, warped and resized into a fixedsize of 454 × 454. The structure of the first five lay-ers is identical to those in CaffeNet, which leads to a27 × 27 × 256 output after conv5 layer. Afterwards, wefurther introduce a 1×1 convolutional layer with 512 out-put channels as conv6, and another 1 × 1 convolutionallayer with M + 1 outputs termed conv7 to perform clas-sification. By adopting a spatial preserving softmax thatnormalizes predictions at each spatial location of the fea-ture map, the final loss function is a sum of softmax lossat all 27× 27 positions:

L = −27∑h=1

27∑w=1

log σ(h,w, ĉ), (1)

where

σ(h,w, ĉ) =exp(fconv7(h,w, ĉ))∑Mc=0 exp(fconv7(h,w, c))

.

Here, ĉ ∈ [0, 1, ...,M ] is the part label of the patch atlocation (h,w), where the label 0 denotes background.fconv7(h,w, c) stands for the output of conv7 layer atspatial position (h,w) and channel c.

Inference. The inference process starts from the outputof the learned FCN, i.e., (M + 1) part-specific heat mapsin the size of 27 × 27, in which we introduce a Gaussiankernel G to remove isolated noise in the feature maps. Thefinal output of the localization network areM locations inthe 27×27 conv5 feature map, each of which is computedas the location with the maximum response for one objectpart.

Meanwhile, considering that object parts may be miss-ing in some images due to varied poses and occlusion, weset a threshold µ that if the maximum response of a partis below µ, we simply discard this part’s channel in theclassification network for this image. Let g(h,w, c) =σ(h,w, c) ∗ G, the inferred part locations are given as:

5

Part locations

Receptive fields 27x27 conv5

6x6xM output

Figure 4: Demonstration of the part crop layer with threeobject parts. The input is a 27 × 27 heatmap and threeobject parts’ anchor position, and the output is three 6×6regions centered at the respective anchor position. Imagein the left shows the respective field for each part afterperforming part cropping.

(h∗c , w∗c ) =

{argmaxh,w g(h,w, c) if g(h

∗c , w

∗c , c) > µ,

(−1,−1) otherwise.(2)

3.2 Classification networkThe second stage of the proposed PS-CNN is a classifi-cation network with the inferred part locations given asan input. It follows a two-stream architecture with a PartStream and a Object Stream to capture semantics frommultiple levels. A sub-network consisting of three fullyconnected layers is then performed as an object classifier,as shown in Figure 2.

Part stream. The part stream acts as the core of theproposed PS-CNN architecture. To capture object-part-dependent differences between fine-grained categories,one can train a set of part CNNs, each one of which con-ducts classification on a part separately, as proposed byZhang et al. [47]. Although such method worked wellfor [47] who only employed two object parts, we arguethat it is not applicable when the number of object partsis much larger in our case, because of the high time andspace complexity.

In PS-CNN, we introduce two strategies to improve theefficiency of the part stream. The first one is model pa-rameter sharing. Specifically, model parameters of thefirst five convolutional layers are shared among all ob-

ject parts, which can be regarded as a generic part-levelfeature extractor. This strategy leads to less parametersin the proposed architecture and thus reduces the risk ofoverfitting.

Other than model parameter sharing, we also conduct acomputational sharing strategy. The goal is to make surethat the feature extraction procedure of all parts only re-quires one pass through the convolutional layers. Analo-gous to the localization network, the input images of thepart stream are in doubled resolution 454 × 454 so thatthe respective receptive fields are not too large to modelobject parts; forwarding the network to conv5 layer gen-erates output feature maps of size 27 × 27. By far, thecomputation of all object parts is completely shared.

After performing the shared feature extraction pro-cedure, the computation of each object part is thenpartitioned through a part crop layer to model part-specific classification cues. As shown in Figure 4, theinput for the part crop layer is a set of feature maps, e.g.,the output of conv5 layer in our architecture, and thepredicted part locations from the previous localizationnetwork, which also reside in conv5 feature maps. Foreach part, the part crop layer extracts a local neigh-borhood region centered at the detected part location.Features outside the cropped region are simply dropped.In practice, we crop 6 × 6 neighborhood regions out ofthe 27 × 27 conv5 feature maps to match the output sizeof the object stream. The resultant receptive fields for thecropped feature maps has a width of 163+16× 5 = 243.

Object stream. The object stream utilizes bounding-box-level supervision to capture object-level semanticsfor fine-grained recognition. It follows the general archi-tecture of CaffeNet, in which the input of the network is a227 × 227 RGB image and the output of pool5 layer are6× 6 feature maps.

We find the design of the two-stream architecture inPS-CNN analogous to the famous Deformable Part-basedModels [12], in which object-level features are capturedthrough a root filter in a coarser scale, while detailedpart-level information is modeled by several part filtersat a finer scale. We find it critical to measure visual cuesfrom multiple semantic levels in an object recognitionalgorithm.

Dimension reduction and fully connected layers. The

6

aforementioned two-stream architecture generates an in-dividual feature map for each object part and boundingbox. When conducting classification, they serve as anover-complete set of CNN features from multiple scales.Following the standard CaffeNet architecture, we employa DNN including three fully connected layers as objectclassifiers. The first fully connected layer fc6 now be-comes a part concatenation layer whose input is generatedby stacking the output feature maps of the part stream andthe object stream together. However, such a concatenat-ing process requires M +1 times more model parametersthan the original fc6 layer in CaffeNet, which leads to ahuge memory cost.

To reduce model parameters, we introduce a 1× 1 con-volutional layer termed conv5 1 in the part stream thatprojects the 256 dimensional conv5 output to 32-d. Itis identical to a low-rank projection of the model outputand thus can be initialized through standard PCA. Never-theless, in our experiments, we find that directly initial-izing the weights of the additional convolution by PCAin practice worsens the performance. To enable domain-specific fine-tuning from pre-trained CNN model weights,we train an auxiliary CNN to initialize the weights for theadditional convolutional layer.

Let Xc ∈ RN×M×6×6 be the cth 6 × 6 cropped re-gion around the center point (h∗c , w

∗c ) from conv5 1 fea-

ture maps X ∈ RN×M×27×27, where (h∗c , w∗c ) is the pre-dicted location for part c. The output of part concatenationlayer fc6 can be formulated as:

fout(X) = σ(

M∑c=1

(W c)TXc), (3)

where W c is the model parameters for part c in fc6 layer,and σ is an activation function.

We conduct the standard gradient descent method totrain the classification network. The most complicatedpart for computing gradients lies in the dimension reduc-tion layer due to the impact of part cropping. Specifically,the gradient of each cropped part feature map (in 6 × 6spatial resolution) is projected back to the original size ofconv5 (27× 27 feature maps) according to the respectivepart location and then summed up. Note that the proposedPS-CNN is implemented as a two stage framework, i.e. af-ter training the FCN, weights of the localization networkare fixed when training the classification network.

Model architecture MPK MRK APKconv5+cls 70.0 80.6 83.5conv5+conv6(256)+cls 71.3 81.8 84.7conv5+conv6(512)+cls 71.5 81.9 84.8conv5+conv6(512)+cls+gaussian 80.0 83.8 86.6

Table 1: Comparison of different model architectures onlocalization results. “conv5” stands for the first 5 convo-lutional layers in CaffeNet; “conv6(256)” stands for theadditional 1×1 convolutional layer with 256 output chan-nels; “cls” denotes the classification layer withM+1 out-put channels; “gaussian” represents a Gaussian kernel forsmoothing.

4 Experiments

We present experimental results and analysis of the pro-posed method in this section. Specifically, we will eval-uate the performance through four different aspects: lo-calization accuracy, classification accuracy, inference ef-ficiency, and model interpretation.

4.1 Dataset and implementation details

Experiments are conducted on the widely used fine-grained classification benchmark the Caltech-UCSDBirds dataset (CUB-200-2011) [41]. The dataset contains200 bird categories with roughly 30 training images percategory. In the training phase we adopt strong supervi-sion available in the dataset, i.e. we employ 2D key pointpart annotations of altogether M = 15 object parts to-gether with image-level labels and object bounding boxes.

The proposed Part-Stacked CNN architecture is imple-mented using the open-source package Caffe [16]. Specif-ically, bounding-box cropped input images are warpedto a fixed size of 512 × 512, randomly cropped into454× 454, and then fed into the localization network andthe part stream in the classification network as input. Weemploy a pooling layer in the object stream that down-samples the 454 × 454 input to 227 × 227 to guaranteesynchronization between the two streams in the classifi-cation network.

7

part throat beak crown forehead right eye nape left eye backAPK 0.908 0.894 0.894 0.885 0.861 0.857 0.850 0.807part breast belly right leg tail left leg right wing left wing overallAPK 0.799 0.794 0.775 0.760 0.750 0.678 0.670 0.866

Table 2: APK for each object part in the CUB-200-2011 test set in descending order.

Figure 5: Typical localization results on CUB-200-2011 test set. We show 6 of the 15 detected parts here. They are:beak (red), belly (green), crown (blue), right eye (yellow), right leg (magenta), tail (cyan). Better viewed in color.

4.2 Localization resultsWe quantitatively assess the localization correctness usingthree metrics. The first two are MPK (Mean Precisionof Key points over images) and MRK (Mean Recall ofKey points over images), which calculate precision andrecall for the detected key points in each image and thenaverage them over all test images. Suppose that image Ihas ngt ground-truth parts, the proposed method predictsnpd parts where ntp of them is correctly located, MPKand MRK are computed as:

MPK =1

N

∑I∈I

ntpnpd

, MRK =1

N

∑I∈I

ntpngt

, (4)

where N is the number of test images in the dataset.We also adopt APK (Average Precision of Key points)

[46] to explicitly study the localization performance foreach part. Following [23], we consider a key point tobe correctly predicted if the prediction lies within a Eu-clidean distance of α times the maximum of the bounding

box width and height compared to the ground truth. Weset α = 0.1 in all the analysis below.

The results of different FCN architectures evaluated inthis paper are outlined in Table 1. By introducing an addi-tional 1×1 convolutional layer after the first five layers inCaffeNet, a reasonably significant performance improve-ment is achieved. The Gaussian kernel also contributesto the localization accuracy by removing isolated noise,achieving a nearly 10% improvement in MPK in partic-ular. The final localization network achieves an inspiring86.6% APK on the test set of CUB-200-2011 for 15 objectparts.

4.3 Classification results

Furthermore, we present per part APKs in Table 2. An in-teresting phenomenon here is that parts residing near thehead of the birds tend to be located more accurately. Itturns out that the birds’ head has relatively more stablestructure with less deformations and lower probability tobe occluded. On the contrary, parts that are highly de-

8

BBox only +2 part +4 part +8 part +15 part69.08 73.72 74.84 76.22 76.15

Table 3: The effect of increasing the number of objectparts on the classification accuracy.

formable such as wings and legs get lower APK values.Figure 5 shows typical localization results of the proposedmethod.

We begin the analysis of classification results by a studyon the discriminative power of each object part. Eachtime we select one object part as the input and discardthe computation of all other parts. Different parts revealsignificantly different classification results. The most dis-criminative part crown itself achieves a quite impressiveaccuracy of 57%, while the lowest accuracy is only 10%for part beak. Therefore, to obtain better classification re-sults, it may be beneficial to find a rational combinationor order of object parts instead of directly ran the experi-ments on all parts altogether.

We therefore introduce a strategy that incrementallyadds object parts to the whole framework and iterativelytrains the model. Specifically, starting from a modeltrained on bounding-box supervision only, which is alsothe baseline of the proposed method, we iteratively in-sert object parts into the framework and re-finetune thePS-CNN model. The number of parts inserted in each it-eration increases exponentially, i.e., in the ith iteration,2i parts are selected and inserted. When starting froman initialized model with relatively high performance, in-troducing a new object part into the framework does notrequire to run a brand new classification procedure basedon this specific part alone; ideally only the classificationof highly confusing categories that may be distinguishedthrough the new part will be impacted and amended. Asa result, this procedure overcomes the drawback raisedby the existence of object parts with lower discriminativepower. Table 3 reveals that as the number of object partsincreases from 0 to 8, the classification accuracy improvesgradually and then becomes saturated. Further increasingthe part number does not lead to a better accuracy; how-ever, it does provide more resources for performing ex-plicit model interpretation.

Table 4 shows the performance comparison between

Method Train Anno. Test Anno. Acc.Alignment [13] n/a n/a 53.6Attention [44] n/a n/a 69.7Bilinear-CNN [21] n/a n/a 72.5CNNaug [30] BBox BBox 61.8Alignment [13] BBox BBox 67.0No parts [18] BBox BBox 74.9Bilinear-CNN [21] BBox BBox 77.2Part R-CNN [47] BBox+Parts n/a 73.9PoseNorm CNN [6] BBox+Parts n/a 75.7POOF [3] BBox+Parts BBox 56.8DPD+DeCAF[11] BBox+Parts BBox 65.0Part R-CNN [47] BBox+Parts BBox 76.4PS-CNN (this paper) BBox+Parts BBox 76.2

Table 4: Comparison with state-of-the-art methods on theCUB-200-2011 dataset. To conduct fair comparisons, forall the methods using deep features, we report their re-sults on the standard seven-layer architecture (AlexNet) ifpossible.

PS-CNN and existing fine-grained recognition methods.The complete PS-CNN model with a bounding-box and15 object parts achieves 76% accuracy, which is compa-rable with state-of-the-art methods including part-basedR-CNN [47] and bilinear CNN [21]. In particular, ourmodel is over two orders of magnitude faster than [47],requiring only 0.05 seconds to perform end-to-end classi-fication on a test image. This number is quite inspiring,especially considering the number of parts used in the pro-posed method.

4.4 Model interpretation

The proposed approach adopts a part-based strategy toprovide visual manuals for fine-grained visual categoriza-tion. In particular, we have discovered the most discrim-inative object parts for classifying a category from otherbird species (one-versus-all) and also from its most simi-lar categories (one-versus-most). It is an offline processachieved by calculating the classification performancegain of inserting a part into the bounding-box-only train-ing scheme.

The model interpretation routine is demonstrated inFigure 6. When a test image is presented, the proposed

9

crown (0.9382) back (0.9268) belly (0.9220)

vs. Green

Kingfisher

crown (0.9435) forehead (0.9327) nape (0.9317)

left eye (0.9995) left leg (0.9994) forehead (0.9993)

Similar Class Comparison Predict Class Test Image

right eye belly

Important Parts

vs. Belted

Kingfisher

vs. Blue Jay Pied

Kingfisher

part class

part class

part class

Figure 6: Example of the prediction manual generated by the proposed approach. Given a test image, the systemreports its predicted class label with some typical exemplar images. Part-based comparison criteria between thepredicted class and its most similar classes are shown in the right part of the image. The number in brackets shows theconfidence of classifying two categories by introducing a specific part. We present top three object parts for each pairof comparison. For each of the parts, three part-center-cropped patches are shown for the predicted class (upper rows)and the compared class (lower rows) respectively.

method first conducts object classification through the PS-CNN architecture. The predicted category is presented bya set of images in the dataset that are closest to the testimage according to conv5 1 outputs. Except for classifi-cation results, the proposed method also presents classi-fication criteria for distinguishing the predicted categoryfrom its most similar neighbor classes based on objectparts. Again we use the output of conv5 1 layer but afterperforming part cropping to retrieve nearest neighbor partpatches of the input test image. The procedure describedabove provides an intuitive visual guide for distinguishingfine-grained categories.

5 ConclusionIn this paper, we proposed a novel model for fine-grainedrecognition called Part-Stacked CNN. The model ex-

ploited detailed part-level supervision, in which objectparts were first located by a fully convolutional network,following by a two-stream classification network thatexplicitly captured object-level and part-level informa-tion. Experiments on the CUB-200-2011 dataset revealedthe effectiveness and efficiency of PS-CNN, especiallythe impact of introducing object parts on fine-grainedvisual categorization tasks. Meanwhile, we have pre-sented human-understandable interpretations of the pro-posed method, which can be used as a visual field guidefor studying fine-grained categorization.

We have discussed the application of the proposed Part-Stacked CNN on fine-grained visual categorization withstrong supervision. In fact, PS-CNN can be easily gener-alized for varied applications. Examples include:

1) Discarding the requirement of strong supervision.The goal of introducing manually-labeled part annota-

10

tions here is to generate human-understandable visualguides. However, one can also exploit unsupervised partdiscover methods [18] to define object parts automati-cally. This strategy should have the potential to achievecomparable classification accuracy with strongly super-vised methods, while requiring far less human labelingeffort.

2) Attribute learning. The application scenario of PS-CNN is not restricted to fine-grained recognition tasks.For instance, the performance of recommendation sys-tems for online shopping [24] could definitely benefitfrom the analysis of clothing attributes from local parts.The proposed PS-CNN, in this case, could be an effectiveand efficient choice to perform part-based garment analy-sis.

3) Context-based CNN. The role of local “parts” in PS-CNN can be also replaced by global contexts. For objectsthat are small in size and have no obvious object parts,such as volleyballs or tennis balls, modeling global con-texts instead of local parts could be a practical solution.Such an architecture can be achieved from the proposedPS-CNN without significant structural changes.

They are regarded as our future works.

References

[1] A. Angelova, S. Zhu, and Y. Lin. Image segmenta-tion for large-scale subcategory flower recognition.In Applications of Computer Vision (WACV), 2013IEEE Workshop on, pages 39–45. IEEE, 2013. 1

[2] T. Berg and P. N. Belhumeur. How do you tell ablackbird from a crow? In Computer Vision (ICCV),2013 IEEE International Conference on, pages 9–16. IEEE, 2013. 2

[3] T. Berg and P. N. Belhumeur. Poof: Part-based one-vs.-one features for fine-grained categorization, faceverification, and attribute estimation. In CVPR 2013,pages 955–962. IEEE, 2013. 2, 3, 9

[4] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W.Jacobs, and P. N. Belhumeur. Birdsnap: Large-scalefine-grained visual categorization of birds. In Com-puter Vision and Pattern Recognition (CVPR), 2014,pages 2019–2026. IEEE, 2014. 1, 2

[5] L. Bo, X. Ren, and D. Fox. Kernel descriptors forvisual recognition. In NIPS, pages 244–252, 2010.3

[6] S. Branson, G. Van Horn, S. Belongie, and P. Per-ona. Bird species categorization using pose nor-malized deep convolutional nets. arXiv preprintarXiv:1406.2952, 2014. 2, 3, 9

[7] S. Branson, G. Van Horn, C. Wah, P. Perona, andS. Belongie. The ignorant led by the blind: A hy-brid human–machine vision system for fine-grainedcategorization. International Journal of ComputerVision, 108(1-2):3–29, 2014. 2

[8] S. Branson, C. Wah, F. Schroff, B. Babenko,P. Welinder, P. Perona, and S. Belongie. Visualrecognition with humans in the loop. In ComputerVision–ECCV 2010, pages 438–451. Springer, 2010.3

[9] Y. Chai, V. Lempitsky, and A. Zisserman. Sym-biotic segmentation and part localization for fine-grained categorization. In ICCV 2013, pages 321–328. IEEE, 2013. 2

[10] J. Deng, J. Krause, and L. Fei-Fei. Fine-grainedcrowdsourcing for fine-grained recognition. In Com-puter Vision and Pattern Recognition (CVPR), 2013IEEE Conference on, pages 580–587. IEEE, 2013.2, 3

[11] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman,N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deepconvolutional activation feature for generic visualrecognition. arXiv preprint arXiv:1310.1531, 2013.9

[12] P. F. Felzenszwalb, R. B. Girshick, D. McAllester,and D. Ramanan. Object detection with dis-criminatively trained part-based models. PAMI,32(9):1627–1645, 2010. 6

[13] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeul-ders, and T. Tuytelaars. Fine-grained categorizationby alignments. In ICCV 2013, pages 1713–1720.IEEE, 2013. 9

[14] E. Gavves, B. Fernando, C. G. Snoek, A. W. Smeul-ders, and T. Tuytelaars. Local alignments for fine-grained categorization. International Journal ofComputer Vision, 111(2):191–212, 2014. 3

11

[15] G. Gkioxari, R. Girshick, and J. Malik. Actionsand attributes from wholes and parts. arXiv preprintarXiv:1412.2604, 2014. 3

[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev,J. Long, R. Girshick, S. Guadarrama, and T. Dar-rell. Caffe: Convolutional architecture for fast fea-ture embedding. In Proceedings of the ACM Inter-national Conference on Multimedia, pages 675–678.ACM, 2014. 3, 7

[17] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-f. Li.L.: Novel dataset for fine-grained image categoriza-tion. In First Workshop on Fine-Grained Visual Cat-egorization, CVPR 2011. Citeseer, 2011. 1

[18] J. Krause, H. Jin, J. Yang, and L. Fei-Fei. Fine-grained recognition without part annotations. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 5546–5555,2015. 1, 9, 11

[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Im-agenet classification with deep convolutional neuralnetworks. In NIPS, pages 1097–1105, 2012. 2, 3

[20] N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Ja-cobs, W. J. Kress, I. C. Lopez, and J. V. Soares. Leaf-snap: A computer vision system for automatic plantspecies identification. In Computer Vision–ECCV2012, pages 502–516. Springer, 2012. 2

[21] T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilin-ear cnn models for fine-grained visual recognition.arXiv preprint arXiv:1504.07889, 2015. 2, 3, 9

[22] J. Long, E. Shelhamer, and T. Darrell. Fully con-volutional networks for semantic segmentation. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 3431–3440,2015. 3, 4

[23] J. L. Long, N. Zhang, and T. Darrell. Do con-vnets learn correspondence? In Advances in NeuralInformation Processing Systems, pages 1601–1609,2014. 8

[24] K. M. Hadi, H. Xufeng, L. Svetlana, B. Alexander,and B. Tamara. Where to buy it: Matching streetclothing photos in online shops. In Computer Vi-sion (ICCV), 2015 IEEE International Conferenceon, 2015. 11

[25] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, andA. Vedaldi. Fine-grained visual classification of air-craft. arXiv preprint arXiv:1306.5151, 2013. 1, 2

[26] S. Maji and G. Shakhnarovich. Part and attributediscovery from relative annotations. InternationalJournal of Computer Vision, 108(1-2):82–96, 2014.2, 3

[27] O. Matan, C. J. Burges, Y. Le Cun, and J. S. Denker.Multi-digit recognition using a space displacementneural network. 1995. 4

[28] M.-E. Nilsback and A. Zisserman. Automatedflower classification over a large number of classes.In ICVGIP’08, pages 722–729. IEEE, 2008. 1

[29] O. M. Parkhi, A. Vedaldi, A. Zisserman, andC. Jawahar. Cats and dogs. In CVPR 2012, pages3498–3505. IEEE, 2012. 1

[30] A. S. Razavian, H. Azizpour, J. Sullivan, andS. Carlsson. Cnn features off-the-shelf: an astound-ing baseline for recognition. In CVPRW 2014, pages512–519. IEEE, 2014. 9

[31] E. Rosch, C. B. Mervis, W. D. Gray, D. M. Johnson,and P. Boyes-Braem. Basic objects in natural cate-gories. Cognitive psychology, 8(3):382–439, 1976.2

[32] J. Sánchez, F. Perronnin, and Z. Akata. Fisher vec-tors for fine-grained visual categorization. In FGVCWorkshop in IEEE Computer Vision and PatternRecognition (CVPR), 2011. 3

[33] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu,R. Fergus, and Y. LeCun. Overfeat: Integratedrecognition, localization and detection using convo-lutional networks. arXiv preprint arXiv:1312.6229,2013. 3

[34] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014. 3

[35] M. Stark, J. Krause, B. Pepik, D. Meger, J. J. Little,B. Schiele, and D. Koller. Fine-grained categoriza-tion for 3d scene understanding. International Jour-nal of Robotics Research, 30(13):1543–1552, 2011.1

[36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Ra-

12

binovich. Going deeper with convolutions. arXivpreprint arXiv:1409.4842, 2014. 3

[37] J. J. Tompson, A. Jain, Y. LeCun, and C. Bre-gler. Joint training of a convolutional network and agraphical model for human pose estimation. In Ad-vances in Neural Information Processing Systems,pages 1799–1807, 2014. 3, 4

[38] G. Van Horn, S. Branson, R. Farrell, S. Haber,J. Barry, P. Ipeirotis, P. Perona, and S. Belongie.Building a bird recognition app and large scaledataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of theIEEE Conference on Computer Vision and PatternRecognition, pages 595–604, 2015. 2

[39] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji,R. Girshick, J. Kannala, E. Rahtu, I. Kokkinos,M. B. Blaschko, D. Weiss, et al. Understandingobjects in detail with fine-grained attributes. InComputer Vision and Pattern Recognition (CVPR),2014 IEEE Conference on, pages 3622–3629. IEEE,2014. 1

[40] C. Wah, S. Branson, P. Perona, and S. Belongie.Multiclass recognition and part localization with hu-mans in the loop. In Computer Vision (ICCV), 2011IEEE International Conference on, pages 2524–2531. IEEE, 2011. 1

[41] C. Wah, S. Branson, P. Welinder, P. Perona, andS. Belongie. The caltech-ucsd birds-200-2011dataset. 2011. 1, 2, 7

[42] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Per-ona, and S. Belongie. Similarity comparisons forinteractive fine-grained categorization. In ComputerVision and Pattern Recognition (CVPR), 2014 IEEEConference on, pages 859–866. IEEE, 2014. 3

[43] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff,S. Belongie, and P. Perona. Caltech-ucsd birds 200.2010. 1

[44] T. Xiao, Y. Xu, K. Yang, J. Zhang, Y. Peng, andZ. Zhang. The application of two-level attentionmodels in deep convolutional neural network forfine-grained image classification. arXiv preprintarXiv:1411.6447, 2014. 9

[45] Z. Xu, S. Huang, Y. Zhang, and D. Tao. Augmentingstrong supervision using web data for fine-grained

categorization. In Computer Vision (ICCV), 2015IEEE International Conference on, 2015. 1

[46] Y. Yang and D. Ramanan. Articulated human detec-tion with flexible mixtures of parts. Pattern Analy-sis and Machine Intelligence, IEEE Transactions on,35(12):2878–2890, 2013. 8

[47] N. Zhang, J. Donahue, R. Girshick, and T. Darrell.Part-based r-cnns for fine-grained category detec-tion. In ECCV 2014, pages 834–849. Springer, 2014.2, 3, 6, 9

[48] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, andL. Bourdev. Panda: Pose aligned networks for deepattribute modeling. In Computer Vision and Pat-tern Recognition (CVPR), 2014 IEEE Conferenceon, pages 1637–1644. IEEE, 2014. 3

13

Documents

University of Technology, Sydney arXiv:1512.08086v1 [cs.CV ...University of Technology, Sydney Sydney, Ultimo, NSW 2007, Australia [email protected] Zhe Xu* Shanghai