A Multi-Modal, Discriminative and Spatially Invariant CNN ...static.tongtianta.site/paper_pdf/dac96026-036e-11e... · semantic segmentation to a multi-modal CNN architecture, where

A Multi-Modal, Discriminative and SpatiallyInvariant CNN for RGB-D Object Labeling

Umar Asif , Senior Member, IEEE, Mohammed Bennamoun, and Ferdous A. Sohel , Senior Member, IEEE

Abstract—While deep convolutional neural networks have shown a remarkable success in image classification, the problems of

inter-class similarities, intra-class variances, the effective combination of multi-modal data, and the spatial variability in images of

objects remain to be major challenges. To address these problems, this paper proposes a novel framework to learn a discriminative

and spatially invariant classification model for object and indoor scene recognition using multi-modal RGB-D imagery. This is achieved

through three postulates: 1) spatial invariance�this is achieved by combining a spatial transformer network with a deep convolutional

neural network to learn features which are invariant to spatial translations, rotations, and scale changes, 2) high discriminative

capability�this is achieved by introducing Fisher encoding within the CNN architecture to learn features which have small inter-class

similarities and large intra-class compactness, and 3) multi-modal hierarchical fusion�this is achieved through the regularization of

semantic segmentation to a multi-modal CNN architecture, where class probabilities are estimated at different hierarchical levels

(i.e., image- and pixel-levels), and fused into a Conditional Random Field (CRF)-based inference hypothesis, the optimization of which

produces consistent class labels in RGB-D images. Extensive experimental evaluations on RGB-D object and scene datasets, and live

video streams (acquired from Kinect) show that our framework produces superior object and scene classification results compared to

the state-of-the-art methods.

Index Terms—RGB-D object recognition, 3D scene labeling, semantic segmentation

Ç

1 INTRODUCTION

THE emergence of affordable RGB-D sensors (e.g., Micro-soft Kinect), together with the recent advancements in

real-time surface reconstruction techniques (e.g., [1]), haveincreased the demand for the generation of models for theunderstanding of indoor environments at the object- andscene-levels. These models can be used in several applica-tions such as augmented reality and robotics (e.g., objectgrasping). In the recent years, deep convolutional neuralnetworks have produced a high performance in image clas-sification [2], [3]. However, object recognition and sceneclassification using real-world RGB-D data is still challeng-ing due to a number of factors such as: large inter-class simi-larities, large intra-class variances, variations in viewpoints,changes in lighting conditions, occlusions, and backgroundclutter [4], [5]. Another major challenge with existing CNNmodels is the requirement of large labelled datasets to avoidover-fitting during model training. The collection of largelabelled datasets can be very time-consuming and expen-sive (especially in the case of multiple input modalities). To

address these challenges, we propose a novel frameworkwith a superior classification performance, while automati-cally learning discriminative features with a limited amountof training data using existing deep CNN models. In devel-oping our framework for scene analysis, we particularlyfocus on object labeling in 3D reconstructed scenes, argu-ably the most challenging scene-understanding task. Thefinal outcome of our framework is an object-aware scenereconstruction encompassing both object-level segmenta-tion and class labels. In summary, we extend our work in[6] to propose a novel framework and contribute to the taskof 3D scene analysis as follows:

1) We propose a novel classification model (Section 4.2)termed Multi-modal Discriminative Spatially Invari-ant CNN (MDSI-CNN) with regularizations of spa-tial invariance (Section 4.2.1) and Fisher encoding(Section 4.2.4) to learn highly discriminative multi-modal feature representations at different hierarchi-cal levels (i.e., image- and pixel-levels).

2) We propose an enhanced version of the inferencealgorithm in [6] (Section 4.3), which fuses the outputsof our classificationmodel into a Conditional RandomField (CRF)-based inference hypothesis, the optimiza-tion of which produces a semantic labeling of thescene. The enhancement is achieved by the incorpo-ration of novel clique potentials which complementthe unary and pairwise terms of the CRF and produceimproved labeling of objects in real-world scenes.

3) We present an extensive evaluation of our frameworkon four challenging datasets: i)WRGBD object dataset[7] for object recognition, ii)WRGBD scene dataset [8]

� U. Asif is with IBM Research, Melbourne, Vic 3053, Australia.E-mail: [email protected].

� M. Bennamoun is with the University of Western Australia, Crawley, WA6009, Australia. E-mail: [email protected].

� F.A. Sohel is with Murdoch University, Murdoch, WA 6150, Australia.E-mail: [email protected].

Manuscript received 22 Jan. 2017; revised 7 Aug. 2017; accepted 24 Aug.2017. Date of publication 29 Aug. 2017; date of current version 13 Aug. 2018.(Corresponding author: Umar Asif.)Recommended for acceptance by A. Geiger.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2017.2747134

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 9, SEPTEMBER 2018 2051

0162-8828� 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0001-5209-7084

https://orcid.org/0000-0001-5209-7084

https://orcid.org/0000-0001-5209-7084

https://orcid.org/0000-0001-5209-7084

https://orcid.org/0000-0001-5209-7084

https://orcid.org/0000-0003-1557-4907

https://orcid.org/0000-0003-1557-4907

https://orcid.org/0000-0003-1557-4907

https://orcid.org/0000-0003-1557-4907

https://orcid.org/0000-0003-1557-4907

mailto:

mailto:

mailto:

for object labeling in 3D reconstructed scenes, iii)SUNRGBD datatset [9], and iv) NYU-V2 dataset [10]for scene recognition and semantic segmentation. Wealso performed experiments for object labeling of real-world scenes using live videos captured by a Kinect.The performance of our framework for the real-worldscenes reveals its potential for robotics applications(see the video in the supplementary material, whichcan be found on the Computer Society Digital Libraryat http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2747134).

2 RELATED WORK

One stream of scene understanding work uses off-the-shelf3D models and object-level segmentation. For instance,methods such as [11] perform segmentation into semanticregions, which are then individually matched to 3D CADmodels of a model database. Although these methods pro-duce high-quality object recognition results, they require theconstruction of a 3D model database offline, which is expen-sive and time consuming for large-scale classification prob-lems. Furthermore, these methods are computationallyexpensive (their runtimes generally scale from a few secondstominutes), which restricts their use in real-time scene analy-sis applications. Another stream of scene understandingwork adopts an image-classification approach. These meth-ods (e.g., [10], [12]) build dense scene reconstructions andperform pixel-wise labeling on the 2D image space. Thelabeling results are then projected on the final 3D model.These methods only use image-level class information dur-ing the training and the inference processes, and produceconsiderable classification errors in the case of objects withmissing information (e.g., due to occlusions). In contrast, ourmodel combines image-level classification and pixel-levelsemantic labeling, and produces considerably higher recog-nition accuracy compared to the aforementionedmethods.

Recently, multi-view-based object recognition has gainedmuch consideration due to its robustness to scale, and view-point changes. In this context, methods (e.g., [8], [13]) useMarkov Random Fields (MRFs) to label point clouds usingcues from view-based sliding window object detectors. Forinstance, Lai et al. [8] used HOG-based sliding-windowdetectors (trained from object views on the RGB-D dataset[7] and background images from the RGB-D scene dataset[8]) to compute object-class predictions in a voxelizedmerged point cloud. Although these methods achieve excel-lent 3D labeling results, their reliance on sliding windowdetectors (where the learned model is scanned over theentire image in a sliding-window fashion across multiplescales for each target class) limits their scalability for real-time applications (because the runtime performance directlydepends on the number of target object classes). Further-more, these methods are based on hand-crafted features,which means that feature extraction and classification arenot performedwithin a unified optimization framework.

Our method is also related to recent popular and success-ful CNN-based object detection and recognition approachessuch as R-CNN [3]. These methods work as follows: First,they generate initial object proposals (which typically rangefrom one thousand to several thousands per image) in an

image [14]. Next, they extract CNN-based features from thegenerated object proposals, and finally run an object classi-fier on the extracted features to generate classificationscores. These methods dramatically reduce the search spacecompared to the existing sliding-window based approachesand have demonstrated state-of-the-art performance on thePascal VOC object detection and segmentation tasks.Inspired by R-CNN, our framework complements the workof RGB-D object recognition in the following ways: First,our classification model provides regularization of spatialinvariance on the learned features. This is achieved by theintroduction and learning of a Spatial Transformer Network[15] at the start of our model architecture to learn featureswhich are robust to spatial variations (e.g., translations,rotations, and scale changes), occlusions and backgroundclutter. Second, our classification model computes classprobabilities (termed CNN scores) at image- and pixel-levels, and fuses these probabilities into a cumulative prob-abilistic output that is integrated into a CRF-based inferencealgorithm to produce the semantic labeling of 3D recon-structed scenes. In experiments (Section 5.11), we show thatour hierarchical fusion approach produces considerableimprovements in recognition accuracy for semantic labelingof 3D reconstructed scenes, compared to image-wise orpixel-wise classification methods (e.g., [16] or [17]).

Our hierarchical fusion is inspired by the work in [18],where the authors propose a regularization of pixel-levelsegmentation to improve their image-level recognition accu-racies. There are, however, several differences between ourwork and the work in [18]. First, our classification modelprovides an additional regularization of Fisher encodingwithin the CNN architecture and learns features that havelarge inter-class scatter and small intra-class compactness.Second, our model employs an efficient multi-modal fusionarchitecture, where the proposed model uses multipleCNNs (one for each input modality) to learn modality-spe-cific features at the lower layers and multi-modal features atthe higher fusion layers. This enables our model to learnboth the discriminative and the complementary informationbetween different input modalities, compared to themethod in [18], which treats multi-modal data as an undif-ferentiated multichannel input to a single CNN.

3 OVERVIEW OF THE PROPOSED FRAMEWORK

The basic setting of our problem is 3D scene labeling usingan RGB-D sensor. Fig. 1 shows the overall architecture ofour framework. The input is a continuous stream of RGB-Dimages of an indoor scene (see Fig. 1A), and the output is afull 3D reconstructed scene with object-class information(see Fig. 1I). Our framework consists of three main sub-sys-tems as described below.

1) A camera-pose tracking sub-system (Fig. 1B), whichproduces the camera pose of the current image frameusing our algorithm in [6].

2) An object proposal generation sub-system (Fig. 1D),which takes the current image as input and gener-ates object proposals using our algorithms in [6],[19]. Specifically, the framework first generatesan over-segmentation of the scene into patches(Fig. 1C), and then merges them into object proposals

2052 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 40, NO. 9, SEPTEMBER 2018

http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2747134

http://doi.ieeecomputersociety.org/10.1109/TPAMI.2017.2747134

using graph cuts and merging relations (e.g., shape-convexity and co-planarity [20]).

3) An object labeling sub-system (Fig. 1F), which com-putes object-class scores per image using the pro-posed classification model (Fig. 1G) and infers thefinal labeling using our CRF-based inference algo-rithm (Fig. 1H).

In the following, we describe our object labeling sub-system (Section 4), which is one of the main contributions ofthis paper.

4 PROPOSED OBJECT-LABELING (FIG. 1F)

4.1 Input Image Representation (Fig. 1E)

We use a 224� 224� 3�Nt�dimensional representation ofan RGB-D image as an input to our classification model.This representation is computed using our method in [21],which generates Nt ¼ 6 three dimensional feature mapsfrom RGB-D point clouds as shown in Fig. 1E. The first fea-ture map ffrgb corresponds to the RGB values of the points ofthe point cloud. The second feature map fflab represents thechannels of the CIELab colorspace. The third feature mapffpntcloud represents the x, y, and z values of the point cloud.The fourth feature map ffnormals represents the x, y, and zcomponents of the local surface normals of the points of thepoint cloud. The fourth and the fifth feature maps (ffdist andffangles) represent projected distances and angular orienta-tions of the points of the point cloud with respect to the cen-troid of the object, respectively. Specifically, ffangles, encodesthe orientations (V1;V2; and V3) of the surface normals ofthe points in a point cloud P with respect to its centroid ccP .For a point ppi 2 P, the orientations are computed withrespect to the centroid of the point cloud as in [22], but with-out the viewpoint component of [22] to retain rotationalinvariance (with respect to the viewpoint) which is requiredto build a general representation of an object when viewedfrom different viewpoints. Mathematically, the orientationsare computed as

V1ðppiÞ ¼ acosðnni � vvnÞ;V2ðppiÞ ¼ acosðnnP � vvnÞ;V3ðppiÞ ¼ atan2ðr; nÞ; r ¼ ðnnP � ðvvn � nniÞÞ � nni;

n ¼ nnP � nni;

(1)

where nni represents the surface normal of the point ppi, vvn isa normalized vector between ppi and the centroid ccP , and nnPrepresents the surface normal of the point cloud (estimatedby taking the mean of the surface normals of all points inthe point cloud). Finally, ffangles is constructed by normaliz-ing the V1, V2, and V3 values between 0 and 255

ffangles ¼ jjV1ðppiÞjj; jjV2ðppiÞjj; jjV3ðppiÞjjf g; 8ppi 2 P: (2)

The feature map ffdist captures the 3D geometry of a pointcloud. It is composed of three feature channels D�1 , D�2 andD�3 . These channels represent the signed distances of thepoints of a point cloud with respect to its centroid, alongeach of the three eigenvectors (vv�1 , vv�2 and vv�3 ) of the scattermatrix of the points of the point cloud. Mathematically, D�1 ,D�2 and D�3 of a point ppi 2 P are computed as

D�1ppi

¼ ðppi � ccPÞ � vv�1 ;D�2ppi

¼ ðppi � ccPÞ � vv�2 ;D�3

ppi¼ ðppi � ccPÞ � vv�3 ;

(3)

Finally, ffdist is constructed by normalizing the D�1ppi, D�2

ppi, and

D�3ppi

values between 0 and 255. It is given by

ffdist ¼ jjD�1ppijj; jjD�2

ppijj; jjD�3

ppijj

n o; 8ppi 2 P: (4)

The individual featuremaps of our image representation pro-vide complementary information during CNN learning. Toillustrate, we visualize the responses learnt by the first convo-lutional layer of a CNN model for our feature maps. FromFig. 2, one can observe that the features which qualitativelyappear in the CNN responses are considerably different fordifferent feature maps and therefore provide complementaryinformation during feature learning. In experiments weshow that the CNN activations extracted from the proposedfeature maps are more discriminative compared to the CNNactivations extracted from rawRGB and depth images.

4.2 Proposed Classification Model (Fig. 1G)

Fig. 3 illustrates the training process of our Multi-modal Dis-criminative Spatially Invariant CNN (MDSI-CNN) for twoinput modalities. It consists of two stages. The first stage gen-erates a batch of augmented training samples by using simple

Fig. 1. An overview of our framework. Given an input RGB-D video (A), our framework first computes the pose of the camera (B) on a frame-by-framebasis. Next, it generates patches (C) and combines them into object proposals (D). Next, each object proposal is transformed into an image repre-sentation (E) and fed as input to our classification model (G), to compute object-class scores, which are integrated into our CRF-based inferencealgorithm (H) to produce a labeling of the scene (I). Note that for clarity, we show the image representation (E) for only one object proposal.

ASIF ET AL.: A MULTI-MODAL, DISCRIMINATIVE AND SPATIALLY INVARIANT CNN FOR RGB-D OBJECT LABELING 2053

rotation, translation, and crop operations for each inputmodality. The second stage uses the generated batch andtrains our model for image-wise and pixel-wise classification.Specifically, ourmodel is composed of foursmain branches: i)a Spatial Transformation (ST) branch (Section 4.2.1), ii) animage-wise classification (CNN-I) branch (Section 4.2.2), iii) apixel-wise classification (CNN-P) branch (Section 4.2.3), andiv) a Fisher-encoding based fusion branch (Section 4.2.4). TheST branch provides a regularization of spatial invariance onthe learned features. The CNN-I and the CNN-P brancheslearn modality-specific features, and the fusion branch com-bines the high-level CNN activations from multiple modali-ties through Fisher encoding and feature concatenation, andlearn highly discriminative feature representations for image-level and pixel-level classification.

4.2.1 Spatial Transformation (ST) Branch

The Spatial Transformation branch uses the Spatial Trans-former Network (STN) of [15] and learns to extract the mostinteresting regions from the input data. It also helps inremoving geometric noise form the input data (e.g., due tobackground clutter). The STN branch is mainly composed ofa localization network which takes the original image and aset of transformation parameters (e.g., scaling, cropping, androtations) as input and outputs a transformed image and a

set of desired transformation parameters, which are learntthrough the standard back-propagation algorithm in an end-to-end fashion. We apply a trained STN (for each inputmodality with scale, rotation and translation parameters) atthe start of our network, thus producing transformed imagedata (for the rest of the network) which is more object-centricand contains significantly less geometric noise. This enablesour model to learn features which are invariant to objectscale, rotation, and translation changes.

4.2.2 Image-Level Classification: CNN-I Branch

Let us denote the data structure of a single input sample asðXT ; yk; YY gÞ, where XT 2 RH�W�C represents the image of anobject proposal oi 2 O, H is the height, W is the width, C isthe number of channels, and T denotes the input modality.The term yk 2 ZjYj is the image-level ground truth classlabel, and Y is the set of ground truth classes (including thebackground class). The term YY g 2 ZH�W�jYj is the pixel-wisesegmentation ground truth with the same height and widthas those of XT . The CNN-I branch is composed of five suc-cessive convolutional layers Conv1c . . .Conv5c followed bytwo fully connected layers FC6c and FC7c (see Fig. 3). Thelayers Conv1c . . .FC7c are modality-specific and are trainedindependently for each input modality. The layers FC6c

and FC7c compute

Fig. 2. CNN responses from the first convolutional layer for the individual feature maps of our input image representation (Section 4.1). Note that theCNN responses are considerably different for different feature maps and therefore provide complementary information.

Fig. 3. An overview of our Multi-modal Discriminative Spatially Invariant CNN (Fig. 1G) training process. It starts with the generation of a batch of aug-mented training samples for each input modality. Next, the augmented batch is used to train a CNN model (composed of a Spatial Transformationbranch (Section 4.2.1), a CNN-I branch (Section 4.2.2), a CNN-P branch (Section 4.2.3), and a Fisher fusion branch (Section 4.2.4)), for image-wiseand pixel-wise classification. Note that for clarity, we show the training process for only two input modalities (i.e., RGB color and surface normal fea-ture maps). In experiments, we evaluated our model using six input feature maps generated from RGB-D images (see Fig. 1E and Section 4.1).


HHct ¼ fcðWWc

tHHct�1 þBBc

tÞ; t 2 6; 7f gfðHHc

tÞ ¼ max 0; HHct

� �;

(5)

where HHct denotes the output of the tth layer, WWc

t , BBct are the

trainable parameters of the tth layer, and fc denotes the rec-tification non-linearity (ReLU) [23].

4.2.3 Pixel-Level Classification: CNN-P Branch

The CNN-P branch originates from the Conv5c layer of theCNN-I branch. It is composed of two convolution layersConv6s and Conv7s which are modality specific and aretrained independently for each input modality.

4.2.4 Multi-Modal Fusion Branches (CNN-FV and FS2)

There are two fusion branches in our model: CNN-FV andFS2. The fusion branchCNN-FV consists of a Fisher encodingmodule, a concatenation module, a fully connected layerFCA, and a fully connected layer FCB. The layer FCA learnsa multi-modal feature representationHFVHFV by fusing the FC7c

activations and the Conv6s activations (from the CNN-Pbranch) for all input modalities through Fisher-encoding [24]and feature concatenation (as shown in Fig. 3). FCA has asize of 4096�Nt, where Nt denotes the number of inputmodalities. Its output can be written as: HHA ¼ fðWWAHHFV þBBAÞ. The layer FCB has a size equal to the number of targetclasses (i.e., jYj). It uses the output vector HHA of the layerFCA and computes jYj�dimensional scores, given by

HHB ¼ ’ðWWBHHA þBBBÞ;’ðHHtÞ ¼ eHHt=XjYjk¼1

eHHtk ; (6)

where ’ð�Þ represents the SoftMax function. The computa-tion of the Fisher vector representation HFVHFV for an inputimage proceeds as follows: First, we extract local activationsfrom the Conv6s feature maps and activations from the FC7c

layer for all the data samples in the augmented batch of theinput image (shown in Fig. 3). The Conv6s layer outputs afeature map of 7� 7� 512� 4096 dimensions, which can beregarded as 25,088 (7� 7� 512) local activations, each ofwhich is a 4,096-dimensional vector.Next, we randomly pick200 samples from the extracted activations, and pool the acti-vations into a Fisher vector representation using a vocabu-lary (of activations) with 128 Gaussian components (built ina separate offline step). The Fisher vector encoding of a set ofCNN activations is based on fitting a Gaussian MixtureModel (GMM) to the activations and encoding the deriva-tives of the log-likelihood of the model with respect to itsparameters. Given a GMM with K Gaussians, parametrizedby f#k;mk; sk; k ¼ 1; . . . ;Kg, the Fisher encoding capturesthe average first and second order differences between theactivations and each of the GMM centres. For an activationx 2 RD, we define a vector PPðxÞ ¼ ½t1ðxÞ; . . . ; tKðxÞ; h1ðxÞ;. . . ; hKðxÞ� 2 R2KD. The terms tðxÞ and hðxÞ are defined as

tkðxÞ ¼ 1ffiffiffiffiffi#k

p kkðxÞ x� mk

sk

� �2 RD;

hkðxÞ ¼1ffiffiffiffiffiffiffiffi2#k

p kkðxÞ ðx� mkÞ2s2k

� 1

!2 RD;

(7)

where, #k, mk, and sk are the weights, means, and diagonalcovariance of the kth GMM, respectively. These terms arecomputed on the training set. The term kkðxÞ ¼ #kmkðxÞ=PK

k¼1 #kmkðxÞ is the posterier probability [24]. The Fishervector FV for a set of Nd activations (X ¼ fx1; . . . ; xNd

g) iscomputed by the mean-pooling of all the activations in Xfollowed by l2 normalization, given by

FVðXÞ ¼ 1

Nd

XNd

i¼1

PðxiÞ;FVðXÞ ¼ FVðXÞ=jjFVðXÞjj2: (8)

Finally, we compress FV to a 4,096-dimensional image-levelFisher representation HFVHFV using PCA. Our Fisher encodingbased fusion layer FCA is trainable, whose parameters areinitialized through unsupervised Expectation Maximization[48], and updated through gradient descent jointly with theCNN-I and CNN-P branches during training. The fusionbranch FS2 is composed of two convolution layers ConvCs

and ConvDs, and a deconvolution layer Deconvs. It learns amulti-modal feature representation by fusing the Conv7s

activations for all the input modalities through feature con-catenation (as shown in Fig. 3). The fusion branch FS2outputs a prediction map Sp 2 ZH�W�jYj, indicating the nor-malized CNN scores of each pixel with respect to the classesof the target dataset. Fig. 3 shows the max-pooled (acrossthe jYj classes) scores map for the input image.

4.2.5 Model Learning

In order to minimize uncertainties in both the pixel-leveland the object-level CNN outputs, we define a loss functionLobject which is a weighted combination of a pixel-level clas-sification loss function Lseg and an image-level classificationloss function Lclass. It is given by

Lobject ¼ bLseg þ ð1� bÞLclass; (9)

where, b controls the relative influence of the pixel-levelregularization to the overall training objective function. Weempirically found that a value of b ¼ 0:0001 produced thebest performance of our model. The pixel-level classificationloss function forNs labeled training images is given by

Lseg ¼ 1

Ns

XNs

n¼1

Spn ; (10)

where Spn represents the ðH �W � jYjÞ�dimensional pixel-level CNN normalized scores map for an image n. It is com-puted by applying a SoftMax to the output of the Deconvlayer of the model. The object-level classification loss func-tion is defined as

Lclass ¼ 1

Ns

XNs

n¼1

SXn ; (11)

where SXn represents the jYj�dimensional image-levelCNN normalized scores vector for an image n. It is com-puted by applying a SoftMax to the output of the layer FCBof the model. The two loss functions are learned by maxi-mizing the log-likelihood functions with respect to the train-ing parameters (WWc

t , BBct , WW

st , BB

st , 8t) over the entire training

dataset using stochastic gradient descent (SGD).


4.2.6 Model Initialization and Implementation

The CNN-I and the CNN-P branches of our model are builtupon the network architecture in [16]. To adapt the commonlayers of our model to the new object categories and with thenew multi-modal input representation (Section 4.1), we fol-lowed a two step procedure. In the first step, for each inputmodality, we initialized the parameters Conv1c . . .FC7c andConv6s . . .Conv7s of our model using the parameters of themodel trained over ImageNet, and adjusted the parametersby fine-tuning the model over the target datasets using alearning rate of 0.0001. In the second step, we initialized theparameters of the layers FCA, FCB, ConvCs, and ConvDs ofour model from zero-mean Gaussian distributions with astandard deviation of 0.01, and biases set to 0, and trainedthe parameters using a learning rate of 0.01 and a parameterdecay of 0.0005 (on the weights and biases). Our model wasimplemented using the MatConvnet toolbox [25] in theVLfeat library [26]. For model learning, we used stochasticgradient descent (SGD)withmomentum fixed to 0.9.

4.2.7 Model Inference (Fig. 4-Left)

During testing, we apply the proposed CNN model to eachobject proposal extracted from the test scene (see Fig. 4-left-A). Specifically, for a query proposal oi 2 O, a 224 �224� 3�NT � dimensional image representation Xi is gen-erated (Section 4.1) and passed through the model whichproduces ð224� 224� jYjÞ�dimensional pixel-level scoresmap (as shown in Fig. 4-left-B), and a jYj�dimensionalimage-level scores vector. Given that our training data is lim-ited, and the difficulties of background clutter, large inter-class similarities and large intra-class variances, we observedthat labels inferred directly from the CNN scores are subjectto false positives (as shown in Fig. 4-left-C and the zoomedportions of Fig. 4-left-E). To effectively deal with these chal-lenges, we propose a CRF-based inference algorithm (Section4.3) which uses the outputs of our model with additionalpairwise and clique potentials to generate highly accuratesemantic labeling of 3D reconstructed scenes.

4.3 Semantic Labeling (Fig. 4-left-G and Fig. 1H)

Our CRF-based inference algorithm is an enhanced versionof our voxel-based inference method in [6], which computes

a 3D patch-wise posterior distribution over the set of objectlabels. The enhancement is achieved through our object cli-que potentials which complements the unary and pairwiseterms of [6] and provide better object-level information espe-cially for confusing object parts. Specifically, the class scoresproduced by our model are integrated into the CRF throughunary and object-clique potentials. The pairwise terms of theCRF ensure a smooth labeling and allow the labels to propa-gate outwards to the strong object boundaries in the scene.To achieve this, we first over-segment the scene into 3Dsemantic patches termed surfels (V). For this, we use the algo-rithm in [6] (chosen for its high accuracy and real-time run-time). Let each surfel ni 2 V be associated with a label yi 2 Y.We define a labeling of surfels V into labelsY by YY o. The gen-eral form of our CRF energy function E is given by

EðYY oÞ ¼Xi2V

ciðyiÞ þXc2C

$cðycÞ þX

fi;jg2N%ijðyi; yjÞ: (12)

The term c in Eq. (12) is the surfel-wise unary potential. It isgiven by

ciðyiÞ ¼ SXiþ 1

jnijX

ðm;nÞ2niaSpiðm;nÞ; (13)

where SXidenotes the image-level CNN scores and

Spiðm;nÞ denotes the pixel-level CNN scores for a pixel atlocation ðm;nÞ which belongs to the surfel ni. The term a inEq. (13) is a user-defined weighing parameter, which con-trols the relative influence of the image-level class informa-tion and the pixel-level class information on the overallunary potential. The second term in Eq. (12) is our object cli-que potential. It is formulated as

$cðycÞ ¼X

ðm;nÞ2cSpcðm;nÞ; (14)

where, Spcðm;nÞ denotes the pixel-level CNN scores for apixel at location ðm;nÞ which belongs to the object clique c.To generate object cliques C, we start with the computationof an initial pixel-level label map (using a SoftMax operationon the pixel-level score map produced of the CNNmodel) asshown in Fig. 4-right. This labelmap provides a set of regionswhich give a rough indication of the presence of objects, and

Fig. 4. Left: An illustration of the test process of our classification model for a test image of the WRGBD scene dataset [8]. The object proposals gen-erated from the test image (A) are fed as input to our classification model, which compute pixel-level (B) and object-level (D) class scores. The CNNscores are then fed as unary potentials to our CRF-based inference algorithm (G) which generates the final labeling (H). Note that the outputs of ourmodel are class scores. We show the labeling in (C) and (E) to show the differences of the two outputs. Right: Illustration of object cliques. The pre-dicted label for each clique is listed on the bottom for clique identification.


serve as initial candidates of our clique set C. We remove cli-queswith sizes smaller than 500 points. Next, these initial cli-ques are merged based on the adjacency of regions in thepoint cloud to produce new cliques. When two adjacent cli-ques are merged, the resultant clique is fed to the CNNmodel to compute its image-level score. The pairwise term%ijðyi; yjÞ in Eq. (12) encodes the interactions between adja-cent surfels (we considered a neighborhood of N ¼ 4 adja-cent surfels), and favours smooth label consistency aroundobject boundaries. It is defined by a Pottsmodel [27]

%ijðyi; yjÞ ¼�ij if yi 6¼ yj

0 otherwise:

�(15)

where %ij is �ij when yi 6¼ yj and 0 otherwise. This adds aconstant penalty (i.e., �ij) when adjacent surfels do not sharethe same label. The term �ij is modeled as a linear combina-tion ofK kernel functions

�ij ¼XK

uKWKðni; njÞ; (16)

where WKðni; njÞ is a kernel function that captures the con-textual relations between adjacent surfels on how likely twosurfels carry the same label. The parameters uK 2 QQ controlthe relative importance of the individual kernel functionsand are equally weighted in our experiments. In this work,we define four kernels that altogether capture the intuitionthat object boundaries tend to have abrupt variations in spa-tial proximities, colors, and surface normal orientations andthat objects are supported by flat surfaces, leading to con-cave surface transitions [6]. Specifically, the first kernel is aspatial-proximity kernel which assigns the same label totwo surfels that are spatially close in the point cloud. It isdefined as: W1ðni; njÞ ¼ u1e

jjppni�ppnj jj2 , where ppn represents thelocation of a surfel. The second kernel is an appearance kernelwhich favours the assignment of the same label betweensurfels with similar colors. It is defined as: W2ðni; njÞ ¼u2e

jjccni�ccnj jj2 , where ccni represents the mean normalized colorvalues of the points of the surfel. The third kernel is a struc-tural kernel which favours the assignment of the same labelbetween surfels with similar surface normal orientations. Itis defined as: W3ðni; njÞ ¼ u3e

ð1�nnni �nnnj Þ, where nnni representsthe mean local surface normal of the points of the surfel.The fourth kernel is a shape kernel which is an indicator vari-able expressing whether the surfels ni and nj exhibit a con-vex shape (e.g., the top/side of a box) or a concave shape(e.g., when a box touches its supporting surface). Similar to[6], we compute the shape relationship between two surfelsas: W4ðni; njÞ ¼ u4½ðni � njÞ � nnni ^ ðni � njÞ � nnnj �, where theindicator function ½�� returns 1 if both the conditions (com-bined by the logical AND operator ^) are satisfied and azero otherwise. We minimize the energy function E usingstandard graph-cut-based algorithms (e.g., [27]). Duringtesting, we perform an additional voxel-based max-poolingoperation on the voxelized reconstructed scene to removeoutliers (due to the noise in the depth information). Specifi-cally, after every Nf frames, we first voxelize the recon-structed scene. Next, for each voxel, we re-assign the labelsof its constituent 3D points to the value of the label thatoccurs the most in that voxel. In experiments, we found that

a voxel resolution of 1 cm3 with a max-pooling after everyNf ¼ 20 frames produced the best results.

5 EXPERIMENTAL EVALUATION

We evaluated our framework for four tasks: i) object labelingin 2D and 3D reconstructed scenes (Section 5.2) using theWRGBD scene dataset [8], ii) object category and instancerecognition (Section 5.3) using theWRGBD object dataset [7],iii) scene classification (Section 5.4) using the SUNRGBDdataset [9] and the NYU V2 dataset [10], and iv) semanticsegmentation (Section 5.5) using the SUNRGBD dataset. Wealso present experimental evaluation using RGB-D videos ofreal-world scenes acquired from a Kinect (Section 5.11).

5.1 Datasets

The WRGBD object dataset contains RGB-D images of 300objects organized into 51 categories. The dataset is challeng-ing because it contains a large variety of textured objects(e.g., food bags or cereal boxes) as well as texture-less objects(e.g., bowls, fruits, or vegetables) captured from differentviewpoints. The WRGBD scene dataset consists of eightvideo sequences of office, kitchen, and meeting room envi-ronments. Each sequence contains between 1,700 and 3,000RGB-D image frames containing objects placed on flat surfa-ces. The scene images contain multiple instances of objects inthe presence of background clutter, under different scalesand viewpoints. The SUNRGBD dataset contains RGB-Dimages of indoor scenes categorized into 47 classes. Specifi-cally, the dataset contains 10,335 images including 1,449images taken from the NYU Depth dataset V2 [10], 554images taken from the Berkeley B3DO dataset [28], and3,389 images taken from the SUN3D videos [29]. The datasetis very challenging due to the large intra-class variations,different spatial layouts, and variable lighting conditions [9].

5.2 Object Labeling in 2D and 3D ReconstructedScenes

For the labeling of objects in real-world scenes, we used theWRGBD scene dataset [8] and evaluated our framework forfive object categories (bowl, cap, cereal box, coffee mug, andsoda can) using RGB-D images of objects from the WRGBDobject dataset [7]. Table 1 shows the results of these experi-ments and a comparison with the state-of-the-art deeplearning models (e.g., the deep CNN model (DCNN) in [16]and the fully convolutional model (FCN) in [17]). For theevaluation, we used the same overall framework (shown in

TABLE 1Average Precision and Recall Scores on the WRGBD

Scene Dataset [8], Computed over the FiveSelected Object Categories

Method Avg. Precision/Recall2D (Single view)

Avg. Precision/Recall3D (Multiple views)

[8] Det3DMRF 54.2/86.6 89.4/86.7[30] CaRFs 76.4/58.5 -[16] DCNN 78.8/80.5 88.9/84.4[17] FCN 79.2/81.5 88.7/83.4[18] SS-CNN 80.4/82.6 89.6/86.8[6] CRFFs 81.5/82.7 89.8/87.0(Ours) MDSI-CNN 82.8/84.2 91.2/88.6


Fig. 1), the same image representation and the training datato train the compared models. As shown in Table 1, weachieved considerably higher average precision and recallscores compared to the state-of-the-art. Figs. 6 and 5 show aqualitative representation of the 2D and the 3D labelingresults, respectively of our model compared to models in[16] and [17]. From the results, we see that both the DCNNand FCN models have their own pros and cons. The DCNNmodel uses global image-level information to compute classpredictions. The inability of the DCNN model to considerthe local parts of the object results in a considerable reduc-tion of the true positives (see labeling results of soda can andcoffee mug shown in Fig. 6B). Consequently, the labeling pro-duced by DCNN is highly affected by occlusion and incor-rect segmentation. The FCN model considers the individualimage pixels to compute class predictions. Since the modeldoes not learn the spatial relationships between the

neighboring pixels, it considerably miss-classifies pixelswhich share similar appearance or structural characteristicsbetween different classes. Consequently, the labeling pro-duced by FCN contains a much higher number of false posi-tives. This can be understood from the labeling resultsshown in Fig. 6C, where the pixels belonging to “laptop”are considerably miss-classified as cereal box pixels. OurCNNmodel combines the positive aspects of both the globaland local models and outperforms both by a significantmargin (see Table 1). This is attributed to the fusion ofobject-level and pixel-level class information in the pro-posed inference algorithm (see Section 4.3), where the com-bined probabilistic distributions are heavily influenced bythe most confident pixels within the individual object pro-posals. This considerably reduces the false positives andyields a labeling that is consistent and constrained withinthe true object boundaries.

Fig. 5. Qualitative comparison of the 3D labeling results of the FCNmodel in [17] (C), the CNN model in [16] (D), and our classification model (E). Theground truths for the input scenes (A) are shown in (B). Figure best viewed in color.


5.3 Object Category and Instance Recognition

We evaluated our model for object category and instancerecognition on turntable images of the WRGBD object data-set [7]. For this, we used the experimental setup describedin [7] which uses every fifth frame in the dataset for trainingand evaluation. Specifically, for categorization, the setup in[7] computes the cross-validation accuracy for ten prede-fined folds over the objects (i.e., in each fold, the test instan-ces are completely unknown to the system). For instancerecognition, the setup in [7] uses the Leave-Sequence-Outscheme of Bo et al. [31], where training is performed on the30 and 60 degree sequences, while testing is performed onthe 45 degree sequence of every instance.

Table 2 shows the results of these experiments, whichshow that our classification model outperforms the sate-of-the-art methods. Table 2 also shows that we have improvedover our previous results (reported in [30] and [21]). Theseimprovements are attributed to the regularizations of thespatial invariance (Section 4.2.1) and the Fisher encoding(Section 4.2.4) on the learned features, which prove to bemore discriminative in dealing with the challenges of small

inter-class scatter and large intra-class variances, comparedto the features learnt from CNNmodels without these regu-larizations (e.g., [21], [36]).

5.4 Indoor Scene Recognition

For indoor scene recognition, we used the SUNRGBDdataset[9] and the NYU dataset V2 [10]. For the SUNRGBD dataset,we used the publicly available split configuration providedin the toolbox of the dataset. The benchmark split considersscene recognition on a subset of 19 selected classes from thedataset. Specifically, there are 4,845 images for training and4,659 images for testing. For the NYU dataset, we used thepublicly available split, which has 795 images for trainingand 654 images for testing. For both datasets, we report theoverall mean accuracy over all the test images. Formodel ini-tialization, we used the parameters of the CNNmodel of [38]pretrained on the Places Dataset [38] (2.5million RGB imageswith 205 scene classes). The results of these experiments arereported in Tables 3 and 4, which show that our model pro-duces considerably higher recognition accuracies comparedto the state-of-the-art methods. This improved performancedemonstrates that the discriminative capability of the CNN-based features can effectively be increased through the regu-larizations of the spatial invariance (Section 4.2.1) and Fisherencoding (Section 4.2.4), during training. Furthermore, thecomplementary information of the different inputmodalities

Fig. 6. Qualitative comparison of 2D labeling results of our model (A), the CNN model in [16] (B), and the FCN model in [17] (C).

TABLE 2Comparisons with the State-of-the-Art Methods on the

WRGBD Object Dataset [7] for Object Categoryand Instance Recognition

Method Category accuracy Instance accuracy

Depth RGB RGB-D Depth RGB RGB-D

[32] HMP 70.3�2.2 74.7�2.5 82.1�3.3 39.8 75.8 78.9

[33] CKM descriptor 86.4�2.3 82.9 90.4

[34] Kernel descriptor 80.3�2.9 80.7�2.1 86.5�2.1 54.7 90.8 91.2

[35] CNN-RNN 78.9�3.8 80.8�4.2 86.8�3.3

[31] SP+HMP 81.2�2.3 82.4�3.1 87.5�2.9 51.7 92.1 92.8

[30] CaRFs 88.1�2.4

[36] Multi-modal 83.1�2.0 89.4�1.3 92.0 94.1

[37] Deep learning 83.8�2.7 84.1�2.7 91.3�1.4

[21] STEM-CaRFs 80.8�2.1 88.8�2.0 92.2�1.3 56.3 97.0 97.6

(Ours) MDSI-CNN 84.9�1.7 89.9�1.8 92.8�1.2 57.6 97.7 97.9

TABLE 3Scene Recognition Accuracies (%)

on the SUNRGBD Dataset [9]

Method Depth RGB RGB-D

[39] GIST + RBF kernel SVM 20.1 19.7 23.0[32] HMP + linear SVM 21.7 25.7[40] Places-CNN + linear SVM 25.5 35.6 37.2[40] Places-CNN + RBF kernel SVM 27.7 38.1 39.0[18] SSCNN 36.1 41.3[41] Multi-modal CNN 37.0 41.5

(Ours) MDSI-CNN 35.2 39.6 45.2


can be better extracted with the proposed fusion architecture(Section 4.2).

5.5 Semantic Scene Segmentation

We evaluated the performance of our model for semanticsegmentation on the SUNRGBD dataset [9] using the pixel-wise annotation available in the dataset for 37 target classes.For the evaluation, we used the pixel-wise mean accuracymetric, which is the average of class-wise accuracy. Theresults of these experiments are reported in Table 5, whichshow that all the deep architectures achieve a low overallaccuracy. For these experiments, we used the same inputimage representation and training data to train our modeland the compared models. Table 5 shows that our modelachieved the highest accuracy compared to the state-of-the-art deep architectures. For instance, our model outper-formed the well-known SegNet [45] by almost 4 percent.

5.6 Significance of Our Image Representation

We evaluated our model for object recognition, scene recog-nition, and semantic segmentation by independently usingthe featuremaps of our image representation (Section 4.1) forfeature learning. The results of these experiments are shownin Table 6, which show that the CNNmodel generalizes wellto the individual feature maps. For instance, for object recog-nition, we achieve a mean accuracy of 88.6, 85.7, 86.3, and82.5 percent for the cases of fflab, ffnormals, ffangles, and ffdist,respectively. While color-based feature maps are importantin capturing the characteristics of textured objects (e.g., sta-pler, battery, cell phone), feature maps derived from depthdata are equally important in capturing large depth disconti-nuities (between the object and the background), and thegeometry of the objects. From Table 6, we also observe thatthe accuracies for the depth-derived features are relativelylow compared to the accuracies for the color-derived fea-tures. This is attributed to the fact that color information pro-vides better discrimination across objects or scenes with highinter-class similarities. Nonetheless, this problem can effec-tively be mitigated by the fusion of the color-derived and

depth-derived feature maps yielding an average accuracy of92.8 and 45.2 percent for object recognition and scene classifi-cation, respectively. These results validate our hypothesisthat the feature maps of our image representation providecomplementary information and it is appropriate to use theimage representation as an initialization for the fine-tuningof a large CNNpre-trained on RGB images.

5.7 Significane of the Proposed Regularizations

To elaborate on the significance of the proposed regulariza-tionswithin ourmodel architecture,we evaluated ourmodel:i) trainedwithout the regularizations, ii)with only the spatialtransformation (Section 4.2.1), iii)with only the fisher encod-ing based fusion (Section 4.2.4), and with all the regulariza-tions. The results of these experiments are shown in Table 7.

5.7.1 Spatial Invariance (Section 4.2.1)

Table 7 shows that the incorporation of our Spatial Transfor-mation branch (Section 4.2.1) within our model architectureproduces significant improvements for both the cases ofobject recognition and scene classification. Fig. 7 visualizesthe effect of the spatial transformation on the labeling pro-duced by the model. From Fig. 7 (top-row), we observe thatthe ST branch acts as an attention mechanism and effec-tively refines the object proposals. This results in animproved labeling with less false positives compared to themodel which does not use the ST branch (e.g., the instancesof soda-canwere miss-classified as cap as shown in Fig. 7E).

5.7.2 Fisher Encoding (Section 4.2.4)

From Table 7, we observe that the incorporation of ourFisher-encoding based fusion branch (Section 4.2.4) withinour model architecture brings improvements to boththe object recognition and scene classification accuracy

TABLE 4Scene Recognition Accuracies (%) on the NYU Dataset V2 [10]

Method Accuracy

[42] Spatial Pyramid (SPM) on Textons 33.8[42] Spatial Pyramid (SPM) on SIFT 38.9[42] Spatial Pyramid (SPM) on SIFT + Textons 44.9[42] Spatial Pyramid (SPM) on semantic segmentation 45.4

(Ours) MDSI-CNN 50.1

TABLE 5Semantic Segmentation Results on the SUNRGBD Dataset [9]

Method Mean accuracy (%)

[43] DeconvNet 32.28[44] DeepLab-denseCRF 33.06[12] RGBD-kernel descriptors 36.33[17] FCN-32s 41.13[45] SegNet 46.16(Ours) MDSI-CNN 50.02

TABLE 6Accuracy (%) of Our Model for the Individual Feature Maps ofOur Image Representation (Section 4.1) on WRGBD Object

Dataset [7] and SUNRGBD Dataset [9]

Input featuremap

Objectrecognition

Scenerecognition

Segmentation

ffrgb 89.9 39.6 46.3fflab 88.6 32.5 41.2ffpntcloud 84.9 35.2 38.9ffnormals 85.7 36.1 39.4ffangles 86.3 35.9 37.8ffdist 82.5 34.3 36.6

All feature maps 92.8 45.2 50.0

TABLE 7Accuracy (%) of Our Model Trained with and without theProposed Regularizations (Section 4.2) on WRGBD

Object Dataset [7] and SUNRGBD Dataset [9]

Regularization Objectrecognition

Scenerecognition

No regularization 85.2 35.3Spatial transformation (Section 4.2.1) 89.9 39.6Fisher encoding (Section 4.2.4) 88.6 38.8All regularizations 92.8 45.2


compared to the model which does not use the regulariza-tion. These improvements validate our hypothesis that themapping of multi-modal CNN activations (from fully con-nected and convolutional layers) to a Fisher-vector based fea-ture space, and the fusion of the Fisher-encoded features atimage- and pixel-levels, produces features which are morediscriminative for classes with large inter-class similaritiesand small intra-class variance, thereby yielding a higher clas-sification accuracy compared to the models which do not usethe proposed Fisher-based regularization (see Table 7). InFig. 8, we show the confusion matrices of our model trainedwith and without the regularizations of spatial invarianceand Fisher encoding on the SUNRGBD dataset [9]. It can beseen that there is a performance improvement for almostevery class when using the proposed regularizations, sug-gesting their importance for indoor scene recognition. Onone hand, the proposed STN branch (Section 4.2.1) learns toextract more representative regions from the input sceneimages. On the other hand, Fisher encoded features containpertinent local information which is not represented in thefull-image-based features (e.g., in [40]). The combination ofthese two produces features which are highly discriminativefor classes with high inter-class visual and semantic similari-ties (e.g., “lecture theatre” versus “classroom”).

5.8 Significance of the Proposed CRFRegularization

To elaborate on the significance of the proposed CRF-basedinference algorithm (Section 4.2.7) for object recognition, weevaluated our framework: i) with only the unary potentials(U-only), ii) with the unary and object clique potentials (U+OCP), and iii) with the unary, object clique, and pairwisepotentials (U+OCP+PW). The results of these experimentsare reported in Table 8, which show that both the object rec-ognition and the scene labeling accuracies increase with theincorporation of our unary, clique, and pairwise potentials.Figs. 7D and 7E show qualitative representations of the 2Dand 3D labeling results, respectively, produced with our

Fig. 7. Qualitative visualization of labeling produced by our model with and without using the proposed Spatial transformation regularization (Section4.2.1). Note that the labeling produced with the Spatial transformation branch contains less false positives (top row) compared to the labeling pro-duced without the spatial transformation branch (bottom row). For instance, instances of soda-can were classified as cap (B-E).

Fig. 8. Confusion matrices of our model without the regularizations of the spatial invariance and Fisher encoding (left), with the regularization of onlyspatial invariance (middle), and with all the regularization (right), on the SUNRGND dataset [7]. These results show that there is a performanceimprovement for almost every class when using all the proposed regularizations suggesting their importance for indoor scene recognition.

TABLE 8Accuracy (%) of Our Framework Trained with the Proposed

Unary (U), Object Clique (OCP), and Pairwise (PW) Potentials

Method Scene labeling Object recognition

2D 3D Category Instance

MDSI-CNN (No-CRF) 80.8/81.3 88.6/85.4 90.3 95.2MDSI-CNN (U-only) 81.3/82.6 89.8/86.9 91.7 96.5MDSI-CNN (U+OCP) 82.3/83.9 90.8/88.4 92.1 96.8MDSI-CNN (U+OCP+PW) 82.8/84.2 91.2/88.6 92.8 97.7


CRF-based inference hypothesis. As seen on these figures,our inference hypothesis provides an additional regulariza-tion of the spatial smoothing, thereby producing a moreaccurate semantic labeling compared to the labeling pro-duced from image-level or pixel-level classification models(e.g., see Fig. 7C). It improves the labeling in two ways: i) itforces pixels with a low probability of being part of an objectto be labeled as background, thereby reducing the numberof false positives, and ii) guarantees local label consistencywithin smooth surface patches (separated by concave sur-face transitions), thereby increasing the number of true posi-tives. Fig. 9 shows a qualitative illustration of the labelingresults produced by our model, the FCN model and theDCNN model on the WRGBD object dataset [7]. From thefigure, we observe that our unary and object clique poten-tials successfully classify the confusing cases (e.g., mush-room, sponge, garlic, and potato), which are difficult for theFCN and DCNNmodels. The FCNmodel consists of severalspatial-invariant operations (such as max-pooling and con-volutions) and produces a labeling which do not preservethe accurate object contours. Our clique potential provides adifferent view of objects from top-down, and thus givesadditional confidence to the energy term in Eq. (12). Fur-thermore, it incorporates object evidences from candidateswhich are different from the initial object proposals, andthus corrects segmentation errors and encourages a highobject recall. Our unary, clique and pairwise potentialswork collaboratively and produce predictions which arelocally consistent and smooth around object boundaries.

5.9 Significance of the Proposed HierarchicalFusion

To elaborate on the significance of the fusion of the image-level and pixel-level CNN scores within a unified architec-ture, we evaluated two variants of our model: i) CNN-I,where we only use the image-level classification loss Lclass toestimate the class labels at the image level, and ii) CNN-P,where we only use the pixel-level loss Lseg to estimate thepixel-level class labels. We aggregate the pixel-level classlabels into an image-level label to allow a fair comparisonwith the CNN-I model. Table 9 shows the results of these

experiments, where we observe that the accuracies signifi-cantly increase for the case of MDSI-CNN model comparedto the CNN-I or CNN-P models. The CNN-I branch learnsthe image description at the fully connected layers. Theselayers are more domain-specific, and are therefore less trans-ferable than the shallow layers. On the other hand, the CNN-P branch is substantially less sensitive to domain shiftsbecause the fully convolutional layers in the CNN-P branchare less constrained to a specific dataset. Hence, the featureslearnt by the CNN-P branch tend to bemore general descrip-tors than the features learnt by the CNN-I branch. The fusionof CNN-I and CNN-P branches through the proposed Fisherencoding effectively combines the strengths of both the fullyconnected and convolutional features. This allows ourmodelto learn modality-specific features at the lower layers(Sections 4.2.2 and 4.2.3) and multi-modal features at thehigher fusion layers (Section 4.2.4), thereby maximizing thelearning of both the discriminative and the complementaryinformation between the different inputmodalities.

5.10 Parameter Selection

Here, we investigate the performance of our inference algo-rithm for different values of the parameter a of Eq. (13)(which weighs the influence of object-level class scores overthe pixel-level class scores). Fig. 10-left shows the results ofthis experiment. From the figure, we observe that when a issmall, importance is placedmore on object-level classificationcompared to semantic pixel-level labeling. When a increases,the regularization of the semantic labeling is added to theoverall unary cost, and the performance increases. However,it can be seen that the performance of themodel has an appar-ent drop for very large values of a. One possible reason is thatthe model directly classifies the object based on the semanticpixel-level class information where the classification perfor-mance suffers from the false positives of the semantic label-ing. The best performance is achieved for a ¼ 0:7. For a betterillustration, the labeling results of our model with the differ-ent values of a are shown in Fig. 10-right, which demon-strates that the object classification results are considerablyimprovedwith semantic regularization.

5.11 Real-World Experimentation

We evaluated our framework for object labeling using vid-eos (captured with a Kinect) of scenes with different layoutsof objects belonging to 15 different classes. The video (in thesupplementary material, available online) best demon-strates the capabilities of our framework in these newscenes. For each video, the resulting reconstruction washand-labeled with object class labels. Fig. 11 shows ourlabeling results for these experiments. It is clear from thefigure that our framework yields a consistent labeling of

Fig. 9. Qualitative comparison of the labelings produced using FCN [17](C), DCNN [16] (D), our clique potentials (E), and our all three potentials(F) for the images shown in (A). The labels produced by FCN are esti-mated by max-pooling over the scores maps shown in (B). Our CRF-potentials ensure a smooth and accurate labeling for the cases of“confusing classes” compared to the results produced by the FCN andDCNN models.

TABLE 9Comparison of Our Model to the CNN-I and the CNN-P Models

Method Scene labeling Object recognition Scene recognition

2D 3D Category Instance Scene

CNN-I 80.3/80.8 87.3/85.5 88.2 91.6 38.9CNN-P 80.2/81.6 88.7/86.3 88.4 92.5 41.1MDSI-CNN 82.8/84.2 91.2/88.6 92.8 97.7 45.2


scenes with background clutter. Table 10 shows a compari-son of mean accuracy (for the scenes shown in Fig. 11) ofour CNN model, the model in [16], and the FCN model in[17]. From Table 10, it can be seen that our model achievedhigher accuracy for all the tested scenes compared to theother models. These improvements further validate ourhypothesis that the regularizations of the spatial invariance,and the Fisher encoding based multi-modal feature fusion

within a unified CNN architecture enables the learning ofhighly discriminative features for accurate class predictions.Furthermore, the proposed CRF-based inference refines theCNN predictions to produce consistent and highly accuratelabelings in real-world scenes.

5.12 Runtime Performance

Our framework efficiently parallelizes on CPU and GPUarchitectures. In experiments, we employed an Nvidia TitanX, although the framework works on CPU only for smallerscenes. Table 11 summarizes the approximate runtimes ofthe individual sub-systems of our framework. The runtime ofour multi-threaded implementation is dominated by the timeit takes to perform the initial proposal generation and theCRF-based semantic labeling which require around 220 and270ms per frame, respectively. The feature extraction and theinference algorithms operate on GPU threads and takearound 250 and 160 ms per frame, respectively. Althoughthe runtimes change as a function of the number of object pro-posals and image-resolution, we observed interactive framerates in our real-world experiments.

6 CONCLUSION

In this paper, we address the challenging task of labelingobjects in 3D reconstructed scenes using a multi-modal,discriminative and spatially invariant CNN. This is achievedthrough the regularizations of a spatial transformationbranch and a Fisher encoding based multi-modal fusionbranch to a CNN architecture for image-level and pixel-levelclassification. The proposed regularizations enable ourmodel

Fig. 10. Left: Average precision scores for different values of a on the WRGBD Scene dataset [8]. Right: Qualitative comparison of per-frame labelingresults produced for different values of the parameter a on the WRGBD Scene dataset [8].

Fig. 11. Qualitative representation of 3D labeling results on real-world scenes.

TABLE 103D Object Labeling Results on the Real-World

Scenes Shown in Fig. 11

Method Scene Average

1 2 3 4 5 6

CNN [16] 88.7 88.4 89.1 87.9 88.7 89.3 88.3FCN [17] 89.3 90.1 91.2 88.3 90.1 88.8 88.9MDSI-CNN 91.5 92.7 93.7 92.8 91.4 92.7 92.4

The measure used is the percentage of correctly classified pixels.

TABLE 11Runtimes of Different Steps of Our Framework

(QVGA Resolution)

Step Thread No. Thread Runtime

Camera pose tracking [6] 1 GPU � 60msObject proposal generation [6] 2 CPU � 220msFeature extraction (Section 4.1) 3 GPU � 250msModel Inference (Section 4.2.7) 3 GPU 160msSemantic Labeling (Section 4.3) 4 CPU 270ms

Overall < 960ms


to learn highly discriminative features which better deal withthe difficulties of large inter-class similarities and large intra-class variances, compared to the features extracted from thefully connected or the convolutional layers. Experimentalresults show that we achieved state-of-the-art results on avariety of recognition tasks, such as object recognition, objectlabeling in 3D reconstructed scenes, indoor scene recognitionand semantic segmentation using challenging datasets andlive videos from Kinect. We foresee numerous potentialapplications of our framework, including scene understand-ing for augmented reality and object grasping. For futurework, we plan to extend our framework to incorporate otherrelevant tasks such as object grasp estimationwithin a unifiedarchitecture. We also intend to investigate the integration ofloop closure within our framework for high-quality large-scale scene labeling.

ACKNOWLEDGMENTS

The authors are grateful to NVIDIA for providing a TitanXGPU to conduct this research. This work was supportedby Australian Research Council grants (DP150100294,DE120102960).

REFERENCES

[1] R. A. Newcombe, “KinectFusion: Real-time dense surface map-ping and tracking,” in Proc. 10th IEEE Int. Symp. Mixed AugmentedReality, 2011, pp. 127–136.

[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifi-cation with deep convolutional neural networks,” in Proc. Int.Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.

[3] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich featurehierarchies for accurate object detection and semanticsegmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2014, pp. 580–587.

[4] U. Asif, M. Bennamoun, and F. Sohel, “Discriminative featurelearning for efficient RGB-D object recognition,” in Proc. IEEE/RSJInt. Conf. Intell. Robots Syst., 2015, pp. 272–279.

[5] U. Asif, M. Bennamoun, and F. Sohel, “A model-free approach forthe segmentation of unknown objects,” in Proc. IEEE/RSJ Int. Conf.Intell. Robots Syst., 2014, pp. 4914–4921.

[6] U. Asif, M. Bennamoun, and F. Sohel, “Simultaneous dense scenereconstruction and object labeling,” in Proc. IEEE Int. Conf. Robot.Autom., 2016, pp. 2255–2262.

[7] K. Lai, L. Bo, X. Ren, and D. Fox, “A large-scale hierarchical multi-view RGB-D object dataset,” in Proc. IEEE Int. Conf. Robot. Autom.,2011, pp. 1817–1824.

[8] K. Lai, L. Bo, X. Ren, and D. Fox, “Detection-based object labelingin 3D scenes,” in Proc. IEEE Int. Conf. Robot. Autom., 2012,pp. 1330–1337.

[9] S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-Dscene understanding benchmark suite,” in Proc. IEEE Conf. Com-put. Vis. Pattern Recognit., 2015, pp. 567–576.

[10] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmen-tation and support inference from RGBD images,” in Proc. Eur.Conf. Comput. Vis., 2012, pp. 746–760.

[11] Y. Wang, J. Feng, Z. Wu, J. Wang, and S. F. Chang, “From low-costdepth sensors to CAD: Cross-domain 3D shape retrieval viaregression tree fields,” in Proc. Eur. Conf. Comput. Vis., 2014,pp. 489–504.

[12] X. Ren, L. Bo, and D. Fox, “RGB-(D) scene labeling: Features andalgorithms,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,2012, pp. 2759–2766.

[13] K. Lai, L. Bo, and D. Fox, “Unsupervised feature learning for 3Dscene labeling,” in Proc. IEEE Int. Conf. Robot. Autom., 2014,pp. 3050–3057.

[14] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeulders,“Selective search for object recognition,” Int. J. Comput. Vis.,vol. 104, no. 2, pp. 154–171, 2013.

[15] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu,“Spatial transformer networks,” in Proc. Int. Conf. Neural Inf. Pro-cess. Syst., 2015, pp. 2017–2025.

[16] K. Simonyan andA. Zisserman, “Very deep convolutional networksfor large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014,http://arxiv.org/abs/1409.1556

[17] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional net-works for semantic segmentation,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 39, no. 4, pp. 640–651, 2017.

[18] Y. Liao, S. Kodagoda, Y. Wang, L. Shi, and Y. Liu, “Understandscene categories by objects: A semantic regularized scene classifierusing convolutional neural networks,” in Proc. IEEE Int. Conf.Robot. Autom., 2016, pp. 2318–2325.

[19] U. Asif, M. Bennamoun, and F. Sohel, “Unsupervised segmenta-tion of unknown objects in complex environments,” Auton. Robots,vol. 40, no. 5, pp. 805–829, 2016.

[20] U. Asif, M. Bennamoun, and F. Sohel, “Model-free segmentationand grasp selection of unknown stacked objects,” in Proc. Eur.Conf. Comput. Vis., 2014, pp. 659–674.

[21] U. Asif, M. Bennamoun, and F. A. Sohel, “RGB-D object recogni-tion and grasp detection using hierarchical cascaded forests,”IEEE Trans. Robot., vol. 33, no. 3, pp. 547–564, Jun. 2017.

[22] R. B. Rusu, G. Bradski, R. Thibaux, and J. Hsu, “Fast 3D recogni-tion and pose using the viewpoint feature histogram,” in Proc.IEEE/RSJ Int. Conf. Intell. Robots Syst., 2010, pp. 2155–2162.

[23] V. Nair and G. E. Hinton, “Rectified linear units improverestricted Boltzmann machines,” in Proc. 27th Int. Conf. Mach.Learn., 2010, pp. 807–814.

[24] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi,“Describing textures in the wild,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2014, pp. 3606–3613.

[25] A. Vedaldi and K. Lenc, “MatConvNet: Convolutional neural net-works for MATLAB,” ACM Int. Conf. Multimedia, 2015.

[26] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portablelibrary of computer vision algorithms,” 2008. [Online]. Available:http://www.vlfeat.org/

[27] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energyminimization via graph cuts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 23, no. 11, pp. 1222–1239, Nov. 2001.

[28] A. Janoch, et al., “A category-level 3D object dataset: Putting theKinect to work,” in Consumer Depth Cameras for Computer Vision.Berlin, Germany: Springer, 2013, pp. 141–165.

[29] J. Xiao, A. Owens, and A. Torralba, “SUN3D: A database of bigspaces reconstructed using SfM and object labels,” in Proc. IEEEInt. Conf. Comput. Vis., 2013, pp. 1625–1632.

[30] U. Asif, M. Bennamoun, and F. Sohel, “Efficient RGB-D object cat-egorization using cascaded ensembles of randomized decisiontrees,” in Proc. IEEE Int. Conf. Robot. Autom., 2015, pp. 1295–1302.

[31] L. Bo, X. Ren, and D. Fox, “Unsupervised feature learning forRGB-D based object recognition,” in Proc. 13th Int. Symp. Exp.Robot., 2013, pp. 387–402.

[32] L. Bo, X. Ren, and D. Fox, “Hierarchical matching pursuit forimage classification: Architecture and fast algorithms,” in Proc.Int. Conf. Neural Inf. Process. Syst., 2011, pp. 2115–2123.

[33] M. Blum, J. T. Springenberg, J. Wulfing, and M. Riedmiller, “Alearned feature descriptor for object recognition in RGB-D data,”in Proc. IEEE Int. Conf. Robot. Autom., 2012, pp. 1298–1303.

[34] L. Bo, X. Ren, and D. Fox, “Depth kernel descriptors for objectrecognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2011,pp. 821–826.

[35] R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng,“Convolutional-recursive deep learning for 3D object classification,”in Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 665–673.

[36] M. Schwarz, H. Schulz, and S. Behnke, “RGB-D object recognitionand pose estimation based on pre-trained convolutional neuralnetwork features,” in Proc. IEEE Int. Conf. Robot. Autom., 2015,pp. 1329–1335.

[37] A. Eitel, J. T. Springenberg, L. Spinello, M. Riedmiller, andW. Burgard, “Multimodal deep learning for robust RGB-D objectrecognition,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2015,pp. 681–687.

[38] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba,“Places: A 10 million image database for scene recognition,” IEEETrans. Pattern Anal. Mach. Intell., 2017.

[39] A. Oliva and A. Torralba, “Modeling the shape of the scene: Aholistic representation of the spatial envelope,” Int. J. Comput.Vis., vol. 42, no. 3, pp. 145–175, 2001.


http://arxiv.org/abs/1409.1556

http://www.vlfeat.org/

[40] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva,“Learning deep features for scene recognition using places data-base,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 487–495.

[41] H. Zhu, J. B. Weibel, and S. Lu, “Discriminative multi-modal fea-ture fusion for RGBD indoor scene recognition,” in Proc. IEEEConf. Comput. Vis. Pattern Recognit., 2016, pp. 2969–2976.

[42] S. Gupta, P. Arbel�aez, R. Girshick, and J. Malik, “Indoor sceneunderstanding with RGB-D images: Bottom-up segmentation,object detection and semantic segmentation,” Int. J. Comput. Vis.,vol. 112, no. 2, pp. 133–149, 2015.

[43] H. Noh, S. Hong, and B. Han, “Learning deconvolution networkfor semantic segmentation,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., 2015, pp. 1520–1528.

[44] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille, “Semantic image segmentation with deep convolu-tional nets and fully connected CRFs,” ICLR, 2015, http://arxiv.org/abs/1412.7062

[45] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deepconvolutional encoder-decoder architecture for image segmen-tation,” IEEE Trans. Pattern Anal. Mach. Intell., 2017.

Umar Asif received the master’s degree inmechatronics engineering from the Universityof New South Wales, Australia and the PhDdegree in computer vision from the University ofWestern Australia, Australia. He is currently aresearch scientist at IBM Research Australia.His research interests include computer vision,machine learning, and robotics. He is a seniormember of the IEEE.

Mohammed Bennamoun received the MScdegree in control theory from Queens University,Canada and the PhD degree in computer visionfrom QUT, Brisbane, Australia. He is currently aWinthrop professor with the University of WesternAustralia, Australia. He is the co-author of thebook Object Recognition: Fundamentals andCase Studies (Springer-Verlag, 2001). He haspublished more than 300 journal and conferencepublications. His areas of interest include controltheory, robotics, object recognition, artificial neu-

ral networks, and computer vision. He served as a guest editor for a cou-ple of special issues in International journals such as the InternationalJournal of Pattern Recognition and Artificial Intelligence. He wasselected to give tutorials at the European Conference on ComputerVision (ECCV 02), the International Conference on Acoustics Speechand Signal Processing (2003), Interspeech (2014), and CVPR (2016).

Ferdous A. Sohel received the PhD degree fromMonash University, Australia, in 2009. He is cur-rently a senior lecturer with Murdoch University,Australia. Prior to joining Murdoch University, hewas a research assistant professor in the Schoolof Computer Science and Software Engineering,The University of Western Australia from January2008 to mid-2015. His research interests includecomputer vision, multimodal biometrics, sceneunderstanding, and robotics. He is a recipient ofprestigious Discovery Early Career Research

Award (DECRA) funded by the Australian Research Council. He is alsoa recipient of the Early Career Investigators Award (UWA) and the bestPhD thesis medal form Monash University. He is a member of theAustralian Computer Society and a senior member of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.




Documents

A Multi-Modal, Discriminative and Spatially Invariant CNN ...static.tongtianta.site/paper_pdf/dac96026-036e-11e... · semantic segmentation to a multi-modal CNN architecture, where