Bridging the Robot Perception Gap With Mid-Level …cli53/papers/chi_isrr15.pdfBridging the Robot Perception Gap With Mid-Level Vision 3 can signiﬁcantly improve a standard RANSAC-based

Bridging the Robot Perception Gap WithMid-Level Vision

Chi Li, Jonathan Bohren, and Gregory D. Hager

Abstract The practical application of machine perception to support physical ma-nipulation in unstructured environments remains a barrier to the development of in-telligent robotic systems. Recently, great progress has been made by the large-scalemachine perception community, but these methods have made few contributions tothe applied robotic perception. This is in part because such large-scale systems aredesigned to recognize category labels of large numbers of objects from a singleimage, rather than highly accurate, efficient, and robust pose estimation in environ-ments for which a robot has reliable prior knowledge. In this paper, we illustratethe potential for synergistic integration of modern computer vision methods intorobotics by augmenting a RANSAC-based registration method with a state-of-theart semantic segmentation algorithm. We detail a convolutional architecture for se-mantic labeling of the scene, modified to operate efficiently using integral images.We combine this labeling with two novel scene parsing variants of RANSAC, andshow, on a new RGB-D dataset that contains complex configurations of texturelessand highly specular objects, that our method demonstrates improved performanceof pose estimation over the unaugmented algorithms.

1 Introduction

Despite the substantial progress made in computer vision for object recognition overthe past few years, practical “turn-key” use of object recognition and localizationfor robotic manipulation in unstructured environments remains elusive. Traditionalrecognition methods that have been optimized for large-scale object classification[23, 24] from contextless images simply do not provide the reliability and precisionfor object pose estimation necessary to support physical manipulation. In robotics,

Chi Li e-mail: [email protected] · Jonathan Bohren e-mail: [email protected] ·Gregory D. Hagere-mail: [email protected] Hopkins University, 3400 N. Charles St, Baltimore, MD 21218,

1

[email protected]

[email protected]

[email protected]

2 Chi Li, Jonathan Bohren, and Gregory D. Hager

(a) Robotic assembly. (b) Atomic objects. (c) Well-separated. (d) Densely cluttered.

Fig. 1: In a robotic assembly scenario (a), for the objects (b) that have little or nodistinguishing texture features, most object recognition systems would rely on theobjects being well-separated (c) and fail when objects are densely packed (d).

both the identity and 3D pose of an object must be highly reliable since errors can re-sult in costly failures. Subsequently, roboticists are either forced to redesign tasks toaccommodate the capabilities of the available object registration algorithms, or theyneed to modify the objects used in a task for easier recognition. Modifications to theobjects usually involve adding easily-classifiable colors, artificial texture, or easily-recognizable artificial planar markers or marker constellations [19, 27]. Unfortu-nately, such modifications are often impractical and sometimes even infeasible – forexample, in manufacturing and assembly applications, robotic search-and-rescue,and any operation in hazardous or extreme environments.

As with many industrial automation domains, we face an assembly task in whicha robot is required to construct structures from rigid components which have no dis-criminative texture, as seen in Figure 1a. The lattice structures are built out of truss-like “links” which are joined together with coupling “nodes” via gendered magneticsurfaces, as seen in Figure 1b. While these components were originally designed foropen-loop assembly via quadcopter robots [31], their mechanical properties makethem ideal for autonomous and semi-autonomous [32] manipulation in assembly.

Unfortunately, many object registration algorithms [10, 11, 25, 8] are developedto perform well only in “partially cluttered” scenes where individual objects arewell-separated like that shown in Figure 1c. Furthermore, few existing recognitionalgorithms are designed to reliably detect and estimate the poses of such texturelessobjects once they have been assembled into composite structures as shown in Fig-ure 1d. Even if the application allowed for it, augmenting the parts with 2D planarmarkers is still insufficient for precise pose estimation due to the small size of theparts and the range at which they need to be observed.

While object recognition for these dense, textureless scenes is a challengingproblem, we can take advantage of the inherent constraints imposed by physicalenvironments to find a solution. In particular, many tasks only involve manipulationof a small set of known objects, the poses of these objects evolve continuously overtime [29], and often multiple views of the scene are available.

In this paper, we support this line of attack by describing the process of adaptinga state-of-the-art RGBD object classification algorithm [1] to support robot manip-ulation. We show that by redesigning this algorithm to compute efficient semanticsegmentation on cluttered scenes containing a small number of known objects, we

Bridging the Robot Perception Gap With Mid-Level Vision 3

can significantly improve a standard RANSAC-based pose estimation algorithm byexploiting semantic labels of the scene.

In the process of creating this algorithm we have introduced three critical inno-vations. First, we have adapted our previous state-of-the art feature-pooling-basedarchitecture [1] to operate efficiently enough for on-line robotic use on small setsof known objects. Second, we have created and tested two variants on scene pars-ing, making use of a semantic segmentation provided by feature pooling, that im-proves the existing RANSAC-based recognition method [11] on highly clutteredand occluded scenes. Lastly, we quantitatively and qualitatively evaluate this hybridalgorithm on a new dataset including complex dense configurations of texturelessobjects in cluttered and occluded scenes. For this dataset, our method demonstratesdramatic improvement on pose estimation compared with the RANSAC-based al-gorithms without the object class segmentation.

The remainder of this paper is organized as follows. Sec. 2 provides a reviewof object instance detection and pose estimation. Sec. 3 introduces a new hybridalgorithm which performs scene semantic segmentation and object pose recognition.Experiments are presented in Sec. 4 and we conclude the paper in Sec. 5.

2 Related Work

Most existing 3D object recognition and pose estimation methods [13, 10, 17,14, 15, 18] employ robust local feature descriptors such as SIFT [20] (2D) andSHOT/CSHOT [12] (3D) to reduce the search space of object hypotheses, com-bined with some constraints imposed by 3D object structures. In particular, Houghvoting [13, 10, 9] has been applied to find hypotheses that preserve consistent geo-metric configurations of matched feature points on each model. Hand-crafted globalfeature descriptors which model partial views have also been examined [17, 15] tofilter hypotheses. A more principled framework is proposed by [5, 14] to select theoptimal subset of hypotheses yielding a solution that is globally consistent with thescene while handling the interactions among objects. Other approaches only rely onsimple geometric features [25, 11, 28] for fast model-scene registration. However,this pipeline of work suffers from the limited power of local features to robustly in-fer accurate object poses, especially in the context of textureless objects, occlusions,foreground and background clutter and large viewpoint variations.

Another line of work exploits detection results on 2D images for pose estimation.One representative is the LINE-MOD system [25] which uses gradient templates tomatch sliding windows to object partial views and initialize Iterative Closest Point(ICP) for pose refinement. This template-based design does not capture fine-grainedvisual cues between similar objects and does not scale well to multiple object in-stances which occlude and / or are in close contact with each other. Furthermore,the precision of LINE-MOD’s similarity measure decreases linearly in the percent-age of occlusion [30]. Additionally, some unsupervised segmentation techniques[6, 7] partition the scene into different objects without any prior model knowledge.


Fig. 2: The illustration of failure cases of ObjRecRANSAC. Figures from the left toright are the testing scene, estimated poses from ObjRecRANSAC and groundtruth.

[29] updates scene models for generic shapes like boxes and balls over time. Thesemethods are hard to generalize to segment objects with arbitrary shape.

Recently, the use of deep convolutional architectures [3, 4, 22, 2, 21, 16] and theavailability of huge datasets with hundreds to thousands of object classes have ledto rapid and revolutionary developments in large-scale object recognition. Althoughthese frameworks significantly boost object classification performance, accurate andefficient pose estimation still remains an unsolved problem. Our recent work [1]shows how color pooling, within a convolutional architecture, is effective for devel-oping robust object representations which are insensitive to out-of-plane rotations.This is the foundation of the proposed method in this paper, which combines theadvantages of a pose-insensitive convolutional architecture with an efficient poseestimation method [11].

2.1 RANSAC-Based Object Pose Recognition

In this section, we briefly review an efficient pose estimation algorithm originallyreported in [11]. We used it as one option of object registration in our pipeline due toits efficiency and robustness to complex occlusions. The reference implementationof this algorithm is called “ObjRecRANSAC” and is available for academic useunder an open-source license.1

ObjRecRANSAC is designed to perform fast object pose prediction using ori-ented point pair features ((pi,ni),(p j,n j)) where pi, p j are the 3D positions of thepoints and ni, n j are their associated surface normals. In turn, a simple descriptorf (i, j) is defined by:

f (i, j) =

‖pi− p j‖∠(ni,n j)

∠(ni, p j− pi)∠(n j,ni− pi)

(1)

1 See http://github.com/tum-mvp/ObjRecRANSAC.git for the reference implemen-tation of [11].

http://github.com/tum-mvp/ObjRecRANSAC.git


Fig. 3: Overview of the hybrid algorithm for object detection and pose estimation.

where ∠(a,b) denotes the angle between a and b. Then a hash table is constructedfor fast matching of point pairs from object models to the scene. We refer the readerto [11] for more details.

In ObjRecRANSAC, only oriented point pair features with fixed predefined dis-tance d are used for RANSAC sampling. This prevents the algorithm from recogniz-ing scenes composed of objects with significantly different characteristic lengths.If one object has a highly eccentric shape, it is best localized by sampling pointpairs which span it’s widest axis. This large pair separation, however, prevents anysmaller objects in the scene from being recognized. Moreover, for objects situatedin cluttered and occluded scenes, the probability of sampling point pairs from sin-gle object instances significantly decreases, which leads to the strong degradation inperformance. One failure case of ObjRecRANSAC is shown in Figure 2.

From a high-level perspective, the information needed to improve the recogni-tion accuracy in heterogeneous scenes is the object class membership. If such classmembership could be determined independently from object pose, it could be usedto partition the input data into independent RANSAC pipelines which are specifi-cally optimally parameterized. Semantic segmentation techniques are well-suited toprovide this crucial information.

3 Mid-Level Perception and Robust Pose Estimation

This section presents details of a two-stage algorithm in which semantic segmenta-tion first partitions the scene and the ObjRecRANSAC, or one of its variants is ap-plied to estimate the poses of instances within each semantic class. Figure 3 showsthe flow chart for this hybrid algorithm.


Fig. 4: Illustration of the feature extraction for semantic segmentation, includingfrom right to left: convolution of local feature codes (a→b), color pooling (b→c),integral image construction (c→d), and final feature concatenation (d→e).

3.1 Semantic Scene Segmentation using Mid-Level Visual Cues

There are three major challenges to achieve accurate and efficient semantic seg-mentation. First, robust and discriminative features need to be learned to distinguishdifferent object classes, even for those with similar textureless appearances. Second,“mid-level” object models should be produced in order to handle clutter and occlu-sions caused by interactions among objects. Third, the state-of-the-art recognitiontechniques in large-scale setting typically do not operate at the time scales consistentwith robot manipulation.

Here, we develop an algorithm for semantic segmentation based on the idea ofcolor pooling in our previous work[1], but modified to make use of integral imagesto speed up the color pooling for sliding windows in the image domain. This en-ables the algorithm to perform efficient dense feature extraction in practice. We alsodetail how we exploit adaptive scales of sliding windows to achieve scale invariancefor dense scene classification. The overview of the entire semantic segmentationpipeline is illustrated in Figure 4.

3.1.1 Review of Color Pooling

Pooling, which groups local filter responses within neighborhoods in a certain do-main, is a key component in convolutional architectures to reduce the variability ofraw input signals while preserving dominant visual characteristics. Color pooling[1] yields features which achieve superior 3D rotation invariance over pooling inthe spatial domain.

First, the convolutional feature is constructed as follows. Given a point cloud2

P = {p1, · · · , pn}, we compute the rotationally invariant 3D feature CSHOT[12] for

2 For efficiency purpose, raw point clouds are downsampled via octree with the leaf size as 0.005m


each 3D point pi. Next, the CSHOT descriptor is divided into color and depth com-ponents: fc and fd . Dictionaries D = {d1,d2, · · · ,dK} with K filters for each compo-nent are learned via hierarchical K-means for randomly sampled CSHOT featuresacross different object classes. Finally, each CSHOT component f is transformedinto a feature code µ = {µ1, · · · ,µK} by the hard-assignment encoder3 and the fi-nal local feature xi = [µc,µd ] for each pi are constructed by concatenating the twotransformed CSHOT codes µc and µd . The hard assignment coding is defined asfollows:

µ j =

{1 : d j ∈N1( f )0 : d j /∈N1( f )

(2)

where N1( f ) returns the set of the first nearest neighbor of the CSHOT componentx in dictionary D. The convolution and encoding process is shown in stage (a→b) inFig. 4.

Next, we pool features over LAB color space S because it achieves better recog-nition performance than both RGB and HSV based on our experiences reported in[1]. For a set of pooling regions R = {R1, · · · ,Rm} where each R j (1 ≤ j ≤ m) oc-cupies certain subspaces in S, the pooled feature vector y j associated with R j iscomputed by sum pooling over the local feature codes {xi}:

y j = ∑i

xi ·1(ci ∈ R j) (3)

where 1(.) is the indicator function to decide if the color signature ci (a LAB value)at pi falls in R j. We choose sum pooling instead of max pooling (used in [1]) be-cause it is computationally expensive to compute maximum values over the integralstructure. Lastly, each y j is L2-normalized in order to suppress the noise[2] and thepooled feature vector Y = [y1, · · · ,ym] is constructed as the final representation forthe given point cloud.

3.1.2 Efficient Computation via Integral Images

Integral images are often used for fast feature computation in real-time object de-tection such as [26]. We build the integral image structure for fast dense featureextraction. To do so, we first project each scene point cloud onto a 2D image usingthe camera’s intrinsic parameters 4. Suppose we obtain the local feature vector xi(in Sec. 3.1.1) for each pi. For each pooling region R j, the corresponding integralimage I j is constructed as follows:

I j(u,v) = ∑i

xi ·1(ci ∈ R j ∧ui ≤ u∧ vi ≤ v) (4)

3 We replace the soft encoder used in [1] with the hard encoder to speed up the computation4 In our implementation, PrimeSense Carmine 1.08 depth sensor is used. We found no differencein performance between using default camera parameters and manual calibration.


where (u,v) is the 2D coordinate of integral image and (ui,vi) is the projected 2Dlocation of 3D point pi in 3D point cloud.

The total complexity to construct all integral images is O((Kd +Kc)WHm) whereKd and Kc are the number of codewords for color and depth components, respec-tively, and W and H are the width and height of integral images, respectively. Thus,with I j, the pooled feature y j(B) for sliding window B = {ul ,vl ,ur,vr} can be com-puted in O(1):

y j(B) = I j(ul ,vl)+ I j(ur,vr)− I j(ul ,vr)− I j(ur,vl) (5)

where (ul ,vl) and (ur,vr) are 2D coordinates for top-left and bottom-right cornersof window B on the projection of the 3D scene. Stages (c→ d) and (d→ e) in Fig.4 show the process of integral image construction and pooled feature extractionrespectively.

3.1.3 Scale Invariant Modeling for Object Parts

Modeling object partial views from complete object segments does not account formissing object parts due to occlusion and outliers from background clutter. To over-come this, we train object models based on generic object parts randomly sampledfrom object segments at different viewpoints. In order to achieve the scale invariancefor the learned models, all sampled parts are encompassed by a predefined fixed-size3D bounding box B. In turn, the sliding windows extracted for testing scene adoptscales which are consistent with B. Specifically, the scale of the ith sliding window(wi,hi) with center (ui,vi) is equal to the scale of the projected bounding box Bonto the same location:

(wi,hi) =f̃zi(wB,hB) (6)

where (wB,hB) is the predefined size in (x,y) coordinate of B in 3D and zi isthe depth corresponding to (ui,vi). f̃ is the focal length of the camera. We notethat object parts here do not necessarily have specific semantic correspondences.Next, we directly train a discriminative classification model using a linear SVMover object parts with semantic labels inherited from corresponding partial views.

Given a new scene, we extract features with adaptive scales for all sliding win-dows on integral images. Each window is classified into one of the trained semanticclasses and votes for all included 3D points. The final semantic label of each 3Dpoint is the one with the maximum votes.


Fig. 5: Illustration of the algorithm pipelines of B, GB and GO.

3.2 Recursive RANSAC-Based Registration for Pose Estimation

Although the semantic segmentation narrows down the space of RANSAC samplingwithin only a single semantic class, the ratio of inlier correspondences may be stillsmall due to multiple adjacent or connected object instances. In this section, weintroduce two recursive pipelines that improve the performance of the RANSAC-based registration algorithm detailed in Sec 2.1 in terms of stability and the recallrate. In what follows, we denote the original ObjRecRANSAC as B short for BatchMatching and introduce two improved variants as GB and GO.

Greedy-Batch Matching(GB): In this approach, we run the ObjRecRANSACrecursively over the parts of the scene that have not been well explained by previousdetected models. Specifically, the initial inputs to the ObjRecRANSAC are the set ofsegmented points P0 that share the same class label. At the ith round of recognition(i≥ 1), the working space Pi is constructed by removing the points in Pi−1 that canbe explained by the detected models Mi−1 at (i−1)th round:

Pi = {p | minm∈Mi−1

‖p−m‖2 > Td ∧ p ∈ Pi−1} (7)

where Td is the threshold (set to 0.01m) to determine the inlier. The detected modelsMi−1 are the transformed point clouds that are uniformly sampled from full objectmeshes. Finally, this greedy registration pipeline stops once no more instances aredetected. The final set of estimated poses is the union of all previously detectedposes: M f inal = ∪iMi.

Greedy-One Matching(GO): The GB approach can fail to discover some objectinstances because false positives in early iterations can lead to false negatives lateron. In order to achieve higher precision and recall detection rates, we adopt a moreconservative greedy approach in which we only choose the best detected objectcandidate with the highest confidence score from ObjRecRANSAC as the currentdetected model Mi at ith round. The rest follows the same implementation as in GB.The simple flow charts for B, GB and GO are illustrated in Figure 5.


4 Experiments

In all our experiments, we use data collected with a PrimeSense Carmine 1.09 depthsensor. We choose 0.03m as the radius for both normal estimation and CSHOT de-scriptor.5 Depth and color components in the raw CSHOT feature are decoupledinto two feature vectors. Dictionaries with 200 codewords for each component arelearned by hierarchical K-means. For the LAB pooling domain, we adopt a 4-levelarchitecture where gridding over the entire domain at the kth level is performed byequally dividing each channel of LAB into k bins. Therefore, in this 4-level architec-ture we have 100 = ∑

4k=1 k3 pooling regions. Pooled features in different levels and

domains are concatenated as the final feature vector. Integral images are constructedwith the size that is 1

5 of the original RGB-D frame for efficiency. Sliding windowswith step size of 1px are extracted on integral images. For ObjRecRANSAC, webuild models separately for each object by setting the oriented point pair featureseparation to 60% of the largest diameter of the corresponding object. The rest ofthe parameters are the same as the default ones used in [11]. Last, we capture ob-ject partial views under both fixed and random viewpoints as the training data forthe SVM classifier in semantic segmentation. Specifically, three data sequences atfixed viewing angles of 30, 45 and 60 degrees as well as one random viewing se-quence are captured. This follows the same procedure of data collection for JHUIT-50 dataset[1]. In each partial view, we randomly sample 30 object patches encom-passed by a predefined 3D bounding box with size wB = hB = 0.03m (see detailsin Sec. 3.1.3). This size is also applied to compute the scale of sliding windows onintegral images for testing examples.

Next, unlike the matching function designed for LINE-MOD[25], we introduce amore flexible matching criterion to determine whether an estimated pose is correct.In the task of object manipulation, a good pose estimation needs to achieve highmatching accuracy only with respect to the 3D geometry but not the surface texture.This implies that for objects with certain symmetrical structure (rotational and/orreflective), there should exist multiple pose candidates having perfect matching tothe groundtruth. Thus, we design a new distance function between two estimatedposes (i.e. 3D transformation in SE(3)) T1 and T2 for model point cloud PM with N3D points uniformly sampled from the full object mesh:

D(T1,T2;PM) =∑pi∈PM 1(minp j∈PM ‖T1(pi)−T2(p j)‖2 < δD)

N(8)

where threshold δD controls the matching degree. Another threshold RD is usedto justify an estimated pose T with respect to the groudtruth Tg by the criterion:D(T,Tg;PM)≥ RD. We set δD = 0.01 and RD = 0.7 for all our experiments.

The algorithm presented in this paper is implemented in C++ and all tests areperformed on a desktop with Intel Xeon CPU E5-2690 3.00GHz.

5 The implementations of normal estimation and CSHOT come from PCL Library.

http://pointclouds.org/


(a) Testing Scene. (b) Semantic Labels. (c) Confidence Map.

Fig. 6: An example of semantic scene segmentation.

4.1 LN-66 Dataset for Textureless Industrial Objects

We first evaluate our method on our new LN-66 dataset which contains 66 sceneswith various complex configurations of the two “link” and “node” textureless ob-jects shown in Figure 1b. We combine the training and testing sequences (corre-sponding to fixed and random viewpoints) of “link” and “node” objects in JHUIT-50[1] as the training data so that each object has 300 training samples. We note thatour algorithm can easily be applied to scenes composed of more than 2 objects bysimply adding more training classes in the semantic classification stage. The LN-66dataset and the object training data are available at http://cirl.lcsr.jhu.edu/jhu-visual-perception-datasets/. An example testing scene isshown in Figure 6a. There are 6 to 10 example point clouds for each static scenefrom a fixed viewpoint, where each cloud is the average of ten raw RGB-D im-ages. This gives a total of 614 testing examples across all scenes. In our dataset,the background has been removed from each example by RANSAC plane estima-tion and defining workspace limits in 3D space. Background subtraction can also bedone with the semantic segmentation stage if object models are trained along witha background class. Therefore, the points in the remaining point cloud only belongto instances of the “link” or “node” objects. However, robust object detection andpose estimation are still challenging in such scenario due to similar appearancesbetween objects, clutter, occlusion and sensor noise. To quantitatively analyze ourmethod, we manually label the groundtruth object poses for each scene and propa-gate them to all testing examples. Then the groundtruth poses are projected onto 2Dto generate the groundtruth for the semantic segmentation at each frame.

The overall segmentation accuracy is measured as the average ratio of correctlylabeled 3D points versus all in a testing scene point cloud. By running the classi-fication algorithm (in Sec. 3.1) over all 614 testing frames, the average accuracyof the semantic segmentation achieves as high as 91.2%. One example of semanticscene labeling is shown in Figure 6. The red and blue regions represent the “link”and “node” object classes, respectively. In Figure 6.c, we also show the confidencescores returned from the SVM classifier for each class. The brighter color in ei-ther red or blue indicates stronger confidence from the corresponding classifier. We

http://cirl.lcsr.jhu.edu/jhu-visual-perception-datasets/

http://cirl.lcsr.jhu.edu/jhu-visual-perception-datasets/


Precision(%) Recall(%) F-MeasureNS+B 84.47±0.36 61.75±0.27 71.30±0.28

NS+GB 79.88±0.47 79.42±0.37 79.65±0.40NS+GO 88.63±0.31 83.13±0.32 85.80±0.30

S+B 87.77±0.20 81.31±0.27 84.42±0.22S+GB 91.89±0.24 89.27±0.19 90.56±0.21S+GO 94.50±0.16 91.71±0.13 93.09±0.12GS+B 97.27±0.06 87.03±0.14 91.87±0.08

GS+GB 95.29±0.10 92.33±0.13 93.79±0.11GS+GO 98.79±0.20 94.33±0.13 96.51±0.16

Table 1: Reported precision, recall and F-score by different methods on LN-66dataset.

S-CSHOT S-Int S-Det B(NS) GB(NS) GO(NS) B(S) GB(S) GO(S)0.39±0.12 0.13±0.03 0.31±0.12 0.86±0.16 1.49±0.40 7.69±3.43 0.85±0.20 2.16±0.68 4.40±1.72

Table 2: Means and standard deviations of running times of different methods onLN-66 dataset.

could visually observe that the semantic segmentation obtains high classificationaccuracy.

Next, we report the means and standard deviations(std) of precision, recall andF-measure 6 of our algorithm on LN-66 in Table 1. For comparison, we run ex-periments for different variants of our algorithm whose names are formatted as‘S+O’. The first entry ‘S’ indicates the degree of semantic segmentation used withthree specific options ‘NS’, ‘S’ and ‘GS’ as no segmentation, standard segmenta-tion (Sec. 3.1) and groudtruth. The second entry ‘O’ stands for the three choices ofObjRecRANSAC including ‘B’, ‘GB’ and ‘GO’. Due to the randomized process inObjRecRANSAC, we run 50 trials of each method over all testing data.

From Table 1, we observe that: 1) the semantic segmentation significantlyimproves all three RANSAC-based pose estimation methods in terms of preci-sion/recall rates; 2) when using the segmentation computed by our algorithm, theRANSAC stage performs only 2 ∼ 4% behind in the final F-measure comparedto using the groundtruth segmentation. 3) Both GO and GB are more accurate(higher F-measure) and stable (smaller standard deviation) than the standard Ob-jRecRANSAC (B) regardless whether they are supported by semantic labeling.

Furthermore, we show an example of comparison between different methods inFigure 7 and more results from S+GB are shown in Figure 8. In each sub-figureof Figure 7, the gray points represent the point cloud of the testing scene. The esti-mated poses for the “link” and “node” objects are shown in yellow and blue meshes,respectively. We can see that methods that work on semantically segmented scenesachieve noticeable improvement over the ones without scene classification. In ad-dition, the computed semantic segmentation yields similar results ((d), (e), (f) in

6 F-measure is a joint measurement computed by precision and recall as 2·precision·recallprecision+recall


(a) NS+B (b) NS+GB (c) NS+GO

(d) S+B (e) S+GB (f) S+GO

(g) GS+B (h) GS+GB (i) GS+GO

Fig. 7: An example of the comparison of the estimated poses by different methods.

Figure 7) compared with the ground truth ((g), (b), (i) in Figure 7), which showsthe effectiveness of our semantic segmentation algorithm. Also, GO and GB out-perform B whether or not semantic segmentation is used. From Figure 8, we cansee S+GB could reliably detect and estimate object poses in cluttered and occludedscenes. Finer pose refinement can be made by incorporating physical constraintsbetween adjacent objects.

Finally, Table 2 reports the means and standard deviations of running times of allmain modules in the semantic segmentation as well as B, GB, GO in two contexts:(S) and (NS) indicating with and without semantic segmentation, respectively. Forsemantic segmentation, we evaluate all three components: CSHOT extraction (S-CSHOT), integral image construction (S-Int) and classification of sliding windows(S-Det). From Table 2, we can see that the semantic segmentation is running effi-


Fig. 8: Example results of S+GB on LN-66. The left, middle and right columnsshow the testing scenes, segmentation results and estimated poses, respectively.

ciently compared to the overall runtime of the algorithm. Furthermore, all three sub-stages can be trivially parallelized and dramatically accelerated with GPU-basedimplementations. We also observe that the semantic segmentation reduces the run-time of GO(NS) by half because it decreases the number of RANSAC hypotheses inthis greedy approach. For pose estimation, two proposed greedy approaches GB andGO are slower than the standard one B due to multiple runs of ObjRecRANSAC.Additionally, GB performs only slightly worse than GO (shown in Table 1) whilebeing much more efficient. These times were also computed for the CPU-based im-plementation of ObjRecRANSAC, and did not use the GPU-accelerated implemen-tation, which is already available under the same open-source license. The choiceof these three methods in practice can be decided based on the specific performancerequirements of a given application.


Although the overall runtime of the entire perception system takes more than 1seven without the semantic segmentation, this may not be a main issue to integrateour algorithm into a real-time robotic system. First, GPU-based parallel program-ming techniques could significantly speed up the current implementation. Second,standard object tracking methods [8] can be initialized by our algorithm to trackobject poses in real time and reinitialized when they fail.

5 Conclusion

In this paper, we present a novel robot perception pipeline which applies advan-tages of state-of-the art large-scale recognition methods to constrained rigid objectregistration for robotics. Mid-level visual cues are modeled via an efficient convolu-tional architecture built on integral images, and in turn used for robust semantic seg-mentation. Additionally, two greedy approaches are introduced to further augmentthe RANSAC sampling procedure for pose estimation in the constrained seman-tic classes. This pipeline effectively bridges the gap between powerful large-scalevision techniques and task-dependent robot perception.

Although the recursive nature of our modification for RANSAC-based registra-tion slows down the entire detection process, the better trade-off between time andaccuracy would be achieved. Moreover, we could adaptively select the suitable strat-egy regarding the requirement of robotics systems in practice. We believe that thisapproach can be used with other registration algorithms which are hindered by largesearch spaces, and plan to investigate other such compositions in the future.

Acknowledgment

This work is supported by the National Science Foundation under Grant No. NRI-1227277 and the National Aeronautics and Space Administration under Grant No.NNX12AM45H.

References

1. Li, C.,Reiter, A., Hager, G.D.: Beyond Spatial Pooling, Fine-Grained Representation Learn-ing in Multiple Domains. In: CVPR, 2015.

2. Bo, L., Ren, X., Fox, D. Unsupervised feature learning for RGB-D based object recognition.In ISER, 2013.

3. Krizhevsky, A., Sutskever, I., Hinton, G. E. Imagenet classification with deep convolutionalneural networks. In NIPS, 2012.

4. Socher, R.,Huval, B., Bhat, B., Manning, C.D., Ng, A.Y.: Convolutional-Recursive DeepLearning for 3D Object Classification. In: NIPS, 2012.


5. Aldoma, A., Tombari, F., Stefano, L.D., Vincze, M.: A Global Hypotheses VerificationMethod for 3D Object Recognition. In: ECCV, 2012.

6. Richtsfeld, A., Morwald, T., Prankl, J., Zillich, M., Vincze, M: Segmentation of unknownobjects in indoor environments. In IROS, 2012.

7. Uckermann, A., Haschke, R., Ritter, H. Realtime 3D segmentation for human-robot interac-tion. In IROS, 2013.

8. Pauwels, K., Ivan, V., Ros, E., Vijayakumar, S. Real-time object pose recognition and trackingwith an imprecisely calibrated moving RGB-D camera. In IROS, 2014.

9. Woodford, O. J., Pham, M. T., Maki, A., Perbet, F., Stenger, B. Demisting the Hough trans-form for 3D shape recognition and registration. In IJCV, 2014

10. Drost, B., Ulrich, M., Navab, N., Ilic, S. Model globally, match locally: Efficient and robust3D object recognition. In CVPR, 2010

11. Papazov, C., Burschka, D. An efficient RANSAC for 3D object recognition in noisy andoccluded scenes. In ACCV, 2010.

12. Tombari, F., Salti, S., Di Stefano, L. A combined texture-shape descriptor for enhanced 3Dfeature matching. In ICIP, 2011.

13. Knopp, J., Prasad, M., Willems, G., Timofte, R., Van Gool, L. Hough transform and 3D SURFfor robust three dimensional classification. In ECCV, 2010.

14. Aldoma, A., Tombari, F., Prankl, J., Richtsfeld, A., Di Stefano, L., Vincze, M. Multimodalcue integration through Hypotheses Verification for RGB-D object recognition and 6DOFpose estimation. In ICRA, 2013.

15. Xie, Z.,Singh, A., Uang, J., Narayan, K. S., Abbeel, P. Multimodal blending for high-accuracyinstance recognition. In IROS, 2013.

16. Gupta, S., Girshick, R., Arbelez, P., Malik, J. Learning rich features from RGB-D images forobject detection and segmentation. In ECCV, 2014.

17. Tang, J., Miller, S., Singh, A., Abbeel, P. A textured object recognition pipeline for color anddepth image data. In ICRA, 2012.

18. Fischer, J., Bormann, R., Arbeiter, G., Verl, A. A feature descriptor for texture-less objectrepresentation using 2D and 3D cues from RGB-D data. In ICRA, 2013 .

19. Macias, N., Wen, J. Vision guided robotic block stacking. In IROS, 2014.20. Lowe, D. G. Distinctive image features from scale-invariant keypoints. IJCV, 2004.21. Girshick, R., Donahue, J., Darrell, T., Malik, J. Rich feature hierarchies for accurate object

detection and semantic segmentation. In CVPR, 2014.22. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T. Decaf: A deep

convolutional activation feature for generic visual recognition. In ICML, 2014.23. Lai, K., Bo, L., Ren, X., Fox, D. A large-scale hierarchical multi-view rgb-d object dataset.

In ICRA, 2011.24. Singh, A., Sha, J., Narayan, K. S., Achim, T., Abbeel, P. BigBIRD: A large-scale 3D database

of object instances. In ICRA, 2014.25. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N. (2013).

Model based training, detection and pose estimation of texture-less 3D objects in heavilycluttered scenes. In ACCV, 2012.

26. Viola, P., Jones, M. Robust real-time object detection. IJCV, 2001.27. Niekum, S., Osentoski, S., Konidaris, G., Chitta, S., Marthi, B., Barto, A. G. Learning

grounded finite-state representations from unstructured demonstrations. IJRR, 2014.28. Rusu, R. B., Bradski, G., Thibaux, R., Hsu, J. Fast 3d recognition and pose using the view-

point feature histogram. In IROS, 2010.29. Hager, G. D., Wegbreit, B. Scene parsing using a prior world model. IJRR, 2011.30. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., Lepetit, V. (2012).

Gradient response maps for real-time detection of textureless objects. PAMI, 2012.31. Lindsey, Q., Mellinger, D., Kumar, V. Construction with quadrotor teams. Autonomous

Robots, 2012.32. Bohren, J., Papazov, C., Burschka, D., Krieger, K., Parusel, S., Haddadin, S., Shepherdson, W.

L., Hager, G. D., Whitcomb, L. L. A pilot study in vision-based augmented telemanipulationfor remote assembly over high-latency networks. In ICRA, 2013.

Documents

Bridging the Robot Perception Gap With Mid-Level …cli53/papers/chi_isrr15.pdfBridging the Robot Perception Gap With Mid-Level Vision 3 can signiﬁcantly improve a standard RANSAC-based