8
Object Codetection from Mobile Robot Video Scott Alan Bronikowski, Daniel Paul Barrett, and Jeffrey Mark Siskind Purdue University School of Electrical and Computer Engineering 465 Northwestern Avenue West Lafayette, IN 47907-2035, USA Email:{sbroniko,dpbarret,qobi}@purdue.edu Abstract—We present a method for detecting, localizing, and labeling objects encountered by a ground-based mobile robot in its environment. This method does not require any object detectors or models, which allows it to codetect objects that have never been seen. Our experimental results show that our system can detect 90% of objects presented, localize such to within 16.5cm, and label such with an accuracy of at least 74.1%. I. INTRODUCTION We present a method for detecting, localizing, and labeling novel objects encountered by a mobile robot in its environ- ment. A crucial aspect of our method is that it does not require any pretrained object detectors or models, allowing it to detect, localize, and label objects that have never been seen. An established computer-vision community studies codetection, detecting, localizing, and labeling previously unseen objects in images (e.g., [1], [2], [3], [4]) and video (e.g., [5], [6], [7], [8]). Our work differs from this prior work in three crucial ways. a) We process a video feed from a ground-based mobile robot. The properties of this video are very different from that of static images or video from a stationary camera at eye level. We capitalize on these properties, particularly that we get multiple views of the same objects as the robot moves. b) We avail ourselves of the odometry and IMU information from the robot, integrating such into the codetection process. c) The above allow localization of the objects in a 3D world coordinate system, not just the 2D image frame. We use a general-purpose object proposal-generation mecha- nism [9], a variety of similarity measures between proposals [10], and multiple graphical models [11] to determine which proposals are most likely to be objects, locate such objects in the world coordinate system, and find consistent object This research was sponsored, in part, by the Army Research Laboratory, accomplished under Cooperative Agreement Number W911NF-10-2-0060, and by the National Science Foundation under Grant No. 1522954-IIS. The views, opinions, findings, conclusions, and recommendations contained in this document are those of the authors and should not be interpreted as representing the official policies, either express or implied, of the Army Research Laboratory, the National Science Foundation, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein. class labels across several different floor plans, all without human intervention. We present experiments conducted on a custom-built robot that demonstrate the feasibility of our approach. Our system can detect 90% of objects presented, localize such with a mean accuracy of 16.5cm, and label such with an accuracy of at least 74.1%. II. RELATED WORK To the best of our knowledge, codetection of objects from the video feed of a mobile robot has not yet been done. Most existing codetection methods (e.g., [1], [2], [3], [4], [5], [6], [7]) operate on images or video taken from a stationary human-centric point of view, require the object of interest to be prominent and close to the center of the field of view, and localize objects only within the 2D image frame. Conversely, our method operates on video from a ground-based mobile robot, which has a visual perspective that is both moving and much closer to the ground. Our system is also able to both detect small objects that are off-center and localize objects in both the 2D image and 3D world frames. The codetection approach of Blaschko et al. [1] uses weakly-annotated data to detect and localize objects by training SVM classifiers for each object class independently. This weak annotation can be either a simple binary indication of the presence or absence of an object or an indication of an object location without precise bounding box coordinates. Additionally, the fact that the training data is split into indi- vidual classes is another form of annotation of the dataset. Our approach uses no human annotation and does not require independent training of object classifiers. Lee and Grauman [2] present an iterative clustering ap- proach to learn object models individually over many passes through an image corpus. Their system estimates the likeli- hood both that an image region both contains an object and that nearby regions in that image contain instances for which their system has previously-trained models. This requires them to seed their system with a small number of pretrained object models to start, to which their system adds the models learned in each iteration. Our system, on the other hand, does not require pretrained object models and can classify multiple object classes simultaneously.

Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

Object Codetection from Mobile Robot Video

Scott Alan Bronikowski, Daniel Paul Barrett, and Jeffrey Mark SiskindPurdue University

School of Electrical and Computer Engineering465 Northwestern Avenue

West Lafayette, IN 47907-2035, USAEmail:{sbroniko,dpbarret,qobi}@purdue.edu

Abstract— We present a method for detecting, localizing, andlabeling objects encountered by a ground-based mobile robotin its environment. This method does not require any objectdetectors or models, which allows it to codetect objects thathave never been seen. Our experimental results show that oursystem can detect 90% of objects presented, localize such towithin 16.5cm, and label such with an accuracy of at least74.1%.

I. INTRODUCTION

We present a method for detecting, localizing, and labelingnovel objects encountered by a mobile robot in its environ-ment. A crucial aspect of our method is that it does notrequire any pretrained object detectors or models, allowingit to detect, localize, and label objects that have never beenseen. An established computer-vision community studiescodetection, detecting, localizing, and labeling previouslyunseen objects in images (e.g., [1], [2], [3], [4]) and video(e.g., [5], [6], [7], [8]). Our work differs from this prior workin three crucial ways.a) We process a video feed from a ground-based mobile

robot. The properties of this video are very different fromthat of static images or video from a stationary camera ateye level. We capitalize on these properties, particularlythat we get multiple views of the same objects as therobot moves.

b) We avail ourselves of the odometry and IMU informationfrom the robot, integrating such into the codetectionprocess.

c) The above allow localization of the objects in a 3D worldcoordinate system, not just the 2D image frame.

We use a general-purpose object proposal-generation mecha-nism [9], a variety of similarity measures between proposals[10], and multiple graphical models [11] to determine whichproposals are most likely to be objects, locate such objectsin the world coordinate system, and find consistent object

This research was sponsored, in part, by the Army Research Laboratory,accomplished under Cooperative Agreement Number W911NF-10-2-0060,and by the National Science Foundation under Grant No. 1522954-IIS.The views, opinions, findings, conclusions, and recommendations containedin this document are those of the authors and should not be interpretedas representing the official policies, either express or implied, of theArmy Research Laboratory, the National Science Foundation, or the U.S.Government. The U.S. Government is authorized to reproduce and distributereprints for Government purposes, notwithstanding any copyright notationherein.

class labels across several different floor plans, all withouthuman intervention.

We present experiments conducted on a custom-built robotthat demonstrate the feasibility of our approach. Our systemcan detect 90% of objects presented, localize such with amean accuracy of 16.5cm, and label such with an accuracyof at least 74.1%.

II. RELATED WORK

To the best of our knowledge, codetection of objects fromthe video feed of a mobile robot has not yet been done.Most existing codetection methods (e.g., [1], [2], [3], [4], [5],[6], [7]) operate on images or video taken from a stationaryhuman-centric point of view, require the object of interest tobe prominent and close to the center of the field of view, andlocalize objects only within the 2D image frame. Conversely,our method operates on video from a ground-based mobilerobot, which has a visual perspective that is both moving andmuch closer to the ground. Our system is also able to bothdetect small objects that are off-center and localize objectsin both the 2D image and 3D world frames.

The codetection approach of Blaschko et al. [1] usesweakly-annotated data to detect and localize objects bytraining SVM classifiers for each object class independently.This weak annotation can be either a simple binary indicationof the presence or absence of an object or an indication ofan object location without precise bounding box coordinates.Additionally, the fact that the training data is split into indi-vidual classes is another form of annotation of the dataset.Our approach uses no human annotation and does not requireindependent training of object classifiers.

Lee and Grauman [2] present an iterative clustering ap-proach to learn object models individually over many passesthrough an image corpus. Their system estimates the likeli-hood both that an image region both contains an object andthat nearby regions in that image contain instances for whichtheir system has previously-trained models. This requiresthem to seed their system with a small number of pretrainedobject models to start, to which their system adds the modelslearned in each iteration. Our system, on the other hand, doesnot require pretrained object models and can classify multipleobject classes simultaneously.

Page 2: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

Rubinstein et al. [3] discover and segment out commonobjects from diverse image collections. Their system takesas input collections of images retrieved from Internet search,where each collection comprises the results of a search for aspecific object, such as a car. They are able to automaticallylocalize the common object in each image or decide that animage does not contain such an object. Like the approachof Blaschko et al. [1], their approach requires the manualseparation of these image collections by object class; theycannot learn multiple object models simultaneously.

Tang et al. [4] perform codetection by solving a jointimage-box formulation for each object class. Like Blaschkoet al. [1] and Rubinstein et al. [3], their system operates ondata that has been manually grouped by object class, whereeach image in a set either contains the object of interestor is considered noise because it does not contain such anobject. Our system uses an unsorted collection of objectimages and is able to separate them into object-based groupsautomatically.

Prest et al. [5] describe an approach for codetection invideos which uses motion segmentation to create spatio-temporal tubes that identify candidate objects. Like severalpreviously-mentioned approaches (e.g., [1], [3], [4]), thisapproach requires input data that is grouped by class andthus must learn each object individually. Additionally, thisapproach also requires each input video to have a binaryannotation indicating whether or not it contains the class ofinterest. Our approach does not require such grouping orannotation and learns multiple objects at once.

Schulter et al. [6] present a system which discovers andsegments unknown objects in videos through the use opticalflow to estimate a motion segmentation. From this theylearn appearance models by clustering both superpixels andbounding boxes. They can thus classify multiple objectssimultaneously, as we can, although they have fewer objectclasses (4) than we do (6).

Joulin et al. [7] build upon their previous work [4] toextend such work to video in addition to images. Thisnew approach incorporates temporal consistency between theframes of a video. However, like their previous work [4], thisapproach also requires input data that is manually groupedby object class. Our approach requires no such manual input.

Unlike most previous codetection work, which can onlydetect objects that are large, prominent, and centered in thefield of view, Srikantha and Gall [8] present an approachwhich can detect and localize objects which are small andoff-center. However, their approach relies upon human poseestimation and thus can only detect objects with whichhumans interact. Our approach can also detect small and off-center, but is independent of human activity within a video.

III. OVERALL CONCEPT

The work presented here is a component of our largereffort to automatically drive—and learn to drive—a mobilerobot under natural-language command [12]. This effortlearns the meanings of the nouns and prepositions in sen-tential descriptions of paths driven by the robot as measured

Fig. 1. The custom mobile robot used in our experiments.

with odometry. These learned word meanings support twokinds of subsequent natural-language interaction with therobot: generation of new sentential descriptions of newdriven paths and automatically driving the robot to achievenavigational goals specified in natural-language instructions.

Our current approach to this effort requires, as input, amanually-crafted floor plan consisting of a set of 2D objectcoordinates, each labeled with an object class. The objectiveof the work described here is to eliminate the need for suchhuman intervention, producing all information in the floorplan automatically from sensor input. Different floor planscan contain objects of the same class at different coordinates.Our learning algorithm learns a mapping from nouns, suchas box, to object class labels, such as OBJECT3. As such, itis imperative that the class labels produced automatically byour codetection process be consistent across all floor plans.

IV. EXPERIMENTAL PLATFORM

In order to conduct our experiments, we built a custommobile robot, or rover, shown in Fig. 1. The rover measures45cm long by 33cm wide, and the wheels are 12cm tall.We use the front-facing camera on the rover to collectobservations of objects. Through the use of an ExtendedKalman Filter [13], which takes inertial guidance data froman IMU and odometry data from motor shaft encoders, therover is able to localize itself in real time. Such localizationis not perfectly accurate due to sensor noise, wheel slippage,and other mechanical factors, but is generally within 15–20cm (or approximately one-half the length or width of therover) of the actual location.

To collect data, a human teleoperator drives the roverwithin a floor plan (Fig. 2) in order to accomplish a nav-igational goal specified by a sentence, such as “The robot

Page 3: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

Fig. 2. An example floor plan with a trace of robot motion. This tracesatisfies the sentence “The robot went in front of the stool then went awayfrom the box which is right of another box then went left of the table thenwent behind the box which is left of another box then went right of thetable.”

went in front of the stool then went away from the boxwhich is right of another box then went left of the tablethen went behind the box which is left of another box thenwent right of the table.” During such operation, the videofeed, localization data, and all other sensor and actuator dataare logged in a synchronized time-stamped format. It is thissynchronization of data which allows us to find real-worldlocations for objects observed in video (see Sec. V).

The use of a physical robot with noisy real-world sensordata increases the difficulty of our task. The noisy robotposition data is densely sampled in space and time.

V. TECHNICAL DETAILS

Our approach to codetection comprises four steps, de-scribed in greater detail in the following subsections:detection Detect objects within video frames. There will

necessarily be fewer detections than frames, since ob-jects are not visible in every frame.

localization Find 3D world locations for detected objects.clustering Perform clustering on the detection locations to

find object locations within the floor plan.labeling Learn class labels that are consistent within and

across different floor plans.

A. Detecting and localizing objects

The tasks of detecting and localizing objects are performedsimultaneously with a three-step process:

1) Generating scored bounding-box proposals: We firstapply an off-the-shelf proposal-generation mechanism [9] toeach video frame. This uses general-purpose visual cues suchas edges, within-region similarity, and closed contours toplace bounding boxes around candidate objects. No class-specific object detectors are used. This allows detecting,localizing, and labeling previously unseen objects.

The proposal-generation mechanism is highly inaccurate;it often produces both false positives and false negatives. Tocompensate for this, we bias the proposal-generation mech-anism to overgenerate, producing ten proposals per framein attempt to reduce false negatives at the expense of falsepositives. These are filtered out by subsequent processing.

The proposal-generation mechanism associates a scorewith each proposal. This is nominally intended to give anindication of the degree to which the proposal actuallyencloses an object. We find, however, that this score, taken byitself, is not a reliable indicator. Thus we need to integratedthis score with other sources of information and aggregate itover many samples in order to make reliable determinations.

The raw proposal score b is in [0,1]. We modify b byapplying penalty terms that encode the additional sourcesof information. Because the video feed from our robot istime stamped and synchronized with localization data fromodometry and the IMU, each video frame is associatedwith the camera location in the world coordinate frame.The camera location, together with camera calibration andperspective geometry [14] constrain every image pixel to aworld location on a ray of known angle emanating fromthe known camera center. The world location (wx, wy, wz)and world width ww of an object proposal can be derivedfrom this under the assumption that the object is resting onthe ground. This assumption corresponds to constraining thebottom edge of the proposal bounding box to have wz = 0.This location is then used to associate a penalty applied tothe raw proposal score.

The penalty is composed from five factors, each encodinga different undesirable object property. These penalties allowconstruction of a unary factors f used by the graphical model(Sec. V-A.3) from the raw proposal score b.

f = be−(pz+pp+pw+px+py)

Note that f ∈ [0, 1] since b ∈ [0, 1]. Proposals that donot exhibit specific undesirable properties have an associatedparameter p = 0.

We first observe that proposals whose bottom edge isabove the horizon line have a world coordinate behind therobot. Such proposals are penalized with pz = 10.

We next observe that the proposal-generation mechanismtends to give high scores to proposals that occupy a largefraction of the field of view, in part, because it considers theimage perimeter to satisfy the contour closure criterion. Toavoid such, we then penalize proposals that are too close tothe image perimeter or are too wide or tall. Proposals that areclose to any two image boundaries, ones that are close to anysingle image boundary and also exceed a specified height orwidth, and ones that exceed both the specified height andwidth are penalized with pp = 10.

We next observe that the proposal-generation mechanismtends to give high scores to proposals with unduly high widthin the world coordinate frame. Thus we penalize proposalswith world width ww > tw with a penalty pw = ww − tw.

We finally penalize proposals that reside outside ourfloorplan. Proposals with world coordinate wx outside the

Page 4: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

boundary X are penalized with px = |wx−X| and those withworld coordinate wy outside the boundary Y are penalizedwith py = |wy − Y |.

2) Measuring similarity between proposals in differentframes: We next compute a similarity measure between pairsof proposals k and l in different frames to use as the binaryfactors g in the graphical model (Sec. V-A.3) as a weightedsum of three terms sk,l, dk,l, and wk,l that denote visualsimilarity, distance, and width, respectively.

gk,l = αsk,l + βdk,l + γwk,l (1)

All similarity measures are scaled to the range [0,1], so everyg score is also in the range [0,1].

The first similarity measure, sk,l, encodes visual similarity.We first compute dense SIFT descriptors [10] for the imageportion inside each proposal in each frame. These descriptorsare represented as normalized histograms, which allows us tomeasure similarity with χ2 distance. This measure is negatedand normalized to [0,1] so that 1 represents a perfect matchbetween the two histograms.

The second similarity measure, dk,l, encodes the Euclideandistance between the world coordinates of two proposedobjects, reflecting the constraint that objects viewed fromdifferent viewpoints in different video frames can only be thesame if they are in the same position. To scale dk,l to [0,1],with 1 representing zero distance, we pass the computedEuclidean distance through a zero-mean Gaussian, whosevariance has been empirically determined.

The final similarity measure, wk,l, encodes the differencein the world width of two proposals, reflecting the constraintthat our objects exhibit low variance in width when viewedfrom different viewpoints. Again, we scale wk,l to [0,1] usinga zero-mean Gaussian, whose variance has been empiricallydetermined.

3) Graphical-model formulation and solution: In thecontext of the video feed from the front-facing camera of amobile robot that is following natural-language instructions,the bulk of the video feed contains frames with prominentobjects in the field of view that correspond to nouns in theinstructions. For example, when the robot is following theportion of the instructions that include towards the chair, achair will be prominent in the field of view. Thus we seek asingle proposal per frame for the most prominent object inthat frame. That object will tend to persist for the portion ofthe robot navigation that corresponds to that portion of theinstructions and then switch to a different object after thatnavigational goal is met and the robot begins meeting thenavigational goal specified by the next phrase such as rightof the table. Thus the video feed will tend to be partitionedinto segments where each segment has the same prominentobject. We encode this property of our domain in a graphicalmodel that seeks to find a single prominent object in eachframe.

We construct a graphical model with a vertex for eachframe that ranges over a set of labels that denote the propos-als generated for that frame. Solving this graphical modelproduces an assignment from vertices to labels selecting a

Fig. 3. A visualization of the graphical-model framework for proposalselection. Here f1,1 represents the unary factor of proposal 1 in frame 1and g(1,1),(1,2) represents the binary factor between proposal 1 in frame 1and proposal 1 in frame 2. For clarity, only the binary factors for proposal1 in frame 1 are labeled and shown in black. All other binary factors areshown in gray.

single proposal as depicting the most prominent object inthat frame. One problem arises however. As the robot turnsto transition from one navigational goal to the next, theremight not be a prominent object in the field of view. Further,as the robot gets closer to a target of certain instructions,like right of the chair, the target object may pass out of thefield of view. To allow the solution to the graphical modelto represent these cases, we augment the potential label setof each vertex to include a dummy proposal that indicatesthat no object is prominent in the field of view. The graphicalmodel formulation, however, requires that there be unary andbinary factors for label assignments to the dummy proposal,which does not correspond to an image region and thus doesnot allow computation of such factors. Thus we set suchunary and binary factors to empirically determined nominalvalues.

Our graphical model optimizes the score

maxv1∈L1

· · · maxvT∈LT

∑i<j

fvi + gvi,vj

where i and j denote frames from a video feed of T frames,vi denotes the vertex constructed for frame i, Li denotesthe set of proposals generated for frame i, k and l denoteparticular proposals, fl denotes the unary factor for proposall, and gk,l denotes the binary factor for a pair of proposalsk and l. A visualization of this is shown in Fig. 3.

We solve this graphical model using the OpenGM library[11], [15] and its built-in belief propagation [16] algorithm.Some visualizations of these results are shown in Fig. 4.All images are taken from navigational paths driven ona single floor plan. These results show that both falsepositives (Fig. 4 bottom right) as well as false negatives(Fig. 4 top right, middle right) can arise. Note that, however,our overall goal is simply to determine the collection ofobjects present in each floor plan, their world positions,

Page 5: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

Fig. 4. Visualizations of the solution to the graphical model used to selectprominent objects from proposals. The left column shows correct results,while the right column shows failure modes. (top left, middle left) Correctdetections of a chair and cone. (bottom left) Correct selection of the dummyproposal when no object is visible. (top right, middle right) False negatives:missed detections of the chair and cone. (bottom right) A false positive:spurious detection of what appears to be the shadow of the table.

and a unique labeling. For this, it is not necessary to havecorrect detections of prominent objects in every frame of thevideo feed. Subsequent processing is resilient to such falsepositives and negatives because missed and spurious detec-tions are usually corrected in temporally adjacent frames. Forexample, Fig. 4 (top left, top right) and Fig. 4 (middle left,middle right) depict temporally adjacent frame pairs.

B. Clustering detected objects

We gather video feed from multiple navigational pathsdriven in each floor plan. Since our goal is to find the objectlocations in each floor plan, we first solve a graphical modelfor the video feed from each navigation path and plot thefloor projections of the world locations of the proposalsselected by the graphical model for all navigational pathsdriven on a particular floor plan. For the floor plan used inFig. 4, the plot of selected proposals from all ten navigationalpaths driven on this floor plan is shown in Fig. 5. Here, theboundaries of the experimental area are shown by the solidlines that form the boundary of the plot. The origin of theworld frame is at the intersection of the other two solid lines,and the dashed lines indicate a grid imposed on the floorplan on which we placed objects. The clustering of selectedproposals around actual objects is apparent.

To determine these cluster centers, we construct a density

Fig. 5. Plot of selected proposal locations derived from multiple naviga-tional paths, indicated by color, driven on the sample floor plan. Only thepoint positions are used in subsequent processing, not the navigational pathas indicated by color.

function (Fig. 6) by computing

Sx,y =

N∑n=1

fn‖(x, y)− (xn, yn)‖

vn(2)

for each point (x, y) in the floor plan, where n rangesover all nondummy selected proposals, (xn, yn denotes theworld location of proposal n, fn denotes the unary factorof proposal n, vn denotes a visibility measure of proposaln, and the Euclidean distance is scaled to [0,1] with azero-mean Gaussian, whose variance has been empiricallydetermined. The visibility measure is taken as the numberof times (xn, yn) was in the camera’s field of view. Normal-izing the density function by the visibility measure raisesthe importance of rare observations over repeated commonobservations of the same object in the robot’s path.

Fig. 6 shows surface and contour plots of (2) for the floorplan in Fig. 5. It clearly shows five peaks of the densityfunction, corresponding to the locations of the five objects inthe floor plan. The ground truth object locations are labeledwith magenta diamonds in the contour plot. The peaks, foundby weighted centroid, are marked with blue squares. Thepeak furthest from ground truth, corresponding to the BAGobject at (−1.3, 0), is approximately 30cm from the actualobject; all other peaks are considerably closer to groundtruth. We report the distance error for all floor plans usedin our experiment in Sec. VI.

C. Labeling object classes

Having found object locations, as described above, thenext step is to assign labels to these objects based on theirvisual similarity. We do this in a two step process: first withtemporary labels within a single floor plan, and then withfinal labels common to all floor plans.

To assign the temporary labels for each floor plan, wefirst assign each nondummy selected proposal, and its corre-

Page 6: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

Fig. 6. Surface plot (left) and corresponding contour plot (right) of (2) for the floor plan in Fig. 5.

sponding image region, to the closest object location (peak)determined in the previous step, rejecting outliers based ona distance threshold. Then we create a similarity matrix Qbetween pairs i, j of objects for each floor plan h. If floorplan h has B objects, each with a set CB of associated imageregions, let Ua,b denote the visual similarity between pairsa, b of image regions where a is associated with object i andb is associated with object j. Visual similarity is measuredby the same methods as s in (1). We compute Q as

Qi,j =

∑a∈Ci

maxb∈Cj

Ua,b +∑b∈Cj

maxa∈Ci

Ua,b

|Ci|+ |Cj |(3)

We cluster object instances into object classes by binarizingQ, taking the off diagonal element of Qi,j to be one whenthey are within a tolerance of either diagonal element Qi,i

or Qj,j , computing connected components in the resultingadjacency matrix. This computes a common labeling withina single floor plan.

To compute a common labeling across all floor plans,we create another P × P similarity matrix, R, for all Pobject locations across all floor plans. The elements of Rare computed using a variant of (3) where the means areonly over the top 50% of the maxes to avoid noise.

Since P is much larger than B, the simple thresholdingmethod for clustering object instance that was used for Qwill not work with R. Therefore, we construct a graphicalmodel with a vertex for each of the P object locations. Thevalues over which these vertex variables can range is a setof abstract class labels, one label for each object class inthe dataset. Since abstract class labels are interchangeable,there are no unary factors in this graphical mode. The binaryfactors, however, encode the similarity measure reflected inR.

Let `(p) represent the abstract class label selected forobject p ∈ P = {1, . . . , P}. We assign a factor t(p1, p2)

for each pair of p1, p2 ∈ P, p1 6= p2, computed as follows:

t(p1, p2) =

{−log(Rp1p2

) if `(p1) = `(p2)

−log(1−Rp1p2) if `(p1) 6= `(p2)

We then seek the set of labels that maximizes the sum of allt scores:

max`

∑p1,p2∈Pp1 6=p2

t(p1, p2)

For this graphical model, belief propagation often gets stuckin local optima. Therefore, we instead use branch and bound[17].

VI. EXPERIMENTAL RESULTS

We conducted an experiment using a set of six differentfloor plans, each with five object instances. We utilized a totalof six object classes: BAG, BOX, CHAIR, CONE, STOOL, andTABLE. In each floor plan, two of the five object instanceswere of the same object class (i.e., Fig. 6 right shows afloor plan with two chairs, a table, a bag, and a cone). Wedrove ten navigational paths in each floor plan, with eachpath satisfying a sentential path description, as in the captionof Fig. 2. We created a dataset from the video feed andodometry data for these 60 paths and applied the methoddescribed in Sec. V to this dataset.

Fig. 7 shows both the number of objects detected and thelocation error of such detections. Our dataset contains a totalof 30 object instances, of which 27 are detected. This yieldsa detection rate of 90%, with no false alarms.

The right side of Fig. 7 gives statistics on localizationerror, the distance between ground truth and detected objectlocation. The overall mean localization error is 16.5cm witha standard deviation of 10.6cm. Note that while ground truthlocations are taken as object center projected to the groundplane, all objects have a nonzero physical footprint. Thesefootprints range from 21cm by 26cm for the BOX to 46cm

Page 7: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

number of objects location error (cm)floor plan present detected min max mean std dev

1 5 5 7.3 33.9 14.3 11.12 5 5 4.1 23.2 14.7 8.33 5 4 3.9 16.1 9.2 5.94 5 5 8.7 37.9 17.1 12.25 5 4 8.0 25.9 15.1 7.76 5 4 17.2 37.6 29.7 9.8

overall 30 27 3.9 37.9 16.5 10.6Fig. 7. Object detection and localization results. Note that while groundtruth locations are taken as object center projected to the ground plane, allobjects have a nonzero physical footprint. These footprints range from 21cmby 26cm for the BOX to 46cm by 46cm for the TABLE.

object number labeled number labeledinstances (6 abstract labels) (12 abstract labels)

class grou

ndtr

uth

dete

cted

1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 12BAG 6 6 4 1 1 4 2BOX 5 5 2 1 2 1 2 2CHAIR 6 6 1 5 3 1 1 1CONE 4 3 3 3STOOL 4 3 3 3TABLE 5 4 4 4

Fig. 8. Labeling accuracy by object class. (left) Number of ground truthand detected object instances, by object class. (middle) Matrix showingthe number of objects labeled with each abstract label value when usingthe minimum number of abstract label values. (right) Matrix showing thenumber of objects labeled with each abstract label value when using doublethe minimum number of abstract label values.

by 46cm for the TABLE. Thus localization error is withinobject size. One cannot expect better accuracy since ourmethod localizes on the basis of proposals which encloseentire objects within bounding boxes. Such bounding boxesnecessarily enclose an edge, rather than the center, of anobject.

We evaluate object class labeling (Sec. V-C), under twodifferent conditions with regards to the number of abstractclass labels used. We first test with the minimum number ofabstract class labels needed to distinguish all ground-truthclasses, which is six for our dataset. In order to allow forsome degree of variability in appearance of the objects in ourdataset under different viewpoints and lighting conditions,we also test using twice the minimum number of abstractclass labels, namely twelve. This also models the case whenthe number of object classes is not known a priori.

Fig. 8 shows the labeling accuracy by object class. The leftsection shows the number of instances detected compared tothe number of ground truth instances. The middle section ofFig. 8 shows how many detected instances are assigned toeach abstract label value by our system when we use theminimum number of abstract class labels. For convenience,these label values are shown as the numbers 1 through 6. Inthis case, a perfect labeling would have all object instanceson the diagonal, which would correspond to each objecthaving a unique abstract label value. To calculate error, wedivide the total number of on-diagonal instances, 20, by thetotal number of instances, 27. This yields a labeling accuracyof 74.1%.

The right section of Fig. 8 shows the same data for the test

with twelve abstract class labels. Since in this case our matrixof classes to labels is not square, we cannot use the diagonalto judge correctness. However, we can judge correctness bychecking if any of the abstract class labels are assigned tomore than one ground-truth object class. The columns ofFig. 8 (right) show that each abstract label value is unique toa single physical object class. Therefore, this test has labelingaccuracy of 100%.

Fig. 9 shows three example images for each of the sixobject classes, as grouped by the object class labeling testwith twelve class labels. Note that the images only depictthe contents of proposal boxes, like those shown in redin Fig. 4. Fig. 9 clearly illustrates the effect of appear-ance variability on object class labeling. Object instancesthat exhibit a markedly different appearance based on thecamera’s perspective, such as the BAG, the BOX, and theCHAIR, require multiple abstract class labels. On the otherhand, object instances that exhibit a very similar appearanceregardless of the camera’s perspective, such as the CONE,the STOOL, and the TABLE, can be labeled with a singleabstract class label.

VII. CONCLUSIONS AND FUTURE WORK

We demonstrate method for detecting, localizing, andlabeling previously unseen objects in a video feed from amobile robot camera during navigation. Our approach differsfrom prior approaches to codetection in static images andvideo from a stationary eye-level camera in that it effectivelyutilizes the combination of egocentric video and odometryand IMU data to localize novel objects in the 3D worldcoordinate frame. Using a general-purpose object proposal-generation mechanism, several similarity measures betweenproposals, and multiple graphical models, we are able toautomatically determine which proposals are most likely tobe objects, find 3D world locations for such objects, andconsistently label them with abstract class labels unique to asingle physical object class. The results in Sec. VI show thatour system is able to solve this codetection problem with areasonable degree of accuracy.

Our plans for the future of this work include expandingour dataset with more object classes and instances, more floorplans, and more navigational paths per floor plan, in orderto test our system at a larger scale. We will also integratethis system as the automatic floor plan detector in our largernatural-language based robot control system [12] describedin Sec. III. This integration will allow us to automaticallyassociate object class names to abstract class labels. Fromsuch a corpus of object images associated with the names ofthe objects, we plan to explore extending the capabilities ofour system to include automatic training of object detectors.

REFERENCES

[1] M. Blaschko, A. Vedaldi, and A. Zisserman, “Simultaneous objectdetection and ranking with weak supervision,” in NIPS, 2010, pp.235–243.

[2] Y. J. Lee and K. Grauman, “Learning the easy things first: self-pacedvisual category discovery,” in CVPR, 2011, pp. 1721–1728.

Page 8: Object Codetection from Mobile Robot Videoupplysingaoflun.ecn.purdue.edu/~dpbarret/ralicra2016.pdf · observations of objects. Through the use of an Extended Kalman Filter [13], which

BAG BOX CHAIR CONE STOOL TABLEabstract class labels: 1, 2 abstract class labels: 3, 4, 5 abstract class labels: 6, 7, 8, 9 abstract class label: 10 abstract class label: 11 abstract class label: 12

Fig. 9. Examples of object images as grouped by our labeling system using twelve abstract class labels (Fig. 8 right). The ground-truth object classnames have been included at the top of the columns for clarity. Note that the images shown represent only the contents of proposal boxes (shown in red inFig. 4). Objects with high appearance variability (left three columns) require multiple abstract class labels, while objects with low appearance variability(right three columns) need only a single abstract class label.

[3] M. Rubinstein, A. Joulin, J. Kopf, and C. Liu, “Unsupervised jointobject discovery and segmentation in internet images,” in CVPR, 2013,pp. 1939–1946.

[4] K. Tang, A. Joulin, J. Li, and L. Fei-Fei, “Co-localization in real-worldimages,” in CVPR, 2014, pp. 1464–1471.

[5] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari, “Learningobject class detectors from weakly annotated video,” in CVPR, 2012,pp. 3282–3289.

[6] S. Schulter, C. Leistner, P. M. Roth, and H. Bischof, “Unsupervisedobject discovery and segmentation in videos,” in Proceedings of theBritish Machine Vision Conference, 2013, pp. 53.1–53.12.

[7] A. Joulin, K. Tang, and L. Fei-Fei, “Efficient image and video co-localization with Frank-Wolfe algorithm,” in ECCV, 2014, pp. 253–268.

[8] A. Srikantha and J. Gall, “Discovering object classes from activities,”in ECCV, 2014, pp. 415–430.

[9] P. Arbelaez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik,“Multiscale combinatorial grouping,” in CVPR, 2014, pp. 328–335.

[10] A. Vedaldi and B. Fulkerson, “VLFeat: An open and portable library

of computer vision algorithms,” in Proceedings of the InternationalConference on Multimedia, 2010, pp. 1469–1472.

[11] B. Andres, T. Beier, and J. H. Kappes, “OpenGM: A C++ library fordiscrete graphical models,” arXiv, vol. abs/1206.0111, 2012.

[12] D. P. Barrett, S. A. Bronikowski, H. Yu, and J. M. Siskind,“Robot language learning, generation, and comprehension,” arXiv, vol.abs/1508.06161, 2015.

[13] A. H. Jazwinski, Stochastic Processes and Filtering Theory. Aca-demic Press, 1970.

[14] R. Hartley and A. Zisserman, Multiple View Geometry in ComputerVision. Cambridge University Press, 2003.

[15] B. Andres, J. H. Kappes, U. Kothe, C. Schnorr, and F. A. Hamprecht,“An empirical comparison of inference algorithms for graphical mod-els with higher order factors using OpenGM,” in Pattern Recognition,2010, pp. 353–362.

[16] J. Pearl, “Reverend Bayes on inference engines: a distributed hierar-chical approach,” in AAAI, 1982, pp. 133–136.

[17] A. H. Land and A. G. Doig, “An automatic method of solving discreteprogramming problems,” Econometrica: Journal of the EconometricSociety, pp. 497–520, 1960.