DeMoN: Depth and Motion Network for Learning Monocular Stereoopenaccess.thecvf.com/content_cvpr_2017/papers/Ummen... · 2017-05-31 · DeMoN: Depth and Motion Network for Learning

DeMoN: Depth and Motion Network for Learning Monocular Stereo

Benjamin Ummenhofer*,1 Huizhong Zhou*,1

{ummenhof, zhouh}@cs.uni-freiburg.de

Jonas Uhrig1,2 Nikolaus Mayer1 Eddy Ilg1 Alexey Dosovitskiy1 Thomas Brox1

1University of Freiburg 2Daimler AG R&D

{uhrigj, mayern, ilg, dosovits, brox}@cs.uni-freiburg.de

Abstract

In this paper we formulate structure from motion as a

learning problem. We train a convolutional network end-

to-end to compute depth and camera motion from succes-

sive, unconstrained image pairs. The architecture is com-

posed of multiple stacked encoder-decoder networks, the

core part being an iterative network that is able to improve

its own predictions. The network estimates not only depth

and motion, but additionally surface normals, optical flow

between the images and confidence of the matching. A cru-

cial component of the approach is a training loss based on

spatial relative differences. Compared to traditional two-

frame structure from motion methods, results are more ac-

curate and more robust. In contrast to the popular depth-

from-single-image networks, DeMoN learns the concept of

matching and, thus, better generalizes to structures not seen

during training.

1. Introduction

Structure from motion (SfM) is a long standing task in

computer vision. Most existing systems, which represent

the state of the art, are carefully engineered pipelines con-

sisting of several consecutive processing steps. A funda-

mental building block of these pipelines is the computation

of the structure and motion for two images. Present imple-

mentations of this step have some inherent limitations. For

instance, it is common to start with the estimation of the

camera motion before inferring the structure of the scene by

dense correspondence search. Thus, an incorrect estimate of

the camera motion leads to wrong depth predictions. More-

over, the camera motion is estimated from sparse corre-

spondences computed via keypoint detection and descriptor

matching. This low-level process is prone to outliers and

does not work in non-textured regions. Finally, all exist-

∗Equal contribution

Figure 1. Illustration of DeMoN. The input to the network is two

successive images from a monocular camera. The network esti-

mates the depth in the first image and the camera motion.

ing SfM approaches fail in case of small camera translation.

This is because it is hard to integrate priors that could pro-

vide reasonable solutions in those degenerate cases.

In this paper, we succeed for the first time in training a

convolutional network to jointly estimate the depth and the

camera motion from an unconstrained pair of images. This

approach is very different from the typical SfM pipeline in

that it solves the problems of motion and dense depth esti-

mation jointly. We cannot yet provide a full learning-based

system for large-scale SfM, but the two-frame case is a cru-

cial first step towards this goal. In the longer term, the learn-

ing approach has large potential, since it integrates naturally

all the shape from X approaches: multi-view, silhouettes,

texture, shading, defocus, haze. Moreover, strong priors on

objects and structure can be learned efficiently from data

and regularize the problem in degenerate cases; see Fig. 6

for example. This potential is indicated by our results for

the two-frame scenario, where the learning approach clearly

outperforms traditional methods.

Convolutional networks recently have shown to excel at

depth prediction from a single image [7, 8, 24]. By learning

priors about objects and their shapes these networks reach

remarkably good performance in restricted evaluation sce-

15038

narios such as indoor or driving scenes. However, single-

image methods have more problems generalizing to previ-

ously unseen types of images. This is because they do not

exploit stereopsis. Fig. 9 shows an example, where depth

from a single image fails, because the network did not see

similar structures before. Our network, which learned to ex-

ploit the motion parallax, does not have this restriction and

generalizes well to very new scenarios.

To exploit the motion parallax, the network must put the

two input images in correspondence. We found that a sim-

ple encoder-decoder network fails to make use of stereo:

when trained to compute depth from two images it ends up

using only one of them. Depth from a single image is a

shortcut to satisfy the training objective without putting the

two images into correspondence and deriving camera mo-

tion and depth from these correspondences.

In this paper, we present a way to avoid this shortcut and

elaborate on it to obtain accurate depth maps and camera

motion estimates. The key to the problem is an architecture

that alternates optical flow estimation with the estimation

of camera motion and depth; see Fig. 3. In order to solve

for optical flow, the network must use both images. To this

end, we adapted the FlowNet architecture [5] to our case.

Our network architecture has an iterative part that is com-

parable to a recurrent network, since weights are shared.

Instead of the typical unrolling, which is common practice

when training recurrent networks, we append predictions of

previous training iterations to the current minibatch. This

training technique saves much memory and allows us to in-

clude more iterations for training. Another technical contri-

bution of this paper is a special gradient loss to deal with the

scale ambiguity in structure from motion. The network was

trained on a mixture of real images from a Kinect camera,

including the SUN3D dataset [43], and a variety of rendered

scenes that we created for this work.

2. Related Work

Estimation of depth and motion from pairs of images

goes back to Longuet-Higgins [25]. The underlying 3D ge-

ometry is a consolidated field, which is well covered in text-

books [17, 10]. State-of-the-art systems [14, 42] allow for

reconstructions of large scenes including whole cities. They

consist of a long pipeline of methods, starting with descrip-

tor matching for finding a sparse set of correspondences

between images [26], followed by estimating the essential

matrix to determine the camera motion. Outliers among

the correspondences are typically filtered out via RANSAC

[11]. Although these systems use bundle adjustment [39]

to jointly optimize camera poses and structure of many im-

ages, they depend on the quality of the estimated geometry

between image pairs for initialization. Only after estimation

of the camera motion and a sparse 3D point cloud, dense

depth maps are computed by exploiting the epipolar geom-

etry [4]. LSD-SLAM [9] deviates from this approach by

jointly optimizing semi-dense correspondences and depth

maps. It considers multiple frames from a short temporal

window but does not include bundle adjustment. DTAM

[30] can track camera poses reliably for critical motions by

matching against dense depth maps. However, an external

depth map initialization is required, which in turn relies on

classic structure and motion methods.

Camera motion estimation from dense correspondences

has been proposed by Valgaerts et al. [41]. In this paper,

we deviate completely from these previous approaches by

training a single deep network that includes computation of

dense correspondences, estimation of depth, and the camera

motion between two frames.

Eigen et al. [7] trained a ConvNet to predict depth from

a single image. Depth prediction from a single image is an

inherently ill-posed problem which can only be solved us-

ing priors and semantic understanding of the scene – tasks

ConvNets are known to be very good at. Liu et al. [24]

combined a ConvNet with a superpixel-based conditional

random field, yielding improved results. Our two-frame

network also learns to exploit the same cues and priors as

the single-frame networks, but in addition it makes use of a

pair of images and the motion parallax between those. This

enables generalization to arbitrary new scenes.

ConvNets have been trained to replace the descriptor

matching module in aforementioned SfM systems [6, 44].

The same idea was used by Zbontar and LeCun [45] to es-

timate dense disparity maps between stereo images. Com-

putation of dense correspondences with a ConvNet that is

trained end-to-end on the task, was presented by Dosovit-

skiy et al. [5]. Mayer et al. [28] applied the same concept

to dense disparity estimation in stereo pairs. We, too, make

use of the FlowNet idea [5], but in contrast to [28, 45], the

motion between the two views is not fixed, but must be es-

timated to derive depth estimates. This makes the learning

problem much more difficult.

Flynn et al. [12] implicitly estimated the 3D structure

of a scene from a monocular video using a convolutional

network. They assume known camera poses – a large sim-

plification which allows them to use the plane-sweeping

approach to interpolate between given views of the scene.

Moreover, they never explicitly predict the depth, only RGB

images from intermediate viewpoints.

Agrawal et al. [2] and Jayaraman & Grauman [19] ap-

plied ConvNets to estimating camera motion. The main fo-

cus of these works is not on the camera motion itself, but on

learning a feature representation useful for recognition. The

accuracy of the estimated camera motion is not competitive

with classic methods. Kendall et al. [21] trained a ConvNet

for camera relocalization — predicting the location of the

camera within a known scene from a single image. This is

mainly an instance recognition task and requires retraining

5039

3x

r,t

egomotionimage pairbootstrap net refinement netiterative net

depth

Figure 2. Overview of the architecture. DeMoN takes an image pair as input and predicts the depth map of the first image and the relative

pose of the second camera. The network consists of a chain of encoder-decoder networks that iterate over optical flow, depth, and egomotion

estimation; see Fig. 3 for details. The refinement network increases the resolution of the final depth map.

Figure 3. Schematic representation of the encoder-decoder pair used in the bootstrapping and iterative network. Inputs with gray font are

only available for the iterative network. The first encoder-decoder predicts optical flow and its confidence from an image pair and previous

estimates. The second encoder-decoder predicts the depth map and surface normals. A fully connected network appended to the encoder

estimates the camera motion r, t and a depth scale factor s. The scale factor s relates the scale of the depth values to the camera motion.

for each new scene. All these works do not provide depth

estimates.

3. Network Architecture

The overall network architecture is shown in Fig. 2.

DeMoN is a chain of encoder-decoder networks solving dif-

ferent tasks. The architecture consists of three main com-

ponents: the bootstrap net, the iterative net and the refine-

ment net. The first two components are pairs of encoder-

decoder networks, where the first one computes optical flow

while the second one computes depth and camera motion;

see Fig. 3. The iterative net is applied recursively to succes-

sively refine the estimates of the previous iteration. The last

component is a single encoder-decoder network that gener-

ates the final upsampled and refined depth map.

Bootstrap net. The bootstrap component gets the image

pair as input and outputs the initial depth and motion es-

timates. Internally, first an encoder-decoder network com-

putes optical flow and a confidence map for the flow (the

left part of Fig. 3). The encoder consists of pairs of convo-

lutional layers with 1D filters in y and x-direction. Using

pairs of 1D filters as suggested in [37] allows us to use spa-

tially large filter while keeping the number of parameters

and runtime manageable. We gradually reduce the spatial

resolution with a stride of 2 while increasing the number

of channels. The decoder part generates the optical flow

estimate from the encoder’s representation via a series of

up-convolutional layers with stride 2 followed by two con-

volutional layers. It outputs two components of the optical

Method L1-inv sc-inv L1-rel

Single image 0.080 0.159 0.696

Naıve image pair 0.079 0.165 0.722

DeMoN 0.012 0.131 0.097

Table 1. Naıve two-frame depth estimation does not perform bet-

ter than depth from a single image on any of the error measures

(smaller is better). The architecture of DeMoN forces the network

to use both images, yielding a large performance improvement.

flow field and an estimate of their confidence. Details on the

loss and the training procedure are described in Section 5.

The second encoder-decoder, shown in the right part of

Fig. 3, takes as input the optical flow, its confidence, the im-

age pair, and the second image warped with the estimated

flow field. Based on these inputs it estimates depth, sur-

face normals, and camera motion. The architecture is the

same as above, apart from the extra 3 fully connected lay-

ers that compute the camera motion and a scaling factor for

the depth prediction. The latter reflects the inherent con-

nection between depth and motion predictions due to scale

ambiguity; see Section 4.

By feeding optical flow estimate into the second

encoder-decoder we let it make use of motion parallax.

Tab. 1 shows that an encoder-decoder network trained to es-

timate depth and camera motion directly from an image pair

(naıve image pair) fails to make use of stereo cues and per-

forms on par with a single-image network. DeMoN, on the

other hand, performs significantly better.

Iterative net. The iterative net is trained to improve ex-

5040

isting depth, normal, and motion estimates. The architec-

ture of this encoder-decoder pair is identical to the boot-

strap net, but it takes additional inputs. We convert the

depth map and camera motion estimated by the bootstrap

net or a previous iteration of the iterative net into an op-

tical flow field, and feed it into the first encoder-decoder

together with other inputs. Likewise, we convert the optical

flow to a depth map using the previous camera motion pre-

diction and pass it along with the optical flow to the second

encoder-decoder. In both cases the networks are presented

with a prediction proposal generated from the predictions of

the previous encoder-decoder.

Fig. 4 shows how the optical flow and depth improve

with each iteration of the network. The iterations enable

sharp discontinuities, improve the scale of the depth values,

and can even correct wrong estimates of the initial boot-

strapping network. The improvements largely saturate after

3 or 4 iterations. Quantitative analysis is shown in the sup-

plementary material.

During training we simulate 4 iterations by appending

predictions from previous training iterations to the mini-

batch. Unlike unrolling, there is no backpropagation of the

gradient through iterations. Instead, the gradient of each

iteration is described by the losses on the well defined net-

work outputs: optical flow, depth, normals, and camera mo-

tion. Compared to backpropagation through time this saves

a lot of memory and allows us to have a larger network

and more iterations. A similar approach was taken by Li et

al. [23], who train each iteration in a separate step and there-

fore need to store predictions as input for the next stage. We

also train the first iteration on its own, but then train all iter-

ations jointly which avoids intermediate storage.

Refinement net. While the previous network compo-

nents operate at a reduced resolution of 64 × 48 to save

parameters and to reduce training and test time, the final re-

finement net upscales the predictions to the full input image

resolution (256 × 192). It gets as input the full resolution

first image and the nearest-neighbor-upsampled depth and

normal field. Fig. 5 shows the low-resolution input and the

refined high-resolution output.

A forward pass through the network with 3 iterations

takes 110ms on an Nvidia GTX Titan X. Implementation

details and exact network definitions of all network compo-

nents are provided in the supplementary material.

4. Depth and Motion Parameterization

The network computes the depth map in the first view

and the camera motion to the second view. We represent

the relative pose of the second camera with r, t ∈ R3. The

rotation r = θv is an angle axis representation with angle θand axis v. The translation t is given in Cartesian coordi-

nates.

It is well-known that the reconstruction of a scene from

Bootstrap Iterative 1 2 3 GT

Dep

thF

low

Figure 4. Top: Iterative depth refinement. The bootstrap net fails

to accurately estimate the scale of the depth. The iterations refine

the depth prediction and strongly improve the scale of the depth

values. The L1-inverse error drops from 0.0137 to 0.0072 after

the first iteration. Bottom: Iterative refinement of optical flow.

Images show the x component of the optical flow for better visi-

bility. The flow prediction of the bootstrap net misses the object

completely. Motion edges are retrieved already in the first iteration

and the endpoint error is reduced from 0.0176 to 0.0120.

Prediction Refined prediction Ground Truth

Figure 5. The refinement net generates a high-resolution depth

map (256 × 192) from the low-resolution estimate (64 × 48) and

the input image. The depth sampling preserves depth edges and

can even repair wrong depth measurements.

images with unknown camera motion can be determined

only up to scale. We resolve the scale ambiguity by nor-

malizing translations and depth values such that ‖t‖ = 1.

This way the network learns to predict a unit norm transla-

tion vector.

Rather than the depth z, the network estimates the in-

verse depth ξ = 1/z. The inverse depth allows represen-

tation of points at infinity and accounts for the growing lo-

calization uncertainty of points with increasing distance.To

match the unit translation, our network predicts a scalar

scaling factor s, which we use to obtain the final depth val-

ues sξ.

5. Training Procedure

5.1. Loss functions

The network estimates outputs of very different na-

ture: high-dimensional (per-pixel) depth maps and low-

dimensional camera motion vectors. The loss has to bal-

ance both of these objectives, and stimulate synergy of the

two tasks without over-fitting to a specific scenario.

Point-wise losses. We apply point-wise losses to our

outputs: inverse depth ξ, surface normals n, optical flow

w, and optical flow confidence c. For depth we use an L1

loss directly on the inverse depth values:

Ldepth =∑

i,j |sξ(i, j)− ξ(i, j)|, (1)

5041

with ground truth ξ. Note that we apply the predicted scale

s to the predicted values ξ.

For the loss function of the normals and the optical flow

we use the (non-squared) L2 norm to penalize deviations

from the respective ground truths n and w

Lnormal =∑

i,j ‖n(i, j)− n(i, j)‖2

Lflow =∑

i,j ‖w(i, j)− w(i, j)‖2 .(2)

For optical flow this amounts to the usual endpoint error.

We train the network to assess the quality of its own flow

prediction by predicting a confidence map for each optical

flow component. The ground truth of the confidence for the

x component is

cx(i, j) = e−|wx(i,j)−wx(i,j)|, (3)

and the corresponding loss function reads as

Lflow confidence =∑

i,j |cx(i, j)− cx(i, j)| . (4)

Motion losses. We use a minimal parameterization of

the camera motion with 3 parameters for rotation r and

translation t each. The losses for the motion vectors are

Lrotation = ‖r− r‖2

Ltranslation = ‖t− t‖2.(5)

The translation ground truth is always normalized such that

‖t‖2 = 1, while the magnitude of r encodes the angle of

the rotation.

Scale invariant gradient loss. We define a discrete scale

invariant gradient g as

gh[f ](i, j) =(

f(i+h,j)−f(i,j)|f(i+h,j)|+|f(i,j)| ,

f(i,j+h)−f(i,j)|f(i,j+h)|+|f(i,j)|

)⊤

.

(6)

Based on this gradient we define a scale invariant loss that

penalizes relative depth errors between neighbouring pixels:

Lgrad ξ =∑

h∈{1,2,4,8,16}

∑

i,j

∥

∥

∥gh[ξ](i, j)− gh[ξ](i, j)

∥

∥

∥

2.

(7)

To cover gradients at different scales we use 5 different

spacings h. This loss stimulates the network to compare

depth values within a local neighbourhood for each pixel.

It emphasizes depth discontinuities, stimulates sharp edges

in the depth map and increases smoothness within homo-

geneous regions as seen in Fig. 10. Note that due to the

relation gh[ξ](i, j) = −gh[z](i, j) for ξ, z > 0, the loss is

the same for the actual non-inverse depth values z.

We apply the same scale invariant gradient loss to each

component of the optical flow. This enhances the smooth-

ness of estimated flow fields and the sharpness of motion

discontinuities.

Weighting. We individually weigh the losses to balance

their importance. The weight factors were determined em-

pirically and are listed in the supplementary material.

5.2. Training Schedule

The network training is based on the Caffe frame-

work [20]. We train our model from scratch with Adam [22]

using a momentum of 0.9 and a weight decay of 0.0004.

The whole training procedure consists of three phases.

First, we sequentially train the four encoder-decoder

components in both bootstrap and iterative nets for 250k

iterations each with a batch size of 32. While training an

encoder-decoder we keep the weights for all previous com-

ponents fixed. For encoder-decoders predicting optical flow,

the scale invariant loss is applied after 10k iterations.

Second, we train only the encoder-decoder pair of the it-

erative net. In this phase we append outputs from previous

three training iterations to the minibatch. In this phase the

bootstrap net uses batches of size 8. The outputs of the pre-

vious three network iterations are added to the batch, which

yields a total batch size of 32 for the iterative network. We

run 1.6 million training iterations.

Finally, the refinement net is trained for 600k iterations

with all other weights fixed. The details of the training pro-

cess, including the learning rate schedules, are provided in

the supplementary material.

6. Experiments

6.1. Datasets

SUN3D [43] provides a diverse set of indoor images to-

gether with depth and camera pose. The depth and camera

pose on this dataset are not perfect. Thus, we sampled im-

age pairs from the dataset and automatically discarded pairs

with a high photoconsistency error. We split the dataset so

that the same scenes do not appear in both the training and

the test set.

RGB-D SLAM [36] provides high quality camera poses

obtained with an external motion tracking system. Depth

maps are disturbed by measurement noise, and we use the

same preprocessing as for SUN3D. We created a training

and a test set.

MVS includes several outdoor datasets. We used the

Citywall and Achteckturm datasets from [15] and the

Breisach dataset [40] for training and the datasets provided

with COLMAP [33, 34] for testing. The depth maps of the

reconstructed scenes are often sparse and can comprise re-

construction errors.

Scenes11 is a synthetic dataset with generated images of

virtual scenes with random geometry, which provide perfect

depth and motion ground truth, but lack realism.

Thus, we introduce the Blendswap dataset which is

based on 150 scenes from blendswap.com. The dataset

provides a large variety of scenes, ranging from cartoon-like

to photorealistic scenes. The dataset contains mainly indoor

scenes. We used this dataset only for training.

5042

NYUv2 [29] provides depth maps for diverse indoor

scenes but lacks camera pose information. We did not train

on NYU and used the same test split as in Eigen et al. [7]. In

contrast to Eigen et al., we also require a second input im-

age that should not be identical to the previous one. Thus,

we automatically chose the next image that is sufficiently

different from the first image according to a threshold on

the difference image.

In all cases where the surface normals are not available,

we generated them from the depth maps. We trained De-

MoN specifically for the camera intrinsics used in SUN3D

and adapted all other datasets by cropping and scaling to

match these parameters.

6.2. Error metrics

While single-image methods aim to predict depth at the

actual physical scale, two-image methods typically yield the

scale relative to the norm of the camera translation vector.

Comparing the results of these two families of methods re-

quires a scale-invariant error metric. We adopt the scale-

invariant error of [8], which is defined as

sc-inv(z, z) =√

1n

∑

i d2i −

1n2 (

∑

i di)2, (8)

with di = log zi−log zi. For comparison with classic struc-

ture from motion methods we use the following measures:

L1-rel(z, z) = 1n

∑

i|zi−zi|

zi(9)

L1-inv(z, z) = 1n

∑

i|ξi − ξi| =1n

∑

i

∣

∣

∣

1zi

− 1zi

∣

∣

∣(10)

L1-rel computes the depth error relative to the ground truth

depth and therefore reduces errors where the ground truth

depth is large and increases the importance of close objects

in the ground truth. L1-inv behaves similarly and resembles

our loss function for predicted inverse depth values (1).

For evaluating the camera motion estimation, we report

the angle (in degrees) between the prediction and the ground

truth for both the translation and the rotation. The length of

the translation vector is 1 by definition.

The accuracy of optical flow is measured by the aver-

age endpoint error (EPE), that is, the Euclidean norm of the

difference between the predicted and the true flow vector,

averaged over all image pixels. The flow is scaled such that

the displacement by the image size corresponds to 1.

6.3. Comparison to classic structure from motion

We compare to several strong baselines implemented by

us from state-of-the-art components (“Base-*”). For these

baselines, we estimated correspondences between images,

either by matching SIFT keypoints (“Base-SIFT”) or with

the FlowFields optical flow method from Bailer et al. [3]

(“Base-FF”). Next, we computed the essential matrix with

the normalized 8-point algorithm [16] and RANSAC. To

Depth Motion

Method L1-inv sc-inv L1-rel rot trans

MV

S

Base-Oracle 0.019 0.197 0.105 0 0

Base-SIFT 0.056 0.309 0.361 21.180 60.516

Base-FF 0.055 0.308 0.322 4.834 17.252

Base-Matlab - - - 10.843 32.736

Base-Mat-F - - - 5.442 18.549

DeMoN 0.047 0.202 0.305 5.156 14.447

Sce

nes

11

Base-Oracle 0.023 0.618 0.349 0 0

Base-SIFT 0.051 0.900 1.027 6.179 56.650

Base-FF 0.038 0.793 0.776 1.309 19.425

Base-Matlab - - - 0.917 14.639

Base-Mat-F - - - 2.324 39.055

DeMoN 0.019 0.315 0.248 0.809 8.918

RG

B-D

Base-Oracle 0.026 0.398 0.336 0 0

Base-SIFT 0.050 0.577 0.703 12.010 56.021

Base-FF 0.045 0.548 0.613 4.709 46.058

Base-Matlab - - - 12.831 49.612

Base-Mat-F - - - 2.917 22.523

DeMoN 0.028 0.130 0.212 2.641 20.585

Sun3D

Base-oracle 0.020 0.241 0.220 0 0

Base-SIFT 0.029 0.290 0.286 7.702 41.825

Base-FF 0.029 0.284 0.297 3.681 33.301

Base-Matlab - - - 5.920 32.298

Base-Mat-F - - - 2.230 26.338

DeMoN 0.019 0.114 0.172 1.801 18.811

NY

Uv2

Base-oracle - - - - -

Base-SIFT - - - - -

Base-FF - - - - -

Base-Matlab - - - - -

Base-Mat-F - - - - -

DeMoN - - - - -

Depth

Method sc-inv

Liu indoor 0.260

Liu outdoor 0.341

Eigen VGG 0.225

DeMoN 0.203

Liu indoor 0.816

Liu outdoor 0.814

Eigen VGG 0.763

DeMoN 0.303

Liu indoor 0.338

Liu outdoor 0.428

Eigen VGG 0.272

DeMoN 0.134

Liu indoor 0.214

Liu outdoor 0.401

Eigen VGG 0.175

DeMoN 0.126

Liu indoor 0.210

Liu outdoor 0.421

Eigen VGG 0.148

DeMoN 0.180

Table 2. Left: Comparison of two-frame depth and motion es-

timation methods. Lower is better for all measures. For a fair

comparison with the baseline methods, we evaluate depth only at

pixels visible in both images. For Base-Matlab depth is only avail-

able as a sparse point cloud and is therefore not compared to here.

We do not report the errors on NYUv2 since motion ground truth

(and therefore depth scale) is not available. Right: Comparison

to single-frame depth estimation. Since the scale estimates are not

comparable, we report only the scale invariant error metric.

further improve accuracy we minimized the reprojection er-

ror using the ceres library [1]. Finally, we generated the

depth maps by plane sweep stereo and used the approach

of Hirschmueller et al. [18] for optimization. We also re-

port the accuracy of the depth estimate when providing

the ground truth camera motion (“Base-Oracle”). (“Base-

Matlab”) and (“Base-Mat-F”) are implemented in Matlab.

(“Base-Matlab”) uses Matlab implementations of the KLT

algorithm [38, 27, 35] for correspondence search while

(“Base-Mat-F”) uses the predicted flow from DeMoN. The

essential matrix is computed with RANSAC and the 5-point

algorithm [31] for both.

Tab. 2 shows that DeMoN outperforms all baseline

methods both on motion and depth accuracy by a factor of

1.5 to 2 on most datasets. The only exception is the MVS

dataset where the motion accuracy of DeMoN is on par with

the strong baseline based on FlowFields optical flow. This

demonstrates that traditional methods work well on the tex-

ture rich scenes present in MVS, but do not perform well for

example on indoor scenes, with large homogeneous regions

or small baselines where priors may be very useful. Besides

5043

Reference

Sec

on

d

Figure 6. Qualitative performance gain by increasing the baseline

between the two input images for DeMoN. The depth map is pro-

duced with the top left reference image and the second image be-

low. The first output is obtained with two identical images as input,

which is a degenerate case for traditional structure from motion.

First frame Frontal view Top view

Figure 7. Result on a sequence of the RGB-D SLAM dataset [36].

The accumulated pairwise pose estimates by our network (red) are

locally consistent with the ground truth trajectory (black). The

depth prediction of the first frame is shown. The network also

separates foreground and background in its depth output.

that, all Base-* methods use images at the full resolution of

640 × 480, while our method uses downsampled images

of 256 × 192 as input. Higher resolution gives the Base-*

methods an advantage in depth accuracy, but on the other

hand these methods are more prone to outliers. For detailed

error distributions see the supplemental material. Remark-

ably, on all datasets except for MVS the depth estimates of

DeMoN are better than the ones a traditional approach can

produce given the ground truth motion. This is supported

by qualitative results in Fig. 8. We also note that DeMoN

has smaller motion errors than (“Base-Mat-F”), showing its

advantage over classical methods in motion estimation.

In contrast to classical approaches, we can also han-

dle cases without and with very little camera motion, see

Fig. 6. We used our network to compute camera trajecto-

ries by simple concatenation of the motion of consecutive

frames, as shown in Fig. 7. The trajectory shows mainly

translational drift. We also did not apply any drift correc-

tion which is a crucial component in SLAM systems, but

results convince us that DeMoN can be integrated into such

systems.

6.4. Comparison to depth from single image

To demonstrate the value of the motion parallax, we

additionally compare to the single-image depth estimation

methods by Eigen & Fergus [7] and Liu et al. [24]. We com-

pare to the improved version of the Eigen & Fergus method,

which is based on the VGG network architecture, and to two

Image GT Base-O Eigen DeMoN

Su

n3

DR

GB

DM

VS

Sce

nes

11

NY

Uv

2

Figure 8. Top: Qualitative depth prediction comparison on var-

ious datasets. The predictions of DeMoN are very sharp and de-

tailed. The Base-Oracle prediction on NYUv2 is missing because

the motion ground truth is not available. Results on more methods

and examples are shown in the supplementary material.

models by Liu et al.: one trained on indoor scenes from the

NYUv2 dataset (“indoor”) and another, trained on outdoor

images from the Make3D dataset [32] (“outdoor”).

The comparison in Fig. 8 shows that the depth maps pro-

duced by DeMoN are more detailed and more regular than

the ones produced by other methods. This becomes even

more obvious when the results are visualized as a point

cloud; see the videos in the supplemental material.

On all but one dataset, DeMoN outperforms the single-

frame methods also by numbers, typically by a large mar-

gin. Notably, a large improvement can be observed even on

the indoor datasets, Sun3D and RGB-D, showing that the

additional stereopsis complements the other cues that can

be learned from the large amounts of training data avail-

able for this scenario. Only on the NYUv2 dataset, DeMoN

is slightly behind the method of Eigen & Fergus. This is

because the comparison is not totally fair: the network of

Eigen & Fergus as well as Liu indoor was trained on the

training set of NYUv2, whereas the other networks have

not seen this kind of data before.

6.4.1 Generalization to new data

Scene-specific priors learned during training may be use-

less or even harmful when being confronted with a scene

that is very different from the training data. In contrast, the

geometric relations between a pair of images are indepen-

dent of the content of the scene and should generalize to

unknown scenes. To analyze the generalization properties

of DeMoN, we compiled a small dataset of images show-

ing uncommon or complicated scenes, for example abstract

sculptures, close-ups of people and objects, images rotated

by 90 degrees.

5044

Image GT Eigen Liu DeMoN

Eigen Liu DeMoN

Figure 9. Visualization of DeMoN’s generalization capabilities to

previously unseen configurations. Single-frame methods have se-

vere problems in such cases, as most clearly visible in the point

cloud visualization of the depth estimate for the last example.

Method L1-inv sc-inv L1-rel

Liu [24] 0.055 0.247 0.194

Eigen [7] 0.062 0.238 0.185

DeMoN 0.041 0.183 0.130

Table 3. Quantitative generalization performance on previously

unseen scenes, objects, and camera rotations, using a self-recorded

and reconstructed dataset. Errors after optimal log-scaling. The

best model of Eigen et al. [7] for this task is based on VGG, for

Liu et al. [24], the model trained on Make3D [13] performed best.

DeMoN achieved best performance after two iterations.

Fig. 9 and Tab. 3 show that DeMoN, as to be expected,

generalizes better to these unexpected scenes than single-

image methods. It shows that the network has learned to

make use of the motion parallax.

6.5. Ablation studies

Our architecture contains some design decisions that we

justify by the following ablation studies. All results have

been obtained on the Sun3D dataset with the bootstrap net.

Choice of the loss function. Tab. 4 shows the influence

of the loss function on the accuracy of the estimated depth

and motion. Interestingly, while the scale invariant loss

greatly improves the prediction qualitatively (see Fig. 10),

it has negative effects on depth scale estimation. This leads

to weak performance on non-scale-invariant metrics and the

motion accuracy. Estimation of the surface normals slightly

improves all results. Finally, the full architecture with the

scale invariant loss, normal estimation, and a loss on the

flow, leads to the best results.

a b c d

Figure 10. Depth prediction comparison with different outputs and

losses. (a) Just L1 loss on the absolute depth values. (b) Addi-

tional output of normals and L1 loss on the normals. (c) Like (b)

but with the proposed gradient loss. (d) Ground truth.

Depth Motion

grad norm flow L1-inv sc-inv L1-rel rot tran

no no no 0.040 0.211 0.354 3.127 30.861

yes no no 0.057 0.159 0.437 4.585 39.819

no yes no 0.037 0.190 0.336 2.570 29.607

no yes yes 0.029 0.184 0.266 2.359 23.578

yes yes yes 0.032 0.150 0.276 2.479 24.372

Table 4. The influence of the loss function on the performance.

The gradient loss improves the scale invariant error, but degrades

the scale-sensitive measures. Surface normal prediction improves

the depth accuracy. A combination of all components leads to the

best tradeoff.

Depth Motion Flow

Confidence L1-inv sc-inv L1-rel rot tran EPE

no 0.030 0.028 0.26 2.830 25.262 0.027

yes 0.032 0.027 0.28 2.479 24.372 0.027

Table 5. The influence of confidence prediction on the overall per-

formance of the different outputs.

Flow confidence. Egomotion estimation only requires

sparse but high-quality correspondences. Tab. 5 shows that

given the same flow, egomotion estimation improves when

given the flow confidence as an extra input. Our interpreta-

tion is that the flow confidence helps finding most accurate

correspondences.

7. Conclusions and Future Work

DeMoN is the first deep network that has learned to es-

timate depth and camera motion from two unconstrained

images. Unlike networks that estimate depth from a single

image, DeMoN can exploit the motion parallax, which is a

powerful cue that generalizes to new types of scenes, and al-

lows to estimate the egomotion. This network outperforms

traditional structure from motion techniques on two frames,

since in contrast to those, it is trained end-to-end and learns

to integrate other shape from X cues. When it comes to han-

dling cameras with different intrinsic parameters it has not

yet reached the flexibility of classic approaches. The next

challenge is to lift this restriction and extend this work to

more than two images. As in classical techniques, this is

expected to significantly improve the robustness and accu-

racy.

Acknowledgements We acknowledge funding by the

ERC Starting Grant VideoLearn, the DFG grant BR-

3815/5-1, and the EU project Trimbot2020.

5045

References

[1] S. Agarwal, K. Mierle, and Others. Ceres Solver. 6

[2] P. Agrawal, J. Carreira, and J. Malik. Learning to see by

moving. In IEEE International Conference on Computer Vi-

sion (ICCV), Dec. 2015. 2

[3] C. Bailer, B. Taetz, and D. Stricker. Flow Fields: Dense Cor-

respondence Fields for Highly Accurate Large Displacement

Optical Flow Estimation. In IEEE International Conference

on Computer Vision (ICCV), Dec. 2015. 6

[4] R. T. Collins. A space-sweep approach to true multi-image

matching. In Proceedings CVPR ’96, 1996 IEEE Computer

Society Conference on Computer Vision and Pattern Recog-

nition, 1996, pages 358–363, June 1996. 2

[5] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazırbas,

V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. Flownet:

Learning optical flow with convolutional networks. In IEEE

International Conference on Computer Vision (ICCV), Dec.

2015. 2

[6] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Ried-

miller, and T. Brox. Discriminative unsupervised feature

learning with exemplar convolutional neural networks. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

38(9):1734–1747, Oct 2016. TPAMI-2015-05-0348.R1. 2

[7] D. Eigen and R. Fergus. Predicting Depth, Surface Normals

and Semantic Labels With a Common Multi-Scale Convo-

lutional Architecture. In IEEE International Conference on

Computer Vision (ICCV), Dec. 2015. 1, 2, 6, 7, 8

[8] D. Eigen, C. Puhrsch, and R. Fergus. Depth Map Predic-

tion from a Single Image using a Multi-Scale Deep Network.

In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,

and K. Q. Weinberger, editors, Advances in Neural Informa-

tion Processing Systems 27, pages 2366–2374. Curran Asso-

ciates, Inc., 2014. 1, 6

[9] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-

scale direct monocular SLAM. In European Conference on

Computer Vision (ECCV), September 2014. 2

[10] O. Faugeras. Three-dimensional Computer Vision: A Geo-

metric Viewpoint. MIT Press, Cambridge, MA, USA, 1993.

2

[11] M. A. Fischler and R. C. Bolles. Random Sample Consen-

sus: A Paradigm for Model Fitting with Applications to Im-

age Analysis and Automated Cartography. Commun. ACM,

24(6):381–395, June 1981. 2

[12] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deep-

stereo: Learning to predict new views from the world’s im-

agery. In Conference on Computer Vision and Pattern Recog-

nition (CVPR), 2016. 2

[13] D. A. Forsyth. Make3d: Learning 3d scene structure from

a single still image. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 31(5):824–840, May 2009. 8

[14] J.-M. Frahm, P. Fite-Georgel, D. Gallup, T. Johnson,

R. Raguram, C. Wu, Y.-H. Jen, E. Dunn, B. Clipp, S. Lazeb-

nik, and M. Pollefeys. Building Rome on a Cloudless Day.

In K. Daniilidis, P. Maragos, and N. Paragios, editors, Eu-

ropean Conference on Computer Vision (ECCV), number

6314 in Lecture Notes in Computer Science, pages 368–381.

Springer Berlin Heidelberg, 2010. 2

[15] S. Fuhrmann, F. Langguth, and M. Goesele. Mve-a mul-

tiview reconstruction environment. In Proceedings of the

Eurographics Workshop on Graphics and Cultural Heritage

(GCH), volume 6, page 8, 2014. 5

[16] R. I. Hartley. In defense of the eight-point algorithm. IEEE

Transactions on Pattern Analysis and Machine Intelligence,

19(6):580–593, June 1997. 6

[17] R. I. Hartley and A. Zisserman. Multiple View Geometry

in Computer Vision. Cambridge University Press, ISBN:

0521540518, second edition, 2004. 2

[18] H. Hirschmuller. Accurate and efficient stereo processing

by semi-global matching and mutual information. In IEEE

International Conference on Computer Vision and Pattern

Recognition (CVPR), volume 2, pages 807–814, June 2005.

6

[19] D. Jayaraman and K. Grauman. Learning image representa-

tions tied to egomotion. In ICCV, 2015. 2

[20] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-

shick, S. Guadarrama, and T. Darrell. Caffe: Convolutional

Architecture for Fast Feature Embedding. arXiv preprint

arXiv:1408.5093, 2014. 5

[21] A. Kendall and R. Cipolla. Modelling Uncertainty in Deep

Learning for Camera Relocalization. In International Conv-

erence on Robotics and Automation (ICRA), 2016. 2

[22] D. Kingma and J. Ba. Adam: A Method for Stochastic

Optimization. arXiv:1412.6980 [cs], Dec. 2014. arXiv:

1412.6980. 5

[23] K. Li, B. Hariharan, and J. Malik. Iterative Instance Segmen-

tation. In 2016 IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 3659–3667, June 2016.

4

[24] F. Liu, C. Shen, G. Lin, and I. Reid. Learning Depth from

Single Monocular Images Using Deep Convolutional Neural

Fields. In IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 2015. 1, 2, 7, 8

[25] H. C. Longuet-Higgins. A computer algorithm for re-

constructing a scene from two projections. Nature,

293(5828):133–135, Sept. 1981. 2

[26] D. G. Lowe. Distinctive Image Features from Scale-

Invariant Keypoints. International Journal of Computer Vi-

sion, 60(2):91–110, Nov. 2004. 2

[27] B. D. Lucas and T. Kanade. An Iterative Image Registration

Technique with an Application to Stereo Vision. In Proceed-

ings of the 7th International Joint Conference on Artificial

Intelligence - Volume 2, IJCAI’81, pages 674–679, San Fran-

cisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc. 6

[28] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers,

A. Dosovitskiy, and T. Brox. A large dataset to train convo-

lutional networks for disparity, optical flow, and scene flow

estimation. In IEEE International Conference on Computer

Vision and Pattern Recognition (CVPR), 2016. 2

[29] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor

Segmentation and Support Inference from RGBD Images. In

European Conference on Computer Vision (ECCV), 2012. 6

[30] R. A. Newcombe, S. Lovegrove, and A. Davison. DTAM:

Dense tracking and mapping in real-time. In 2011 IEEE In-

ternational Conference on Computer Vision (ICCV), pages

2320–2327, 2011. 2

5046

[31] D. Nister. An efficient solution to the five-point relative pose

problem. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 26(6):756–770, June 2004. 6

[32] A. Saxena, S. H. Chung, and A. Y. Ng. Learning depth from

single monocular images. In In NIPS 18. MIT Press, 2005.

7

[33] J. L. Schonberger and J.-M. Frahm. Structure-from-motion

revisited. In IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), 2016. 5

[34] J. L. Schonberger, E. Zheng, M. Pollefeys, and J.-M. Frahm.

Pixelwise view selection for unstructured multi-view stereo.

In European Conference on Computer Vision (ECCV), 2016.

5

[35] J. Shi and C. Tomasi. Good features to track. In 1994

IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR’94), pages 593 – 600, 1994. 6

[36] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cre-

mers. A benchmark for the evaluation of rgb-d slam systems.

In Proc. of the International Conference on Intelligent Robot

Systems (IROS), Oct. 2012. 5, 7

[37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.

Rethinking the Inception Architecture for Computer Vision.

arXiv:1512.00567 [cs], Dec. 2015. 3

[38] C. Tomasi and T. Kanade. Detection and tracking of point

features. Technical report, International Journal of Computer

Vision, 1991. 6

[39] B. Triggs, P. McLauchlan, R. Hartley, and A. Fitzgibbon.

Bundle Adjustment A Modern Synthesis Vision Algorithms:

Theory and Practice. In B. Triggs, A. Zisserman, and

R. Szeliski, editors, Vision Algorithms: Theory and Practice,

volume 1883, pages 153–177. Springer Berlin / Heidelberg,

Apr. 2000. 2

[40] B. Ummenhofer and T. Brox. Global, dense multiscale re-

construction for a billion points. In IEEE International Con-

ference on Computer Vision (ICCV), Dec 2015. 5

[41] L. Valgaerts, A. Bruhn, M. Mainberger, and J. Weickert.

Dense versus Sparse Approaches for Estimating the Funda-

mental Matrix. International Journal of Computer Vision,

96(2):212–234, Jan. 2012. 2

[42] C. Wu. Towards Linear-Time Incremental Structure from

Motion. In International Conference on 3D Vision (3DV),

pages 127–134, June 2013. 2

[43] J. Xiao, A. Owens, and A. Torralba. SUN3D: A Database of

Big Spaces Reconstructed Using SfM and Object Labels. In

IEEE International Conference on Computer Vision (ICCV),

pages 1625–1632, Dec. 2013. 2, 5

[44] S. Zagoruyko and N. Komodakis. Learning to compare im-

age patches via convolutional neural networks. In Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

2015. 2

[45] J. Zbontar and Y. LeCun. Computing the Stereo Matching

Cost With a Convolutional Neural Network. In IEEE Confer-

ence on Computer Vision and Pattern Recognition (CVPR),

June 2015. 2

5047

Documents

DeMoN: Depth and Motion Network for Learning Monocular Stereoopenaccess.thecvf.com/content_cvpr_2017/papers/Ummen... · 2017-05-31 · DeMoN: Depth and Motion Network for Learning