Virtual Rephotography: Novel View Prediction Error for 3D ...€¦ · capture certain aspects of 3D reconstructions. Thus, a new method-ology that evaluates visual reconstruction

Virtual Rephotography:Novel View Prediction Error for 3D ReconstructionMICHAEL WAECHTER, MATE BELJAN, SIMON FUHRMANN, and NILS MOEHRLETechnische Universitat DarmstadtJOHANNES KOPFFacebookandMICHAEL GOESELETechnische Universitat Darmstadt

The ultimate goal of many image-based modeling systems is to renderphoto-realistic novel views of a scene without visible artifacts. Existingevaluation metrics and benchmarks focus mainly on the geometric accuracyof the reconstructed model, which is, however, a poor predictor of visual ac-curacy. Furthermore, using only geometric accuracy by itself does not allowevaluating systems that either lack a geometric scene representation or uti-lize coarse proxy geometry. Examples include a light field and most image-based rendering systems. We propose a unified evaluation approach basedon novel view prediction error that is able to analyze the visual quality ofany method that can render novel views from input images. One key advan-tage of this approach is that it does not require ground truth geometry. Thisdramatically simplifies the creation of test datasets and benchmarks. It alsoallows us to evaluate the quality of an unknown scene during the acquisitionand reconstruction process, which is useful for acquisition planning. Weevaluate our approach on a range of methods including standard geometry-plus-texture pipelines as well as image-based rendering techniques, com-pare it to existing geometry-based benchmarks, demonstrate its utility for arange of use cases, and present a new virtual rephotography-based bench-mark for image-based modeling and rendering systems.

CCS Concepts: •Computing methodologies → Reconstruction; Image-based rendering;

Additional Key Words and Phrases: image-based rendering, image-basedmodeling, 3D reconstruction, multi-view stereo, novel view prediction error

ACM Reference Format:Michael Waechter, Mate Beljan, Simon Fuhrmann, Nils Moehrle, JohannesKopf, and Michael Goesele. 2017. Virtual rephotography: Novel view pre-diction error for 3D reconstruction. ACM Trans. Graph. 36, 1, Article 8(January 2017), 11 pages.DOI: http://dx.doi.org/10.1145/2999533

Part of this work was done while Johannes Kopf was with Microsoft Re-search.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM.

$15.00DOI: http://dx.doi.org/10.1145/2999533

(a) Mesh with vertex colorsrephoto error: 0.47

(b) Mesh with texturerephoto error: 0.29

(c) Point cloud splattingrephoto error: 0.51

(d) Image-based rendering (IBR)rephoto error: 0.62

Fig. 1. Castle Ruin with different 3D reconstruction representations.Geometry-based evaluation methods [Seitz et al. 2006; Strecha et al. 2008;Aanæs et al. 2016] cannot distinguish (a) from (b) as both have the samegeometry. While the splatted point cloud (c) could in principle be evaluatedwith these methods, the IBR solution (d) cannot be evaluated at all.

1. INTRODUCTION

Intense research in the computer vision and computer graphicscommunities has lead to a wealth of image-based modeling andrendering systems that take images as input, construct a model ofthe scene, and then create photo-realistic renderings for novel view-points. The computer vision community contributed tools such asstructure from motion and multi-view stereo to acquire geomet-ric models that can subsequently be textured. The computer graph-ics community proposed various geometry- or image-based render-ing systems. Some of these, such as the Lumigraph [Gortler et al.1996], synthesize novel views directly from the input images (plusa rough geometry approximation), producing photo-realistic resultswithout relying on a detailed geometric model of the scene. Eventhough remarkable progress has been made in the area of model-ing and rendering of real scenes, a wide range of issues remain,especially when dealing with complex datasets under uncontrolledconditions. In order to measure and track the progress of this ongo-ing research, it is essential to perform objective evaluations.

ACM Transactions on Graphics, Vol. 36, No. 1, Article 8, Publication date: January 2017.

miwaecht

Typewritten Text

©2017 Copyright held by the authors. This is the authors' version of the work. It is posted here for your personal use. Not for redistribution. The definitive version was published in ACM Transactions on Graphics: http://dx.doi.org/10.1145/2999533

miwaecht

Typewritten Text

2 • M. Waechter et al.

(a)geom. error: 0.83 mmcompleteness: 0.88 %

(b)geom. error: 0.81 mmcompleteness: 0.96 %

(c)geom. error: 0.48 mmcompleteness: 0.99 %

(d)geom. error: 0.79 mmcompleteness: 0.96 %

Fig. 2. Four submissions to the Middlebury benchmark to which we addedtexture [Waechter et al. 2014]: (a) Merrell Confidence [Merrell et al. 2007]and (b) Generalized-SSD [Calakli et al. 2012] have similar geometric errorand different visual quality. (c) Campbell [2008] and (d) Hongxing [2010]have different geometric error and similar visual quality.

Existing evaluation efforts focus on systems that acquire meshmodels. They compare the reconstructed meshes with ground truthgeometry and evaluate measures such as geometric complete-ness and accuracy [Seitz et al. 2006; Strecha et al. 2008; Aanæset al. 2016]. This approach falls short in several regards: First,only scenes with available ground-truth models can be analyzed.Ground-truth models are typically not available for large-scale re-construction projects outside the lab such as PhotoCity [Tuite et al.2011], but there is, nevertheless, a need to evaluate reconstructionquality and identify problematic scene parts. Second, evaluatingrepresentations other than meshes is problematic: Point clouds canonly be partially evaluated (only reconstruction accuracy can bemeasured but not completeness), and image-based rendering rep-resentations or light fields [Levoy and Hanrahan 1996] cannot beevaluated at all. And third, it fails to measure properties that arecomplementary to geometric accuracy. While there are a few appli-cations where only geometric accuracy matters (e.g., reverse engi-neering or 3D printing), most applications that produce renderingsfor human consumption are arguably more concerned with visualquality. This is for instance the main focus in image-based render-ing where the geometric proxy does not necessarily have to be ac-curate and only visual accuracy of the resulting renderings matters.

If we consider, e.g., multi-view stereo reconstruction pipelines,for which geometric evaluations such as the common multi-viewbenchmarks [Seitz et al. 2006; Strecha et al. 2008; Aanæs et al.2016] were designed, we can easily see that visual accuracy is com-plementary to geometric accuracy. The two measures are intrinsi-cally related since errors in the reconstructed geometry tend to bevisible in renderings (e.g., if a wall’s depth has not been correctlyestimated, this may not be visible in a frontal view but definitelywhen looking at it at an angle), but they are not fully correlated andare therefore distinct measures. Evaluating the visual quality adds anew element to multi-view stereo reconstruction: recovering a goodsurface texture in addition to the scene geometry. Virtual scenes areonly convincing if effort is put into texture acquisition, which ischallenging, especially with datasets captured under uncontrolled,real-world conditions with varying exposure, illumination, or fore-ground clutter. If this is done well, the resulting texture may evenhide small geometric errors.

A geometric reconstruction evaluation metric does not allow di-rectly measuring the achieved visual quality of the textured model,and it is not always a good indirect predictor for visual quality: InFigure 2 we textured four submissions to the Middlebury bench-mark using Waechter et al.’s approach [2014]. The textured model

in Figure 2(a) is strongly fragmented while the model in 2(b) isnot. Thus, the two renderings exhibit very different visual quality.Their similar geometric error, however, does not reflect this. Con-trarily, Figures 2(c) and 2(d) have very different geometric errorsdespite similar visual quality. Close inspection shows that 2(d) hasa higher geometric error because its geometry is too smooth. Thisis hidden by the texture, at least from the viewpoints of the ren-derings. In both cases similarity of geometric error is a poor pre-dictor for similarity of visual quality, clearly demonstrating that thepurely distance-based Middlebury evaluation is by design unable tocapture certain aspects of 3D reconstructions. Thus, a new method-ology that evaluates visual reconstruction quality is needed.

Most 3D reconstruction pipelines are modular: Typically, struc-ture from motion (SfM) is followed by dense reconstruction, tex-turing and rendering (in image-based rendering the latter two stepsare combined). While each of these steps can be evaluated individ-ually, our proposed approach is holistic and evaluates the completepipeline including the rendering step by scoring the visual qualityof the final renderings. This is more consistent with the way humanswould assess quality: They are not concerned with the quality ofintermediate representations (e.g., a 3D mesh model), but insteadconsider 2D projections, i.e., renderings of the final model fromdifferent viewpoints.

Leclerc et al. [2000] pointed out that in the real world infer-ences that people make from one viewpoint, are consistent withthe observations from other viewpoints. Consequently, good mod-els of real world scenes must be self-consistent as well. We exploitthis self-consistency property as follows: We divide the set of cap-tured images into a training set and an evaluation set. We then re-construct the training set with an image-based modeling pipeline,render novel views with the camera parameters of the evaluationimages, and compare those renderings with the evaluation imagesusing selected image difference metrics.

This can be seen as a machine learning view on 3D recon-struction: Algorithms infer a model of the world based on train-ing images and make predictions for evaluation images. For sys-tems whose purpose is the production of realistic renderings ourapproach is the most natural evaluation scheme because it di-rectly evaluates the output instead of internal models. In fact, ourapproach encourages that future reconstruction techniques takephoto-realistic renderability into account. This line of thought isalso advocated by Shan et al.’s Visual Turing Test [2013] or Van-hoey et al.’s appearance-preserving mesh simplification [2015].

Our work draws inspiration from Szeliski [1999], who proposedto use intermediate frames for optical flow evaluation. We extendthis into a complete evaluation paradigm which is able to handle adiverse set of image-based modeling and rendering methods. Sincethe idea to imitate image poses resembles computational rephoto-graphy [Bae et al. 2010] as well as the older concept of repeat pho-tography [Webb et al. 2010], we call our technique virtual repho-tography and call renderings rephotos. By enabling the evaluationof visual accuracy, virtual rephotography puts visual accuracy on alevel with geometric accuracy as a quality of 3D reconstructions.

In summary, our contributions are as follows:

—A flexible evaluation paradigm using the novel view predictionerror that can be applied to any renderable scene representation,

—quantitative view-based error metrics in terms of image differ-ence and completeness for the evaluation of photo-realistic ren-derings,

—a thorough evaluation of our methodology on several datasetsand with different reconstruction and rendering techniques, and

—a virtual rephotography-based benchmark.


Virtual Rephotography: Novel View Prediction Error for 3D Reconstruction • 3

Our approach has the following advantages over classical evalu-ation systems:

—It allows measuring aspects complementary to geometry, such astexture quality in a multi-view stereo plus texturing pipeline,

—it dramatically simplifies the creation of new benchmarking data-sets since it does not require a ground truth geometry acquisitionand vetting process,

—it enables direct error visualization and localization on the scenerepresentation (see, e.g., Figure 11), which is useful for erroranalysis and acquisition guidance, and

—it makes reconstruction quality directly comparable across dif-ferent scene representations (see, e.g., Figure 1) and thus closesa gap between computer vision and graphics techniques.

2. RELATED WORK

Photo-realistic reconstruction and rendering is a topic that spansboth computer graphics and vision. On the reconstruction side, thegeometry-based Middlebury multi-view stereo benchmark [Seitzet al. 2006] and other factors have triggered research on differentreconstruction approaches. On the image-based rendering (IBR)side, many evaluation approaches have been proposed [Schwarzand Stamminger 2009; Berger et al. 2010; Guthe et al. 2016] butmost of them are too specialized to be directly transferable to 3Dreconstruction. This paper takes a wider perspective by consideringcomplete reconstruction and rendering pipelines.

In the following we first provide a general overview of evalua-tion techniques for image-based modeling and rendering, before wediscuss image comparison metrics with respect to their suitabilityfor our proposed virtual rephotography framework.

2.1 Evaluating Image-Based Modeling & Rendering

Algorithms need to be objectively evaluated in order to prove thatthey advance their field [Forstner 1996]. For the special case ofmulti-view stereo (MVS) this is, e.g., done using the MiddleburyMVS benchmark [Seitz et al. 2006]: It evaluates algorithms bycomparing reconstructed geometry with scanned ground truth, andmeasures accuracy (distance of the mesh vertices to the groundtruth) and completeness (percentage of ground truth nodes withina threshold of the reconstruction). The downsides of this purelygeometric evaluation have been discussed in Section 1. In addi-tion, this benchmark has aged and cannot capture most aspects ofrecent MVS research (e.g., preserving fine details when mergingdepth maps with drastically different scales or recovering texture).Strecha et al. [2008] released a more challenging benchmark withlarger and more realistic architectural outdoor scenes and largerimages. They use laser scanned ground truth geometry and com-pute measures similar to the Middlebury benchmark. Most recently,Aanæs et al. [2016] released a more comprehensive dataset of con-trolled indoor scenes with larger, higher quality images, more ac-curate camera positions, much denser ground truth geometry anda modified evaluation protocol that is still based on geometric ac-curacy. They therefore address no fundamentally new challengesand all objections stated against Middlebury above apply here aswell. Thus, researchers who address more challenging conditionsor non-geometric aspects still have to rely mostly on qualitativecomparison, letting readers judge their results by visual inspection.

Szeliski [1999] encountered the same problem in the evaluationof optical flow and stereo and proposed novel view prediction erroras a solution: Instead of measuring how well algorithms estimateflow, he measures how well the estimated flow performs in frame

rate doubling. Given two video frames for time t and t+2, flow al-gorithms predict the frame t+1, and this is compared with the non-public ground truth frame. Among other metrics this has been im-plemented in the Middlebury flow benchmark [Baker et al. 2011].Leclerc et al. [2000] use a related concept for stereo evaluation:They call a stereo algorithm self-consistent if its depth hypothesesfor image I1 are the same when inferred from image pairs (I0, I1)and (I1, I2). Szeliski’s (and our) criterion is more flexible: It allowsthe internal model (a flow field for Szeliski, depth for Leclerc) to bewrong as long as the resulting rendering looks correct, a highly rel-evant case as demonstrated by Hofsetz et al. [2004]. Virtual repho-tography is clearly related to these approaches. However, Szeliskionly focused on view interpolation in stereo and optical flow. Weextend novel view prediction error to the much more challenginggeneral case of image-based modeling and rendering where viewshave to be extrapolated over much larger distances.

Novel view prediction has previously been used in image-basedmodeling and rendering, namely for BRDF recovery [Yu et al.1999, Section 7.2.3]. However, they only showed qualitative com-parisons. The same holds for the Visual Turing Test: Shan etal. [2013] ask users in a study to compare renderings and origi-nal images at different resolutions to obtain a qualitative judgmentof realism. In contrast, we automate this process by comparing ren-derings and original images from the evaluation set using severalimage difference metrics to quantitatively measure and localize re-construction defects. We suggest the novel view prediction erroras a general, quantitative evaluation method for the whole field ofimage-based modeling and rendering, and present a comprehensiveevaluation to shed light on its usefulness for this purpose.

Fitzgibbon et al. [2005] use the novel view prediction error toquantify their IBR method’s error, but they neither reflect on thismethod’s properties nor do they use the obtained difference im-ages to compare their results with other IBR methods. Further, theirimage difference metric (RGB difference) only works in scenarioswhere all images have identical camera response curves, illumina-tion, and exposure. We show in the supplemental material that avery similar metric (Cb+Cr difference in YCbCr color space) failsin settings that are more unconstrained than Fitzgibbon’s.

One 3D reconstruction quality metric that, similarly to ours, doesnot require geometric ground truth data, is Hoppe et al.’s [2012a]metric for view planning in multi-view stereo. For a triangle mesh itchecks each triangle’s degree of visibility redundancy and maximalresolution. In contrast to our method it does not measure visualreconstruction quality itself, but it measures circumstances that areassumed to cause reconstruction errors.

In image-based rendering a variety of works cover the auto-matic or manual evaluation of result quality. Schwarz and Stam-minger [2009], Berger et al. [2010], and Guthe et al. [2016] auto-matically detect ghosting and popping artifacts in IBR. Vangorp etal. [2011] investigate how users rate the severity of ghosting, blur-ring, popping, and parallax distortion artifacts in street-level IBRsubject to parameters such as scene coverage by the input images,number of images used to blend output pixels, or viewing angle. Inlater work, Vangorp et al. [2013] evaluate how users perceive par-allax distortions in street-level IBR (more specifically distortionsof rectangular protrusions in synthetic facade scenes rendered withplanar IBR proxies) and derive metrics that indicate where in anIBR scene a user should be allowed to move or where additionalimages of the scene should be captured to minimize distortions.Like in Hoppe et al.’s [2012a] work their final metric is indepen-dent of the actual visual appearance of the rendered scene and onlytakes its geometry into account. In video-based rendering Tompkinet al. [2013] analyze user preferences for different types of transi-



00.20.40.60.81

Fig. 3. Left to right: Input image, rendering from the same viewpoint, andHDR-VDP-2 result (color scale: VDP error detection probability)

tions between geometrically connected videos, and how users per-ceive the artifacts that can occur. None of the above papers’ resultscan be easily transferred to non-IBR scene representations such astextured MVS reconstructions and some of them cannot even beeasily generalized to IBR of general scenes.

2.2 Image Comparison Metrics

Our approach reduces the task of evaluating 3D reconstructions tocomparing evaluation images with their rendered counterparts us-ing image difference metrics. A basic image metric is the pixel-wisemean squared error (MSE) which is, however, sensitive to typicalimage transformations such as global luminance changes. Wang etal. [2004] proposed the structural similarity index (SSIM), whichcompares similarity in a local neighborhood around pixels aftercorrecting for luminance and contrast differences. Stereo and op-tical flow methods also use metrics that compare image patches in-stead of individual pixels for increased stability: Normalized cross-correlation (NCC), Census [Zabih and Woodfill 1994] and zero-mean sum of squared differences are all to some degree invariantunder low-frequency luminance and contrast changes.

From a conceptual point of view, image comparison in the photo-realistic rendering context should be performed with human per-ception in mind. The visual difference predictor VDP [Daly 1993]and its high dynamic range variant HDR-VDP-2 [Mantiuk et al.2011] are based on the human visual system and detect differ-ences near the visibility threshold. Since reconstruction defectsare typically significantly above this threshold, these metrics aretoo sensitive and unsuitable for our purpose. As Figure 3 shows,HDR-VDP-2 marks almost the whole rendering as defective com-pared to the input image. For HDR imaging and tone mappingAydın et al. [2008] introduced the dynamic range independent met-ric DRIM, which is invariant under exposure changes. However,DRIM’s as well as HDR-VDP-2’s range of values is hard to in-terpret in the context of reconstruction evaluation. Finally, the vi-sual equivalence predictor [Ramanarayanan et al. 2007] measureswhether two images have the same high-level appearance even inthe presence of structural differences. However, knowledge aboutscene geometry and materials is required, which we explicitly wantto avoid. Given these limitations of perceptually-based methods,we utilize more basic image correlation methods. Our experimentsshow that they work well in our context.

3. EVALUATION METHODOLOGY

Before introducing our own evaluation methodology, we would liketo propose and discuss a set of general desiderata that should ide-ally be fulfilled: First, an evaluation methodology should evaluatethe actual use case of the approach under examination. For exam-ple, if the purpose of a reconstruction is to record accurate geo-metric measurements, a classical geometry-based evaluation [Seitzet al. 2006; Strecha et al. 2008; Aanæs et al. 2016] is the correctmethodology. In contrast, if the reconstruction is used as a proxy in

an image-based rendering setting, one should instead evaluate theachieved rendering quality. This directly implies that the results ofdifferent evaluation methods will typically not be consistent unlessa reconstruction is perfect.

Second, an evaluation methodology should be able to operatesolely on the data used for the reconstruction itself without requir-ing additional data that is not available per se. It should be appli-cable on site while capturing new data. Furthermore, the creationof new benchmark datasets should be efficiently possible. One ofthe key problems of existing geometry-based evaluations is that thecreation of ground-truth geometry models with high enough accu-racy requires a lot of effort (e.g., high quality scanning techniquesin a very controlled environment). Thus, it is costly and typicallyonly used to benchmark an algorithm’s behavior in a specific set-ting. An even more important limitation is that it cannot be appliedto evaluate its performance on newly captured data.

Third, an evaluation methodology should cover a large numberof reconstruction methods. The computer graphics community’sachievements in image-based modeling and rendering are fruitfulfor the computer vision community and vice versa. Being able toonly evaluate a limited set of reconstruction methods restricts inter-comparability and exchange of ideas between the communities.

Finally, the metric for the evaluation should be linear (i.e., if thequality of model B is right in the middle of model A and C, its scoreshould also be right in the middle). If this is not possible, the metricshould at least give reconstructions an ordering (i.e., if model Ais better than B, its error score should be lower than B’s). In thefollowing we will always refer to this as the ordering criterion.

Fulfilling all of these desiderata simultaneously and completelyis challenging. In fact, the classical geometry-based evaluationmethods [Seitz et al. 2006; Strecha et al. 2008; Aanæs et al. 2016]satisfy only the first and the last item above (in geometry-only sce-narios it evaluates the use case and it is linear). In contrast, ourvirtual rephotography approach fulfills all except for the last re-quirement (it provides an ordering but is not necessarily linear).We therefore argue that it is an important contribution that is com-plementary to existing evaluation methodologies.

In the following, we first describe our method, the overall work-flow and the used metrics in detail before evaluating our method inSection 4 using a set of controlled experiments.

3.1 Overview and Workflow

The key idea of our proposed evaluation methodology is that theperformance of each algorithm stage is measured in terms of theimpact on the final rendering result. This makes the specific systemto be evaluated largely interchangeable, and allows evaluating dif-ferent combinations of components end-to-end. The only require-ment is that the system builds a model of the world from giveninput images and can then produce (photo-realistic) renderings forthe view-points of test images.

We now give a brief overview over the complete workflow:Given a set of input images of a scene such as the one depicted inFigure 4a, we perform reconstruction using n-fold cross-validation.In each of the n cross-validation instances we put 1/n th of the im-ages into an evaluation set E and the remaining images into a train-ing set T . The training set T is then handed to the reconstruction al-gorithm that produces a 3D representation of the scene. This could,e.g., be a textured mesh, a point cloud with vertex colors, or theinternal representation of an image-based rendering approach suchas the set of training images combined with a geometric proxy ofthe scene. In Sections 4 and 5 we show multiple examples of re-construction algorithms which we used for evaluation.



(a) Photo (b) Virtual rephoto (c) Completeness mask (d) Error image (1-NCC) (e) Error projection

Fig. 4. An overview over the complete virtual rephotography pipeline.

The reconstruction algorithm then rephotographs the scene, i.e.,renders it photo-realistically using its own, native rendering systemwith the exact extrinsic and intrinsic camera parameters of the im-ages in E (see Figure 4b for an example). If desired, this step canalso provide a completeness mask that marks pixels not includedin the rendered reconstruction (see Figure 4c). Note, that we regardobtaining the camera parameters as part of the reconstruction algo-rithm. However, since the test images are disjoint from the train-ing images, camera calibration (e.g. using structure from motion[Snavely et al. 2006]) must be done beforehand to have all cameraparameters in the same coordinate frame. State-of-the-art structurefrom motion systems are sub-pixel accurate for the most part (oth-erwise multi-view stereo would not work on them) and are assumedto be accurate enough for the purpose of this evaluation.

Next, we compare rephotos and test images with image differ-ence metrics and ignore those image regions that the masks mark asunreconstructed (see Figure 4d for an example of a difference im-age). We also compute completeness as the fraction of valid maskpixels. We then average completeness and error scores over all re-photos to obtain global numerical scores for the whole dataset.

Finally, we can project the difference images onto the recon-structed model to visualize local reconstruction error (see Fig. 4e).

3.2 Accuracy and Completeness

In order to evaluate the visual accuracy of rephotos we measuretheir similarity to the test images using image difference met-rics. The simplest choice would be the pixel-wise mean squarederror. The obvious drawback is that it is not invariant to lumi-nance changes. If we declared differences in luminance as a re-construction error, we would effectively require all image-basedreconstruction and rendering algorithms to bake illumination ef-fects into their reconstructed models or produce them during ren-dering. However, currently only few reconstruction algorithms re-cover the true albedo and reflection properties of surfaces as well asthe scene lighting (examples include Haber et al.’s [2009] and Shanet al.’s [2013] works). An evaluation metric that only works forsuch methods would have a very small scope. Furthermore, in real-world datasets illumination can vary among the input images andcapturing the ground truth illumination for the test images woulddrastically complicate our approach. Thus, it seems adequate to useluminance-invariant image difference metrics.

We therefore use the YCbCr color space and sum up the absoluteerrors in the two chrominance channels. We call this Cb+Cr errorin the following. This metric takes, however, only single pixels intoconsideration, and detects in practice mostly minor color noise.

Thus, we also analyze patch-based metrics, some of which arefrequently used in stereo and optical flow: Zero-mean sum ofsquared differences (ZSSD), Wang et al.’s [2004] structural dissim-ilarity index DSSIM= (1−SSIM)/2, normalized cross-correlation (weuse 1-NCC instead of NCC to obtain a dissimilarity metric), Census[Zabih and Woodfill 1994], and six variants of Preiss et al.’s [2014]

improved color image difference iCID. For presentation clarity wewill only discuss one metric throughout this paper: 1-NCC. Thisis because it is a well-established tool in computer vision (e.g., instereo or optical flow) precisely for our purpose—comparing im-age similarity in the presence of changes in illumination, exposureetc.—and (as we shall see in Sections 4 and 5.1.1) because it bestfulfills properties such as the ordering criterion defined in Section 3.ZSSD, DSSIM, Census, and iCID fulfill the same properties but to alesser degree and are therefore shown in the supplemental material.Further, we show in the supplemental material that the pixel-basedCb+Cr error is in general unable to detect larger structural defectsand does not fulfill the ordering criterion.

In conjunction with the above accuracy measures one must al-ways compute some completeness measure, which states the frac-tion of the test set for which the algorithm made a prediction.Otherwise algorithms could resort to rendering only those sceneparts where they are certain about their prediction’s correctness.For the same reason machine learning authors report precision andrecall and geometric reconstruction benchmarks [Seitz et al. 2006;Strecha et al. 2008] report geometric accuracy and completeness.For our purpose we use the percentage of rendered pixels as com-pleteness. It is hereby understood that many techniques cannotreach a completeness score of 100 % since they do not model thecomplete scene visible in the input images.

4. EVALUATION

In the following, we perform an evaluation of our proposed meth-odology using a range of experiments. In Sections 4.1 and 4.2 wefirst demonstrate how degradations in the reconstructed model orthe input data influence the computed accuracy. We show in par-ticular, that our metric fulfills the ordering criterion defined in Sec-tion 3. In Section 4.3 we discuss the relation between our method-ology and the standard Middlebury MVS benchmark [Seitz et al.2006]. Finally, we analyze in Section 4.4 to what extent deviatingfrom the classical, strict separation of training and test set decreasesthe validity of our evaluation methodology.

4.1 Evaluation with Synthetic Degradation

In this section we show that virtual rephotography fulfills the afore-mentioned ordering criterion on very controlled data. If we have adataset’s ground truth geometry and can provide a good quality tex-ture, this should receive zero or a very small error. Introducing ar-tificial defects into this model decreases the model’s quality, whichshould in turn be detected by the virtual rephotography approach.

We take Strecha et al.’s [2008] Fountain-P11 dataset for whichcamera parameters and ground truth geometry are given. We com-pensate for exposure differences in the images (using the images’exposure times and an approximate response curve for the usedcamera) and project them onto the mesh to obtain a near-perfectlycolored model with vertex colors (the ground truth mesh’s reso-



Fig. 5. Fountain-P11 [Strecha et al. 2008]. Left to right: Ground truth,texture noise (ntex≈30%), geometry noise (ngeom≈0.5%), and simplifiedmesh (nsimp≈99%).

0 10 20 30 40 500.1

0.3

0.5

0.7

ntex in %

1-NCC

0 0.12 0.24 0.480.1

0.3

0.5

0.7

ngeom in %

1-NCC

0 11 39 80 990.1

0.3

0.5

0.7

nsimp in %

1-NCC

Fig. 6. 1-NCC error for (left to right) texture noise, geometry noise andmesh simplification applied to Fountain-P11.

lution is so high that this is effectively equivalent to applying ahigh-resolution texture to the model). Rephotographing this col-ored model with precise camera calibration (including principalpoint offset and pixel aspect ratio) yields pixel-accurate rephotos.

Starting from the colored ground truth we synthetically applythree different kinds of defects to the model (see Figure 5 for ex-amples) to evaluate their effects on our metrics:

—Texture noise: In order to model random photometric distortions,we change vertex colors using simplex noise [Perlin 2002]. Thenoise parameter ntex is the maximum offset per RGB channel.

—Geometry noise: Geometric errors in reconstructions are hardto model. We therefore use a very general model and displacevertices along their normal using simplex noise. The parameterngeom is the maximum offset as a fraction of the scene’s extent.

—Mesh simplification: To simulate different reconstruction resolu-tions, we simplify the mesh using edge collapse operations. Theparameter nsimp is the fraction of eliminated vertices.

In Figure 6 we apply all three defects with increasing strengthand evaluate the resulting meshes using our method. The 1-NCCerror reflects the increase in noise with an increase in error. Onereason why the error does not vanish for ntex=ngeom=nsimp=0, isthat we cannot produce realistic shading effects easily.

4.2 Evaluation with Multi-View Stereo Data

In the following experiment we demonstrate the ordering crite-rion on MVS reconstructions with more uncontrolled data. Startingfrom the full set of training images at full resolution, we decreasereconstruction quality by (a) reducing the number of images usedfor reconstruction and (b) reducing the resolution of the images,and show that virtual rephotography detects these degradations.

We evaluate our system with two MVS reconstruction pipelines.In the first pipeline we generate a dense, colored point cloud withCMVS [Furukawa et al. 2010; Furukawa and Ponce 2010], meshthe points using Poisson surface reconstruction (PSR) [Kazhdanet al. 2006], and remove superfluous triangles generated from lowoctree levels. We refer to this pipeline as CMVS+PSR. In the sec-ond pipeline we generate depth maps for all views with an al-gorithm for community photo collections (CPCs) [Goesele et al.

67 133 266 533 5610.4

0.6

0.8

1

training set size |T |

Com

plet

enes

s

h = 375h = 750h = 1500

(a) Completeness

67 133 266 533 5610.4

0.6

0.8

1

1.2

training set size |T |

1-N

CC

h = 375h = 750h = 1500

(b) 1-NCC error

Fig. 7. City Wall virtual rephotography results for the CMVS+PSRpipeline. Boxplots show minimum, lower quartile, median, upper quartile,and maximum of 20 cross-validation iterations. Points for |T |=561 havebeen generated without cross-validation (see Section 4.4).

2007; Fuhrmann et al. 2015] and merge them into a global meshusing a multi-scale (MS) depth map fusion approach [Fuhrmannand Goesele 2011] to obtain a high-resolution output mesh withvertex colors. We refer to this pipeline as CPC+MS.

We use a challenging dataset of 561 photos of an old city wall(downloadable from www.gcc.tu-darmstadt.de/home/proj/mve) with fine details, non-rigid parts (e.g., people and plants) andmoderate illumination changes (it was captured over the course oftwo days). We apply SfM [Snavely et al. 2006] once to the completedataset and use the recovered camera parameters for all subsets oftraining and test images. We then split all images into 533 trainingimages (T ) and 28 evaluation images (E). To incrementally reducethe number of images used for reconstruction we divide T threetimes in half. We vary image resolution by using images of sizeh ∈ {375, 750, 1500} (h is the images’ shorter side length). Forevaluation we always render the reconstructed models with h=750and use a patch size of 9×9 pixels for all patch-based metrics.

Figure 7 shows results for CMVS+PSR. CPC+MS behaves sim-ilarly and its results are shown in the supplemental material. Thegraphs give multiple insights: First (Figure 7a), giving the recon-struction algorithm more images, i.e., increasing |T | increases ourcompleteness measure. The image resolution h on the other handhas only a small impact on completeness. And second (Figure 7b),increasing the image resolution h decreases the 1-NCC error: If welook at the boxplots for a fixed |T | and varying h, they are sep-arated and ordered. Analogously, 1-NCC can distinguish betweendatasets of varying |T | and fixed h: Datasets with identical h andlarger |T | have a lower median error. These results clearly fulfillthe desired ordering criterion stated above.

4.3 Comparison with Geometry-based Benchmark

The Middlebury MVS benchmark [Seitz et al. 2006] provides im-ages of two objects, Temple and Dino, together with accurate cam-era parameters. We now investigate the correlation between virtualrephotography and Middlebury’s geometry-based evaluation usingthe TempleRing and DinoRing variants of the datasets, which arefairly stable to reconstruct and are most frequently submitted forevaluation. With the permission of their respective submitters, weanalyzed 41 TempleRing and 39 DinoRing submissions with pub-licly available geometric error scores. We transformed the mod-els with the ICP alignment matrices obtained from the Middleburyevaluation, removed superfluous geometry below the model base,textured the models [Waechter et al. 2014], and evaluated them.

Figure 8 shows 1-NCC rephoto error plotted against geometricerror scores. Analyzing the correlation between 1-NCC and geo-metric error yields correlation coefficients of 0.63 for the TempleR-ing and 0.69 for the DinoRing, respectively. Figure 8 has, however,



0.5 1 1.5 2 2.5 30.2

0.3

0.4

0.5

0.6a

bcd

geometric error in mm

1-N

CC

reph

oto

erro

r

TempleRing

0.5 1 1.5 2 2.5 30.2

0.3

0.4

0.5

0.6

geometric error in mm

DinoRing

Fig. 8. 1-NCC rephoto error against geometric error for 41 TempleRingand 39 DinoRing datasets. Datasets marked with a–d are shown in Fig-ure 2a–d. The dashed line is a linear regression fitted to the data points.

several outliers that deserve a closer look: E.g., the geometric errorsfor the methods Merrell Confidence [Merrell et al. 2007] (markedwith an a) and Generalized-SSD [Calakli et al. 2012] (marked b) aresimilar whereas their 1-NCC errors differ strongly. Conversely, thegeometric errors of Campbell [2008] (c) and Hongxing [2010] (d)are very different, but their 1-NCC error is identical. We showedrenderings of these models in Figure 2 and discussed the visualdissimilarity of 2a and 2b and the similarity of 2c and 2d in Sec-tion 1. Apparently, visual accuracy seems to explain why for thepairs a/b and c/d the rephoto error does not follow the geometricerror. Clearly virtual rephotography captures aspects complemen-tary to the purely geometric Middlebury evaluation.

4.4 Disjointness of Training and Test Set

Internal models of reconstruction algorithms can be placed along acontinuum between local and global methods. Global methods suchas MVS plus surface reconstruction and texture mapping producea single model that explains the underlying scene for all views asa whole. Local methods such as image-based rendering produce aset of local models (e.g., images plus corresponding depth maps),each of which describes only a local aspect of the scene.

It is imperative for approaches with local models to separatetraining and test images since they could otherwise simply displaythe known test images for evaluation and receive a perfect score.We therefore randomly split the set of all images into disjoint train-ing and test sets (as is generally done in machine learning) and usecross-validation to be robust to artifacts caused by unfortunate split-ting. However, using all available images for reconstruction typi-cally yields the best results and it may therefore be undesirable to“waste” perfectly good images by solely using them for the evalu-ation. This is particularly relevant for datasets which contain onlyfew images to begin with and for which reconstruction may failwhen removing images. Also, even though cross-validation is anestablished statistical tool, it is very resource- and time-consuming.

We now show for the global reconstruction methods CPC+MSand CMVS+PSR that evaluation can be done without cross-vali-dation with no significant change in the results. On the City Walldataset we omit cross-validation and obtain the data points for|T |= 561 in Figure 7. Again, we only show the CMVS+PSR re-sults here and show the CPC+MS results in the supplemental ma-terial. Most data points have a slightly smaller error than the me-dian of the largest cross-validation experiments (|T |=533), whichis reasonable as the algorithm has slightly more images to recon-struct from. Neither CMVS+PSR nor CPC+MS seem to overfit theinput images. We want to point out that, although this may notbe generally applicable, it seems safe to not use cross-validationfor the global reconstruction approaches used here. In contrast, theimage-based rendering approaches such as the Unstructured Lumi-

Table I. Rephoto errors for different reconstructionrepresentations of the Castle Ruin (Figure 1).

Representation Completeness 1-NCC

Mesh with texture 0.66 0.29View-dependent texturing 0.66 0.39Mesh with vertex color 0.66 0.47Point cloud 0.67 0.51Image-based rendering 0.66 0.62

graph [Buehler et al. 2001] will just display the input images andthus break an evaluation without cross-validation.

5. APPLICATIONS

Here we show two important applications of virtual rephotography:In Section 5.1 we demonstrate our approach’s versatility by apply-ing it to different reconstruction representations. One possible usecase for this is a 3D reconstruction benchmark open to all image-based reconstruction and rendering techniques, such as the one thatwe introduce in Section 6. Finally, in Section 5.2 we use virtual re-photography to locally highlight defects in MVS reconstructions,which can, e.g., be used to guide users to regions where additionalimages need to be captured to improve the reconstruction.

5.1 Different Reconstruction Representations

One major advantage of our approach is that it handles arbitrary re-construction representations as long as they can be rendered fromnovel views. We demonstrate this using the Castle Ruin dataset(286 images) and five different reconstruction representations, fourof which are shown in Figure 1:

—Point cloud: Multi-view stereo algorithms such as CMVS [Fu-rukawa et al. 2010] output oriented point clouds, that can be ren-dered directly with surface splatting [Zwicker et al. 2001]. Assplat radius we use the local point cloud density.

—Mesh with vertex color: This is typically the result of a surfacereconstruction technique run on a point cloud.

—Mesh with texture: Meshes can be textured using the input im-ages. Waechter et al. [2014] globally optimize the texture layoutsubject to view proximity, orthogonality and focus, and performglobal luminance adjustment and local seam optimization.

—Image-based rendering: Using per-view geometry proxies (depthmaps) we reproject all images into the novel view and rendercolor images as well as per-pixel weights derived from a combi-nation of angular error [Buehler et al. 2001] and TS3 error [Kopfet al. 2014]. We then fuse the color and weight image stack bycomputing the weighted per-channel median of the color images.

—View-dependent texturing: We turn the previous IBR algorithminto a view-dependent texturing algorithm by replacing the localgeometric proxies with a globally reconstructed mesh.

Except for the IBR case we base all representations on the 3Dmesh reconstructed with the CPC+MS pipeline. For IBR, we recon-struct per-view geometry (i.e., depth maps) for each input image us-ing a standard MVS algorithm. For IBR as well as view-dependenttexturing we perform leave-one-out cross-validation, the other rep-resentations are evaluated without cross-validation.

The resulting errors are shown in Table I. The 1-NCC error ranksthe datasets in an order that is consistent with our visual intu-ition when manually examining the representations’ rephotos: Thetextured mesh is ranked best. View-dependent texturing, that was



Table II. Rank correlations between human judgment and rephotoerror for various image difference metrics.

1-NCC Census ZSSD DSSIMiCID iCID iCID iCID iCID iCID

Cb+Crper. hue. sat. per. csf hue. csf sat. csf

0.64 0.64 0.39 0.48 0.40 0.40 0.39 0.41 0.40 0.38 0.18

ranked second, could in theory produce a lower error but our im-plementation does not perform luminance adjustments at textureseams (in contrast to the texturing algorithm), which are detectedas erroneous. The point cloud has almost the same error as themesh with vertex colors since both are based on the same meshand thus contain the same geometry and color information. Onlytheir rendering algorithms differ, as the point cloud rendering algo-rithm discards vertex connectivity information. The image-basedrendering algorithm is ranked lowest because our implementationsuffers from strong artifacts caused by imperfectly reconstructedplanar depth maps used as geometric proxies. We note, that ourfindings only hold for our implementations and parameter choicesand no representation is in general superior to others.

In the following we will take a closer look at the relation betweenvirtual rephotography scores and human judgment.

5.1.1 Correlation with Human Judgment. We performed auser study to determine the correlation between virtual rephotogra-phy and human ratings for the five different reconstruction repre-sentations of the Castle Ruin. We used two 1080p monitors whichwe calibrated to the CIE D65 white point and 160 cdm−2 to ensuresimilar luminance and color display. Out of our 25 study partici-pants 9 had a background in 3D reconstruction or computer graph-ics and 16 did not. Prior to the main study each participant under-went a tutorial explaining the task and letting them test the interfaceon a dataset of five examples. We reduced the Castle Ruin datasetto 142 randomly selected views. For each of the 142 views we firstshowed users the input photo and clicking revealed all five repho-tos (one for each reconstruction representation). Those five repho-tos were shown simultaneously on the two monitors at the sameresolution at which the virtual rephotography framework evaluatedthem. Users then had to drag and drop the rephotos into the orderof which rephoto they perceived as most respectively least similarto the input photo. They were instructed to not take black regions inthe rephotos into account. No time limit was imposed on them andthey could toggle back and forth between photo and rephotos at anytime to compare them precisely. After users were satisfied with theorder of the rephotos they could proceed to the next view. All 142views were shown in random temporal order and for each view thefive rephotos were shown in random spatial on-screen order.

On average, each user took about 71min for all decisions. Af-ter the study we removed the results of two users who took a lotless time for their decisions than all others and three users who leftrephotos at their (random) initial position significantly more oftenthan random chance would predict. Thereby, we obtained a total of(25−2−3) ·142 rankings. The machine-computed virtual rephotoscores and the human rankings are not directly comparable. Wetherefore converted the machine scores into per-view per-scene-representation ranks. For the human rankings we eliminated theusers as a variable of the data by computing the mode (i.e., mostcommon value) over all users, also giving us per-view per-scene-representation ranks. We then computed Spearman’s rank correla-tion between machine and human ranks and obtained the correla-tions shown in Table II. 1-NCC and Census exhibit the strongestcorrelation with human judgment. The other metrics have a lowercorrelation with Cb+Cr having the lowest. Interestingly, in the de-

cision times as well as the rank correlations we found almost nodifference between the participants with a background in 3D re-construction or computer graphics and the ones without.

There is a variety of potential reasons why the correlation be-tween the users and, e.g., 1-NCC is smaller than 1, including thefollowing: In a questionnaire that users filled out directly after thestudy, we asked them whether they preferred a globally consistentbut slightly blurry rendering (e.g., from the point cloud) over asharp, detailed rendering that had distinct image regions with dif-ferent exposures and sharp transitions between these regions (e.g.,from our view-dependent texturing implementation). 37% of allusers affirmed this question, which indicates that humans at leastpartially take global effects such as consistency into account whenjudging image similarity. In contrast, the image difference metricsemployed in this paper are strictly local (i.e., restricted to theirpatch size) in nature and almost always prefer the sharp image, be-cause sharp regions have a small local error and only the relativelysmall transition regions have a high local error.

5.2 Error Localization

If the evaluated reconstruction contains an explicit geometry model(which is, e.g., not the case for a traditional light field [Levoy andHanrahan 1996]), we can project the computed error images ontothe model to visualize reconstruction defects directly. Multiple er-ror images projected to the same location are averaged. To improvevisualization contrast we normalize all errors between the 2.5% and97.5% percentile to the range [0, 1], clamp errors outside the rangeand map all values to colors using the “jet” color map.

Figure 9 shows a 1-NCC projection on a reconstruction of theLying Statue dataset. It highlights blob-like Poisson surface recon-struction artifacts behind the bent arm, to the right of the pedestal,and above the head. Less pronounced, it highlights the ground andthe pedestal’s top which were photographed at acute angles.

Figure 10 shows that color defects resulting from combining im-ages with different exposure are detected by the 1-NCC error.

Figure 11 shows a textured City Wall model and its 1-NCC er-ror projection. The error projection properly highlights hard to re-construct geometry (grass on the ground or the tower’s upper half,which was only photographed from a distance) and mistexturedparts (branches on the tower or the pedestrian on the left). The sup-plemental material and the video show these results in more detail.

A point to discuss are occluders (e.g., pedestrians or plants) ina scene. Our metrics only measure differences between rephotoand test image and can therefore not tell which of both is “erro-neous”. If a test image contains an occluder and the rephoto doesnot (because moving or thin objects tend to reconstruct badly), thedifference will be high in that image region even though the recon-struction is in essence good. Interestingly, this is often not visiblein the error projection: If the occluder has not been reconstructed,the high difference is projected to the scene part behind the oc-cluder. Typically, scene parts are seen redundantly by many viewsand these views may or may not consistently see the occluder infront of the same scene part depending on the occluder’s and thecamera’s movement. If an occluder stays at the same spot and isclose to the geometry behind it, all views will consistently projecttheir error to the same scene part, as is the case for the plant in Fig-ure 10. But if the occluder moves or the camera moves around theoccluder, the error will be projected to inconsistent places and av-eraged out, as is the case for the pedestrians in the City Wall: Theyonly show up in the error projection wherever they are present inrephotos (i.e., where they were baked into the texture in Figure 11’sfar left) but not where they are present in test images. So, for error



≤ 0.080.240.410.580.75

≥ 0.92

Fig. 9. Lying Statue: CMVS+PSR reconstruction and 1-NCC error pro-jection (color scale: per vertex average 1-NCC error).

Fig. 10. Left to right: Photo, rephoto with color defect framed in red, and1-NCC error projection of the framed segment.

≤ 0.10

0.23

0.36

0.49

0.62

≥ 0.75

≤ 0.10

0.23

0.36

0.49

0.62

≥ 0.75

≤ 0.10

0.23

0.36

0.49

0.62

≥ 0.75

Fig. 11. Textured City Wall reconstruction. Top: A Pedestrian and a treebeing used as texture. Bottom: The 1-NCC error projection highlights both.

localization, occluders in test images are not a big issue if there areenough test images to average them out.

6. BENCHMARK

Based on our proposed framework we built a benchmark forimage-based modeling and rendering which is available online atibmr-benchmark.gcc.informatik.tu-darmstadt.de. It pro-vides training images with known camera parameters, asks submit-ters to render images using camera parameters of secret test im-

ages, and computes completeness and 1-NCC error score for thetest images. For authors focusing on individual reconstruction as-pects instead of a complete pipeline we provide shortcuts such asa decent mesh, depth maps for IBR, and texturing code. We be-lieve that such a unified benchmark will be fruitful since it providesa common base for image-based modeling and rendering methodsfrom both the graphics and the vision community. It opens up noveland promising research directions by “allow[ing] us to measureprogress in our field and motivat[ing] us to develop better algo-rithms” [Szeliski 1999].

7. DISCUSSION

Evaluating the quality of a 3D reconstruction is a key componentrequired in many areas. In this paper we proposed an approach thatfocuses on the rendered quality of a reconstruction, or more pre-cisely its ability to predict unseen views, without requiring a groundtruth geometry model of the scene. It makes renderable reconstruc-tion representations directly comparable and therefore accessiblefor a general reconstruction benchmark.

While the high-level idea has already been successfully used forevaluation in other areas (optical flow [Szeliski 1999]), we are thefirst to use this for quantitative evaluation in image-based mod-eling and rendering and show with our experiments that it ex-hibits several desirable properties: First, it reports quality degra-dation when we introduce artificial defects into a model. Second,it reports degradation when we lower the number of images usedfor reconstruction or reduce their resolution. Third, it ranks differ-ent 3D reconstruction representations in an order that is consistentwith our visual intuition, which is also supported by a correlationbetween human judgment and rephoto error that we found. Andfourth, its errors are correlated to the purely geometric Middleburyerrors but also capture complementary aspects which are consis-tent with our visual intuition (Figure 2). Measuring visual accuracysolely from input images leads to useful results and enables manyuse cases, such as measuring the reconstruction quality improve-ment while tweaking parameters of a complete pipeline or whilereplacing pipeline parts with other algorithms. Highlighting recon-struction defects locally on the mesh can be used for acquisitionplanning.

Although we focused on the 1-NCC error throughout this pa-per, we note that all patch-based metrics (1-NCC, Census, ZSSD,DSSIM, iCID) behave similarly: They all show a strictly monotonicerror increase in the experiment from Section 4.1, and they are closeto scaled versions of 1-NCC in the experiments from Sections 4.2and 4.3. In the supplemental material we show the full statistics forall experiments, all metrics, and the CPC+MS pipeline. The factthat all patch-based metrics behave largely similar does not comeas a complete surprise: In the context of detecting global illumina-tion and rendering artifacts Mantiuk [2013] “did not find evidencein [his] data set that any of the metrics [...] is significantly bet-ter than any other metric.” We picked 1-NCC for presentation inthis paper, because it separated the box plots in the experiment ofSection 4.2 most clearly, it correlated most with human rankingsin Section 5.1.1, and in our experience its error projections (Sec-tion 5.2) are most consistent with local reconstruction errors thatwe found manually while inspecting the reconstructions.

Based on our experiments, a set of basic insights for future re-construction algorithms to produce reconstructions with low re-photo error is the following: Models should be complete and un-fragmented (in contrast to Figure 2a), there should not be any smallpieces floating around in front of the actual model (in contrast tothe supplemental material’s Figure 3.3), and models do not need to



Fig. 12. Detail of a reconstruction (left) and its 1-NCC error projec-tion (right) of Notre Dame based on 699 Flickr images.

be too accurate in regions where inaccuracies will never be uncov-ered by any test image. E.g., if a reconstruction will be used for acomputer game with first person perspective, test images (i.e., po-tential player positions) will all be close to the ground plane andmay even be restricted to a certain path through the scene. Note,that the training images do not need to be restricted the same way.

7.1 Limitations

An important limitation of our approach is the following: Since ithas a holistic view on 3D reconstruction it cannot distinguish be-tween different error sources. If a reconstruction is flawed, it candetect this but is unable to pinpoint what effect or which reconstruc-tion step caused the problem. Thus—by construction—virtual re-photography does not replace but instead complements other eval-uation metrics that focus on individual pipeline steps or reconstruc-tion representations. For example, if we evaluate an MVS plus tex-turing pipeline which introduces an error in the MVS step, virtualrephotography can give an indication of the error since visual andgeometric error are correlated (Section 4.3). By projecting the errorimages onto the geometric model we can find and highlight geo-metric errors (Section 5.2). But to precisely debug only the MVSstep, we have to resort to a geometric measure. Whether we usea specialized evaluation method such as Middlebury or a generalmethod such as virtual rephotography, is a trade-off between thetypes and sources of errors the method can detect and its generality(and thus comparability of different reconstruction methods).

Another limitation revolves around various connected issues:Currently we do not evaluate whether a system handles surfacealbedo, BRDF, shading, interreflection, illumination or camera re-sponse curves correctly. They can in principle be evaluated, but onewould have to use metrics that are less luminance-invariant thanthe investigated ones, e.g. mean squared error. Furthermore, onewould need to measure and provide all information about the eval-uation images which is necessary for correct novel-view predictionbut cannot be inferred from the training images, e.g. illumination,exposure time, camera response curve, or even shadows cast byobjects not visible in any of the images. Acquiring ground truthdatasets for benchmarking, etc. would become significantly morecomplicated and time-consuming. In certain settings it may be ap-propriate to incorporate the above effects, but given that most 3Dreconstruction systems do currently not consider them, and that vir-tual rephotography already enables a large number of new applica-tions when using the metrics we investigated, we believe that ourchoice of metrics in this paper is appropriate for the time being.

7.2 Future Work

Similar to, e.g., Hoppe et al.’s [2012b] two-stage reconstructionprocedure, our system could be used to localize error based on the3D reconstruction of a previous image capture iteration and guidethe user to low quality areas in a subsequent capture iteration.

Furthermore we found, that datasets based on community photocollections (tourist photos from photo sharing sites) are very chal-lenging for virtual rephotography. Most problematic here are testimages with occluders or challenging illumination that are ef-fectively impossible to predict by reconstruction algorithms. Weachieve partial robustness towards this by (a) using image differ-ence metrics that are (to a certain extent) luminance/contrast in-variant, and (b) averaging all error images, that contribute to theerror of a mesh vertex, during the error projection step. This canaverage out the contribution of challenging test images. In our ex-perience, for some community photo collection datasets this seemsto work and the errors highlighted by the 1-NCC projection par-tially correspond with errors we found during manual inspection ofthe reconstructions (e.g., in Figure 12 the 1-NCC projection detectsdark, blurry texture from a nighttime image on the model’s left sideand heavily distorted texture on the model’s right), while for othersit does clearly not. In the future we would like to investigate meansto make our evaluation scheme more robust with respect to suchchallenging datasets.

ACKNOWLEDGMENTSWe are very grateful to Daniel Scharstein for providing us withthe Middlebury benchmark submissions, the benchmark submitterswho kindly permitted us to use their datasets, Rick Szeliski forvaluable discussions about this paper, Florian Kadner for helpingwith the user study, and Benjamin Richter, Andre Schulz, DanielHitzel, Stepan Konrad and Daniel Thul for helping with variousprogramming and dataset acquisition things. This work was sup-ported in part by the Intel Visual Computing Institute (project Re-alityScan).

REFERENCES

Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, andAnders Bjorholm Dahl. 2016. Large-scale data for multiple-view stere-opsis. IJCV (2016).

Tunc Ozan Aydın, Rafał Mantiuk, Karol Myszkowski, and Hans-Peter Sei-del. 2008. Dynamic range independent image quality assessment. In SIG-GRAPH.

Soonmin Bae, Aseem Agarwala, and Fredo Durand. 2010. Computationalrephotography. ACM Transactions on Graphics 29, 3 (July 2010).

Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth, Michael J. Black,and Richard Szeliski. 2011. A database and evaluation methodology foroptical flow. IJCV 92, 1 (2011).

Kai Berger, Christian Lipski, Christian Linz, Anita Sellent, and MarcusMagnor. 2010. A ghosting artifact detector for interpolated image qualityassessment. In International Symposium on Consumer Electronics.

Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, andMichael F. Cohen. 2001. Unstructured Lumigraph rendering. In SIG-GRAPH.

Fatih Calakli, Ali O. Ulusoy, Maria I. Restrepo, Gabriel Taubin, andJoseph L. Mundy. 2012. High resolution surface reconstruction frommulti-view aerial imagery. In 3DIMPVT.

Neill D. Campbell, George Vogiatzis, Carlos Hernandez, and RobertoCipolla. 2008. Using multiple hypotheses to improve depth-maps formulti-view stereo. In ECCV.

Scott Daly. 1993. The visible differences predictor: An algorithm for theassessment of image fidelity. In Digital Images and Human Vision.

Andrew Fitzgibbon, Yonatan Wexler, and Andrew Zisserman. 2005. Image-based rendering using image-based priors. IJCV 63, 2 (2005).



Wolfgang Forstner. 1996. 10 pros and cons against performance characteri-zation of vision algorithms. In Workshop on Performance Characteristicsof Vision Algorithms.

Simon Fuhrmann and Michael Goesele. 2011. Fusion of depth maps withmultiple scales. In SIGGRAPH Asia.

Simon Fuhrmann, Fabian Langguth, Nils Moehrle, Michael Waechter, andMichael Goesele. 2015. MVE – An image-based reconstruction environ-ment. Computers & Graphics 53, Part A (2015).

Yasutaka Furukawa, Brian Curless, Steven M. Seitz, and Richard Szeliski.2010. Towards Internet-scale multi-view stereo. In CVPR.

Yasutaka Furukawa and Jean Ponce. 2010. Accurate, dense, and robustmulti-view stereopsis. PAMI 32, 8 (2010).

Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Ste-ven M. Seitz. 2007. Multi-view stereo for community photo collections.In ICCV.

Steven J. Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F. Co-hen. 1996. The Lumigraph. In SIGGRAPH.

Stefan Guthe, Douglas Cunningham, Pascal Schardt, and Michael Goesele.2016. Ghosting and popping detection for image-based rendering. In3DTV Conference.

Tom Haber, Christian Fuchs, Philippe Bekaert, Hans-Peter Seidel, MichaelGoesele, and Hendrik P. A. Lensch. 2009. Relighting objects from imagecollections. In CVPR.

Christian Hofsetz, Kim Ng, George Chen, Peter McGuinness, Nelson Max,and Yang Liu. 2004. Image-based rendering of range data with estimateddepth uncertainty. Computer Graphics and Applications 24, 4 (2004).

Yuan Hongxing, Guo Li, Yu Li, and Cheng Long. 2010. Multi-view recon-struction using band graph-cuts. Journal of Computer-Aided Design &Computer Graphics 4 (2010).

Christof Hoppe, Manfred Klopschitz, Markus Rumpler, Andreas Wendel,Stefan Kluckner, Horst Bischof, and Gerhard Reitmayr. 2012a. Onlinefeedback for structure-from-motion image acquisition. In BMVC.

Christof Hoppe, Andreas Wendel, Stefanie Zollmann, Katrin Pirker, ArnoldIrschara, Horst Bischof, and Stefan Kluckner. 2012b. Photogrammet-ric camera network design for micro aerial vehicles. In Computer VisionWinter Workshop.

Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. 2006. Poissonsurface reconstruction. In SGP.

Johannes Kopf, Michael F. Cohen, and Richard Szeliski. 2014. First-personhyper-lapse videos. In SIGGRAPH.

Yvan G. Leclerc, Quang-Tuan Luong, and Pascal Fua. 2000. Measuring theself-consistency of stereo algorithms. In ECCV.

Marc Levoy and Pat Hanrahan. 1996. Light field rendering. In SIGGRAPH.Rafał Mantiuk. 2013. Quantifying image quality in graphics: Perspective

on subjective and objective metrics and their performance. In SPIE, Vol.8651.

Rafał Mantiuk, Kil Joong Kim, Allan G. Rempel, and Wolfgang Heidrich.2011. HDR-VDP-2: A calibrated visual metric for visibility and qualitypredictions in all luminance conditions. In SIGGRAPH.

Paul Merrell, Amir Akbarzadeh, Liang Wang, Philippos Mordohai, Jan-Michael Frahm, Ruigang Yang, David Nister, and Marc Pollefeys. 2007.Real-time visibility-based fusion of depth maps. In ICCV.

Ken Perlin. 2002. Improving noise. In SIGGRAPH.Jens Preiss, Felipe Fernandes, and Philipp Urban. 2014. Color-image qual-

ity assessment: From prediction to optimization. IEEE Transactions onImage Processing 23, 3 (2014).

Ganesh Ramanarayanan, James Ferwerda, Bruce Walter, and Kavita Bala.2007. Visual equivalence: Towards a new standard for image fidelity. InSIGGRAPH.

Michael Schwarz and Marc Stamminger. 2009. On predicting visual pop-ping in dynamic scenes. In Applied Perception in Graphics and Visual-ization.

Steven M. Seitz, Brian Curless, James Diebel, Daniel Scharstein, andRichard Szeliski. 2006. A comparison and evaluation of multi-view ste-reo reconstruction algorithms. In CVPR.

Qi Shan, Riley Adams, Brian Curless, Yasutaka Furukawa, and Steven M.Seitz. 2013. The visual Turing test for scene reconstruction. In 3DV.

Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2006. Photo tourism:Exploring photo collections in 3D. In SIGGRAPH.

Christoph Strecha, Wolfgang von Hansen, Luc Van Gool, Pascal Fua, andUlrich Thoennessen. 2008. On benchmarking camera calibration andmulti-view stereo for high resolution imagery. In CVPR.

Richard Szeliski. 1999. Prediction error as a quality metric for motion andstereo. In ICCV.

James Tompkin, Min H. Kim, Kwang In Kim, Jan Kautz, and ChristianTheobalt. 2013. Preference and artifact analysis for video transitions ofplaces. ACM Transactions on Applied Perception 10, 3 (2013).

Kathleen Tuite, Noah Snavely, Dun-yu Hsiao, Nadine Tabing, and ZoranPopovic. 2011. PhotoCity: Training experts at large-scale image acquisi-tion through a competitive game. In SIGCHI.

Peter Vangorp, Gaurav Chaurasia, Pierre-Yves Laffont, Roland Fleming,and George Drettakis. 2011. Perception of visual artifacts in image-basedrendering of facades. In Eurographics Symposium on Rendering.

Peter Vangorp, Christian Richardt, Emily A. Cooper, Gaurav Chaurasia,Martin S. Banks, and George Drettakis. 2013. Perception of perspectivedistortions in image-based rendering. In SIGGRAPH.

Kenneth Vanhoey, Basile Sauvage, Pierre Kraemer, Frederic Larue, andJean-Michel Dischler. 2015. Simplification of meshes with digitized ra-diance. The Visual Computer 31 (2015).

Michael Waechter, Nils Moehrle, and Michael Goesele. 2014. Let there becolor! Large-scale texturing of 3D reconstructions. In ECCV.

Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli.2004. Image quality assessment: From error visibility to structural simi-larity. IEEE Transactions on Image Processing 13, 4 (2004).

Robert H. Webb, Diane E. Boyer, and Raymond M. Turner. 2010. Repeatphotography. Island Press.

Yizhou Yu, Paul Debevec, Jitendra Malik, and Tim Hawkins. 1999. Inverseglobal illumination: Recovering reflectance models of real scenes fromphotographs. In SIGGRAPH.

Ramin Zabih and John Woodfill. 1994. Non-parametric local transforms forcomputing visual correspondence. In ECCV.

Matthias Zwicker, Hanspeter Pfister, Jeroen van Baar, and Markus Gross.2001. Surface splatting. In SIGGRAPH.

Received November 2015; accepted September 2016


Documents

Virtual Rephotography: Novel View Prediction Error for 3D ...€¦ · capture certain aspects of 3D reconstructions. Thus, a new method-ology that evaluates visual reconstruction