Temporal Coherence in Image-based Visual Hull Rendering · Microsoft Kinect has profoundly changed the possibilities of sensing for games or virtual reality applications. This devices

1

Temporal Coherence in Image-basedVisual Hull Rendering

Stefan Hauswiesner, Student Member, IEEE, Matthias Straka, and Gerhard Reitmayr, Member, IEEE

Abstract—Image-based visual hull rendering is a method for generating depth maps of a desired viewpoint from a set of silhouette imagescaptured by calibrated cameras. It does not compute a view-independent data representation, such as a voxel grid or a mesh, which makes itparticularly efficient for dynamic scenes. When users are captured, the scene is usually dynamic, but does not change rapidly because peoplemove smoothly within a sub-second time frame. Exploiting this temporal coherence to avoid redundant calculations is challenging becauseof the lack of an explicit data representation. This paper analyses the image-based visual hull algorithm to find intermediate information thatstays valid over time and is therefore worth to make explicit. We then derive methods that exploit this information to improve the renderingperformance. Our methods reduce the execution time by up to 25%. When the user’s motions are very slow, reductions of up to 50% areachieved.

Index Terms—mixed reality, image-based visual hull rendering, temporal coherence

F

1 Introduction

R eal-time reconstruction and rendering of people capturedby a set of cameras is an important step for a number

of applications, for example, virtual try-on and telepresencesystems. Shape-from-silhouette methods can provide bothspeed and sufficient reconstruction quality when a number ofcameras, typically between four and ten, is used. The renderingperformance of these methods is especially important, becauseusers perceive latency to their own body motions as disturbing.

Image-based visual hull (IBVH) rendering is an efficientshape-from-silhouette method because it generates a depthmap reconstruction from a desired viewpoint directly froma set of silhouette images. To produce the high resolutionsthat future applications will require, however, it is importantto reduce the number of calculations that are performed forevery output image.

People in front of a set of cameras usually move smoothly.As a result, the reconstructed and rendered output imagedoes not change drastically over time. This is usually calledtemporal coherence and is exploited in many visual computingalgorithms. Exploiting coherence for IBVH rendering, how-ever, is challenging because the user is reconstructed everyframe and can move and deform arbitrarily.

The contribution of this paper is the first attempt to utilizeprevious computation results to improve the performance ofthe IBVH algorithm even during user motion. To achieve this,we first analyzed the IBVH algorithm to find intermediatecomputation results that are coherent over time and verifiedtheir persistence by measurements. We found that visual hullpoints have an association to silhouette edges that persistsover several frames. These associations are more stable thanthe surface point locations themselves. This knowledge can

• The authors are with the Graz University of TechnologyE-mail: hauswiesner—straka—[email protected]

be used to save computations in regions with stable cameraassociations (see Section 4).

However, with an increasing number of cameras, theseregions become smaller. To maintain the benefit of temporalcoherence, we make use of the observation that users usuallydo not move quickly in relation to the camera update rates. Weanalyzed the camera associations of ray-silhouette intersectionintervals in the light of an assumed maximal user motionspeed. Based on this constraint, we introduce a method todetermine irrelevant and stable ray-silhouette intersections toreduce the amount of calculations over time (see Section 5).

Finally, we observed that the angle between the camerasand the desired viewing direction has a strong impact onthe foreground/background decision during rendering. Basedon this observation we derived a method that reduces theexecution time of the IBVH algorithm (see Section 6). Weevaluated the effectiveness of the suggested methods undervarying degrees of coherence and different resolutions.

2 Related workExtracting geometry from a set of calibrated camera im-ages when a foreground segmentation is available is a fairlycommon technique in computer vision called shape-from-silhouette. When using this information for display, the processis often called visual hull rendering. Other reconstructionalgorithms, for example, shape-from-stereo, are less suited forinteractive reconstruction of high resolution depth maps fromten or more cameras.

2.1 Visual hull rendering

Most approaches are based on voxel carving. A recent voxel-based approach using CUDA on four PCs was shown [1].Another implementation [2] uses shader programs. However,voxels can not efficiently provide higher output resolutions.Workarounds have been suggested, for example, using higherresolutions for the user’s head [3]. More efficient systems use

2

(a) (b) (c)

Fig. 1. (a) is a rendered illustration of the capturing room used in this work. (b) shows an image-based visual hull renderingof a user that is visualized using phong-shading and view-dependent texture mapping. (c) shows a rendering of a usermoving her arms. Our motion detector indicates different motion magnitudes in shades of red.

shape-from-silhouette to reconstruct the surface geometry asmeshes [4].

When immediate rendering of novel viewpoints is required,the detour of explicit geometry can be avoided to reducelatency. To directly create novel views from silhouette images,the image-based visual hull (IBVH) [5] method was intro-duced. It involves on-the-fly ray-silhouette intersection andCSG operations to recover a depth map of the current view.Recent implementations of the IBVH algorithm utilize GPUhardware [6].

Other methods utilize the conventional graphics pipeline ofmodern GPUs. Usually texture mapped proxy geometry [7] ordepth-peeling are used [8]. These implementations are usuallynot optimal due to the mapping to the pipeline.

Cameras only provide color information of the capturedscene, and therefore require a reconstruction algorithm toobtain 3D surfaces. With active depth sensors, a surface re-construction is provided directly because the required sensingand processing is already performed by the device.

2.2 Active depth sensors

Time-of-flight depth sensors were used [9], [10] to capture andrender scenes, but the devices were usually expensive. TheMicrosoft Kinect has profoundly changed the possibilities ofsensing for games or virtual reality applications. This devicesdelivers rich information on the scene structure and colorwithout intensive processing at a competitive price.

The capabilities of the Microsoft Kinect have been exploredfor a variety of applications. Kinect-based body scanning [11]also enables virtual try-on applications at low costs. New-combe et al. have shown with their work on KinectFusion [12]that dense volumetric reconstructions can be created in realtime. To enable free viewpoint rendering of dynamic scenes itis necessary to combine multiple devices. For example, Wilsonet al. [13] and Berger et al. [14] use up to four depth sensorsto monitor a room. To avoid interferences between the sensors,both ensure that the infrared light patterns do not overlap.

The problem of interference has been alleviated by Mai-mone et al. [15] and Butler et al. [16] with a similar approach.

Letting the device vibrate blurs the light pattern for other,concurrently capturing sensors. The rigid connection of thevibrating sensor and the light pattern supports a clear recon-struction without interferences from other Kinects.

In contrast to 3D reconstruction from color cameras, activedepth sensors can be more noisy and inaccurate at depth dis-continuities [17]. Moreover, they contain background surfacesthat should often be removed. Therefore, systems that combinethe strengths of both approaches were proposed. The FreeCamsystem [18] combines color cameras and depth cameras in asystem for free-viewpoint rendering. The depth hull renderingmethod [19] improves a reconstructed visual hull with depthinformation. A similar method was suggested in a multi-Kinectsetup [17]. Theses systems show that visual hull rendering hasnot become obsolete due to recent sensor technology.

2.3 Reconstruction of human body models

Reconstruction algorithms that are specialized on human bod-ies usually adapt a template mesh to the camera images. Atthe same time, the user’s body pose needs to be determined.

Motion capture, or human pose tracking, is the task ofdetermining the user’s body pose. It usually involves a poseand shape model that is adapted to the sensor data andtherefore comprises a model-based tracking problem. We onlyconsider marker-less pose tracking, because we assume thatany sort of markers is too obtrusive for end users. Marker-lesspose tracking and shape adaption is usually achieved by usingmultiple cameras [20]. These systems first adapt a templateskeleton to the observed image data by adjusting pose param-eters and comparing the rigged template mesh to the images.Then the mesh shape is adapted to the observation, whichin turn yields better a skeleton pose. The process is repeateduntil convergence. Often, visual hulls are reconstructed fromthe camera images to guide the adaption. The shape and poseadaption is often formulated as an optimization problem toachieve robustness against noisy or wrong input data.

Shape reconstruction and motion capture can be used tocreate new animations of humans [21]. New animations canbe embedded into the original video, which allows to change

3

the shape of humans in videos [22]. Image-based renderingcan be used for realistic rendering of such sequences [23].

Most of the systems described above require a humantemplate model, for example, the SCAPE model [24]. Sucha shape prior limits the applicability of the system to humans.

2.4 Temporal coherence

Exploiting temporal coherence was shown to be a successfulmethod for interactive raytracing. For example, the Tapestrysystem [25] builds a 3D mesh that acts as a cache forprevious rendering results. The cache is updated incremen-tally and rendered instead of the scene. Adaptive frame lessrendering [26] combines two efficient rendering techniques.Frame less rendering updates single pixels or patches insteadof the whole image at each time step. Adaptive renderingguides this process to favor important pixels, for example, atdepth discontinuities. The render cache [27] and the reversereprojection cache [28] follow a similar notion. These systemsare tailored to synthetic scenes and can not handle the 3Dreconstruction of arbitrarily deforming objects.

For voxel-based visual hull reconstruction, temporal coher-ence has also been exploited to improve the performance [29].It is the most similar method to ours and works by incre-mentally updating a voxel grid. For each new frame, only thevoxels whose corresponding silhouette pixels have changed areupdated. This approach is limited in resolution due to usinga voxel representation, and slower than our IBVH methoddespite the incremental update scheme (see Section 8).

For image-based visual hull rendering, temporal coherencewas only exploited for surface patches that remain static overtime [30]. The desired viewpoint does not change very fastin a virtual try-on or telepresence system. Therefore, many ofthe output pixels of the previous frame can be reused, as longas the foreground object has not moved or deformed at thesepixels. Object or user motion is also usually relatively slow,because it requires physical motion and is often restricted toparts of the user, like the arms or the head. An IBVH-setupis very well suited for exploiting these types of coherence,because motion of the foreground object can be detected veryefficiently: finding a bounding volume of changed parts isequivalent to computing a visual hull from differences in thesilhouette images. Figure 1(c) illustrates the result of such amotion detector. Frame-to-frame forward image warping wasused along with an aging mechanism as an efficient algorithmfor reusing previous rendering results at surfaces patches thatdid not deform or move. Our system builds on the suggestedmotion detector.

2.5 Prerequisites

The multi-camera room that is used for this project consistsof a 2 × 3 meter footprint cabin with green walls [31](Figure 1(a)). Ten cameras are mounted on the walls: two atthe back, two at the sides and six at the front. The camerasare synchronized and focused at the center of the cabin, wherethe user is allowed to move freely inside a certain volume.All cameras are calibrated intrinsically and extrinsically and

Set up viewing ray

Project viewing ray to camera

Find ray-silhouette intersections

Append intersection intervals to list

All cameras processed?

Intersect intervals from list

Store result

Yes

No

Decide which cameras to process

Compute and store persistent data

Fetch persistent information

Fig. 2. The black boxes describe the process of image-based visual hull (IBVH) rendering for a single viewing ray.The red boxes indicate where extensions are required toexploit temporal coherence.

connected to a single PC. The output device is a 42” TV thatis mounted to the front wall in a portrait orientation.

In such a scenario, silhouettes can be extracted from thecamera images quickly and robustly by background subtrac-tion. Silhouettes allow for efficient novel view synthesis. Theimage-based visual hull (IBVH) algorithm generates a depthmap for arbitrary viewpoints that can be textured to producethe final output. Figure 2 gives an overview of the process.First, a viewing ray is set up for each pixel. The viewing rayis projected on each of the image planes of the cameras. There,the ray-silhouette intersection intervals are found and storedin a list. Finally, this list is intersected to obtain the intervalswhere the ray enters and exits the visual hull. The front mostentry point can be used to compute a depth value.

While our experiments are focused on rendering humans,the suggested methods are not limited to humans. Every objectthat can be distinguished from the background is suitable forIBVH rendering. We do not employ a human shape prior,like a template mesh model. However, our temporal coherencemethods assume a smooth motion in relation to the cameraupdate rate.

The IBVH algorithm does not compute an explicit datarepresentation, such as a voxel grid or a mesh. As a result,unnecessary memory traffic is avoided. The resolution ofthe computations fits the output resolution exactly: for everypixel a viewing ray is generated to compute a surface point.However, every frame is computed from scratch. To improveperformance further, the temporal coherence in the input videostreams can be utilized.

4

0

0,2

0,4

0,6

0,8

1

1

20

39

58

77

96

11

5

13

4

15

3

17

2

19

1

21

0

22

9

24

8

26

7

28

6

30

5

32

4

34

3

36

2

38

1

40

0

41

9

43

8

45

7

47

6

Coherent Camera Associations Coherent Locations

Frac

tio

n

Frame

Fig. 3. Coherence analysis: the fraction of camera associa-tions and surface point coordinates that stay valid betweentwo consecutive frames are compared. Camera associationsshow more coherence.

3 Temporal coherence in IBVH renderingOur previous work [30] was limited to static surface patches.Exploiting coherence for surface patches that move or de-form over time is much harder. Due to the unpredictabledeformation, previous depth maps can not be reused reliably.Figure 3 shows that a big fraction of visual hull surface points(computed from the depth maps) can become invalid betweentwo consecutive frames. The IBVH algorithm computes thedepth map directly from the silhouette images, which meansthat without modifications to the original approach thereare no intermediate results that reliably stay valid over twoconsecutive frames.

The contribution of this paper is that we found such infor-mation and methods for determining its reliability. Figure 2indicates how the conventional IBVH algorithm can be ex-tended to exploit temporal coherence. First, the informationthat is persistent over consecutive frames is fetched. It helpsto decide which cameras should be processed by the IBVHalgorithm. Finally, the persistent data for the next frame needsto be extracted and stored.

The following sections measure the stability of cameraassociations over time and derive strategies for exploitingcoherence from the measurements. First, we describe how sur-face points of a visual hull are associated with the cameras andshow that this association is more stable than the surface pointsthemselves. This knowledge can be used to save computationsin regions with stable camera associations.

4 Camera association coherenceOne important property of visual hulls is that every surfacepoint projects to a silhouette edge in at least one of the cameraimages. Most surface points do not lie on visual hull ridges orjunctions and therefore project to a silhouette edge in exactlyone camera (Figure 4(a)). Therefore, most surface points canbe computed by finding ray-silhouette intersections in onlyone instead of all camera images.

The association of a surface point with a single camera isnot static over time. However, we found it to vary smoothly.To determine the amount of coherence, we therefore measured

(a) (b)

Fig. 4. (a) shows how silhouette edges generate surfacepoints. (b) illustrates the stability estimation used in ourmethod: blocks with inhomogeneous camera associationsare marked as unstable (red).

the percentage of surface points (equivalent to pixels for theIBVH algorithm) that do not change their association betweenconsecutive frames. We evaluated this measure over a wholesequence of user body motions. Figure 3 shows the fractionof surface points that keep their camera association over arecorded sequence. It can be observed that most associationsstay valid between consecutive frames. In contrast, the surfacepoints’ location does not provide as much coherence.

To quickly compute a surface point from a coherent cameraassociation it is enough to execute a subset of the originalIBVH algorithm: the ray-silhouette intersection. It returns alist of intervals that describe where the viewing ray runsthrough the reconstructed object. The front most interval isnot necessarily the correct one. In complex cases where someof the intersection intervals would be carved by other cameras,it is necessary to know which of the intervals is correct.This means that the temporally persistent information of thismethod is a camera ID and an interval ID. In addition, westore a stability measure.

We therefore extended the conventional image generationalgorithm by three stages (Figure 2). First, the two ID numbersand the stability value are fetched from a buffer for eachpixel. The IDs can be used directly to decide which camerais processed. When a pixel is flagged as unstable, the conven-tional IBVH algorithm is executed by processing all cameras.After the surface point is computed, the associated IDs arestored as the new persistent information. Finally, the expectedstability needs to be computed for the next frame. The stabilitycomputation is subject to a tradeoff.

4.1 Quality/speed tradeoff

While many surface points keep their camera and intersectionassociation between frames, not all of them do. Especiallyat the borders between different camera associations, surfacepoints change their sides frequently. To avoid artifacts in these

5

regions it is therefore important to identify and mark themas unstable, meaning that they require a full IBVH computa-tion that considers all cameras. Figure 4(b) shows an IBVHrendered image with augmented stability information. Thewidth of the unstable area around patches with inhomogeneouscamera associations determines the execution time and theresult quality. Increasing the width causes more full IBVHcomputations and therefore increases quality at the cost ofperformance. The result of setting a too small width can beseen in Figure 5(a).

5 User motion coherenceThe camera association method assumes that associations varyslowly over time. This proved to be true for large surfacechunks like the upper body of people. However, body partslike hands and legs usually move much faster and are relativelythin. When considering the maximum motion speed of, forexample, hands, it becomes obvious that the association to asingle camera can change quickly. As a result, body parts, likehands or feet, can be missing.

The camera association method covers these changes by astability predictor that marks border regions in the associationbuffer as unstable. If we set the three dimensional extent ofthese border regions to the maximum motion distance thatwe expect in our system then most of the surface would beidentified as being unstable. As a result, the method can beconfigured to favor either speed or quality, but does not achievean optimal tradeoff between these two goals. The situation isbetter for a rather low number of cameras, but for qualityreasons we want to use as many cameras as possible.

Because a single camera association per surface point is notstable enough to handle fast user motion and many cameras,we use a list of camera associations instead. In this list we keepall cameras that contribute important ray-object intersectionintervals for a surface point. A list containing all camerasis maximally stable, because it allows to compute the visualhull conventionally even without any temporal coherence.Removing cameras from the list improves the performance ofsubsequent frames. The output stays the same if the removedcameras do not contribute to the result.

To identify which cameras can be removed, we makeassumptions about our application. It mainly focuses on re-constructing and rendering of people. We can safely assume acertain maximum velocity the user moves his body (parts).Most likely the hands will move the fastest. We usuallyassume a maximum velocity of 135 centimeters per second,which was sufficient during our experiments. From this we canderive a maximum distance that the surface can move betweenconsecutive frames. Cameras that only contribute ray-objectintersection intervals with a greater distance to the surfacecan be removed from the list.

Similar to the single camera association method, we ex-tended the conventional image generation algorithm by threestages (Figure 2). First, for every pixel the camera list isfetched in a certain neighborhood, because camera associationsshould be distributed to all surface points inside the maximalmotion range. When the new image is generated, only these

(a) (b)

a0

a1

C0

C1

(c)

Fig. 5. (a) and (b) show the impact of a too small instabilitywidth for camera association coherence (a), and a too smallmaximally expected motion for user motion coherence (b).Red areas augment regions with artifacts. In (c) two cameras(red and blue) with their frustums and a viewing ray (blackarrow) are illustrated. Camera 1 (red) covers a larger intervalon the viewing ray and therefore can carve more emptyspace than camera 0. Camera 1 is therefore more useful forskipping background pixels. ai can be used to sort camerasaccordingly.

cameras need to be processed. All other cameras can safelybe skipped. During camera processing, we keep track of allcameras which generate intervals that lie within the maximalmotion range around the surface point. After the surface pointis computed, the new camera list is stored. As long as theexpected range is sufficient, the resulting algorithm computesthe exact same result as the conventional algorithm, but faster.

Compared to the single camera association method, we getless benefit per surface patch because the processing is reducedto a number of cameras instead of just one. However, manymore surface patches can be reconstructed with a reduced workload.


The maximum motion that we expect from the user hasa strong impact on both the visual hull quality and theexecution time. Usually, it is estimated conservatively, whichguarantees that the result will be the same as with conventionalIBVH rendering. However, the maximum motion limit can belowered to reduce execution times when certain errors in theresulting visual hull are tolerable during fast user motions.Figure 5(b) illustrates how such artifacts look like. In theevaluation section we provide measurements for a range ofmotion limits to illustrate the behavior of the algorithm.

Usually, different body parts and thus image regions aresubject to different maximum velocities. For example, it ishard to move the head as quickly as the hands. Moreover, themotion velocity can vary over time as the user performs differ-ent actions. Therefore, the performance of exploiting temporalcoherence can often be improved by applying different motionlimits across space and time. To achieve this it is necessaryto estimate the user’s motion magnitude before reconstructingand rendering his body. We employ a motion detector [30](Figure 1(c)) which was previously used to identify surface

6

patches suitable for image warping. It computes an approxi-mate probability that a surface patch has moved between twoconsecutive camera image sets. Patches that move quicklyare more likely to be detected and achieve higher scores,which allows us to estimate the velocity from this score. Inour temporal coherence algorithm we adapt the maximallyexpected user motion according to it.

6 Improved camera order

The IBVH algorithm computes every pixel’s depth value byiterating over all silhouette images. Previous IBVH implemen-tations process the cameras in an arbitrary order, because theresult is independent of it. However, the execution time is not.

To detect background pixels quickly, the interval finding andintersection steps should be interleaved. Once the intersectionoperation results in an empty set, processing can be stoppedand the pixel can be marked as background. As was shown inprevious work [30], this early termination strategy can speedup the algorithm considerably. To improve the performanceeven further, the order in which cameras are processed shouldfacilitate early termination.

When cameras that cover large intervals of a viewing ray areprocessed first then empty intersection sets are found earlier.We found that the optical axes of the cameras can be usedto sort them. Figure 5(c) illustrates how the optical axis ofa camera relates to the size of the interval it covers: cameraswith steep angles relative to the current viewing direction tendto cover larger intervals. Therefore, to quickly approximate thepercentage of average ray coverage of a camera i, we use themetric ai.

ai = abs(dot(axisi, axisview))

It is the absolute of the cosine of the angle between thenormalized camera’s optical axis axisi and the normalizedcurrent viewing direction axisview. An arccos is not requiredbecause it does not change the order of the values. Theabsolute value of the dot product computes the shortest arcbetween the axes.

Figure 6 shows how ai relates to the relative number ofsurface points that are associated to a camera i. From themeasurements we can conclude that most surface points areassociated to cameras with low to medium ai. From this wecan derive that cameras with low ai are more likely to producea surface fragment. Conversely, the most space along theviewing rays is carved away by cameras with relatively highai. This means that cameras with high ai are likely to be morepowerful at identifying background pixels.

All our methods therefore sort the cameras according totheir ai in descending order before rendering. The averageperformance gain over several tested camera orders reached upto 12%, depending on the amount of background surroundingthe visual hull. The suggested sorting method does not exploitthe temporal coherence like the methods described above. Itis an improvement based on statistical data instead.

0

10

20

30

40

50

60

70

80

0 0,2 0,4 0,6 0,8 1

Ray Length Carved Surface Fragments Associated

Per

cen

tage

Angle ai

Fig. 6. The plot shows statistical data that was capturedover a user motion sequence with a rotating viewpoint.The x-axis denotes ai, an angular metric computed from acamera’s optical axis and the viewing direction. The y-axisindicates the average percentages that are carved away fromthe viewing rays and the relative number of surface pointassociations.

7 ImplementationThe IBVH pipeline is implemented as CUDA kernels. It startsby uploading the camera images to the GPU, performs back-ground segmentation and compensates for radial distortion.Then, an angular cache [30] is built from the silhouette edgesto improve the performance of the subsequent IBVH step.This work focuses on the IBVH step, which we extended bythree stages (see Figure 2): fetching the persistent information,deciding how to compute the pixel and storing the results.In addition, background pixels that are outside the scenebounding box or too far from previous surface points areexcluded from further processing.

All coherence methods use the CUDA warp voting function-ality to find decisions that avoid diverging branches. Therefore,all decisions are made conservatively: if one thread in a warprequires full reconstruction then all receive full reconstruction.Shared memory is used when data needs to be available to allthreads in a block for decision making.

7.1 Camera association coherence

The persistent information in this approach is a camera IDand an interval ID that are associated to the surface pointat each pixel. The stability buffer only stores one value perblock of 8 × 8 pixels to facilitate divergence free execution.Rendering with camera association coherence is a three-foldprocess. First, the modified IBVH rendering algorithm checksthe stability buffer at each pixel. To compensate for viewpointchanges between two consecutive frames, the read location istransformed to the image plane of the previous frame. Thisprocess is similar to image warping and therefore might sufferfrom occlusions. In such cases, the lowest stability is assumed.

Second, the algorithm decides whether a patch of pixelsshould be reconstructed fully using the conventional IBVHalgorithm or if it is safe to reuse information from thelast frame. In the latter case, the ray-surface intersection iscomputed by only looking up the camera that each surfacepoint is associated to. Otherwise the conventional algorithm is

7

0

100

200

300

400

500

Voxel Carving Raycasting IBVH Ex

ecu

tio

n T

ime

MS 270k 117k 43k surf. points

(a)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

Whole Sequence Weak Coherence Strong Coherence

IBVH Conventional Improved Cam Order

User Motion (Fixed) User Motion (Detected)

Rel

ativ

e Ex

ecu

tio

n T

ime

(b)

0

20

40

60

80

100

120

140

0 200 400 600 800 1000 Thousands

Conventional IBVH Improved Cam Order Only Static Patches Camera Association User Motion

Surface Point Resolution Exec

uti

on

Tim

e M

S

(c)

Fig. 7. Evaluation of the suggested methods. (a) compares IBVH rendering with (incremental) voxel carving at threedifferent image and voxel resolutions. (b) shows the performance of methods with no quality loss. (c) shows performancemeasurements for an increasing resolution.

run. The new surface point location and its camera associationare written to the result buffers.

Third, the persistent camera association buffer is checkedfor surface patches containing visual hull ridges, i.e. inho-mogeneous blocks. These blocks are marked as unstable forfuture coherence decisions. Nearby blocks are also markedunstable. The object space range of this distribution is fixedbefore runtime. When transformed to image space, the rangedepends on the resolution and zoom level. It can be used tocontrol the quality/speed ratio of the algorithm (see Section 8).In addition to inhomogeneous blocks, the background is alsomarked as being unstable to allow the visual hull to move intopreviously unoccupied space.

7.2 User motion coherence

This method also modifies the conventional IBVH algorithmat three points. First, the persistent information in the form ofa camera list is fetched per pixel. The list describes whichcameras should be traversed. We store the cameras’s ac-tive/discarded status in a bit array per 8×8 pixel thread block.This ensures coherent warp decisions in the next frame. Thebit mask also covers background regions and therefore helpsto skip empty blocks quickly. The camera IDs are aggregatedby a logical or operation in an image space neighborhood thatcorresponds to the maximal expected user motion distance.This distance is defined in object space and therefore needs tobe transformed to image space to account for varying zoomlevels and resolutions. To compensate for viewpoint changesbetween two consecutive frames, the read location for thecamera ID lookup is transformed by image warping. In thecase of occlusions, all camera IDs are reported to trigger afull reconstruction.

The current implementation utilizes the ai metric to sort thecameras before processing. For pixels that showed relativelythick parts of the visual hull and that are not close to thebackground we assume that they will contain foreground inthe next frame too. There we sort the cameras with low a’sfirst. This way, the resulting surface is found quicker (Figure 6red dots) and the remaining cameras can be analyzed for beingnecessary in the future. For pixels that are likely to showbackground, we sort the cameras with high a’s first. This helpsto discard them quicker (Figure 6 blue dots).

During IBVH processing the cameras in the list are traversedin the specified order. The algorithm keeps track of the ray-silhouette intersection intervals and their camera associations.Cameras that are not necessary to find the surface intersectionalong a viewing ray are discarded for the next frame. Reasonswhy a camera may be unnecessary for a ray are: the intervalsdo not cover and are far away from the surface. For example,a camera is unnecessary if it only carves space that is morethan the maximal motion distance in front or behind the visualhull. Another reason can be that the ray does not intersect theviewing volume.

Finally, the new camera list is aggregated per block of pixelsby an atomic or operation and stored in the buffer. Beforethe user enters the scene, all lists are empty. Therefore, whenthe user enters the scene, the algorithm would not notice.To bootstrap the process, we activate random camera IDs inrandom blocks every other frame.

8 EvaluationWe evaluated our methods by measuring execution times andrelating them to conventional IBVH rendering and voxel-basedvisual hull reconstruction. We measured how our methodsscale with different resolutions and how their result qualitydegrades with different parameter settings. The test systemis equipped with an Nvidia Quadro 6000 GPU and executesall tests on previously recorded video streams instead of lifecamera data to generate reproducible results.

8.1 IBVH rendering vs. voxel carving

As a first test, we compare IBVH rendering to voxel carving.Voxel carving is a very popular method for shape-from-silhouette reconstruction. It consists of two stages: first, eachvoxel is projected into all camera images and labeled asoutside if it projects to the background in any of the images.Second, the voxel grid needs to be rendered. We use raycastingin our tests because extracting a mesh from a frequentlyupdated voxel grid is less efficient.

For the tests we selected three different image and voxelresolutions. The voxel resolutions are selected such that thevisible voxels project to an equal number of pixels. Figure 7(a)shows how voxel carving and rendering compares to the IBVHapproach: the voxel carving method is only competitive for

8

20

30

40

50

60

70

80

0 0,5 1 1,5 2 2,5 3

Only Static Patches Camera Association User Motion IBVH Conventional

Exec

uti

on

Tim

e M

S

Quality

(a)

30

40

50

60

70

80

0 0,5 1 1,5 2

20

30

40

50

60

70

80

0 0,5 1 1,5 2 2,5 3 3,5 4

Strong Coherence

Weak Coherence

(b)

Fig. 8. Quality versus execution time evaluation of the suggested methods. (a) shows a sequence with various user motions.(b) shows subsets of frames with little coherence and much coherence. Quality is measured in millimeters deviation fromthe ground truth (IBVH Conventional). The lower left corner therefore means better.

small resolutions. Voxel resolutions were 700 × 350 × 1400,450 × 225 × 900 and 250 × 125 × 500 for the three test runs.

Voxel carving can be improved by exploiting temporalcoherence. This approach is known as incremental visual hullreconstruction [29]. It works by casting rays through thevolume at silhouette pixel locations that have changed betweentwo consecutive frames. If a pixel becomes activated, thevoxel grid along the according viewing ray is updated. Whena pixel gets removed, the voxel grid along the viewing raygets carved. It is possible to subsample the silhouette imagesprior to carving to gain performance. We implemented it onour CUDA-based platform and compared our methods to it.Figure 7(a) illustrates that voxel carving, even when executedincrementally and subsampled to a quarter of the originalresolution, does not match the IBVH rendering performance.We even observed that incremental updating can have a severeperformance impact. We assume that the reason for this is thescattered writing pattern in the voxel grid that is not efficienton the CUDA platform. This becomes especially apparent athigher resolutions. The suggested subsampling scheme [29]alleviates the problem but degrades the result quality.

8.2 Methods without quality loss

In our second test, we compare execution times to the con-ventional IBVH algorithm. We use only methods that do notdegrade the visual hull quality in this test run: the improvedcamera order and the user motion coherence approach witha sufficient maximal motion range. The camera associationcoherence method is less optimal in terms of the quality/speedtradeoff and could not be configured to yield a speed-upwhen loss-less reconstruction is required. Figure 7(b) showsthe results for a recorded sequence of various user motions.Averages are built over the whole sequence, over rather staticframes with much coherence and over dynamic frames withmuch user motion. The improved camera sorting by ai reducesexecution times by around 7%. Considering the maximallyexpected user motion (135 cm/s in this case), the run times canbe decreased by another 20%. When estimating the maximum

motion distance by utilizing our motion detector, we coulddecrease run times by up to another 5%. For frames with muchuser motion, however, the performance gains do not cover theadditional compuation time of the motion detector.

8.3 Scaling with resolution

The execution times of the IBVH algorithm and our temporalcoherence extensions scale strongly with the output resolu-tion [32], because for every pixel a surface point is computed.To draw conclusions from our evaluation runs it is thereforeimportant to check whether the relative performance gains arerepresentative for a larger resolution range. Figure 7(c) showsthe execution time of all our temporal coherence methodsfor a wide resolution range. Execution times scale sub linearwith the number of pixels and maintain their relative order. Itcan be observed that the relative performance gains from thecoherence methods and previous work (Only Static Patches)increase until around 700 × 700 surface points and then staysconstant until it slightly decreases at roughly one millionsurface points resolution.


To evaluate temporal coherence methods it is not enough tomeasure execution times alone. It is also important to see howan algorithm performs when the coherence assumptions arebroken. This is important when either the user’s motion speedwas underestimated or the parameters were set to intentionallyfavor speed over full quality.

We therefore decided to measure the execution times ofthe suggested methods and relate them to the output quality.To do so, we render the same video frame with and withoutexploiting coherence and measure the differences. We do thisby finding pairs of closest surface points between groundtruth and temporal coherence output and compute the averagedistance in millimeters. Please note that even small values havea strong impact on the visual result: average errors above 1millimeter can amount to missing arms when moved quickly.

9

Fig. 9. This figure shows output images with relatively badquality, augmented with the difference to the reference imagein red. It contains our previous approach for static visual hullpatches (left) and the new temporal coherence methods. Therelative performance improvement over conventional IBVHrendering is shown and error measurements are given.

Figure 8 shows quality and speed measurements takenduring an evaluation sequence of various user motions. Theframes were grouped into all frames (Figure 8(a)), mostlydynamic frames and mostly static frames (Figure 8(b)). Inthe dynamic frames, users entered the scene or moved theirwhole bodies. The static frames usually show users standingstill and performing no intended motions. In this quality vs.speed scatter plot, measurements in the lower left regionhave a good quality/speed tradeoff. The conventional IBVHalgorithm is used as the ground truth and therefore has 0 mmerror. The temporal coherence methods in this plot use theimproved camera order and are executed for a wide range ofquality/speed settings. It can be observed that the methodssuggested in this paper outperform the conventional IBVHalgorithm easily without quality loss (also see Figure 7(b)) andalso perform better than the previously used coherence methodthat only works on static surface patches. Figure 9 showsoutput images with a relatively high error level to illustratethe error metric. It can be observed that the camera associationmethod is more prone to artifacts on the surface, while the usermotion coherence method begins to fail at the borders when,for example, limbs move too quickly.

9 ConclusionsThis paper is the first attempt to exploit temporal coherence inimage-based visual hull rendering even during user motion. Wesuccessfully showed that the surface point camera associationhas a relatively high stability over time and can be usedto improve the performance of the original algorithm. Wedescribed how to set an upper limit to the expected user’sbody motions and how such a limit can be used to reducethe number of processed silhouette images. By using ourmotion detector, the motion limit can even be estimated online.Furthermore, we introduced the ai metric that is computedfrom the angle between a camera’s optical axis and the viewingdirection as a reliable statistic for skipping empty space andthus unnecessary computations. We evaluated all suggestedmethods in terms of the achievable speed-up and illustratedthe quality/speed tradeoff for a wide range of parameters. Our

methods reduce the execution time by up to 50% when submillimeter deviations in the visual hull are tolerable, which isusually the case for rendering applications.

References

[1] A. Ladikos, S. Benhimane, and N. Navab, “Efficient visual hull compu-tation for real-time 3d reconstruction using cuda,” Computer Vision andPattern Recognition Workshop, vol. 0, pp. 1–8, 2008.

[2] C. Nitschke, A. Nakazawa, and H. Takemura, “Real-time space carvingusing graphics hardware,” IEICE - Trans. Inf. Syst., vol. E90-D, no. 8,pp. 1175–1184, 2007.

[3] D. Knoblauch and F. Kuester, “Region-of-interest volumetric visual hullrefinement,” in Proceedings of the 17th ACM Symposium on VirtualReality Software and Technology, ser. VRST ’10. New York, NY,USA: ACM, 2010, pp. 143–150.

[4] J.-S. Franco and E. Boyer, “Efficient polyhedral modeling from silhou-ettes,” Pattern Analysis and Machine Intelligence, IEEE Transactionson, vol. 31, no. 3, pp. 414 –427, march 2009.

[5] W. Matusik, C. Buehler, R. Raskar, S. J. Gortler, and L. McMillan,“Image-based visual hulls,” in SIGGRAPH ’00 proceedings. New York,NY, USA: ACM Press/Addison-Wesley, 2000, pp. 369–374.

[6] W. Waizenegger, I. Feldmann, P. Eisert, and P. Kauff, “Parallel highresolution real-time visual hull on gpu,” in ICIP’09: Proceedings of the16th IEEE international conference on Image processing. Piscataway,NJ, USA: IEEE Press, 2009, pp. 4245–4248.

[7] C. Lee, J. Cho, and K. Oh, “Hardware-accelerated jaggy-free visual hullswith silhouette maps,” in VRST ’06: Proceedings of the ACM symposiumon Virtual reality software and technology. New York, NY, USA: ACM,2006, pp. 87–90.

[8] M. Li, “Towards real-time novel view synthesis using visual hulls,” Ph.D.dissertation, Universitat des Saarlandes, 2004.

[9] Y. M. Kim, D. Chan, C. Theobalt, and S. Thrun, “Design and calibrationof a multi-view TOF sensor fusion system,” in Proc. IEEE CVPRWorkshops, June 2008, pp. 1 –7.

[10] L. Guan, J.-S. Franco, and M. Pollefeys, “3D Object Reconstructionwith Heterogeneous Sensor Data,” in Proc. 3DPVT, 2008. [Online].Available: http://hal.inria.fr/inria-00349099

[11] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, “Scanning 3d full humanbodies using kinects,” IEEE Transactions on Visualization and ComputerGraphics (Proceedings of IEEE Virtual Reality), 2012.

[12] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon,“Kinectfusion: Real-time dense surface mapping and tracking,” inProc IEEE ISMAR ’11, 2011, pp. 127–136. [Online]. Available:http://dx.doi.org/10.1109/ISMAR.2011.6092378

[13] A. D. Wilson and H. Benko, “Combining multiple depth camerasand projectors for interactions on, above and between surfaces,”in Proc. ACM UIST ’10, 2010, pp. 273–282. [Online]. Available:http://doi.acm.org/10.1145/1866029.1866073

[14] K. Berger, K. Ruhl, C. Brummer, Y. Schroder, A. Scholz, and M. Mag-nor, “Markerless motion capture using multiple color-depth sensors,” inProc. VMV 2011, Oct. 2011, pp. 317–324.

[15] A. Maimone and H. Fuchs, “Reducing interference between multiplestructured light depth sensors using motion.” in Proc. IEEE VR,2012, pp. 51–54. [Online]. Available: http://dblp.uni-trier.de/db/conf/vr/vr2012.html#MaimoneF12

[16] A. Butler, S. Izadi, O. Hilliges, D. Molyneaux, S. Hodges, and D. Kim,“Shake’n’sense: reducing interference for overlapping structured lightdepth cameras,” in Proc. CHI ’12, 2012, pp. 1933–1936.

[17] B. Kainz, S. Hauswiesner, G. Reitmayr, M. Steinberger, R. Grasset,L. Gruber, E. Veas, D. Kalkofen, H. Seichter, and D. Schmalstieg, “Om-nikinect: Real-time dense volumetric data acquisition and applications,”in Symposium on Virtual Reality Software and Technology (VRST), 2012.

[18] C. Kuster, T. Popa, C. Zach, C. Gotsman, and M. Gross, “Freecam:A hybrid camera system for interactive free-viewpoint video,” in Proc.VMV, 2011.

[19] A. Bogomjakov, C. Gotsman, and M. Magnor, “Free-viewpoint videofrom depth cameras,” Proc. Vision, Modeling, and Visualization(VMV’06), Aachen, Germany, pp. 89–96, Nov. 2006.

[20] J. Gall, C. Stoll, E. de Aguiar, C. Theobalt, B. Rosenhahn, and H.-P. Seidel, “Motion capture using joint skeleton tracking and surfaceestimation,” in CVPR, 2009.

10

[21] J. Starck, G. Miller, and A. Hilton, “Video-based character animation,”in Proceedings of the 2005 ACM SIGGRAPH/Eurographics symposiumon Computer animation, ser. SCA ’05. New York, NY, USA:ACM, 2005, pp. 49–58. [Online]. Available: http://doi.acm.org/10.1145/1073368.1073375

[22] A. Jain, T. Thormahlen, H.-P. Seidel, and C. Theobalt, “Moviereshape:Tracking and reshaping of humans in videos,” ACM Trans. Graph. (Proc.SIGGRAPH Asia 2010), vol. 29, no. 5, 2010.

[23] F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H.-P. Seidel, J. Kautz, and C. Theobalt, “Video-based characters:creating new human performances from a multi-view video database,”in ACM SIGGRAPH 2011 papers, ser. SIGGRAPH ’11. NewYork, NY, USA: ACM, 2011, pp. 32:1–32:10. [Online]. Available:http://doi.acm.org/10.1145/1964921.1964927

[24] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, andJ. Davis, “Scape: shape completion and animation of people,” ACMTrans. Graph., vol. 24, no. 3, pp. 408–416, 2005.

[25] M. Simmons and C. H. Sequin, “Tapestry: A dynamic mesh-baseddisplay representation for interactive rendering,” in Proceedings of theEurographics Workshop on Rendering Techniques 2000. London, UK,UK: Springer-Verlag, 2000, pp. 329–340.

[26] A. Dayal, C. Woolley, B. Watson, and D. Luebke, “Adaptive framelessrendering,” in ACM SIGGRAPH 2005 Courses, ser. SIGGRAPH ’05.New York, NY, USA: ACM, 2005.

[27] B. Walter, G. Drettakis, and D. P. Greenberg, “Enhancing and optimizingthe render cache,” in Proceedings of the 13th Eurographics workshopon Rendering, ser. EGRW ’02. Aire-la-Ville, Switzerland, Switzerland:Eurographics Association, 2002, pp. 37–42.

[28] D. Nehab, P. V. Sander, J. Lawrence, N. Tatarchuk, and J. R. Isidoro,“Accelerating real-time shading with reverse reprojection caching,” inProceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS sympo-sium on Graphics hardware, ser. GH ’07. Aire-la-Ville, Switzerland,Switzerland: Eurographics Association, 2007, pp. 25–35.

[29] A. Bigdelou, A. Ladikos, and N. Navab, “Incremental visual hullreconstruction,” in BMVC’09, 2009, pp. –1–1.

[30] S. Hauswiesner, M. Straka, and G. Reitmayr, “Coherent image-basedrendering of real-world objects,” in Symposium on Interactive 3DGraphics and Games. New York, USA: ACM, 2011, pp. 183–190.

[31] M. Straka, S. Hauswiesner, M. Ruether, and H. Bischof, “A free-viewpoint virtual mirror with marker-less user interaction,” in Proc. ofthe 17th Scandinavian Conference on Image Analysis (SCIA), 2011.

[32] S. Hauswiesner, R. Khlebnikov, M. Steinberger, M. Straka, andD. Schmalstieg, “Multi-gpu image-based visual hull rendering,” in Euro-graphics Symposium on Parallel Graphics and Visualization (EGPGV),2012.

Stefan Hauswiesner received his master’s degreein 2009 and his PhD in 2013 from the Graz Uni-versity of Technology. He is working at the Institutefor Computer Graphics and Vision, Graz Universityof Technology as a researcher and teaching assis-tant. His research interests include image-basedrendering and image processing in the context ofmixed reality applications.

Matthias Straka received his bachelor’s degreein 2007 and his master’s degree in 2009 fromthe Graz University of Technology. He is currentlyworking towards his PhD degree at the Institute forComputer Graphics and Vision, Graz University ofTechnology. His research interests include interac-tive 3D human shape and pose estimation frommulti-view images with focus on applications forvirtual dressing rooms and visual body measure-ments.

Gerhard Reitmayr is professor for AugmentedReality at the Graz University of Technology. Hereceived his Dipl.-Ing. (2000) and Dr. techn. (2004)degrees from Vienna University of Technology. Heworked as a research associate at the Departmentof Engineering at the University of Cambridge, UKuntil May 2009 where he was researcher and prin-cipal investigator. Research interests include thedevelopment of augmented reality user interfaces,wearable computing, ubiquitous computing envi-ronments and the integration of these. Research

directions include computer vision techniques for localisation and trackingand interaction methods.

Documents

Temporal Coherence in Image-based Visual Hull Rendering · Microsoft Kinect has profoundly changed the possibilities of sensing for games or virtual reality applications. This devices