8
Accurate Full Body Scanning from a Single Fixed 3D Camera Ruizhe Wang Jongmoo Choi Gerard Medioni Computer Vision Lab, Institute for Robotics and Intelligent Systems University of Southern California {ruizhewa, jongmooc, medioni}@usc.edu Abstract 3D body modeling has been a long studied topic in com- puter vision and computer graphics. While several solu- tions have been proposed using either multiple sensors or a moving sensor, we propose here an approach when the user turns, in a natural motion, in front of a fixed 3D low cost camera. This opens the door to a wide range of applications where scanning is performed at home. Our scanning system can be easily set up and the instructions are straightforward to follow. We propose an articulated, part-based cylindrical representation for the body model, and show that accurate 3D shape can be automatically estimated from 4 key views detected from a depth video sequence. The registration be- tween 4 key views is performed in a top-bottom-top manner which fully considers the kinematic constraints. We validate our approach on a large number of users, and compare ac- curacy to that of a reference laser scan. We show that even using a simplified model (5 cylinders) an average error of 5mm can be consistently achieved. 1. Introduction 3D body modeling is of interest to computer vision and computer graphics. A precise 3D body model is necessary in many applications, such as animation, virtual reality, hu- man computer interaction. However, obtaining such an ac- curate model is not an easy task. Early systems are either based on laser scan or structured light. While these systems can provide very accurate models, they are expensive. Image-based approaches play an important role in body modeling. Shape-from-silhouettes (SFS) method [13, 14] can give us very good result given many synchronized cam- eras. The advantage is that only the silhouette information is used hence we do not need to pay extra attention to take good care of textures. The disadvantage, on the other hand, is also obvious: Accuracy heavily relies on the number of synchronized cameras which restrict its application in real life. The advent of a new type of range sensors, e.g. Mi- (a) (b) (c) Figure 1: (a) System setup (b) Articulated part-based cylin- drical representation of human body (c) Tree representation of human body crosoft Kinect [12], has drawn significant attention in com- puter graphics and computer vision. Several methods and two commercial systems (i.e. Styku [22] and Bodymetrics [2]) based on the depth camera have been proposed for body modeling. These methods can be generalized and catego- rized in the following scenarios: 1) multiple fixed sensors with a static person; 2) multiple fixed sensors a with mov- ing person; 3) single moving sensor with a static person; 4) single fixed sensor with a moving person. Our main focus in the paper is to establish a convenient home-used body modeling system so that a naive user can easily scan his/her body by him/herself. This requirement leaves the 4th option as the only viable one. Therefore we try to address 4) in this 1

Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

Accurate Full Body Scanning from a Single Fixed 3D Camera

Ruizhe Wang Jongmoo Choi Gerard MedioniComputer Vision Lab, Institute for Robotics and Intelligent Systems

University of Southern California{ruizhewa, jongmooc, medioni}@usc.edu

Abstract

3D body modeling has been a long studied topic in com-puter vision and computer graphics. While several solu-tions have been proposed using either multiple sensors or amoving sensor, we propose here an approach when the userturns, in a natural motion, in front of a fixed 3D low costcamera. This opens the door to a wide range of applicationswhere scanning is performed at home. Our scanning systemcan be easily set up and the instructions are straightforwardto follow. We propose an articulated, part-based cylindricalrepresentation for the body model, and show that accurate3D shape can be automatically estimated from 4 key viewsdetected from a depth video sequence. The registration be-tween 4 key views is performed in a top-bottom-top mannerwhich fully considers the kinematic constraints. We validateour approach on a large number of users, and compare ac-curacy to that of a reference laser scan. We show that evenusing a simplified model (5 cylinders) an average error of5mm can be consistently achieved.

1. Introduction3D body modeling is of interest to computer vision and

computer graphics. A precise 3D body model is necessaryin many applications, such as animation, virtual reality, hu-man computer interaction. However, obtaining such an ac-curate model is not an easy task. Early systems are eitherbased on laser scan or structured light. While these systemscan provide very accurate models, they are expensive.

Image-based approaches play an important role in bodymodeling. Shape-from-silhouettes (SFS) method [13, 14]can give us very good result given many synchronized cam-eras. The advantage is that only the silhouette informationis used hence we do not need to pay extra attention to takegood care of textures. The disadvantage, on the other hand,is also obvious: Accuracy heavily relies on the number ofsynchronized cameras which restrict its application in reallife.

The advent of a new type of range sensors, e.g. Mi-

(a) (b)

(c)

Figure 1: (a) System setup (b) Articulated part-based cylin-drical representation of human body (c) Tree representationof human body

crosoft Kinect [12], has drawn significant attention in com-puter graphics and computer vision. Several methods andtwo commercial systems (i.e. Styku [22] and Bodymetrics[2]) based on the depth camera have been proposed for bodymodeling. These methods can be generalized and catego-rized in the following scenarios: 1) multiple fixed sensorswith a static person; 2) multiple fixed sensors a with mov-ing person; 3) single moving sensor with a static person; 4)single fixed sensor with a moving person. Our main focusin the paper is to establish a convenient home-used bodymodeling system so that a naive user can easily scan his/herbody by him/herself. This requirement leaves the 4th optionas the only viable one. Therefore we try to address 4) in this

1

Page 2: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

Figure 2: General pipeline of our scanning system

paper, i.e. single fixed sensor with a moving person.

The complete setup of our home-used scanning system isillustrated in Figure 1(a). The camera is mounted verticallyto maximize the field of view so that a subject can stand asclose as possible. The required initial pose of the subject isalso shown in Figure 1(a). While the subject starts turningfrom the initial pose, he/she is required to stay static at ap-proximately every 1/4 circle for 1 second. The subject canturn naturally as long as his/her two arms stay in the torso’splane and his/her two arms do not cause occlusion on legs.While the system and instructions are easy to set up andfollow respectively, the whole data-recording process won’ttake more than 10 seconds.

This user-friendly scenario imposes serious algorithmicchallenges. Articulation: Articulation needs to be con-sidered even when the person is turning on an automaticturntable [4, 26]. Not to mention when the person is turn-ing in a nature way in front of the Kinect sensor. Themore cylinders we use in the model, the more degrees offreedom (DOF) need to be considered. Noisy nature ofKinect: While Kinect can obtain reasonable results at closedistances, its accuracy decreases quadratically when we go

further away as described in [11]. Our normal working dis-tance is 2m, and this gives us a statistical error ε ∼ N(0, σ)where σ is approximately 7mm. Lack of texture and ac-curate silhouette information: Regular RGB image is notused in order to extend the applicability of our scanning sys-tem. Big variance on the occluding boundaries of the objectin range images makes the silhouette noisy.

In this paper, we propose a body modeling system with asingle fixed range camera which can be used conveniently athome. The general pipeline of the working system is shownin Figure 2. While the scheme of using as many framesas possible fails (Section 4), we instead detect 4 key posesout of the whole depth video sequence, which are front(reference pose), back, and two profiles. They cover suf-ficient information to describe the main body shape. The4 key frames are registered in a top-bottom-top manner in-stead of other well-developed non-rigid registration meth-ods (section 2). Top-bottom means that registration goesfrom the root node to all leaf nodes in the tree model of hu-man body as shown in Figure 1(c) while bottom-top meansthe opposite. Top-bottom registration first aligns the torsoor the whole body then aligns succeeding rigid body parts.

Page 3: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

Bottom-top registration first refines the alignment of rigidbody parts and then propagates the refinement all the wayto the root node, i.e. torso. After registering 4 key frames, anarticulated part-based cylindrical body model (as shown inFigure 1(b)), which supports a set of operations, can be usedto process the rough and noisy registered points cloud of thebody. Figure 2 shows a flow chart of the modeling process.The key here is the 2D part-based unwrapped cylindricalrepresentation which enables computationally effective 2Dinterpolation and 2D filterings.

The contributions of this paper can be summarized asfollows: 1) A body scanning system which can be conve-niently used at home by a single naive user; 2) A new artic-ulated part-based representation of 3D body model whichsupports a set of operations, such as composition, decom-position, filtering, and interpolation; 3) A simple method toautomatically detect 4 key poses from a sequence of rangeimages; 4) A top-bottom-top registration between 4 keyframes; 5) A quantitative evaluation of depth quality.

The rest of the paper is organized as follows. Section 2covers related state-of-the-art algorithms. Section 3 detailsour proposed body model, i.e. the articulated, part-basedcylindrical representation and a set of operations that it sup-ports. In section 4, we give our top-bottom-top registrationprocess. Section 5 includes our experimental results anda quantitative comparison with a laser scanned result. Weconclude briefly with Section 6.

2. Related WorkAt a first look, this is a non-rigid registration problem on

which a huge amount of work has been proposed. In [6],the iterative closest point algorithm (ICP) [30] is extendedby softening hard links between points and formulating theregistration process under a probabilistic scheme. Embed-ded deformation model [23] proved to be successful in non-rigid registration [8, 15, 26]. A similar approach [16] retainssmoothness of warped surface by using Laplace-Beltramioperator. All those methods require either enough overlap-ping areas between consecutive frames or accurate rangescans as inputs. In our case, however, we are registeringfour noisy frames with barely overlapping area which re-stricts the application of the well-developed non-rigid reg-istration methods in our scenario.

Several related works exist on body modeling. They canbe roughly classified as four main categories based on thenumber of cameras used and whether the camera or the per-son is moving. Different scenarios make different underly-ing assumptions and hence lead to different research focus.

Multiple Fixed Cameras with Static Person. Be-fore the introduction of the nowadays popular range sensors(e.g. Kinect), people were able to obtain full models fromseveral fixed intensity cameras by Shape-From-Silhouette(SFS) algorithm [14, 19]. Currently, after more and more

range cameras hit the market, researchers have been work-ing on 3D modeling with this type of sensor since it candirectly give you the depth information of the object. Com-mercial systems are using multiple calibrated depth camerasto model body shape. The person is required to stay staticduring procession and it is quite straightforward to alignseveral points clouds from different cameras to get the finalmodel. Both methods can provide the most accurate modelso far. However, they both require multiple synchronizedcameras and the latter even needs to deal with interferencebetween cameras [18]. These factors make them far fromhome applications.

Multiple Fixed Cameras with Moving Person. Themotion of a person can greatly decrease the number of re-quired fixed cameras, because, provided a good registrationresult between consecutive frames, it is exactly the same assetting up more cameras. For the SFS-based approaches,the registration is achieved by locating the Colored SurfacePoints (CSP) [3, 4]. The corresponding CSP between twoconsecutive frames incorporates the 6 DOF rigid motion in-formation. For the Depth-based approaches, a good regis-tration can be provided either by articulated version of theIterative Closest Point (ICP) algorithm [30] or by iterativelyperforming pair-wise non-rigid geometric registration andglobal registration [26]. Again, these methods still require3 or 4 synchronized cameras and the person must be stand-ing on a turntable and try to remain rigid.

Single Moving Camera with Static Person. SFS-based approach only works when the camera is mountedon a robot arm and moves circularly around the static body.While under circular motion, the DOF of fundamental ma-trix between consecutive frames is restricted to be 4 [28].Parameters of all fundamental matrix can then be esti-mated by minimizing the reprojection errors of correspond-ing epipolar tangents [29]. The model of the body can bereconstructed from the motion estimated from fundamentalmatrixes. Although the result is quite impressive, the sys-tem set up is still impractical. As for Depth-based approach,KinectFusion [9] proves to be a great success. Based ondense tracking and volumetric representation, it can recon-struct a static 3D scene in real time. While it works on astatic scene, it fails when registering only the points from amoving person in the presence of articulation.

Single Fixed Camera with Moving Person. The onlyproposed approach [27] so far is depth-based, and uses sta-tistical learning. In [27], authors estimated the body shapeby fitting image silhouettes and depth data to the ShapeCompletion and Animation of People (SCAPE) model [1].The work can be understood under a learning scheme wherethe best model is predicted by the observed data. The finalmodel is a best match among all candidates of a limited sub-space instead of being generated from the data itself. Thebias introduced by the learning scheme is inevitable. More-

Page 4: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

over, the whole system takes approximately 65 minutes tooptimize, which is too slow for practical applications. Ourfocus is also on a single fixed camera with a moving per-son, we align the detected 4 key frames based on the priorgeometry information completely.

3. Generic Human Body Model and Opera-tions

The generic body model used in this paper and its sup-ported operations are important for both top-bottom-topregistration and modeling.

3.1. Generic Human Body Model

The generic human body model is depicted in Figure1(b). The body model consists of a set of cylinders repre-senting rigid body parts and respects kinematic constraints,i.e. each rigid part is connected with its parent via a joint.This model also has a tree representation (Figure 1(c))which is important for understanding our top-bottom-topregistration. In the tree representation, each node is a rigidbody part and each edge is a joint. Assuming that we haven nodes and n edges, each node Bi and each edge Ji can berepresented as

Bi = {Pi, IJii }, i = 1...n, (1)

Ji = {~jic, (jix, jiy, jimain)}, i = 1...n, (2)

where Ji and Bi forms a pair and Ji connects Bi to its par-ent node. Each joint Ji has four vectors which constitutethe local Cartesian Coordinate System. ~jic is the location ofjoint i and the origin of local system. jimain is the main di-rection of cylinder i and the z axis of local system. jix andjiy are the x axis and y axis of the local system. A specificbody part Bi has two components. Pi is the points cloud,which can be represented in the World Cartesian CoordinateSystem as PWi , in the Local Cartesian Coordinate Systemdefined by Ji as PLi

i or in the Local Cylindrical CoordinateSystem defined by Ji as PCi

i . IJii is the cylindrical imageand it can also be regarded as a compact and discretizedversion of PCi

i .

3.2. Supported Operations

Decomposition (3D). We decompose the points cloudPWbody of a whole human body and obtain PWi of each Biin (1) by taking advantage of planar geometry. Assumingthat accurate critical points information is given (section4.2) we can easily decompose the whole body into severalrigid parts. A 2D example decomposing crotch is given inFigure 3. The whole body can be decomposed in a similarway. Notice that after decomposing, connected rigid bodypairs have an overlapping area, i.e. PWi

⋂PWj 6= φ if Bi

and Bj are connected by joint Ji. The overlapping area is

useful for blending between connected body parts when wecompose everything into a whole model.

Figure 3: Decomposition at crotch

Local Structure Extraction (3D). After decomposi-tion, every point in PWbody is assigned to a rigid body partBi.We use Pincipal Component Analysis (PCA) [10] to extractthe structure (jix, jiy, jimain) of each rigid part. The largestcomponent defines the axis of the corresponding cylinderand hence is jimain. The order of jix and jix does not mat-ter as long as (jix, jiy, jimain) forms a Cartesian Coordi-nate System in space. Notice that during the decompositionprocess, the axis information is used. Therefore in practicewe apply decomposition and local structure extraction iter-atively until all the axes converge.

Global and Local Cartesian Coordinate SystemTransformation PW

i → PLi

i (3D→ 3D). For each rigidbody part Bi, we need to represent its points cloud PWi inthe Local Cartesian Coordinate System obtained from de-composition and local structure extraction. Given the basis(jx, jy, jmain) which are the new basis and the center

−→jc .

Then for any point −→pijw ∈ PWi we have the transformation−→pij li = Ri

−→pijw +−→ti where Ri = [jix, jiy, jimain]−1 and

−→ti = −Ri

−→jic.

Local Cylindrical Image PLi

i → PCi

i (3D→ 2D).After mapping each point to its corresponding Local Carte-sian Coordinate System −→pij

li = [pliijx, pliijy, p

liijz], it can be

further transformed into the Cylindrical Coordinate Sys-

tem −→pijci = [ρciij , ϕciij , z

ciij ] where ρciij =

√pliijx

2+ pliijy

2,

ϕciij = arccos(pliijx

ρciij

) and zciij = pliijz . −→pijci can be further

discretized and mapped to IJii [7].Interpolation (2D). Interpolation is a main reason for

our cylindrical representation. In many cases, even afterwe gather many frames and use all the points, we still haveholes on the cylindrical image due to occlusions. We pro-pose linear interpolation along rows of the cylindrical imagewhich turns out to be circular interpolation in space.

Filtering (2D). Filtering is another reason for us to usethe cylindrical representation. After mapping 3D points toa 2D cylindrical image, we can easily take advantage of thewell-developed 2D image filters. We use the bilateral filter[25], which can reduce noise while preserving edges, for

Page 5: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

spatial smoothing. For temporal filtering, we use the run-ning mean on each pixel on the cylindrical image [7].

Composition (2D→ 3D). We compose single cylin-ders to construct the meshed 3D body model. The gap be-tween connected rigid body parts must be blended. The 2Dcylindrical images of connected body parts overlap (this isnot reflected in Figure 2). Assuming Bi is the parent ofBj , we can decompose the overlapping points cloud PWijout of PWi using the same planar geometry defined before.PWij is mapped in the following directions: PWij → P

Lj

ij →PCj

ij → IJjij . We linearly blend IJjj with IJjij to update IJjj .

Connectivity on the 2D cylindrical image helps us mesh therigid body parts easily.

Algorithm 1 Articulated registration based on EM-ICPInput: A reference decomposed points cloud Qref ={Pref

i , i = 1...n}, a points cloud Qin.Output: A decomposed and registered points cloudQout = {Pout

i , i = 1...n}.Register Qin with Qref as a whole by EM-ICPQout = {Pout

i , i = 1...n} ← Decompose Qin same asQref

initiate ε,tε and countmax

count← 0while ε ≥ tε docount← count+ 1ε← 0for i = 1→ n do

register Pouti with Pref

i in the local Cartesian Coor-dinate System defined by Ji using EM-ICP.project the transformation matrix to a 2 DOF sub-space where the rotation angles are θi and δiε← ε+ |θi|+ |δi|

end forRegister Qout with Qref as a whole by EM-ICPif count ≥ countmax then

display(’Maximum iteration reached!’)return Qout = {Pout

i , i = 1...n}end if

end whilereturn Qout = {Pout

i , i = 1...n}

4. A Top-Bottom-Top Registration MethodIterative alignment failure. A straightforward and in-

tuitive registration approach is to incrementally align newframes with the existing model and add this frame to re-fine the model if the alignment result is proved to be goodenough. Iterative Closest Points (ICP) algorithm [30] hasbeen well developed over these years. Its state-of-the-artvariant EM-ICP [21, 5] implemented on GPU [24] exhibitsboth robustness and speed.

In our case, articulated registration (Algorithm 1) basedon EM-ICP is employed due to the existense of articulationbetween any new frame and our model. The model is ini-tialized with the front pose and all succeeding frames areregistered to the model. Surface Interpenetration Measure(SIM) [20] is used to check the consistency between anynew frame and the model. A hard threshold is used to re-move outliers, i.e. bad frames.

After several experiments, however, we find that this ap-proach fails for two main reasons: 1) The noisy nature ofthe input data as described before. The larger the noise theharder it is for EM-ICP to register two points clouds. 2) TheEM-ICP algorithm tends to underestimate rotations whileregistering cylindrical shape objects [17], i.e. EM-ICP tendsto shrink the model when you go along the cylindrical sur-face.

The failure of the straightforward approach enables usto think from a different perspective. In this paper, insteadof registering all frames, we use 4 key poses: front (refer-ence frame), back , left profile and right profile. We auto-matically detect and extract the 4 key frames out of a com-plete depth video sequence where a subject turns around360o. While the 4 frames cannot be registered by Algo-rithm 1 due to limited overlapping area, we register them ina top-bottom-top manner. Top-bottom means to enforce thekinematic constraints from a parent node to its child nodeswhile bottom-top means the opposite. For our simplifiedbody model which consists of 5 cylinders (torso-head, leftleg, right leg, left arm and right arm), top-bottom methodmeans aligning the whole body first and then the registrationof each rigid limb in the Local Cartesian Coordinate Systemcan be restricted to a 2 DOF. The bottom-top method firstaligns all rigid limbs, then the registration of torso-head issolved by considering the constraints of rigidity and con-nectivity. The top-bottom process can put every rigid bodypart in a roughly correct position while the bottom-top pro-cess refines and completes the registration.

4.1. Detect 4 Key Postures

The front pose Ffront is labeled as soon as we detect thecritical points (defined in Section 4.2). The front pose is setas the reference frame and is used to initialize our model inthe top-bottom-top registration process. Starting from thefront frame we assign weighted normalized score to eachsucceeding frame and detect other key poses by searchingover the whole sequence.

Scores associated with each frame. There are threemain scores. Score smi indicates the motion between twoconsecutive frames. Assume that the user is the only mov-ing object in the scene, we calculate smi as the sum of ab-solute pixel-wise difference between two consecutive rangeimages. Score swi is the width of bounding box in pixelsas shown in Figure 4(a). Score ssi illustrates the similar-

Page 6: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

ity in structure between the current frame i and the frontframe Ffront. The second largest nomalized eigenvector

ˆvarmi of frame i gives us the direction of the arms. Hence

ssi = ˆvarmi · ˆvarm

front indicates to what extent the currentpose looks like the front pose.

Normalized scores associated with each frame. Wenormalize all three scores of each frame i as follows: sim =smi /maxi=1...n s

mi , siw = swi /maxi=1...n s

wi and sis =

ssi/maxi=1...n ssi .

Detect the back and profiles. Based on these nomalizedscores, we can detect the back and the two profiles.

sbacki = αsmi + (1− α)ssi , (3)

sprofilei = βsmi + (1− β)swi (4)

sbacki and sprofilei are two scores of frame i used to detectback and profiles respectively. α and β are weights for thecorresponding base scores. They are all set to 0.5 in ourexperiment. Then the two profiles and back are detected inthe following way,

FleftProfile = argmini=1...n/2

sprofilei , (5)

FrightProfile = argmini=n/2...n

sprofilei , (6)

Fback = argmini=FleftProfile...FrightProfile

sbacki . (7)

4.2. Detect Critical Points

The critical points are the points which are used for de-composing the whole body (see Figure 4(b)). Our simplifiedmodel consists of 5 cylinders, hence we need a minimum of3 points to seperate the whole body as shown in Figure 4(b).We detect these 3 critical points from the front frame by thefollowing procedure: 1) Detect the horizontal gaps on thefront range image, e.g. gap between the hand and the bodyor gap between two legs as shown in Figure 4(b). 2) Searchupwards to find the highest white pixels which are the 2Dcritical points. 3) Project 2D critical points to 3D criticalpoints.

4.3. Top-bottom Registration

The top-bottom registration enforces the kinematic con-straints from a parent to its children. We register the backwith our model using 2D silhouette information and registerprofiles with our model using full 3D information.

Silhouette-based top-bottom registration of backpose. First the whole back pose (i.e. back frame) is flippedϕ = arccos(

vbackarm ·v

frontarm

‖vbackarm ‖‖v

frontarm ‖

) = arccos(vbackarm · vfrontarm ) de-grees in space along the main axis of torso and pasted to acertain depth µ. µ is predefined (e.g. 150mm) or roughlyobtained from the two profiles. Then we register the back

(a) (b)

Figure 4: (a) Width of the bounding box (b) Critical pointsdetection

frame with the front frame (i.e. reference frame) based ontheir silhouettes (i.e. 2D points cloud) obtained from ortho-graphic projections using Algorithm 1. The result is shownin Figure 5.

(a) (b)

Figure 5: (a) Before registration (b) After registration

Depth-based top-bottom registration of profile pos-ture. The same silhouette-based method cannot be appliedto register the two profile frames since the silhouette of thearm cannot be found in the profile frame. We instead di-rectly take advantage of the 3D information. After register-ing the back frame, we build an approximate model and fillthe holes on the two sides by interpolating the cylindricaldepth images. Algorithm 1 is then again used to register thetwo profile frames with the reference model.

4.4. Bottom-top Registration

The bottom-top registration first refines the alignmentof leaf nodes and then propagate the refinement inverselyalong the edges until it reaches the root node. The kinematicconstraint is hence enforced from the child to the parent.

Refining leaf node registration. Each leaf node (be-sides head) represents a rigid body part, and contains infor-mation coming from 3 frames since one profile frame is oc-cluded. The 3 points clouds are only approximately alignedafter the top-bottom process and are registered more pre-

Page 7: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

cisely after an iterative local registration. Step a) We ob-tain profile’s silhouette by orthographically projecting theprofile points cloud along the profile direction. The cor-rect depth and angle of back frame with respect to the frontframe is then recovered from profile’s silhouette. Step b)Given the front and back, we generate a cylinder initiated byfront and back points clouds and fill holes by interpolation.Then we register the profile points cloud to the cylinder withEM-ICP. By iteratively executing step a) and step b), we ob-tain a more compact cylinder at convergence which experi-mentally occurs after 10 iterations. We believe that the iter-ative process works for two main reasons: 1) Profile pointscloud contains the width information of a rigid body part.2) Front and back points clouds have overlapping spatialfeature points with the profile points cloud.

Refinement propagation. After correctly registeringthe child, we propagate the registration improvement toits parent. In our experiment, the parent torso-head has 4childs, i.e. 4 limbs. We register the decomposed torso-headfrom back pose by enforcing rigidity and connectivity, i.e.torso-head is rigid and it is connected with four limbs. Thisis typically a problem of finding the transformation matrixout of corresponding points which has closed-form solution.

5. Experimental ResultsFigure 6 includes modeled results of 4 people scanned by

our system. When they turn in front of the depth camera, weobserve obvious articulated motion. In Figure 6 each modelis showed at 4 different views. Figure 6 shows clear andsmoothed body shapes as a whole and contains personalizedshapes such as knees, hips and clothes. The holes on bodyare interpolated and the joints between rigid body parts arewell blended. The average computing time of the wholeprocess with an Intel Core i7 processor at 2.0 GHz is around3 minutes. Although finer details of body shape (e.g. lips,eyes) cannot be extracted due to the noisy nature of Kinect,we believe that the current system can generate models ac-curate enough for applications such as online shopping andgaming. We believe that the model can be further refinedby using a more complex model, adding more frames andtaking advantage of the corresponding RGB image.

Besides qualitative analysis, we also present quantitativecomparison between our model and the laser scanned re-sult. Due to the existence of articulation between these twomodels, it is hard to compare them as a whole. We fol-low the decomposition idea (section 4.2) and compare thesegmented rigid parts seperately. The heatmap of torso andrightleg are shown in Figure 7. We generate the point-wiseerror by mapping each point of our body part to the cylin-drical image generated by the laser-scanned body part andlooking for the closest pixel or interpolating neighboring 4pixels. The median of absolute error on torso is 5.84mmwhile the median of absolute error on right leg is 2.59mm.

Figure 6: Different people modeled by our scanning system

(a)

(b)

Figure 7: (a) Heatmap of torso at 4 different views (b)Heatmap of right leg at 4 different views. For better vi-sualization, please see the colored PDF file.

6. Conclusions and Future WorkWe have presented a practical system for 3D body scan-

ning that can be operated by a single naive user at home.While the accuracy result are very encouraging, we will

Page 8: Accurate Full Body Scanning from a Single Fixed 3D Cameraruizhewa/papers/3DimPVT2012.pdf · niently used at home by a single naive user; 2) A new artic-ulated part-based representation

investigate several ways to recover fine details of a bodyshape. These include the use of a larger number of cylin-ders, the exploitation of information contained in the RGBdata stream and the application of super-resolution meth-ods.

References[1] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers,

and J. Davis. Scape: shape completion and animation of peo-ple. In ACM Transactions on Graphics (TOG), volume 24,pages 408–416. ACM, 2005. 3

[2] Bodymetrics. http://www.bodymetrics.com/. 1[3] G. Cheung, S. Baker, and T. Kanade. Visual hull alignment

and refinement across time: A 3d reconstruction algorithmcombining shape-from-silhouette with stereo. In CVPR,2003. Proceedings. 2003 IEEE Computer Society Confer-ence on, volume 2, pages II–375. IEEE, 2003. 3

[4] K. Cheung, S. Baker, and T. Kanade. Shape-from-silhouetteof articulated objects and its use for human body kinemat-ics estimation and motion capture. In CVPR, 2003. Proceed-ings. 2003 IEEE Computer Society Conference on, volume 1,pages I–77. IEEE, 2003. 2, 3

[5] S. Granger and X. Pennec. Multi-scale em-icp: A fast androbust approach for surface registration. ECCV 2002, pages69–73, 2006. 5

[6] D. Hahnel, S. Thrun, and W. Burgard. An extension ofthe icp algorithm for modeling nonrigid objects with mo-bile robots. In International joint conference on Artificialintelligence, volume 18, pages 915–920. LAWRENCE ERL-BAUM ASSOCIATES LTD, 2003. 3

[7] M. Hernandez, J. Choi, and G. Medioni. Laser scan quality3-d face modeling using a low-cost depth camera. In EU-SIPCO 2012, 2012. 4, 5

[8] Q. Huang, B. Adams, M. Wicke, and L. Guibas. Non-rigid registration under isometric deformations. In ComputerGraphics Forum, volume 27, pages 1449–1457. Wiley On-line Library, 2008. 3

[9] S. Izadi, R. Newcombe, D. Kim, O. Hilliges, D. Molyneaux,S. Hodges, P. Kohli, J. Shotton, A. Davison, and A. Fitzgib-bon. Kinectfusion: real-time dynamic 3d surface reconstruc-tion and interaction. In ACM SIGGRAPH 2011, page 23.ACM, 2011. 3

[10] I. Jolliffe and MyiLibrary. Principal component analysis,volume 2. Wiley Online Library, 2002. 4

[11] K. Khoshelham. Accuracy analysis of kinect depth data. InISPRS Workshop Laser Scanning, volume 38, page 1, 2011.2

[12] Kinect. http://www.xbox.com/en-US/kinect/. 1[13] K. Kolev, M. Klodt, T. Brox, and D. Cremers. Continuous

global optimization in multiview 3d reconstruction. IJCV,84(1):80–96, 2009. 1

[14] A. Laurentini. The visual hull concept for silhouette-basedimage understanding. TPAMI, 16(2):150–162, 1994. 1, 3

[15] H. Li, R. Sumner, and M. Pauly. Global correspondence op-timization for non-rigid registration of depth scans. In Com-puter Graphics Forum, volume 27, pages 1421–1430. WileyOnline Library, 2008. 3

[16] M. Liao, Q. Zhang, H. Wang, R. Yang, and M. Gong. Model-ing deformable objects from a single depth camera. In Com-puter Vision, 2009 IEEE 12th International Conference on,pages 167–174. IEEE, 2009. 3

[17] Y. Liu. Automatic registration of overlapping 3d pointclouds using closest points. Image and Vision Computing,24(7):762–781, 2006. 5

[18] A. Maimone and H. Fuchs. Reducing interference betweenmultiple structured light depth sensors using motion. InVirtual Reality Workshops (VR), 2012 IEEE, pages 51–54.IEEE, 2012. 3

[19] W. Matusik, C. Buehler, R. Raskar, S. Gortler, and L. McMil-lan. Image-based visual hulls. In Proceedings of the 27th an-nual conference on Computer graphics and interactive tech-niques, pages 369–374. ACM Press/Addison-Wesley Pub-lishing Co., 2000. 3

[20] C. Queirolo, L. Silva, O. Bellon, and M. Segundo. 3d facerecognition using simulated annealing and the surface inter-penetration measure. TPAMI, 32(2):206–219, 2010. 5

[21] S. Rusinkiewicz and M. Levoy. Efficient variants of the icpalgorithm. In 3-D Digital Imaging and Modeling, 2001.Proceedings. Third International Conference on, pages 145–152. IEEE, 2001. 5

[22] Styku. http://www.styku.com/. 1[23] R. Sumner, J. Schmid, and M. Pauly. Embedded deformation

for shape manipulation. In ACM Transactions on Graphics(TOG), volume 26, page 80. ACM, 2007. 3

[24] T. Tamaki, M. Abe, B. Raytchev, and K. Kaneda. Soft-assign and em-icp on gpu. In Networking and Computing(ICNC), 2010 First International Conference on, pages 179–183. IEEE, 2010. 5

[25] C. Tomasi and R. Manduchi. Bilateral filtering for gray andcolor images. In Computer Vision, 1998. Sixth InternationalConference on, pages 839–846. IEEE, 1998. 4

[26] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan. Scanning 3dfull human bodies using kinects. IEEE Trans. Vis. Comput.Graph., 18(4):643–650, 2012. 2, 3

[27] A. Weiss, D. Hirshberg, and M. Black. Home 3d body scansfrom noisy image and range data. In ICCV, 2011 IEEE In-ternational Conference on, pages 1951–1958. IEEE, 2011.3

[28] K. Wong and R. Cipolla. Structure and motion from silhou-ettes. In Computer Vision, 2001. ICCV 2001. Proceedings.Eighth IEEE International Conference on, volume 2, pages217–222. IEEE, 2001. 3

[29] K. Wong, P. Mendonca, and R. Cipolla. Head model acqui-sition from silhouettes. Visual Form 2001, pages 787–796,2001. 3

[30] C. Yang and G. Medioni. Object modelling by registra-tion of multiple range images. Image and vision computing,10(3):145–155, 1992. 3, 5