7
Individual 3D Model Estimation for Realtime Human Motion Capture Lianjun Liao *‡§ , Le Su and Shihong Xia * * Institute of Computing Technology, CAS, Beijing, 100190 Email: [email protected] Civil Aviation University of China, Tianjin, 300300 University of Chinese Academy of Sciences, Beijing, 100049 § North China University of Technology, Beijing, 100044 Abstract—In this paper, we present a practicable method to estimate individual 3D human model in a low cost multi- view realtime 3D human motion capture system. The key idea is: using human geometric model database and human motion database to establish human geometric priors and pose prior model; when given the geometric priors, pose prior and a standard template geometry model, the individual human body model and its embedded skeleton can be estimated from the captured 3D point cloud from multiple depth cameras. Because of the introduction of the global prior model of human body pose and shapes into a unified nonlinear optimization problem for human geometry model estimation, the accuracy of geometric model estimation is significantly improved. The experiments on the synthesized data set with noise or without noise and the real data set captured from multiple depth cameras show that estimation result of our method is more reasonable and accurate than the classical method, and our method is better noise-immunity. The proposed new individual 3D geometric model estimation method is suitable for online realtime human motion tracking system. Keywords-Human Model Estimation, Data Driven, Human Motion Capture I. Introduction Human motion capture and tracking is a hot issue in computer vision and graphics. It mainly studies how to quickly reconstruct accurate human geometry model and human motion sequence from the input depth data stream. The motion capture technology has important application value in the movie stunt, animation games, sports training and other fields, for example, the captured human motion sequence can be used to guide sports training, or improve the sense of reality of game characters. However, up to now, the existing low-cost commercial RGB-D human capture system such as Kinect, suered from ill-pose problem caused by limbs occlusions or self-occlusions, and cannot robustly reconstruct reasonable accurate 3D human motion sequence. Compared with model-free 3D human motion capture system, the model-based system can achieve higher accuracy due to the additional human shape prior. Similarly, in con- trast with single-view-based system such as [1], the multi- view based methods such as [2], can achieve even more accuracy by minimizing the influence of ill-pose problem caused by limbs occlusion or self-occlusion. At present many motion capture system can accurately and reasonably reconstruct human geometry model, but it is too time consuming (even several hours). Realtime human motion capture systems are also reported. However, these methods need a pre-established human model, when human body size changed significantly or a pose is not in the database,it will fail. To address this issue,we present a practicable method to estimate individual 3D human model in a low-cost multi- view realtime 3D human motion capture system. The key idea is: Based on human geometric model database and motion database, establish geometric priors and pose prior model; With geometric priors, pose prior and a standard template geometry model, the individual human body model and its embedded skeleton can be estimated from the cap- tured 3D point clouds from multiple depth cameras. The experiments show that based on our automatic estimation method of individual geometric model, it is easy to develop a low-cost, online realtime and accurate acquisition system for 3D human motion sequences. The main contributions in this work are as follows: We proposed a new individual 3D geometric model estimation method suitable for online realtime human motion tracking system. Successfully introduced the global prior model of hu- man body pose and shapes into the nonlinear optimiza- tion problem of human geometry model estimation, and consequently the accuracy of geometric model estimation is significantly improved. II. Related Work Model-free methods such as [3]–[5], considering no prior information of human body, identify human pose in a frame through image or mesh feature point detection. The drawback is that it neglects the previous frames influence on the pose of the current frame, that is, ignoring the essence of human motion as a continuous process of spatial and temporal variation. Model-based methods need an 3D model scanned in advance, such as methods based on point cloud ICP [6]–[8], or methods based on particle filtering [9], or methods based on multi view depth camera [10], [11]. The cost of the 3D scanner is high, and it is time-

Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

Individual 3D Model Estimation for Realtime Human Motion Capture

Lianjun Liao∗‡§, Le Su† and Shihong Xia∗∗Institute of Computing Technology, CAS, Beijing, 100190

Email: [email protected]†Civil Aviation University of China, Tianjin, 300300

‡University of Chinese Academy of Sciences, Beijing, 100049§North China University of Technology, Beijing, 100044

Abstract—In this paper, we present a practicable methodto estimate individual 3D human model in a low cost multi-view realtime 3D human motion capture system. The keyidea is: using human geometric model database and humanmotion database to establish human geometric priors and poseprior model; when given the geometric priors, pose prior anda standard template geometry model, the individual humanbody model and its embedded skeleton can be estimated fromthe captured 3D point cloud from multiple depth cameras.Because of the introduction of the global prior model of humanbody pose and shapes into a unified nonlinear optimizationproblem for human geometry model estimation, the accuracyof geometric model estimation is significantly improved. Theexperiments on the synthesized data set with noise or withoutnoise and the real data set captured from multiple depthcameras show that estimation result of our method is morereasonable and accurate than the classical method, and ourmethod is better noise-immunity. The proposed new individual3D geometric model estimation method is suitable for onlinerealtime human motion tracking system.

Keywords-Human Model Estimation, Data Driven, HumanMotion Capture

I. Introduction

Human motion capture and tracking is a hot issue incomputer vision and graphics. It mainly studies how toquickly reconstruct accurate human geometry model andhuman motion sequence from the input depth data stream.The motion capture technology has important applicationvalue in the movie stunt, animation games, sports trainingand other fields, for example, the captured human motionsequence can be used to guide sports training, or improve thesense of reality of game characters. However, up to now, theexisting low-cost commercial RGB-D human capture systemsuch as Kinect, suffered from ill-pose problem caused bylimbs occlusions or self-occlusions, and cannot robustlyreconstruct reasonable accurate 3D human motion sequence.

Compared with model-free 3D human motion capturesystem, the model-based system can achieve higher accuracydue to the additional human shape prior. Similarly, in con-trast with single-view-based system such as [1], the multi-view based methods such as [2], can achieve even moreaccuracy by minimizing the influence of ill-pose problemcaused by limbs occlusion or self-occlusion. At presentmany motion capture system can accurately and reasonably

reconstruct human geometry model, but it is too timeconsuming (even several hours). Realtime human motioncapture systems are also reported. However, these methodsneed a pre-established human model, when human body sizechanged significantly or a pose is not in the database,it willfail.

To address this issue,we present a practicable method toestimate individual 3D human model in a low-cost multi-view realtime 3D human motion capture system. The keyidea is: Based on human geometric model database andmotion database, establish geometric priors and pose priormodel; With geometric priors, pose prior and a standardtemplate geometry model, the individual human body modeland its embedded skeleton can be estimated from the cap-tured 3D point clouds from multiple depth cameras. Theexperiments show that based on our automatic estimationmethod of individual geometric model, it is easy to developa low-cost, online realtime and accurate acquisition systemfor 3D human motion sequences. The main contributions inthis work are as follows:

• We proposed a new individual 3D geometric modelestimation method suitable for online realtime humanmotion tracking system.

• Successfully introduced the global prior model of hu-man body pose and shapes into the nonlinear optimiza-tion problem of human geometry model estimation,and consequently the accuracy of geometric modelestimation is significantly improved.

II. RelatedWork

Model-free methods such as [3]–[5], considering no priorinformation of human body, identify human pose in aframe through image or mesh feature point detection. Thedrawback is that it neglects the previous frames influenceon the pose of the current frame, that is, ignoring theessence of human motion as a continuous process of spatialand temporal variation. Model-based methods need an 3Dmodel scanned in advance, such as methods based on pointcloud ICP [6]–[8], or methods based on particle filtering[9], or methods based on multi view depth camera [10],[11]. The cost of the 3D scanner is high, and it is time-

Page 2: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

consuming to process the scanned data. It also suffers fromerror accumulation and tracking the long time movement.

Data driven methods, with the help of a 3D pose databaseconstructed in advance from captured motion data, canusually achieve compelling results. Siddiqu et al. [12] andBaak et al. [13] estimate human pose in each frame bydetecting feature points from depth image. Since the modelin the database is the standard 3D human body model, itcannot always get reasonable results when the size of actorsis much different from the standard model in database. Yeet al. [14] retrieve the optimal match between the 3D pointcloud captured from multi depth camera and the 3D humanpose in the database, and then estimate the full-body poseby deforming the retrieved pose back to the captured 3Dpoint cloud through non-rigid registration. The human posedatabase is composed of the 3D point clouds which arecalculated from the depth images which are generated byprojecting a standard 3D body model driven by an embeddedskeleton. Zhang et al. [2] present an efficient physics-basedmotion reconstruction algorithm that, integrating the inputdepth data from 3 Kinect cameras, foot pressure data fromwearable pressure sensors and detailed full-body geometry,can reconstruct offline full-body motion (i.e. kinematic data)and human dynamic data. When tracking the 3D skeletalposes, the global PCA for features dimension reduction pro-cess is done on the 3D body posture set in the CMU motiondatabase, the reconstructed 3D pose prior is equivalent toimposing additional range constraint on human joint angle.

Zhu et al. [15] succeeded in tracking human motionby combining the semantic feature detection of humanlimbs based on Bayesian estimation and inverse kinematicaloptimization calculation with constraints such as joint limitavoidance. However with the assumption that human headin the image is always located above its waist, therefore, itcan not deal with the pose that does not meet this condition.Wei et al. [1] provide a fast, automatic method for accuratelycapturing full-body motion data using a single depth camera.It described the realtime 3D human posture reconstructionfrom monocular depth image as formulating the registrationproblem in a Maximum A Posteriori (MAP) framework anditeratively registering a 3D articulated human body modelwith monocular depth image. It can recover from failuresby combining 3D tracking with 3D pose detection.

In general, from the above human motion capture system,we can see that the model-based method often is more betterin accuracy and the human geometry model estimation playsan important role in these methods. Therefore, our moti-vation of this paper is how to reasonablely and accuratelyestimate the individual geometry model suitable for realtimehuman motion system. The most related work to ours in thispaper is that proposed by D. Anguelov et al. [16] whichintroduce the SCAPE method, a data-driven method, forbuilding a human shape model that spans variation in bothsubject shape and pose. Generally, the SCAPE method can

reconstruct a reasonable and accurate 3D human geometricmodel and pose. However, the SCAPE method requiresaccurate point correspondence between non-rigid model andthe target 3D point cloud to ensure the 3D human poseaccuracy estimation and large-scale deformation, and whenthe target 3D human pose differs largely from the templatemodel, the result of SCAPE is obviously unreasonable.

III. Overview

We propose an effective approach to online estimateindividual human geometric model for a low-cost realtime3D human motion capture system. Fig.1 gives the pipeline ofour system. The input data is the 3D point cloud captured bymulti-view caliberated depth camera and a standard templatehuman geometric model. Then the individual 3D geometricmodels and embedded skeletons consistent with input pointclouds are accurately constructed by making full use ofhuman geometric model database and human pose database.We will demostrate that our method can quickly reconstructreasonable and accurate individual 3D geometric models andits embedded skeletons, and it is suitable for realtime humanmotion capture system.

Figure 1. System overview

IV. Data Acquisition

This section focuses on the data acquisition method frommultiple depth cameras and the representation form of posedatabase.

A. Spatial and Temporal Alignment

There are four deep cameras (Microsoft Kinect v2.0)connected to one PC, which extended three PCI card toobtain three additional USB3.0 ports (only one USB3.0 hubon my mainboard). Since Microsoft’s Kinect driver does notsupport multiple Kinect cameras connecting simultaneously,we use the open source device driver libfreenect2 [17] .

Temporal alignment: The frame rate of Kinect is 30fps,that is, the acquisition cycle is about 33ms. Attachinga timestamp for every depth frame when capturing, thesynchronous group comprises of depth frames with timedifference of less than half a period (≤ 15ms).

Camera parameters: There are four Kinect cameraslocated at the four corners of an approximate square capturescene, and all are positioned toward the center of the scene.The camera’s intrinsic parameters can be read from deviceusing libfreenect2 API [17]. The extrinsic parameters can beeasily calibrated with the help of a chessboard. We choose

Page 3: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

one camera’s coordinate system as the reference coordinatesystem, and spatially align all the depth frames into thereference coordinate.

B. Data Preprocessing

The raw depth data from camera contain noise. We shouldremove the scene background and noise outlier by followingsteps: To remove the scene background, we remove all 3Dpoints out of a virtual 3D cylindrical bounding box (radius2M and height 2.5 meters) the scene center. After removingthe 3D point with no neighbours in the bounding box (radiusis about 2mm), we can obtain the foreground object withinthe largest connected region in 3D space.

C. Pose Database

3D human pose representation. Just like [18], the 3Dhuman body pose is defined as a DoF (Degree of Freedom)vector q ∈ R36, including root (6 DoF), upperback (3 DoF),r/lclavicle (2 DoF), r/lhumerus (3 DoF), r/ lradius (1 DoF),neck (2 DoF) head (1, DoF), r/lfemur (3 DoF), r/ltibia (1DoF) and r/lfoot (2 DoF).

3D human pose database. Same as [18], we selectmotion sequence database close to 2.5 hours in total timefrom the CMU human motion sequence database [19],and its movement types include: walking, running, boxing,kicking, jumping, dancing and waving, fitness and golf etc..

V. GeometryModel Database

3D human geometry model database. we use theCAESER human geometry model database [20] with detailsinfomation, which contains 1517 male geometric modelsof A-pose and 1531 female geometric models of A-pose.The topology of all the mesh model in the database isconsistent with each other. The human geometry model isrepresented by a long vector si composited of the vertexset of the mesh model, and the database is represented byS = {si, i = 1, · · · ,N}.

3D human geometry model prior. With the humangeometry model database S the global linear prior modelof the human geometry model is established by principalcomponent analysis (PCA) [21], and it is formulated as:

s(βββ) = Pβββ,k · βββ + s (1)

where βββ is the low dimensional parameter vector of humangeometry model, Pβββ,k is the matrix composed of vectors fromthe former k dimension of the principal component vectors,and s is the average vector of human geometric models in thedatabase. In our experiments, the principal component ratiois set to 95%, and the mean model is used as the templatemodel. Moreover, the priors of the male and female humangeometry models are constructed separately.

VI. Skeleton Estimation

Employing skeleton driven method to deform the humangeometric model in pose dimension, we propose a novelmethod to automatically and accurately estimate the em-bedded skeleton when given a new human geometry modelconsistent with the geometric topology of the templatemodel, a known geometric template model and its embeddedskeleton.

Parameterization of embedded skeleton joint. Giventhe human geometric template model and its embeddedskeleton, the coordinates Ji of the joint centers of the humanskeleton can be expressed as weighted linear combinationsof coordinates of their “nearest neighbors” vertices. It canbe formalized as equation (2). Therefore, we can obtain thevertex weights when given the template geometric modeland its embedded skeleton.

Ji =∑

vi, j∈Vi

wi, j · vi, j (2)

where Vi is a set of “nearest neighbors” vertices of i− thjoint of the embedded skeleton, vi, j is the coordinate of j−thvertex in Vi, wi, j is a weight corresponding to the vertex vi, j

. Ji is the coordinate of the embedded skeleton joint i.Solution of vertex weight w. Given the human template

geometric model and its embedded skeleton, the geometricmodel vertices can be estimated and formalized as a con-strained linear least squares problem:

arg minw j‖∑

vi, j∈Vi

wi, j · vi, j − Ji‖2

subject to wi, j ≥ 0(3)

where Ji is the coordinate of the embedded skeleton jointi of the template model, Vi is a set of “nearest neighbors”vertexs of ith joint of the embedded skeleton, vi, j is thecoordinate of jth vertex in Vi, wi, j is a weight correspondingto the vertex vi, j. Those Vi, vi, j, and Ji are known variables,and wi, j is the pending variable.

Estimation of embedded skeleton in human geometrymodel. The estimated individual geometry model has thesame mesh topology as the template model, and all of themare A-pose. Therefore, according to the equation (2), whengiven the vertex coordinates of individual geometry modeland vertex weights wi, j, we can find joint coordinates Ji ofembedded skeleton for the new human geometry model.

VII. GeometricModel Estimation

In this section, we will describe how to formulate theautomatic estimation problem of the detailed individualgeometric model as a nonlinear optimization problem andits iterative optimization method. Specifically, when giventhe 3D point cloud P captured from current frame, humangeometric model database S, pose database Q, the individualhuman geometric model can be obtained (parameterized as

Page 4: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

human pose parameter vector q and human geometric modelparameter vector βββ). it can be formalized as:

q, βββ = arg minq,βββ

λ1Epoint + λ2Eplane + λ3Eβ−prior

+λ4Ebone−balance + λ5Eq−prior

(4)

where Epoint and Eplane are corresponding points term,Ebone−balance is symmetry term of skeleton length, Eq−prior ishuman pose prior term, Eβ−prior is human geometry modelprior term, λ1, λ2, λ3, λ4, λ5 are the weights of each energyitem respectively. By introducting of Eq−prior and Eβ−prior

into the optimization objective function as the global priorof human body pose and shapes, the rationality and accuracyof individual 3D model can be significantly improved.

Usually the pose of a real actor differs from the standardA-pose in the model database, and it may be very smallor large. In order to accurately estimate detailed individualmesh model, the template model should deform in the shapedimension,as well as in its pose dimension. In the iterationprocess, assuming that the topology of the embedded humanskeleton remains the same, and the length of the skeletonwill change naturally with the changes of the human geom-etry model.

Corresponding points term. It is to minimize the dis-tance between the vertex set of the reconstructed humangeometry model and the captured 3D point cloud. In orderto improve the accuracy and rationality of nearest pointmatching, both the point-to-point and point-to-plane metric[22], [23] are used.

Epoint =∑

i

‖vi(q, βββ) − p∗i ‖pF (5)

Eplane =∑

i

‖nTi (q, βββ) · (vi(q, βββ) − p∗i )‖pF (6)

where ‖ • ‖pF is the p − norm. The point-to-point distancerefers to the closest Euclidean distance between the humangeometry model vertex vi(q, βββ) and the captured 3D pointp∗i . The point-to-plane distance refers to the distance fromcaptured 3D point cloud to intersection point of the tangentplane of captured 3D point cloud p∗i and the normalsnT

i (q, βββ) for vertex vi(q, βββ) of human geometry model. Usu-ally, increasing ratio of Eplane to Epoint can guarantee goodconvergence of algorithm [23], and it is set to 1:3 in theexperiment.

Symmetric skeleton length term. It encourages thesymmetrical skeletal segments to be as the same in length aspossible. In our experiments, we suppose only the four limbsare symmetry. Therefore, in order to ensure the rationalityof the reconstructed human geometry model, skeleton lengthis energy term introduced:

Ebone−balance =∑

{(m,n)}∈{(M,N)}

‖lm − ln‖2 (7)

where {(M,N)} is symmetrical skeletal segments, lm, ln isrespectively the left and right symmetry bone length.

Human geometry model prior term. It is a penaltyto satisfaction degree of the probability distribution of thereconstructed individual geometry model and the probabilitydistribution of the global space composed of the humangeometry model in database. Assuming that the humangeometry model database in the global space is subject tomultivariate normal distribution, the the human geometricmodel prior is defined to maximize the following condi-tional probability (equation (8)). In general, the probabilisticmaximization problem can be converted into the energyminimization problem (equation (9)):

Pr(s(βββ)|s1, · · · , sN) =exp(− 1

2 (s(βββ) − s)TΛΛΛ−1βββ (s(βββ) − s))

(2π)d2 |ΛΛΛβββ|

12

(8)

Eβ−prior = (s(βββ) − s)TΛΛΛ−1βββ (s(βββ) − s) = βββT PT

βββ,kΛΛΛ−1βββ Pβββ,kβββ (9)

where ΛΛΛβββ is the matrix composed of the former k prin-cipal component vector from the covariance matrix humangeometry model database, βββ is a pending low dimensionalparameter vector of human body model, the Pβββ,k and s haveobtained respectively in section V, the matrix composed ofthe former k principal component vector from the humangeometric model prior and the average vector.

Human pose prior term. The penalty is the satisfactiondegree of the probability distribution of the reconstructedindividual human pose and that of the global space com-posed of the human pose database. Given human posedatabases Q = {qi, i = 1, · · · ,H}, this section also employsthe principal component analysis (PCA) [21] to establisha global linear prior model of human pose. It can beformalized as:

q = Pq,b · w + q (10)

Where w is the low dimensional parameter vector of humanbody pose, Pq,b is the matrix composed of the formerb principal component vector, and q is the mean vectorof human pose in the database. The principal componentratio is set to 95%. Similarly, assuming that the humanpose database in the global space is subject to multivariatenormal distribution, then the human pose prior can bedefined to maximize the following conditional probabilityequation (11); It can also be converted into equation (12):

Pr(q|q1, · · · ,qH) ∝

exp(−‖PT

q,b · (Pq,b · (q − q)) + q − q‖2

2δ2q−prior

)(11)

Eq−prior = ‖PTq,b · (Pq,b · (q − q)) + q − q‖2 (12)

where q is a pending human pose vector, The δq−prior is asoft constraint form of body pose prior.

Page 5: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

Optimization method.Combining equation (5), (6), (7),(9) and (12), the energy equation (4) can be converted into:

q, βββ = arg minq, βββ

λ1

∑i

‖vi(q, βββ) − p∗i ‖pF + λ2βββ

T PTβββ,kΛΛΛ

−1βββ Pβββ,kβββ

+λ3

∑i

‖nTi (q, βββ) · (vi(q, βββ) − p∗i )‖pF

+λ4

∑{(m,n)}∈{(M,N)}

‖lm − ln‖2

+λ5‖PTq,b · (Pq,b · (q − q)) + q − q‖2

(13)

Since the captured 3D point cloud data usually containsmuch noise, it is inevitable that vertex of human geometricmodel mismatch with the noise-point in the captured pointcloud, and it will affect the accuracy of estimation resultsof human geometry model. Therefore, in our experiment,let p = L2

L1. The variables λ1, λ2, λ3, λ4, λ5 are the weights

of each energy item, and set to 1, 3, 1E3, 0.03 and 10,respectively in our experiments.

The nonlinear optimization problem, equation (13), aboutthe human pose parameter vector q and human geometricmodel parameter vector βββ, is solved by make using ofjoint optimization strategy based on the Particle SwarmOptimization (PSO) [24]–[26]. When the optimization isdone, the number of iterations is 10 times, and 2000 particlesare generated randomly at each iteration step. Since the inde-pendence between the particles, GPU is used to acceleratethe computation. In the experiment, the efficiency of thealgorithm was tested 200 times. Averagely in 6-7 iterations,it will converge to the approximate optimal solution.

VIII. Results

In the following experiments, several virtual human mod-els with large differences in stature and real actors wererandomly selected in a variety of different pose (A-pose andother pose), and by comparing with the SCAPE [16] methodwhich is a state-of-the-art data-driven method and is the mostrelated work to ours in this paper, we will demonstrate theaccuracy and rationality of our individual geometry modelestimation method, on the synthesized data set with noise orwithout noise and the real data set captured from multiplecameras. Then, we apply the method of estimating individualgeometry model to a real human motion capture system toverify the effectiveness of our method in realtime system.

The experimental environment is a PC machine with: aCPU of Intel Core i7-6850K 3.6GHz with 6 processor 12thread, main memory size of 128GB, a CUDA enabledgraphics card of NVIDIA GeForce GTX 1080 of 8GBmemory based on Pascal architecture 1.733GHz.

Evaluation of pose and shape prior term The resultsfrom the experiment with or without Eq−prior and Eβ−prior

are shown in Fig.2 and Fig.3(a) respectively, and it provedthat these two energe term can be significantly improved therationality and accuracy of the reconstructed model.

Figure 2. Evaluation of pose prior. Top row is the result of without Eq−prior ,and the bottom row with Eq−prior .

(a) Evaluation of shapeprior.

(b) Reconstruction Er-ror Comparison

Figure 3. Reconstruction Error

Synthesized clean 3D point cloud In the experiment, theaverage model in CAESER database is used as the humantemplate model. The SCAPE model is trained on the tem-plate model, which consists of 70 human template models indifferent pose (based on the multi-view pose human databasesupplied by SCAPE, obtained by the method [27]).

The purpose of this experiment is to compare the model-ing ability of our method with SCAPE [16].The input dataare the 3D point cloud, which is synthesized from 45 malehuman body models of three different poses (one similarto the template A-pose, another two different obviously)randomly selected from the CAESER database.

Experiment on 45 test data of 3 kinds of poses, theaverage reconstruction error of human geometrical modelswas shown in Table I. Pose1 is of the pose in first row inFig.4,Pose2 and Pose3 are second and third row respectively.The partial comparison results are shown in figure 5. Itshows that our method is more accurate and robust thanthe classical SCAPE method. Show in Fig.4;

Table IReconstruction error in clean data

SCAPE’s OursPose1 1.79 ± 0.88cm 1.74 ± 0.85cmPose2 2.87 ± 1.52cm 2.13 ± 1.08cmPose3 4.4 ± 5.6cm 2.1 ± 1.02cm

Synthesized noisy 3D point cloudThis experiment is the same as previous one except that

the input data are the 3D point cloud added with randomGaussian noise. From Table II and Fig.5, it also shows thatour method is more accurate and robust than the classicalSCAPE method.

Comparing the reconstruction results from noisy and cleansynthetic data set, as shown in Fig.3(b), it is obvious that

Page 6: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

Figure 4. Comparison test on synthetic clean data. Col. a1 and a2 are theinput, b1 and b2 are the results of SCAPE [16], c1,c2 are its reconstructionerrors, d1and d2 are the results of ours, e1 and e2 are our errors.

Table IIReconstruction error of noisy data

SCAPE’s OursPose1 2.69 ± 1.21cm 1.76 ± 0.87cmPose2 9.04 ± 6.79cm 2.19 ± 1.1cm

our method reconstruction accuracy achieves the same well,while the reconstruction error of SCAPE is much differentsuffering noisy data. Therefore, our method performs betterthan SCAPE in the anti-noise ability.

Online test on the point cloud of real actors. Inthe previous experiment using synthetic 3D point cloud,the virtual human model is a scanning model with onlyunderwear, but in actual 3D human motion capture scenarios,the performers are normally dressed. The online test onnormally dressed actors, comparing the modeling ability ofour method with SCAPE [16], also shows that our methodis more accurate and robust than the classical SCAPEmethod.Shown as in Fig.5.

According to the above experimental results, both ourmethod and SCAPE method [16] can reconstruct a reason-able and accurate 3D human geometric model and pose.However, when the target 3D human pose differs largelyfrom the template model, our method can reasonably andaccurately reconstruct the 3D body shape and pose, while thereconstruction result of SCAPE is obviously unreasonable(shown as red circles highlight part in Fig.5 and Fig.6 ).The main reason is: firstly, the SCAPE method requiresaccurate point correspondence between non-rigid model and

Figure 5. Comparison test on synthetic noisy data. Col. a1 and a2 are theinput, b1 and b2 are the results of SCAPE [16], c1,c2 are its reconstructionerrors, d1and d2 are the results of ours, e1 and e2 are our errors.

Figure 6. Online test results. column a is the captured point cloud, b,c isthe result of ours, d,e is SCAPE’s

the target 3D point cloud to ensure the 3D human pose accu-racy estimation and large-scale deformation. These accuratecorrespondences are usually required specified [16] by handor like [28] given actor’s height and weight information, toestimate an initial pose very similar to (or consistent) thepose of target point cloud. Secondly, in our method, thedeformation of limb segments of 3D human pose is rigiddeformation, and it can guarantee reasonable large scaledeformation. However, in the SCAPE method, it is non-rigiddeformation and can not guarantee the large scale pose de-formation. In addition, the method of iterative optimizationalgorithm is the variable to solve in our iterative optimizationalgorithm composed of a shape and pose parameters of lowdimensional vector (in the experiment is 51 dimensional),while in the SCAPE method is a transformation matrix ofeach patch (close to 100,000 dimension in the experiment),it is obvious that our efficiency is better than the SCAPEmethod. Therefore, our method is more suitable for onlineapplication than the SCAPE method.

Application This experiment focuses on the effectivenessand realtime test of real capture 3D point cloud data forour motion tracking method. In the experiment, actorswith different size were asked to complete a variety ofmotions (including: walking, walking and kicking, boxing,and etc.), Some test results are shown in Fig.7(a) andFig.7(b), respectively. The experimental results show thatour motion tracking method can be applied to differentpeople of different size,suitable for a variety of daily action,and get accurate and reasonable motion estimation results.It verifies the effectiveness of the proposed method. Inaddition, the average frame rate of the above captured 3Dhuman motion sequences is 20.25fps, which verifies therealtime performance of the proposed method.

(a) Alternate rolling hands. (b) Heterogeneous Motion

Figure 7. Application Results

Page 7: Individual 3D Model Estimation for Realtime Human Motion ...humanmotion.ict.ac.cn/papers/2017P6_3DModelEstimation/[2017]-IC… · Individual 3D Model Estimation for Realtime Human

IX. ConclusionThis paper focuses on how to estimate the individual

geometric model of human body quickly and reasonablybased on the global human pose and shape prior model, fromthe 3D point cloud captured from multiple depth cameras.Our method successfully integrates the prior informationof human geometric model and human pose into a unifiedoptimization problem, making the estimated human modelmore accurate and reasonable. Due to the use of multi-view depth camera data, the ill-pose problem caused byocclusion or self-occlusion can be solved to some extent.The experiments show that, based on the proposed modelestimation method, the multi-view 3D human motion capturesystem can reconstruct and track the reasonable and accurate3D human motion sequence in realtime and online.

However, the influence of human muscle and skin onthe deformation of human model geometry has not beenconsidered in our method, and it will result in the lackingof local details of the reconstructed human model. We willtry to address this problem in the future research work.

AcknowledgmentThis work was supported by the Knowledge Innova-

tion Program of the Institute of Computing Technologyof the Chinese Academy of Sciences under Grant No.ICT20166040, the Science and Technology Service NetworkInitiative of Chinese Academy of Sciences under Grant No.KFJ-STS-ZDTP-017, the National Natural Science Founda-tion of China under Grant No. 61772499.

References

[1] X. Wei, P. Zhang, and J. Chai, “Accurate realtime full-bodymotion capture using a single depth camera,” ACM Trans.Graph., vol. 31, no. 6, pp. 1–12, 2012.

[2] P. Zhang, K. Siu, J. Zhang, C. K. Liu, and J. Chai, “Lever-aging depth cameras and wearable pressure sensors for full-body kinematics and dynamics capture,” ACM Trans. Graph.,vol. 33, no. 6, p. 221, 2014.

[3] C. Plagemann, V. Ganapathi, D. Koller, and S. Thrun, “Real-time identification and localization of body parts from depthimages,” in IEEE International Conference on Robotics andAutomation, 2010, pp. 3108–3113.

[4] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio,R. Moore, A. Kipman, and A. Blake, “Real-time human poserecognition in parts from single depth images,” in ComputerVision and Pattern Recognition, 2011, pp. 1297–1304.

[5] R. Girshick, J. Shotton, P. Kohli, and A. Criminisi, “Efficientregression of general-activity human poses from depth im-ages,” in International Conference on Computer Vision, 2011,pp. 415–422.

[6] J. W. Daniel Grest and R. Koch, “Nonlinear body poseestimation from depth images,” in Joint Pattern RecognitionSymposium, 2005, pp. 285–292.

[7] S. Knoop, S. Vacek, and R. Dillmann, “Fusion of 2d and3d sensor data for articulated body tracking,” Robotics andAutonomous Systems, vol. 57, no. 3, pp. 321–329, 2009.

[8] D. Grest, V. Krger, and R. Koch, “Single view motion trackingby depth and silhouette information,” in Image Analysis,Scandinavian Conference, Scia 2007, Aalborg, Denmark,June 10-14, 2007, Proceedings, 2007, pp. 719–729.

[9] R. M. Friborg, S. Hauberg, and K. Erleben, GPU AcceleratedLikelihoods for Stereo-Based Articulated Tracking. SpringerBerlin Heidelberg, 2010.

[10] B. Kai, R. Kai, Y. Schroeder, C. Bruemmer, A. Scholz, andM. A. Magnor, “Markerless motion capture using multiplecolor-depth sensors,” in Vision, Modeling, and VisualizationWorkshop 2011, Berlin, Germany, 4-6 October, 2011, pp.317–324.

[11] G. Ye, Y. Liu, N. Hasler, X. Ji, Q. Dai, and C. Theobalt, Per-formance Capture of Interacting Characters with HandheldKinects. Springer Berlin Heidelberg, 2012.

[12] M. Siddiqui and G. Medioni, “Human pose estimation from asingle view point, real,” Dissertations and Theses-Gradworks,vol. 32, no. 12, pp. 1–8, 2010.

[13] A. Baak, M. Mller, G. Bharaj, and H. P. Seidel, “A data-drivenapproach for real-time full body pose reconstruction froma depth camera,” in International Conference on ComputerVision, 2011, pp. 1092–1099.

[14] M. Ye, R. Yang, and M. Pollefeys, “Accurate 3d pose es-timation from a single depth image,” in IEEE InternationalConference on Computer Vision, 2011, pp. 731–738.

[15] Y. Zhu, B. Dariush, and K. Fujimura, “Kinematic self retar-geting: A framework for human pose estimation,” ComputerVision and Image Understanding, vol. 114, no. 12, pp. 1362–1375, 2010.

[16] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers,and J. Davis, “Scape: shape completion and animation ofpeople,” Acm Transactions on Graphics, vol. 24, no. 3, pp.408–416, 2005.

[17] “Openkinect/libfreenect2,” 2017. [Online]. Available: https://github.com/OpenKinect/libfreenect2

[18] L. Su, J. X. Chai, and S. H. Xia, “Local pose prior based3d human motion capture from depth camera,” Ruan JianXue Bao/Journal of Software(in Chinese), vol. 27, no. 2, pp.172–183, 2016.

[19] “Cmu mocap database,” 2017. [Online]. Available: http://mocap.cs.cmu.edu/

[20] “Caeser human geometric model database,” 2017. [Online].Available: http://store.sae.org/caesar/

[21] K. P. F.R.S, “Liii. on lines and planes of closest fit to systemsof points in space,” Philosophical Magazine, vol. 2, no. 11,pp. 559–572, 1901.

[22] K. L. Low, “Linear least-squares optimization for point-to-plane icp surface registration,” Chapel Hill, 2004.

[23] H. Li, B. Adams, L. J. Guibas, and M. Pauly, “Robust single-view geometry and motion reconstruction,” ACM Trans.Graph., vol. 28, no. 5, p. 175, 2009.

[24] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,”in Proceedings of IEEE International Conference on NeuralNetworks, vol. 4, no. 0, 1995, pp. 1942–1948.

[25] F. Marini and B. Walczak, “Particle swarm optimization (pso).a tutorial,” Chemometrics and Intelligent Laboratory Systems,vol. 149, no. B, pp. 153–165, 2015.

[26] J. C. Bansal, P. K. Singh, M. Saraswat, A. Verma, S. S.Jadon, and A. Abraham, “Inertia weight strategies in particleswarm optimization,” in Nature and Biologically InspiredComputing, 2011, pp. 633–640.

[27] R. W. Sumner and J. Popovic, “Deformation transfer fortriangle meshes,” ACM Trans. Graph., vol. 23, no. 3, pp. 399–405, 2004.

[28] A. Weiss, D. Hirshberg, and M. Black, “Home 3d body scansfrom noisy image and range data,” in Proceedings of theIEEE International Conference on Computer Vision, 2011,pp. 1951–1958.