3D Face Modeling From Diverse Raw Scan Datacvlab.cse.msu.edu/pdfs/Liu_Tran_Liu_ICCV2019.pdf · Deep models have been successfully used for 3D face ﬁtting, which recovers 3D shape

3D Face Modeling From Diverse Raw Scan Data

Feng Liu, Luan Tran, Xiaoming LiuDepartment of Computer Science and EngineeringMichigan State University, East Lansing MI 48824{liufeng6, tranluan, liuxm}@msu.edu

Abstract

Traditional 3D face models learn a latent representationof faces using linear subspaces from limited scans of a sin-gle database. The main roadblock of building a large-scaleface model from diverse 3D databases lies in the lack ofdense correspondence among raw scans. To address theseproblems, this paper proposes an innovative framework tojointly learn a nonlinear face model from a diverse set ofraw 3D scan databases and establish dense point-to-pointcorrespondence among their scans. Specifically, by treatinginput scans as unorganized point clouds, we explore theuse of PointNet architectures for converting point cloudsto identity and expression feature representations, fromwhich the decoder networks recover their 3D face shapes.Further, we propose a weakly supervised learning approachthat does not require correspondence label for the scans.We demonstrate the superior dense correspondence andrepresentation power of our proposed method, and itscontribution to single-image 3D face reconstruction.

1. IntroductionRobust and expressive 3D face modeling is valuable for

computer vision problems, e.g., 3D reconstruction [7,24,41,54] and face recognition [42, 43, 58], as well as computergraphics problems, e.g., character animation [15, 31]. Thestate-of-the-art 3D face representations mostly adopt lineartransformations [39, 59, 60], e.g., 3D Morphable Model(3DMM) or higher-order tensor generalizations [1, 13, 14,67], e.g., Blendshapes Model. However, these linear modelsfall short of capturing the nonlinear deformations such ashigh-frequency details and extreme expressions. Recently,with the advent of deep learning, there have been severalattempts at using deep neural networks for nonlinear data-driven face modeling [4, 32, 51, 65].

To model 3D face shapes, a large amount of high-quality3D scans is required. The widely used 3DMM-basedBFM2009 [48] is built from scans of merely 200 subjectsin neutral expressions. Lack of expression may be com-

Figure 1: Comparison between 3D face modeling of (a)existing methods and (b) our proposed method. Dense point-to-point correspondence is a pre-requisite for the existing 3D facemodeling methods. Our proposed CNN-based approach learnsface models directly from raw scans of multiple 3D face databasesand establish dense point-to-point correspondence among all scans(best viewed in color). Despite the diversity of scans in resolutionand expression, our model can express the fine level of details.

pensated with expression bases from FaceWarehouse [14]or BD-3FE [70]. After more than a decade, almost allexisting models use less than 300 training subjects. Such asmall training set is far from adequate to describe the fullvariability of faces. Until recently, Booth et al. [11, 12]build the first Large-Scale Face Model (LSFM) from neutralscans of 9,663 subjects. Unfortunately, with only theresultant linear 3DMM bases being released instead of theoriginal scans, we cannot fully leverage this large databaseto explore different 3D modeling techniques.

In fact, there are many publicly available 3D facedatabases, as shown in Fig. 1. However, these databases areoften used individually, rather than jointly to create large-scale face models. The main hurdle lies in the challenge

of estimating dense point-to-point correspondence for rawscans, which allows these scans to be organized in the samevector space, enabling analysis as a whole.

Dense point-to-point correspondence is one of the mostfundamental problems in 3D face modeling [22, 26], whichcan be defined as in [22]: given two 3D faces S andS′, the correspondence should satisfy three perspectives:i) S and S′ have the same number of vertices; ii) thecorresponding points share the same semantic meaning; iii)the corresponding points lie in the same local topologicaltriangle context. Prior dense correspondence methods [3,7, 24, 47] lack either accuracy, robustness or automation.Moreover, few of them have shown success on multipledatabases. Beyond of the data scale, the challenge of densecorrespondence for multiple databases is certainly escalatedover single database: the quality of scans is often inevitablycorrupted with artifacts (e.g., hair and eyebrows), missingdata and outliers; facial morphology varies significantly dueto expressions like mouth opening and closing; differentdatabases contain high variability on the resolution.

To address these challenges, we propose a novelencoder-decoder to learn face models directly fromraw 3D scans of multiple diverse databases, as well asestablish dense correspondence among them. Our approachprovides: i) a PointNet-based encoder that learns nonlinearidentity and expression latent representations of 3D faces;ii) a corresponding decoder capable of establishing densecorrespondence for scans with a variety of expressionsand resolutions; iii) the decoder can be plugged intoexisting image-based encoders for 3D face reconstruction.Specifically, by treating raw scans as unorganized pointclouds, we explore the use of PointNet [50] for convertingpoint clouds to identity and expression representations,from which the decoder recovers their 3D face shapes.

However, full supervision is often not available due tothe lack of ground-truth dense correspondence. Thus, wepropose a weakly-supervised approach with a mixture ofsynthetic and real 3D scans. Synthetic data with topologicalground truth helps to learn a shape correspondence priorin a supervised fashion, which allows us to incorporateorder-invariant loss functions, e.g., Chamfer distance [21],for unsupervised training of real data. Meanwhile, asurface normal loss retains the original high-frequencydetails. For regularization, we use the edge length lossto encourage the triangulation topology on the templateand the reconstructed point cloud to be the same. Finally,a Laplacian regularization loss improves the performanceof mouth regions with extreme expressions. The abovestrategies allow the network to learn from a large set of raw3D scan databases without any label on correspondences.In summary, the contributions of this work include:�We propose a new encoder-decoder framework that for

the first time jointly learns face models directly from raw

Table 1: Comparison of 3D face modeling from scans. ’Exp.’refers to whether learns the expression latent space, ’Corr.’ refersto whether requires densely corresponded scans in training.

Method Dataset Lin./nonL. #Subj. Exp. Corr.

BFM [48] BFM Linear 200 No YesGPMMs [45] BFM Linear 200 Yes YesLSFM [11, 12] LSFM Linear 9,663 No YesLYHM [19] LYHM Linear 1,212 No YesMultil. model [14] FWH Linear 150 Yes Yes

FLAME [39] CAESARD3DFACS Linear 3,800

10Yes Yes

VAE [4] Proprietary Nonlin. 20 No YesMeshAE [51] COMA Nonlin. 12 No YesJiang et al. [32] FWH Nonlin. 150 Yes YesProposed 7 datasets Nonlin. 1, 552 Yes No

scans of multiple 3D face databases and establishes densecorrespondences among all scans.�We devise a weakly-supervised learning approach and

several effective loss functions for the proposed frameworkthat can leverage known correspondences from syntheticdata and relax the Chamfer distance loss for vertex corre-spondence in an unsupervised fashion.�We demonstrate the superiority of our nonlinear model

in preserving high-frequency details of 3D scans, providingcompact latent representation, and applications of single-image 3D face reconstruction.

2. Related Work3D Face Modeling. Traditional 3DMMs [7, 8] model

geometry variation from limited data via PCA. Paysanet al. [48] build BFM2009, the publicly available morphablemodel in neutral expression, which is extended to emotiveface shapes [2]. Gerig et al. [24, 45] propose the GaussianProcess Morphable Models (GPMMs) and release a newBFM2017. Facial expressions can also be represented withhigher-order generalizations. Vlasic et al. [67] use a mul-tilinear tensor-based model to jointly represent the identityand expression variations. FaceWarehouse (FWH) [14] isa popular multilinear 3D face model. The recent FLAMEmodel [39] additionally models head rotation. However, allthese works adopt a linear space, which is over-constrainedand might not well represent high-frequency deformations.

Deep models have been successfully used for 3D facefitting, which recovers 3D shape from 2D images [20, 33–35,52,59,62,72]. However, in these works the linear modelis learned a-priori and fixed during fitting, unlike ours wherethe nonlinear model is learned during training.

In contrast, applying CNN to learn more powerful 3Dface models has been largely overlooked. Recently, Tranet al. [63, 64] learn to regress 3DMM representation, alongwith the decoder-based models. SfSNet [57] learns shape,albedo and lighting decomposition of a face, from 2Dimages, instead of 3D scans. Bagautdinov et al. [4] learnnonlinear face geometry representations directly from UVmaps via a VAE. Ranjan et al. [51] introduce a convolution-

Figure 2: Overview of our 3D face modeling method. A mixture of synthetic and real data is used to train the encoder-decoder networkwith supervised (green) and unsupervised (red) loss. Our network can be used for 3D dense correspondence and 3D face reconstruction.

al mesh autoencoder to learn nonlinear variations in shapeand expression. Note that [4, 51] train with no more than20 subjects and encode the 3D data to a single latent vector.Jiang et al. [32] extend [51] to decompose a 3D face intoidentity and expression parts. Unlike our work, these threemethods require densely corresponded 3D scans in training.We summarize the comparison in Tab. 1.

3D Face Dense Correspondence. As a fundamentalshape analysis task, correspondence has been well studiedin the literature. Shape correspondence, a.k.a. registration,alignment or simply matching [66], finds a meaningfulmapping between two surfaces. The granularity of mappingvaries greatly, from semantic parts [28, 53], group [17],to points [38]. Within this range, point-to-point cor-respondences for 3D face is the most challenging andstrict one. In original 3DMM [7], the 3D face densecorrespondence is solved with a regularized form of opticalflow as a cylindrical image registration task. This isonly effective in constrained settings, where subjects sharesimilar ethnicities and ages. To overcome this limitation,Patel and Smith [47] use a Thin Plate Splines (TPS) [10]warp to register scans into a template. Alternatively,Amberg et al. [3] propose an optimal step Nonrigid IterativeClosest Point (NICP) for registering 3D shapes. Booth etal. [11,12] quantitatively compare these three popular densecorrespondence techniques in learning 3DMM. Additionalextensions are also proposed [22, 24, 25, 71].

Many algorithms [1, 9, 27] treat dense correspondenceas a 3D-to-3D model fitting problem. E.g., [9] propose amultilinear groupwise model for 3D face correspondenceto decouple identity and expression variations. Abre-vayaemph et al. [1] propose a 3D face autoencoder with aCNN-based depth image encoder and multilinear model as adecoder for 3D face fitting. However, these methods require3D faces with an initial correspondence as input and the cor-

respondence problem is considered in the restrictive spaceexpressed by the model. Although insightful and useful, achicken-and-egg problem still remains unsolved [22].

To summarize, prior work tackle the problems of 3Dface modeling, and 3D face dense correspondence sepa-rately. However, dense correspondence is a prerequisitefor modeling. If the correspondence has errors, they willaccumulate and propagate to 3D modeling. Therefore, thesetwo problems are highly relevant and our framework for thefirst time tackles them simultaneously.

3. Proposed MethodThis section first introduces a composite 3D face shape

model with latent representations. We then present themixture training data and our encoder-decoder network. Wefinally provide implementation details and face reconstruc-tion inference. Figure 2 depicts the overview of our method.

3.1. Problem Formulation

In this paper, the output 3D face scans are represented aspoint clouds. Each densely aligned 3D face S ∈ Rn×3 isrepresented by concatenating its n vertex coordinates as,

S = [x1, y1, z1;x2, y2, z2; · · · ;xn, yn, zn]. (1)

We assume that a 3D face shape is composed of identity andexpression deformation parts,

S = SId + ∆SExp, (2)

where SId is the identity shape and ∆SExp is expressiondifference. Since the identity and expression spaces areindependent, we further assume these two parts can bedescribed by respective latent representations, fId and fExp.

Specifically, as shown in Fig. 2, we use two networksto decode shape component SId and ∆SExp from the

Table 2: Summary of training data from related databases.

Database #Subj. #Neu. #Sample #Exp. #Sample

BU3DFE [70] 100 100 1,000 2,400 2,400BU4DFE [69] 101 >101 1,010 >606 2,424Bosphorus [56] 105 299 1,495 2,603 2,603FRGC [49] 577 3,308 6,616 1,642 1,642Texas-3D [30] 116 813 1,626 336 336MICC [5] 53 103 515 − −BJUT-3D [6] 500 500 5,000 − −Real Data 1,552 5,224 17,262 7,587 9,405Synthetic Data 1,500 1,500 15,000 9,000 9,000

corresponding latent representations. Formally, given aset of raw 3D faces {Sraw

i }Ni=1, we learn an encoder E :Sraw→fId, fExp that estimates the identity and expressionshape parameters fId∈RlId , fExp∈RlExp , an identity shapedecoder DId : fId→SId, and an expression shape decoderDExp : fExp→SExp that decode the shape parameters to a3D shape estimation S.

Recent attempts to encode 3D face shape in deep learn-ing include point clouds, depth map [1], UV map basedmesh [4], and mesh surface [32, 51]. Point clouds are astandard and popular 3D face acquisition format used byKinect, iPhone’s face ID and structured light scanners. Wethus design a deep encoder-decoder framework to directlyconsume unorganized point sets as input and output denselycorresponded 3D shapes. Before providing the algorithmdetails, we first introduce the real and synthetic training dataserved for the weakly-supervised learning.

3.2. Training Data

To learn a robust and highly variable 3D face model,we construct training data of seven publicly available 3Ddatabases with a wide variety of identity, age, ethnicity,expression and resolution, listed in Tab. 2. However, forthese real scans, there are no associated ground-truth ondense correspondence. Recently, some 3D databases arereleased such as 4DFAB [16], Multi-Dim [40] and UHDB3D [61,68]. While including them may increase the amountof training data, they do not provide new types of variationsbeyond the seven databases. We do not use the occlusionand pose (self-occlusion) data of Bosphorus database, sinceextreme occlusion or missing data would break semanticcorrespondence consistency of 3D faces. For BU4DFEdatabase, we manually select one neutral and 24, expressionscans per subject. To keep the balance between real andsynthetic data, we use BFM2009 to synthesize 3D faces of1,500 subjects, and use 3DDFA [73] expression model togenerate 6 random expressions for each subject. Figure 3shows one example scan from each of the eight databases.

Preprocessing and data augmentation As visualized inFig. 4, we first predefine a template of 3D face topologyconsisting of n = 29,495 vertices and 58,366 triangles,which is manually cropped from BFM mean shape. Then,we normalize the template into a unit sphere. The original

Figure 3: One sample from each of eight 3D face databases. Theyexhibit a wide variety of expressions and resolutions.

Figure 4: Preprocessing. (1) Automatic 3D landmark detectionbased on rendered images. (2) Guided by landmarks andpredefined template, the 3D scan is alignment and cropping.

synthetic examples contain 53,215 vertices, after removingpoints on the tongue. For synthetic examples, we croptheir face region with same topological triangulation as thetemplate, perform the same normalization, and denote thisresultant 3D face set with ground-truth correspondence as{Sgt

i }Mi=1, whose number of vertices is also n.Since raw scans are acquired from different distances,

orientations or sensors, their point clouds exhibit enormousvariations in pose and scale. Thus, before feeding themto our network, we apply a similarity transformation toalign raw scans to the template by using five 3D land-marks. Following [11], we detect 2D landmarks on thecorresponding rendered images, from which we obtain 3Dlandmarks by back-projection (Fig. 4 (1)). After alignment,the points outside the unit sphere are removed. Finally, werandomly sample n points as the input Sinput

i ∈ Rn×3. Ifthe vertex number is less than n, we apply interpolatingsubdivision [37] before sampling. As in Tab. 2, we performdata augmentation for neutral scans by repeating randomsampling several times so that each subject has 10 neutraltraining scans. Note that the above preprocessing is alsoapplied to synthetic data, except that their 3D landmarksare provided by BFM. As a result, the point ordering of bothinput raw and synthetic data is random.

3.3. Loss Function

This encoder-decoder architecture is trained end-to-end.We define three kinds of losses to constrain the correspon-

dence of the output shape and template, also to retain theoriginal global and local information. The overall loss is:

L = Lvt + λ1Lnormal + λ2Ledge, (3)

where the vertex loss Lvt is to constrain the locationof mesh vertices, normal loss Lnormal is to enforce theconsistency of surface normals, and edge length loss is topreserve the topology of 3D faces.

Here, we consider two training scenarios: synthetic andreal data. Supervision is typically available for the syntheticdata with ground truth (supervised case), but real scans areobtained without correspondence label (unsupervised case).

Supervised loss In the supervised case, given the shapeSgt (and S) and predefined triangle topology, we can easilycompute the corresponding surface normal ngt (and n) andedge length egt (and e). Therefore, for vertex loss, we canuse L1 loss Lvt(S,Sgt)=||Sgt− S||1. We measure the nor-mal loss by cosine similarity distance Lnormal(n,ngt) =1n

∑i(1 − ngt

i · ni). If the predicted normal has a similarorientation as the ground truth, the dot-product ngt

i · ni willbe close to 1 and the loss will be small, and vice versa. Thethird term Ledge encourages the ratio between edges lengthin the predicted shape and ground truth to be close to 1.Following [28], edge length loss is defined as,

Ledge(S,Sgt) =1

#E

∑(i,j)∈E

∣∣∣∣∣∣∥∥∥Si − Sj

∥∥∥∥∥Sgti − Sgt

j

∥∥ − 1

∣∣∣∣∣∣ , (4)

where E is the fixed edge graph of the template.

Unsupervised loss In the case where the correspondencesbetween the template and real scans are not available,we still optimize the reconstructions, but regularize thedeformations toward correspondence. For reconstruction,we use the Chamfer distance as the Lvt(S,Sraw) betweenthe input scans Sraw and the predicted S,

Lvt(S,Sraw) =∑p∈S

minq∈Sraw

‖p− q‖22 +∑

q∈Sraw

minp∈S‖p− q‖22 ,

(5)where p is a vertex in the predicted shape, q is a vertexin the input scan. When minq∈Sraw ‖p− q‖22 > ε orminp∈S ‖p− q‖

22 > ε, we treat q as a flying vertex and the

error will not be counted.In this unsupervised case, we further define loss on the

surface normal to characterize high-frequency properties,Lnormal(n,nraw

(q) ), where q is the closest vertex for p that isfound when calculating the Chamfer distance, and nraw

(q) isthe observed normal from the real scan. For the edge lengthloss, Ledge is defined the same as Eqn. 4.

Refine In the unsupervised case, the normal lossLnormal(n,nraw

(q) ) always find the closet vertex q in Sraw.

Figure 5: (a) qi is the closet vertex of pi and q′i is computed bythe normal ray scheme. (b) the predefined mouth region.

The disadvantage of this closest vertex scheme is thatthe counterpart qi is not necessary the true target forcorrespondence in high-curvature regions (Fig. 5 (a)).Therefore, the loss is not capable of capturing high-frequency details in Sraw. To remedy this issue, assuggested in [46], we consider the normal ray methodwhich computes the closest point of intersection of thenormal ray originated from pi with Sraw. As shownin Fig. 5 (a), the normal ray in sharp regions wouldfind a better counterpart q′i. At the early stage of thetraining process, we use the closet vertex scheme whichis computationally more efficient. When the loss getssaturated, we switch to use the normal ray scheme.

For many expressive scans, some tongue points havebeen recorded when the mouth is open, which are hardto establish correspondence. To address this issue, on topof the losses in Eqn. 3, we add a mouth region Laplacianregularization loss Llap = ‖LSmouth‖2 to maintain relativelocation between neighboring vertices. Here L is thediscrete Laplace-Beltrami operator and Smouth is the mouthregion vertex as predefined in Fig. 5 (b). See [36] for detailson Laplacian regularization loss.

3.4. Implementation Detail

Encoder/Decoder Network We employ the PointNet [50]architecture as the base encoder. As shown in Fig. 2,the encoder takes the 1024-dim output of PointNet andappends two parallel fully-connected (FC) layers to gener-ate identity and expression latent representations. We setlId=lExp=512. The decoders are two-layer MLP network-s, whose numbers of inputs and outputs are respectively{lId(lExp), 1024} (ReLU), {1024, n×3}.Training Process We train our encoder-decoder networkin three phases. First, we train the encoder and identitydecoder with neutral examples. Then, we fix the identitydecoder and train the expression decoder with expressionexamples. Finally, the end-to-end joint training is conduct-ed. In the first two phases, we start the training with onlysynthetic data. When the loss gets saturated (usually in 10epochs), we continue training using a mixture of syntheticand real data for another 10 epochs. We optimize thenetworks via Adam with an initial learning rate of 0.0001.The learning rate is decreased by half every 5 epochs. We

Figure 6: Qualitative results reflecting the contribution of losscomponents. The first column is the input scan. Column 2-4 showthe reconstructed shapes with different loss combination.

explore the different batch sizes including strategies suchas 50% synthetic and 50% real data per batch, and find theoptimal batch size to be 1. λ1 and λ2 control the influence ofregularizations against Lvt. They are both set to 1.6×10−4

in our experiments. We set ε=0.001 and the weight of Llap

is 0.005.

3.5. Single-Image Shape Inference

As shown in Fig. 2, our identity and expression shapedecoders can be used for image-to-shape inference. Specif-ically, we employ a SOTA face feature extraction networkSphereFace [44] as the base image encoder. This networkconsists of 20 convolutional layers and FC layer, andtakes the 512-dim output of the FC layer as the facerepresentation. We append another two parallel FC layers togenerate the identity and expression latent representations,respectively. Here, we use the raw scans from the 7 realdatabases to render images as our training samples. Withthe learnt ground truth identity and expression latent codes,we employ a L1 latent loss to fine-tune this image encoder.Since the encoder excels in face feature extraction andlatent loss has strong supervision, the encoder is fine-tunedfor 100 epochs with the batch size of 32.

4. Experimental ResultsExperiments are conducted to evaluate our method in

dense correspondence accuracy, shape and expression rep-resentation power, and single-image face reconstruction.

Evaluation Metric The ideal evaluation metric for 3Dshape analysis is per-vertex error. However, this metric isnot applicable to evaluating real scans due to the absence ofdense correspondence ground truth. An alternative metricis per-vertex fitting error, which has been widely used in3D face reconstruction [43] and 3D face-to-face fitting, e.g.,LSFM [11], GPMMs [45]. The per-vertex fitting error isthe distance between every vertex of the test shape and

Figure 7: Raw scans (top) and their reconstructions with color-coded dense correspondences (bottom), for one BU3DFE subjectin seven expressions: angry (AN), disgust (DI), fear (FE), happy(HA), neutral (NE), sad (SA), and surprise (SU).

the nearest-neighbor vertex of the corresponding estimatedshape. Generally, the value of this error could be very smalldue to the nearest-neighbor search. Thus, it sometimes cannot faithfully reflect the accuracy of dense correspondence.

To better evaluate the correspondence accuracy, prior 3Dface correspondence works [22, 24, 55] adopt a semanticlandmark error. With the pre-labeled landmarks on thetemplate face, it is easy to find p 3D landmarks {li}pi=1 withthe same indexes on the estimated shape. By comparingwith manual annotations {l∗i }

pi=1, we can compute the

semantic landmark error by 1p

∑pi=1

∥∥∥l∗i − li∥∥∥. Note that,this error is normally much larger than per-vertex fittingerror due to inconsistent and imprecise annotations. Tab. 6compares these three evaluation metrics.

4.1. Ablation Study

We qualitatively evaluate the function of each loss com-ponent. As seen in Fig. 6, only using vertex loss severelyimpairs the surface smoothness and local details; addingsurface normal loss preserves the high-frequency details.adding edge length term refines the local triangle topology.These results demonstrate that all the loss componentspresented in this work contribute to the final performance.

4.2. Dense Correspondence Accuracy

We first report the correspondence accuracy on BU3DFEdatabase. BU3DFE contains one neutral and six expressionscans with four levels of strength, for each of 100 subjects.Following the same setting in [55] and GPMMs [24], we useall p = 83 landmarks of all neutral scans and expressionscans in the highest level for evaluation. Specifically,the landmarks of the estimated shape are compared tothe manually annotated landmarks that are provided withBU3DFE. We compare with four state-of-the-art densecorrespondence methods, NICP [3], Bolkart et al. [9],Salazar et al. [55], and GPMMs [24]. Among them, NICPhas been widely used for constructing neutral morphablemodel such as BFM 2009 and LSFM. For a fair comparison,we re-implement NICP with extra landmark constraint sothat it can establish dense correspondence for expressive

Table 3: Comparison of the mean and standard deviation of semantic landmark error (mm) on BU3DFE.

Face Region NICP [3] Bolkart et al. [9] Salaza et al. [55] GPMMs [24] Proposed (out) Proposed (in) Relative Impr.Left Eyebrow 7.49±2.04 8.71±3.32 6.28±3.30 4.69±4.64 6.25±2.58 4.18±1.62 10.9%Right Eyebrow 6.92±2.39 8.62±3.02 6.75±3.51 5.35±4.69 4.57±3.03 3.97±1.70 25.8%Left Eye 3.18±0.76 3.39±1.00 3.25±1.84 3.10±3.43 2.00±1.32 1.72±0.84 44.5%Right Eye 3.49±0.80 4.33±1.16 3.81±2.06 3.33±3.53 2.88±1.29 2.16±0.82 35.1%Nose 5.36±1.39 5.12±1.89 3.96±2.22 3.94±2.58 4.33±1.24 3.56±1.08 9.6%Mouth 5.44±1.50 5.39±1.81 5.69±4.45 3.66±3.13 4.45±2.02 4.17±1.70 -13.9%Chin 12.40±6.15 11.69±6.39 7.22±4.73 11.37±5.85 7.47±3.01 6.80±3.24 5.8%Left Face 12.49±5.51 15.19±5.21 18.48±8.52 12.52±6.04 12.10±4.06 9.48±3.42 24.1%Right Face 13.04±5.80 13.77±5.47 17.36±9.17 10.76±5.34 13.17±4.54 10.21±3.07 5.1%Avg. 7.56±3.92 8.49±4.29 8.09±5.75 6.52±3.86 6.36±3.92 5.14±3.03 21.2%

Table 4: Comparison of semantic landmark error (mean+STD inmm) on FRGC v2.0. The landmarks are defined in [22].

Landmark Creusotet al. [18]

Gilantet al. [74]

Fanet al. [22] Proposed

ex(L) 5.87±3.11 4.50±2.97 2.62±1.54 1.79±1.01en(L) 4.31±2.44 3.12±2.09 2.53±1.66 1.61±0.97n 4.20±2.07 3.63±2.02 2.43±1.36 2.69±1.43ex(R) 6.00±3.03 3.74±2.79 2.60±1.71 2.00±0.95en(R) 4.29±2.03 2.73±2.14 2.49±1.65 1.82±0.93prn 3.35±2.00 2.68±1.48 2.11±1.17 2.36±1.37Ch(L) 5.47±3.45 5.31±2.05 2.93±2.14 2.58±2.61Ch(R) 5.64±3.58 4.38±2.08 2.84±2.17 2.60±2.58ls 4.23±3.21 3.31±2.65 2.35±2.86 2.75±2.77li 5.46±3.29 4.02±3.80 4.35±3.93 4.02±3.97Avg. 4.88±0.91 3.74±0.83 2.73±0.62 2.42±0.70

3D scans. For the other three methods, we report resultsfrom their papers. Both Salazar et al. [55] and Bolkart etal. [9] are multilinear model based 3D face fitting method.GPMMs [24] is a recent Gaussian process registration basedmethod. Note these four baselines do require labeled 3Dlandmarks as input, while our method does not.

To further evaluate the generalization ability of theproposed method for new scan data, we conduct two seriesof experiments: (i) training using data from BU3DFEdatabase, denoted as Proposed (in), and (ii) training usingdata outside BU3DFE database, denoted as Proposed (out).

As shown in Tab. 3, the Proposed (in) setting significant-ly reduces errors by at least 21.2% w.r.t. the best baseline.These results demonstrate the superiority of the proposedmethod in dense correspondence. The error of Proposed(out) setting shows a small increase, but is still lower thanthe baselines. The relatively high semantic landmark erroris attributed by the imprecise manual annotations, espe-cially on the semantic ambiguity contour, i.e., Chin, LeftFace and Right Face. Some example dense correspondenceresults are shown in Fig. 7 and Supp..

We further compare semantic landmark error with thevery recent SOTA correspondence method [22], which isan extension of ICP-based method, on the high-resolutionFRGC v2.0 database [49]. We also compare with two 3Dlandmark localization works [18, 74]. Following the samesetting in [22], we compute the mean and standard deviationof p = 10 landmarks for 4,007 scans. The results ofbaselines are from their papers. As shown in Tab. 4, ourmethod improves the SOTA [22] by 11.4%, and preserves

Table 5: 3D scan reconstruction comparison (per-vertex error,mm). lId denotes the dimension of latent representation.

lId 40 80 160

Linear 3DMM [48] 1.669 1.450 1.253Nonlinear 3DMM [64] 1.440 1.227 1.019Proposed 1.258 1.107 0.946

Figure 8: Shape representation power comparison. Ourreconstructions closely match the face shapes and the higher-dimlatent spaces can capture more local details.

high-frequency details for high-resolution 3D models (seeFig.7 of Supp.). The landmark errors are much smaller thanBU3DFE since the annotations used here are more accuratethan BU3DFE’s. Thanks to the offline training process, ourmethod is two order of magnitude faster than the existingdense correspondence methods: 0.26s (2ms with GPU)vs. 57.48s of [3] vs. 164.60s of [22].

4.3. Representation Power

Identity shape We compare the capabilities of the pro-posed 3D face models with linear and nonlinear 3DMMson BFM. The BFM database provides 10 test face scans,which are not included in the training set. As these scans arealready established dense correspondence, we use the per-vertex error for evaluation. For fair comparison, we traindifferent models with different latent space sizes. As shownin Tab. 5, the proposed model has smaller reconstructionerror than the linear or nonlinear models. Also, the pro-posed models are more compact. They can achieve similarperformances as linear and nonlinear models whose latentspaces sizes are doubled. Figure 8 shows the visual qualityof three models’ reconstruction.

Figure 9: Expression representation power comparison. Ourresults better match the expression deformations than 3DDFA.

Table 6: Evaluation metric comparison on two databases.

Metric BFM BU3DFE

Per-vertex fitting error 0.572mm 1.065mmPer-vertex error 0.946mm −Semantic landmark error 1.493mm 5.140mm

Expression shape We compare the expression representa-tion power of the proposed 3D face models with 3DDFAexpression model [73], a 29-dim model originated fromFaceWarehouse [14]. We use a 79-dim expression modelfrom [29] to randomly generate an expression differencewith Gaussian noise for each BFM test sample. Thosedata are treated as the test set. For a fair comparison, wetrain a model with the same expression latent space size(lExp=29). Our model has significantly smaller per-vertexerror than 3DDFA: 1.424mm vs. 2.609mm. Figure 9shows the visual quality of four scans’ reconstructions.

Shape representation on BU3DFE and BFM Table 6compares the shape expressiveness of our model with thethree different metrics. Following the setting in Tab. 5, wefurther calculate the per-vertex fitting error and semanticlandmark error (p = 51) for BFM test samples. Wealso provide the per-vertex fitting error for the BU3DFEreconstructions in Tab 3. From Tab. 6, compared to the idealper-vertex error, semantic landmark error is much largerwhile per-vertex fitting error is smaller.

Shape representation on COMA We further evaluate ourshape representation on a large-scale COMA database [51].For a fair comparison with FLAME [39] and Jiang etal. [32], we follow the same setting as [32], and set ourlatent vector size as 4 for identity and 4 for expression.As in Tab. 7, our method shows better shape representationcompared to SOTA methods. While MeshAE [51] achievesa smaller error (1.160mm), comparing ours with it isnot fair, as it has the advantage of encoding 3D facesinto a single vector without decomposing into identity andexpression. Also, their mesh convolution requires denselycorresponded 3D scans as input.

4.4. Single-image 3D Face Reconstruction

With the same setting in [59], we quantitatively com-pare our single-image shape inference with prior works on

Table 7: Comparison (per-vertex error, mm) with state-of-the-art3D face modeling methods on COMA database.

Sequence Proposed Jiang et al. [32] FLAME [39]

bareteeth 1.609 1.695 2.002cheeks in 1.561 1.706 2.011eyebrow 1.400 1.475 1.862high smile 1.556 1.714 1.960lips back 1.532 1.752 2.047lips up 1.529 1.747 1.983mouth down 1.362 1.655 2.029mouth extreme 1.442 1.551 2.028mouth middle 1.383 1.757 2.043mouth open 1.381 1.393 1.894mouth side 1.502 1.748 2.090mouth up 1.426 1.528 2.067Avg. 1.474 1.643 1.615

Figure 10: Quantitative evaluation of single-image 3D facereconstruction on samples of FaceWarehouse database.

nine subjects (180 images) of the FaceWarehouse database.Visual and quantitative comparisons are shown in Fig. 10.We achieve on-par results with nonlinear 3DMM [64],Tewari [59] and Garrido et al. [23], while surpassing allother CNN-based regression methods [52, 62].

5. Conclusions

This paper proposes an innovative encoder-decoder tojointly learn a robust and expressive face model from adiverse set of raw 3D scan databases and establish densecorrespondence among all scans. By using a mixture ofsynthetic and real 3D scan data with an effective weakly-supervised learning-based approach, our network can pre-serve high-frequency details of 3D scans. The comprehen-sive experimental results show that the proposed methodcan effectively establish point-to-point dense correspon-dence, achieve more representation power in identity andexpression, and is applicable to 3D face reconstruction.

Acknowledgment Research was sponsored by the ArmyResearch Office and was accomplished under Grant Num-ber W911NF-18-1-0330. The views and conclusions con-tained in this document are those of the authors and shouldnot be interpreted as representing the official policies, eitherexpressed or implied, of the Army Research Office or theU.S. Government. The U.S. Government is authorized toreproduce and distribute reprints for Government purposesnotwithstanding any copyright notation herein.

References[1] Victoria Fernandez Abrevaya, Stefanie Wuhrer, and Edmond

Boyer. Multilinear autoencoder for 3D face model learning.In WACV, 2018. 1, 3, 4

[2] Brian Amberg, Reinhard Knothe, and Thomas Vetter.Expression invariant 3D face recognition with a morphablemodel. In FG, 2008. 2

[3] Brian Amberg, Sami Romdhani, and Thomas Vetter. Optimalstep nonrigid icp algorithms for surface registration. InCVPR, 2007. 2, 3, 6, 7

[4] Timur Bagautdinov, Chenglei Wu, Jason Saragih, PascalFua, and Yaser Sheikh. Modeling facial geometry usingcompositional VAEs. In CVPR, 2018. 1, 2, 3, 4

[5] Andrew D Bagdanov, Alberto Del Bimbo, and Iacopo Masi.The florence 2D/3D hybrid face dataset. In J-HGBUworkshop. ACM, 2011. 4

[6] Yin Baocai, Sun Yanfeng, Wang Chengzhang, and GeYun. BJUT-3D large scale 3D face database andinformation processing. Journal of Computer Research andDevelopment, 6:020, 2009. 4

[7] Volker Blanz and Thomas Vetter. A morphable model for thesynthesis of 3D faces. In SIGGRAPH, 1999. 1, 2, 3

[8] Volker Blanz and Thomas Vetter. Face recognition based onfitting a 3D morphable model. TPAMI, 25(9):1063–1074,2003. 2

[9] Timo Bolkart and Stefanie Wuhrer. A groupwise multilinearcorrespondence optimization for 3D faces. In ICCV, 2015.3, 6, 7

[10] Fred L. Bookstein. Principal warps: Thin-plate splines andthe decomposition of deformations. TPAMI, 11(6):567–585,1989. 3

[11] James Booth, Anastasios Roussos, Allan Ponniah, DavidDunaway, and Stefanos Zafeiriou. Large scale 3D morphablemodels. IJCV, 126(2-4):233–254, 2018. 1, 2, 3, 4, 6

[12] James Booth, Anastasios Roussos, Stefanos Zafeiriou, AllanPonniah, and David Dunaway. A 3D morphable model learntfrom 10,000 faces. In CVPR, 2016. 1, 2, 3

[13] Alan Brunton, Timo Bolkart, and Stefanie Wuhrer. Multilin-ear wavelets: A statistical shape space for human faces. InECCV, 2014. 1

[14] Chen Cao, Yanlin Weng, Shun Zhou, Yiying Tong, and KunZhou. Facewarehouse: A 3D facial expression database forvisual computing. TVCG, 20(3):413–425, 2014. 1, 2, 8

[15] Chen Cao, Hongzhi Wu, Yanlin Weng, Tianjia Shao, andKun Zhou. Real-time facial animation with image-baseddynamic avatars. TOG, 35(4):126:1–126:12, 2016. 1

[16] Shiyang Cheng, Irene Kotsia, Maja Pantic, and StefanosZafeiriou. 4DFAB: A large scale 4D database for facialexpression analysis and biometric applications. In CVPR,2018. 4

[17] Timothy F Cootes, Stephen Marsland, Carole J Twining,Kate Smith, and Christopher J Taylor. Groupwisediffeomorphic non-rigid registration for automatic modelbuilding. In ECCV, 2004. 3

[18] Clement Creusot, Nick Pears, and Jim Austin. A machine-learning approach to keypoint detection and landmarking on3D meshes. IJCV, 102(1-3):146–179, 2013. 7

[19] Hang Dai, Nick Pears, William AP Smith, and ChristianDuncan. A 3D morphable model of craniofacial shape andtexture variation. In ICCV, 2017. 2

[20] Pengfei Dou, Shishir K. Shah, and Ioannis A. Kakadiaris.End-to-end 3D face reconstruction with deep neural net-works. In CVPR, 2017. 2

[21] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A pointset generation network for 3D object reconstruction from asingle image. In CVPR, 2017. 2

[22] Zhenfeng Fan, Xiyuan Hu, Chen Chen, and Silong Peng.Dense semantic and topological correspondence of 3D faceswithout landmarks. In ECCV, 2018. 2, 3, 6, 7

[23] Pablo Garrido, Michael Zollhofer, Dan Casas, LeviValgaerts, Kiran Varanasi, Patrick Perez, and ChristianTheobalt. Reconstruction of personalized 3D face rigs frommonocular video. TOG, 35(3):28, 2016. 8

[24] Thomas Gerig, Andreas Morel-Forster, Clemens Blumer,Bernhard Egger, Marcel Luthi, Sandro Schonborn, andThomas Vetter. Morphable face models-An open framework.In FG, 2018. 1, 2, 3, 6, 7

[25] Syed Zulqarnain Gilani, Ajmal Mian, and Peter Eastwood.Deep, dense and accurate 3D face correspondence forgenerating population specific deformable models. PatternRecognition, 69:238–250, 2017. 3

[26] Syed Zulqarnain Gilani, Ajmal Mian, Faisal Shafait, and IanReid. Dense 3D face correspondence. TPAMI, 40(7):1584–1598, 2018. 2

[27] Carl Martin Grewe and Stefan Zachow. Fully automated andhighly accurate dense correspondence for facial surfaces. InECCV, 2016. 3

[28] Thibault Groueix, Matthew Fisher, Vladimir G Kim,Bryan C Russell, and Mathieu Aubry. 3D-CODED: 3Dcorrespondences by deep deformation. In ECCV, 2018. 3, 5

[29] Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang,and Jianmin Zheng. Cnn-based real-time dense facereconstruction with inverse-rendered photo-realistic faceimages. TPAMI, 2018. 8

[30] Shalini Gupta, Kenneth R Castleman, Mia K Markey, andAlan C Bovik. Texas 3D face recognition database. In SSIAI,2010. 4

[31] Xiaoguang Han, Chang Gao, and Yizhou Yu. DeepS-ketch2Face: a deep learning based sketching system for 3Dface and caricature modeling. TOG, 36(4):126, 2017. 1

[32] Zi-Hang Jiang, Qianyi Wu, Keyu Chen, and Juyong Zhang.Disentangled representation learning for 3D face shape. InCVPR, 2019. 1, 2, 3, 4, 8

[33] Amin Jourabloo and Xiaoming Liu. Large-pose facealignment via CNN-based dense 3D model fitting. In CVPR,2016. 2

[34] Amin Jourabloo and Xiaoming Liu. Pose-invariantface alignment via CNN-based dense 3D model fitting.International Journal of Computer Vision, 124(2):187–203,2017. 2

[35] Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Pose-invariant face alignment with a single cnn. In ICCV, 2017.2

[36] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros,and Jitendra Malik. Learning category-specific meshreconstruction from image collections. In ECCV, 2018. 5

[37] Leif Kobbelt. 3-subdivision. In SIGGRAPH, 2000. 4[38] Artiom Kovnatsky, Michael M Bronstein, Alexander M

Bronstein, Klaus Glashoff, and Ron Kimmel. Coupled quasi-harmonic bases. Computer Graphics Forum, 32(24):439–448, 2013. 3

[39] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and JavierRomero. Learning a model of facial shape and expressionfrom 4D scans. TOG, 36(6):194, 2017. 1, 2, 8

[40] Feng Liu, Jun Hu, Jianwei Sun, Yang Wang, and Qijun Zhao.Multi-dim: A multi-dimensional face database towards theapplication of 3D technology in real-world scenarios. InIJCB, 2017. 4

[41] Feng Liu, Dan Zeng, Qijun Zhao, and Xiaoming Liu. Jointface alignment and 3D face reconstruction. In ECCV, 2016.1

[42] Feng Liu, Qijun Zhao, Xiaoming Liu, and Dan Zeng. Jointface alignment and 3D face reconstruction with applicationto face recognition. PAMI, 2018. 1

[43] Feng Liu, Ronghang Zhu, Dan Zeng, Qijun Zhao, andXiaoming Liu. Disentangling features in 3D face shapes forjoint face reconstruction and recognition. In CVPR, 2018. 1,6

[44] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, BhikshaRaj, and Le Song. Sphereface: Deep hypersphere embeddingfor face recognition. In CVPR, 2017. 6

[45] Marcel Luthi, Thomas Gerig, Christoph Jud, and ThomasVetter. Gaussian process morphable models. TPAMI,40(8):1860–1873, 2018. 2, 6

[46] Gang Pan, Xiaobo Zhang, Yueming Wang, Zhenfang Hu,Xiaoxiang Zheng, and Zhaohui Wu. Establishing pointcorrespondence of 3D faces via sparse facial deformablemodel. TIP, 22(11):4170–4181, 2013. 5

[47] Ankur Patel and William AP Smith. 3D morphable facemodels revisited. In CVPR, 2009. 2, 3

[48] Pascal Paysan, Reinhard Knothe, Brian Amberg, SamiRomdhani, and Thomas Vetter. A 3D face model for poseand illumination invariant face recognition. In AVSS, 2009.1, 2, 7

[49] P Jonathon Phillips, Patrick J Flynn, Todd Scruggs, Kevin WBowyer, Jin Chang, Kevin Hoffman, Joe Marques, JaesikMin, and William Worek. Overview of the face recognitiongrand challenge. In CVPR, 2005. 4, 7

[50] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3D classificationand segmentation. In CVPR, 2017. 2, 5

[51] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, andMichael J Black. Generating 3D faces using convolutionalmesh autoencoders. In ECCV, 2018. 1, 2, 3, 4, 8

[52] Elad Richardson, Matan Sela, Roy Or-El, and Ron Kimmel.Learning detailed face reconstruction from a single image.In CVPR, 2017. 2, 8

[53] Emanuele Rodola, Luca Cosmo, Michael M Bronstein,Andrea Torsello, and Daniel Cremers. Partial functionalcorrespondence. Computer Graphics Forum, 36(1):222–236,2017. 3

[54] Joseph Roth, Yiying Tong, and Xiaoming Liu. Adaptive 3Dface reconstruction from unconstrained photo collections.PAMI, 39(11):2127–2141, 2016. 1

[55] Augusto Salazar, Stefanie Wuhrer, Chang Shu, and FlavioPrieto. Fully automatic expression-invariant face correspon-dence. Machine Vision and Applications, 25(4):859–879,2014. 6, 7

[56] Arman Savran, Nese Alyuz, Hamdi Dibeklioglu, OyaCeliktutan, Berk Gokberk, Bulent Sankur, and Lale Akarun.Bosphorus database for 3D face analysis. In Biometrics andIdentity Management, pages 47–56. 2008. 4

[57] Soumyadip Sengupta, Angjoo Kanazawa, Carlos D Castillo,and David W Jacobs. SfSNet: Learning shape, reflectanceand illuminance of facesin the wild. In CVPR, 2018. 2

[58] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, andLior Wolf. Deepface: Closing the gap to human-levelperformance in face verification. In CVPR, 2014. 1

[59] Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, PabloGarrido, Florian Bernard, Patrick Perez, and ChristianTheobalt. MoFA: Model-based deep convolutional faceautoencoder for unsupervised monocular reconstruction. InCVPR, 2017. 1, 2, 8

[60] Justus Thies, Michael Zollhofer, Matthias Nießner, LeviValgaerts, Marc Stamminger, and Christian Theobalt. Real-time expression transfer for facial reenactment. TOG,34(6):1–14, 2015. 1

[61] George Toderici, Georgios Evangelopoulos, Tianhong Fang,Theoharis Theoharis, and Ioannis A Kakadiaris. UHDB11database for 3D-2D face recognition. In PSIVT, 2013. 4

[62] Anh Tuan Tran, Tal Hassner, Iacopo Masi, and GerardMedioni. Regressing robust and discriminative 3Dmorphable models with a very deep neural network. InCVPR, 2017. 2, 8

[63] Luan Tran, Feng Liu, and Xiaoming Liu. Towards high-fidelity nonlinear 3D face morphable model. In CVPR, 2019.2

[64] Luan Tran and Xiaoming Liu. Nonlinear 3D face morphablemodel. In CVPR, 2018. 2, 7, 8

[65] Luan Tran and Xiaoming Liu. On learning 3D facemorphable model from in-the-wild images. TPAMI, 2019.doi:10.1109/TPAMI.2019.2927975. 1

[66] Oliver Van Kaick, Hao Zhang, Ghassan Hamarneh, andDaniel Cohen-Or. A survey on shape correspondence.Computer Graphics Forum, 30(6):1681–1707, 2011. 3

[67] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, and JovanPopovic. Face transfer with multilinear models. TOG,24(3):426–433, 2005. 1, 2

[68] Yuhang Wu, Shishir K Shah, and Ioannis A Kakadiaris.Rendering or normalization? An analysis of the 3D-aidedpose-invariant face recognition. In ISBA, 2016. 4

[69] Lijun Yin, Xiaochen Chen, Yi Sun, Tony Worm, and MichaelReale. A high-resolution 3D dynamic facial expressiondatabase. In FG, 2008. 4

[70] Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew JRosato. A 3D facial expression database for facial behaviorresearch. In FG, 2006. 1, 4

[71] Chao Zhang, William AP Smith, Arnaud Dessein, NickPears, and Hang Dai. Functional faces: Groupwise densecorrespondence using functional maps. In CVPR, 2016. 3

[72] Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and StanZ. Li. Face alignment across large poses: A 3D solution. InCVPR, 2016. 2

[73] Xiangyu Zhu, Zhen Lei, Junjie Yan, Dong Yi, and Stan Z

Li. High-fidelity pose and expression normalization for facerecognition in the wild. In CVPR, 2015. 4, 8

[74] Syed Zulqarnain Gilani, Faisal Shafait, and Ajmal Mian.Shape-based automatic detection of a large number of 3Dfacial landmarks. In CVPR, 2015. 7

Documents

3D Face Modeling From Diverse Raw Scan Datacvlab.cse.msu.edu/pdfs/Liu_Tran_Liu_ICCV2019.pdf · Deep models have been successfully used for 3D face ﬁtting, which recovers 3D shape