9
DeeperForensics-1.0: A Large-Scale Dataset for Real-World Face Forgery Detection Supplementary Material Liming Jiang 1 Ren Li 2 Wayne Wu 1,2 Chen Qian 2 Chen Change Loy 11 Nanyang Technological University 2 SenseTime Research [email protected] [email protected] [email protected] [email protected] [email protected] Abstract This document provides supplementary information which is not elaborated in our main paper due to the con- straints of space: Section A illustrates the detailed deriva- tions and experimental results of the proposed DF-VAE framework; Section B describes the supplementary infor- mation of our real-world face forgery detection benchmark; Section C presents more examples of our source video data collection; Section D shows diverse perturbations in DeeperForensics-1.0 to better simulate real-world scenar- ios. A. Method Details In order to improve the obvious low quality problems of the previous datasets, a new learning-based end-to-end face swapping framework, DeepFake Variational Auto-Encoder (DF-VAE), is proposed as our dataset construction method. In the main paper, we give a brief and intuitive introduc- tion of DF-VAE. Three key points (i.e., disentanglement, style matching, temporal continuity) have been discussed. In this section, we will elaborate on the details of DF-VAE (see Figure 1 for the main framework of DF-VAE). A.1. Disentangled Module To be consistent with our main paper, we refer to the identity in the driving video as the “target” face and the identity of the face which is swapped onto the video as the “source” face. The disentanglement of structure and appearance is rather difficult because appearance contains too much in- formation. Besides, structure and appearance representa- tion are far from being independent. As we claim in the main paper, face swapping can be considered as a subse- quent step of face reenactment with suitable fusion mod- ules for the reenacted face and the background. Thus, in the disentangled module, our main task is to animate the Corresponding author. source face with similar expression as the target face, i.e. face reenactment, without any paired data. Let x 1:T {x 1 ,x 2 , ..., x T } X be a sequence of source face video frames, and y 1:T {y 1 ,y 2 , ..., y T } Y be the sequence of corresponding target face video frames. We first simplify our problem and only consider two spe- cific snapshots at time t, x t and y t . Let ˜ x t , ˜ y t , d t represent the reconstructed source face, the reconstructed target face, and the reenacted face, respectively. Consider the reconstruction procedure of the source face x t . Let s x denote structure representation and a x denote appearance information. The face generator can be de- picted as the posteriori estimate p θ (x t |s x ,a x ). The so- lution of our reconstruction goal, marginal log-likelihood ˜ x t log p θ (x t ), by a common Variational Auto-Encoder (VAE) [16] can be written as: log p θ (x t )= D KL (q φ (s x ,a x |x t ) p θ (s x ,a x |x t )) +L (θ, φ; x t ) , (1) where q φ is an approximate posterior to achieve the evi- dence lower bound (ELBO) in the intractable case, and the second RHS term L (θ, φ; x t ) is the variational lower bound w.r.t. both the variational parameters φ and generative pa- rameters θ. Since the first RHS term KL-divergence is non- negative, we get: log p θ (x t ) L (θ, φ; x t ) = E q φ (sx,ax|xt) [log q φ (s x ,a x |x t ) + log p θ (x t ,s x ,a x )] , (2) where L (θ, φ; x t ) can also be written as: L (θ, φ; x t )= D KL (q φ (s x ,a x |x t ) p θ (s x ,a x )) + E q φ (sx,ax|xt) [log p θ (x t |s x ,a x )] , (3) and we need to optimize L (θ, φ; x t ) w.r.t. φ and θ. In Eq. 1, we assume that both s x and a x are latent priors computed by the same posterior x t . However, the separa- tion of these two variables in the latent space is rather diffi- cult without additional conditions. Therefore, we employ a

DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

DeeperForensics-10 A Large-Scale Dataset forReal-World Face Forgery Detection

Supplementary Material

Liming Jiang1 Ren Li2 Wayne Wu12 Chen Qian2 Chen Change Loy1dagger1Nanyang Technological University 2SenseTime Researchliming002ntuedusg tomobladeleehotmailcom

wuwenyansensetimecom qianchensensetimecom ccloyntuedusg

Abstract

This document provides supplementary informationwhich is not elaborated in our main paper due to the con-straints of space Section A illustrates the detailed deriva-tions and experimental results of the proposed DF-VAEframework Section B describes the supplementary infor-mation of our real-world face forgery detection benchmarkSection C presents more examples of our source videodata collection Section D shows diverse perturbations inDeeperForensics-10 to better simulate real-world scenar-ios

A Method DetailsIn order to improve the obvious low quality problems of

the previous datasets a new learning-based end-to-end faceswapping framework DeepFake Variational Auto-Encoder(DF-VAE) is proposed as our dataset construction method

In the main paper we give a brief and intuitive introduc-tion of DF-VAE Three key points (ie disentanglementstyle matching temporal continuity) have been discussedIn this section we will elaborate on the details of DF-VAE(see Figure 1 for the main framework of DF-VAE)

A1 Disentangled ModuleTo be consistent with our main paper we refer to the

identity in the driving video as the ldquotargetrdquo face and theidentity of the face which is swapped onto the video as theldquosourcerdquo face

The disentanglement of structure and appearance israther difficult because appearance contains too much in-formation Besides structure and appearance representa-tion are far from being independent As we claim in themain paper face swapping can be considered as a subse-quent step of face reenactment with suitable fusion mod-ules for the reenacted face and the background Thus inthe disentangled module our main task is to animate the

dagger Corresponding author

source face with similar expression as the target face ieface reenactment without any paired data

Let x1T equiv x1 x2 xT isin X be a sequence ofsource face video frames and y1T equiv y1 y2 yT isin Ybe the sequence of corresponding target face video framesWe first simplify our problem and only consider two spe-cific snapshots at time t xt and yt Let xt yt dt representthe reconstructed source face the reconstructed target faceand the reenacted face respectively

Consider the reconstruction procedure of the source facext Let sx denote structure representation and ax denoteappearance information The face generator can be de-picted as the posteriori estimate pθ (xt|sx ax) The so-lution of our reconstruction goal marginal log-likelihoodxt sim log pθ (xt) by a common Variational Auto-Encoder(VAE) [16] can be written as

log pθ (xt) = DKL (qφ (sx ax|xt) 983042pθ (sx ax|xt))

+L (θφxt) (1)

where qφ is an approximate posterior to achieve the evi-dence lower bound (ELBO) in the intractable case and thesecond RHS term L (θφxt) is the variational lower boundwrt both the variational parameters φ and generative pa-rameters θ Since the first RHS term KL-divergence is non-negative we get

log pθ (xt) ge L (θφxt)

= Eqφ(sxax|xt) [minus log qφ (sx ax|xt) + log pθ (xt sx ax)]

(2)

where L (θφxt) can also be written as

L (θφxt) =minusDKL (qφ (sx ax|xt) 983042pθ (sx ax))+ Eqφ(sxax|xt) [log pθ (xt|sx ax)]

(3)

and we need to optimize L (θφxt) wrt φ and θIn Eq 1 we assume that both sx and ax are latent priors

computed by the same posterior xt However the separa-tion of these two variables in the latent space is rather diffi-cult without additional conditions Therefore we employ a

Figure 1 The main framework of DeepFake Variational Auto-Encoder In training we reconstruct the source and target faces in blue andorange arrows respectively by extracting landmarks and constructing an unpaired sample as the condition Optical flow differences areminimized after reconstruction to improve temporal continuity In inference we swap the latent codes and get the reenacted face in greenarrows Subsequent MAdaIN module fuses the reenacted face and the original background resulting in the swapped face

simple yet effective approach to disentangle these two vari-ables

The blue arrows in Figure 1 show the reconstruction pro-cedure of the source face xt Instead of feeding a singlesource face xt we sample another source face xprime to con-struct unpaired data in the source domain To make thestructure representation more evident we use the stackedhourglass networks [17] to extract landmarks of xt in thestructure extraction module and get the heatmap xt Thenwe feed the heatmap xt to the Structure Encoder Eα and xprime

to the Appearance Encoder Eβ We concatenate the latentrepresentations (small cubes in red and green) and feed it tothe Decoder Dγ Finally we get the reconstructed face xtie marginal log-likelihood of xt

Therefore the latent structure representation sx in Eq 1becomes a more evident heatmap representation xt whichis introduced as a new condition The unpaired sample xprime

with the same identity wrt xt is another condition beinga substitute for ax Eq 1 can be rewritten as a conditionallog-likelihood

log pθ (xt|xt xprime) = DKL (qφ (zx|xt xt x

prime) 983042pθ (zx|xt xt xprime))

+L (θφxt xt xprime)

(4)

similarly

log pθ983043xt|xt x

prime983044 ge L(θφxt xt xprime)

= Eqφ(zx|xtxtxprime)983045minus log qφ

983043zx|xt xt x

prime983044+ log pθ983043xt zx|xt x

prime983044983046 (5)

and L(θφxt xt xprime) can also be written as

L (θφxt xt xprime) =minusDKL (qφ (zx|xt xt x

prime) 983042pθ (zx|xt xprime))

+ Eqφ(zx|xtxtxprime) [log pθ (xt|zx xt xprime)]

(6)

We let the variational approximate posterior be a multi-variate Gaussian with a diagonal covariance structure

log qφ (zx|xt xt xprime) equiv logN

983043zxmicroσ

2I983044 (7)

where I is an identity matrix Exploiting the reparam-eterization trick [16] the non-differentiable operation ofsampling can become differentiable by an auxiliary vari-able with independent marginal In this case zx simqφ (zx|xt xt x

prime) is implemented by zx = micro + σ983171 where983171 is an auxiliary noise variable 983171 sim N (0 1) Finally theapproximate posterior qφ(zx|xt xt x

prime) is estimated by theseparated encoders Structure Encoder Eα and AppearanceEncoder Eβ in an end-to-end training process by standardgradient descent

We discuss the whole workflow of reconstructing thesource face In the target face domain the reconstructionprocedure is the same as shown by orange arrows in Fig-ure 1

During training the network learns structure and appear-ance information in both the source and the target domainsIt is noteworthy that even if both yt and xprime belong to arbi-trary identities our effective disentangled module is capableof learning meaningful structure and appearance informa-tion of each identity During inference we concatenate the

appearance prior of xprime and the structure prior of yt (smallcubes in red and orange) in the latent space and the re-constructed face dt shares the same structure with yt andkeeps the appearance of xprime Our framework allows concate-nations of structure and appearance latent codes extractedfrom arbitrary identities in inference and permits many-to-many face reenactment

In summary DF-VAE is a new conditional variationalauto-encoder [15] with robustness and scalability It condi-tions on two posteriors in different domains In the disen-tangled module the separated design of two encoders Eα

and Eβ the explicit structure heatmap and the unpaireddata construction jointly force Eα to learn structure infor-mation and Eβ to learn appearance information

A2 DerivationThe core equations of DF-VAE are Eq 4 Eq 5 and

Eq 6 We will provide the detailed mathematical deriva-tions in this section The previous three equations Eq 1Eq 2 and Eq 3 have very similar derivations therefore wewill not be repeating them

Derivation of Eq 4 and Eq 5

log pθ

983059xt|xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983059log pθ

983059xt|xt x

prime983060983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

pθ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044middot

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044+ log

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096 dzx

= DKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt xt x

prime983060983060+ L

983059θφ xt xt x

prime983060

where

L983043θφ xt xt x

prime983044= Eqφ(zx|xtxtx

prime)

983077log

983043xt zx|xt x

prime983044

qφ (zx|xt xt xprime)

983078

DKL

983043qφ

983043zx|xt xt x

prime983044 983042pθ

983043zx|xt xt x

prime983044983044 ge 0

Derivation of Eq 6

L983059θφ xt xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt|zx xt x

prime983060pθ

983059zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095minus logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt x

prime983044+ log pθ

983059xt|zx xt x

prime983060983094

983096 dzx

= minusDKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt x

prime983060983060

+ Eqφ

983043zx|xtxtx

prime983044983147log pθ

983059xt|zx xt x

prime983060983148

A3 ObjectiveReconstruction loss In the reconstruction the source faceand target face share the same forms of loss functions Thereconstruction loss of the source face Lreconx can be writ-ten as

Lreconx = λr1Lpixel (x x) + λr2Lssim (x x) (8)

Lpixel indicates pixel loss It calculates the Mean Abso-lute Error (MAE) after reconstruction which can be writtenas

Lpixel (x x) =1

CHW983042xminus x9830421 (9)

Lssim denotes ssim loss It computes the Structural Sim-ilarity (SSIM) of the reconstructed face and the originalface which has the form of

Lssim (x x) =(2microxmicrox + C1) (2σxx + C2)

(micro2x + micro2

x + C1) (σ2x + σ2

x + C2) (10)

λr1 and λr2 are two hyperparameters that control theweights of two parts of the reconstruction loss For the tar-get face we have the similar form of reconstruction loss

Lrecony= λr1Lpixel (y y) + λr2Lssim (y y) (11)

Thus the full reconstruction loss can be written as

Lrecon = Lreconx+ Lrecony

(12)

KL loss Since DF-VAE is a new conditional variationalauto-encoder reparameterization trick is utilized to makethe sampling operation differentiable by an auxiliary vari-able with independent marginal We use the typical KL lossin [16] with the form of

LKL (qφ (z) pθ (z)) =1

2

J983131

j=1

9830431 + log

983043(σj)

2983044minus (microj)2 minus (σj)

2983044

(13)

where J is the dimensionality of the latent prior z microj andσj are the j-th element of variational mean and sd vectorsrespectivelyMAdaIN loss The MAdaIN module is jointly trained withthe disentangled module in an end-to-end manner We applyMAdaIN loss for this module in a similar form as describedin [10] We use the VGG-19 [20] to compute MAdaIN lossto train Decoder Dδ

LMAdaIN = Lc + λmaLs (14)

Lc denotes the content loss which is the Euclidean dis-tance between the target features and the features of theswapped face Lc has the form of

Lc = 983042ominus c9830422 (15)

where o = mbt middot dt c = mb

t middot dt mbt is the blurred mask

described in the main paper

Ls represents the style loss which matches the meanand standard deviation of the style features Like [10] wematch the IN [22] statistics instead of using Gram matrixloss which can produce similar results Ls can be writtenas

Ls =

L983131

i=1

983042micro(Φi(o))minus micro(Φi(s))9830422

+

L983131

i=1

983042σ(Φi(o))minus σ(Φi(s))9830422

(16)

where o = mbt middot dt s = mb

t middot yt mbt is the blurred mask Φi

denotes the layer used in VGG-19 [20] Similar to [10] weuse relu1 1 relu2 1 relu3 1 relu4 1 layers withequal weights

λma is the weight of style loss to balance two parts ofMAdaIN lossTemporal loss We have given a detailed introduction oftemporal consistency constraint in our main paper Theform of temporal loss is the same as the newly proposedtemporal consistency constraint We will not repeat it hereTotal objective DF-VAE is an end-to-end many-to-manyface swapping framework We jointly train all parts of thenetworks The problem can be described as the optimizationof the following total objective

Ltotal = λ1Lrecon + λ2LKL + λ3LMAdaIN + λ4Ltemporal(17)

where λ1 λ2 λ3 λ4 are the weight hyperparameters of fourtypes of loss functions introduced above

A4 Implementation DetailsThe whole DF-VAE framework is end-to-end We use

the pretrained stacked hourglass networks [17] to extractlandmarks The numbers of stacks and blocks are set to4 and 1 respectively We exploit FlowNet 20 network [11]to estimate optical flows The typical AdaIN network [10]is applied to our style matching and fusion module Thelearning rate is set to 000005 for all parts of DF-VAE Weutilize Adam [14] and set β1 = 05 β2 = 0999 All theexperiments are conducted on NVIDIA Tesla V100 GPUs

A5 User Study of MethodsIn addition to user study based on datasets to examine

the quality of DeeperForensics-10 dataset we also carryout a user study to compare DF-VAE with state-of-the-artface manipulation methods We will present the user studyof methods in this sectionBaselines We choose three learning-based open-sourcemethods as our baselines DeepFakes [1] faceswap-GAN[2] and ReenactGAN [25] These three methods are repre-sentative which are based on different architectures Deep-Fakes [1] is a well-known method based on Auto-Encoders(AE) It uses a shared encoder and two separated decoders

DeepFakes DF-VAE (Ours)

DeepFakes vs Ours

DF-VAE (Ours)faceswap-GAN

faceswap-GAN vs Ours ReenactGAN vs Ours

ReenactGAN DF-VAE (Ours)

319 390

72

2891 2820

3138

Figure 2 Results of user study comparing methods The bar chartsshow the number of users who give preference in each comparedpair of manipulated videos

to perform face swapping faceswap-GAN [2] is based onGenerative Adversarial Networks (GAN) [6] which has asimilar structure as DeepFakes [1] but also uses a paired dis-criminators to improve face swapping quality ReenactGAN[25] makes a boundary latent space assumption and uses atransformer to adapt the boundary of source face to that oftarget face As a result ReenactGAN can perform many-to-one face reenactment After getting the reenacted faceswe use our carefully designed fusion method to obtain theswapped faces For a fair comparison DF-VAE utilizes thesame fusion method when compared to ReenactGAN [25]

Results We randomly choose 30 real videos fromDeeperForensics-10 as the source videos and 30 real videosfrom FaceForensics++ [18] as the target videos Thus eachmethod generates 30 fake videos Same as the user studybased on datasets we conduct the user study based on meth-ods among 100 professional participants who specialize incomputer vision research Because there are correspond-ing fake videos we let the users directly choose their pre-ferred fake videos between those generated by other meth-ods and those generated by DF-VAE Finally we got 3210answers for each compared pair The results are shown inFigure 2 We can see that DF-VAE shows an impressive ad-vantage over the baselines underscoring the high quality ofDF-VAE-generated fake videos

A6 Quantitative Evaluation Metrics

Frechet Inception Distance (FID) [8] is a widely exploitedmetric for generative models FID evaluates the similarityof distribution between the generated images and the realimages FID correlates well with the visual quality of thegenerated samples A lower value of FID means a betterquality

Inception Score (IS) [19] is an early and somewhat widelyadopted objective evaluation metric for generated imagesIS evaluates two aspects of generation quality articulationand diversity A higher value of IS means a better quality

Table 1 shows the FID and IS scores of our method com-pared to other methods DF-VAE outperforms all the threebaselines in quantitative evaluations by FID and IS

1313

1313

1313

Figure 3 Many-to-many (three-to-three) face swapping by a single model with obvious reduction of style mismatch problems This figureshows the results between three source identities and three target identities The whole process is end-to-end

Table 1 The FID and IS scores of DeepFakes [1] faceswap-GAN[2] ReenactGAN [25] and DF-VAE (Ours)

Method FID ISDeepFakes [1] 25771 1711

faceswap-GAN [2] 24718 1685ReenactGAN [25] 26325 1690DF-VAE (Ours) 22097 1714

A7 Many-to-Many Face SwappingBy a single model DF-VAE can perform many-to-many

face swapping with obvious reduction of style mismatchand facial boundary artifacts (see Figure 3 for the face swap-ping between three source identities and three target identi-ties) Even if there are multiple identities in both the sourcedomain and the target domain the quality of face swappingdoes not degrade

A8 Ablation StudiesAblation study of temporal loss Since the swapped facesdo not have the ground truth we evaluate the effectivenessof temporal consistency constraint ie temporal loss in aself-reenactment setting Similar to [13] we quantify the re-rendering error by Euclidean distance of per pixel in RGBchannels ([0 255]) Visualized results are shown in Fig-ure 4 Without the temporal loss the re-rendering error ishigher hence demonstrating the effectiveness of temporal

consistency constraintAblation study of different components We conduct fur-ther ablation studies wrt different components of our DF-VAE framework under many-to-many face swapping set-ting (see Figure 5) The source and target faces are shownin Column 1 and Column 2 In Column 3 our full methodDF-VAE shows high-fidelity face swapping results In Col-umn 4 style mismatch problems are very obvious if weremove the MAdaIN module If we remove the hourglass(structure extraction) module the disentanglement of struc-ture and appearance is not very thorough The swapped facewill be a mixture of multiple identities as shown in Column5 When we perform face swapping without constructingunpaired data in the same domain (see Column 6) the dis-entangled module will completely reconstruct the faces onthe side of Eβ thus the disentanglement is not establishedat all Therefore the quality of face swapping will degradeif we remove any component in DF-VAE framework

B Supplementary Information of Benchmark

In this section we will provide some supplementary in-formation of our real-world face forgery detection bench-mark First we will elaborate on the basic informationof five baselines in our benchmark in Section B1 Thenwe will provide evaluation results of face forgery detectionmodel performance wrt dataset size in Section B2

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 2: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

Figure 1 The main framework of DeepFake Variational Auto-Encoder In training we reconstruct the source and target faces in blue andorange arrows respectively by extracting landmarks and constructing an unpaired sample as the condition Optical flow differences areminimized after reconstruction to improve temporal continuity In inference we swap the latent codes and get the reenacted face in greenarrows Subsequent MAdaIN module fuses the reenacted face and the original background resulting in the swapped face

simple yet effective approach to disentangle these two vari-ables

The blue arrows in Figure 1 show the reconstruction pro-cedure of the source face xt Instead of feeding a singlesource face xt we sample another source face xprime to con-struct unpaired data in the source domain To make thestructure representation more evident we use the stackedhourglass networks [17] to extract landmarks of xt in thestructure extraction module and get the heatmap xt Thenwe feed the heatmap xt to the Structure Encoder Eα and xprime

to the Appearance Encoder Eβ We concatenate the latentrepresentations (small cubes in red and green) and feed it tothe Decoder Dγ Finally we get the reconstructed face xtie marginal log-likelihood of xt

Therefore the latent structure representation sx in Eq 1becomes a more evident heatmap representation xt whichis introduced as a new condition The unpaired sample xprime

with the same identity wrt xt is another condition beinga substitute for ax Eq 1 can be rewritten as a conditionallog-likelihood

log pθ (xt|xt xprime) = DKL (qφ (zx|xt xt x

prime) 983042pθ (zx|xt xt xprime))

+L (θφxt xt xprime)

(4)

similarly

log pθ983043xt|xt x

prime983044 ge L(θφxt xt xprime)

= Eqφ(zx|xtxtxprime)983045minus log qφ

983043zx|xt xt x

prime983044+ log pθ983043xt zx|xt x

prime983044983046 (5)

and L(θφxt xt xprime) can also be written as

L (θφxt xt xprime) =minusDKL (qφ (zx|xt xt x

prime) 983042pθ (zx|xt xprime))

+ Eqφ(zx|xtxtxprime) [log pθ (xt|zx xt xprime)]

(6)

We let the variational approximate posterior be a multi-variate Gaussian with a diagonal covariance structure

log qφ (zx|xt xt xprime) equiv logN

983043zxmicroσ

2I983044 (7)

where I is an identity matrix Exploiting the reparam-eterization trick [16] the non-differentiable operation ofsampling can become differentiable by an auxiliary vari-able with independent marginal In this case zx simqφ (zx|xt xt x

prime) is implemented by zx = micro + σ983171 where983171 is an auxiliary noise variable 983171 sim N (0 1) Finally theapproximate posterior qφ(zx|xt xt x

prime) is estimated by theseparated encoders Structure Encoder Eα and AppearanceEncoder Eβ in an end-to-end training process by standardgradient descent

We discuss the whole workflow of reconstructing thesource face In the target face domain the reconstructionprocedure is the same as shown by orange arrows in Fig-ure 1

During training the network learns structure and appear-ance information in both the source and the target domainsIt is noteworthy that even if both yt and xprime belong to arbi-trary identities our effective disentangled module is capableof learning meaningful structure and appearance informa-tion of each identity During inference we concatenate the

appearance prior of xprime and the structure prior of yt (smallcubes in red and orange) in the latent space and the re-constructed face dt shares the same structure with yt andkeeps the appearance of xprime Our framework allows concate-nations of structure and appearance latent codes extractedfrom arbitrary identities in inference and permits many-to-many face reenactment

In summary DF-VAE is a new conditional variationalauto-encoder [15] with robustness and scalability It condi-tions on two posteriors in different domains In the disen-tangled module the separated design of two encoders Eα

and Eβ the explicit structure heatmap and the unpaireddata construction jointly force Eα to learn structure infor-mation and Eβ to learn appearance information

A2 DerivationThe core equations of DF-VAE are Eq 4 Eq 5 and

Eq 6 We will provide the detailed mathematical deriva-tions in this section The previous three equations Eq 1Eq 2 and Eq 3 have very similar derivations therefore wewill not be repeating them

Derivation of Eq 4 and Eq 5

log pθ

983059xt|xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983059log pθ

983059xt|xt x

prime983060983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

pθ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044middot

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044+ log

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096 dzx

= DKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt xt x

prime983060983060+ L

983059θφ xt xt x

prime983060

where

L983043θφ xt xt x

prime983044= Eqφ(zx|xtxtx

prime)

983077log

983043xt zx|xt x

prime983044

qφ (zx|xt xt xprime)

983078

DKL

983043qφ

983043zx|xt xt x

prime983044 983042pθ

983043zx|xt xt x

prime983044983044 ge 0

Derivation of Eq 6

L983059θφ xt xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt|zx xt x

prime983060pθ

983059zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095minus logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt x

prime983044+ log pθ

983059xt|zx xt x

prime983060983094

983096 dzx

= minusDKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt x

prime983060983060

+ Eqφ

983043zx|xtxtx

prime983044983147log pθ

983059xt|zx xt x

prime983060983148

A3 ObjectiveReconstruction loss In the reconstruction the source faceand target face share the same forms of loss functions Thereconstruction loss of the source face Lreconx can be writ-ten as

Lreconx = λr1Lpixel (x x) + λr2Lssim (x x) (8)

Lpixel indicates pixel loss It calculates the Mean Abso-lute Error (MAE) after reconstruction which can be writtenas

Lpixel (x x) =1

CHW983042xminus x9830421 (9)

Lssim denotes ssim loss It computes the Structural Sim-ilarity (SSIM) of the reconstructed face and the originalface which has the form of

Lssim (x x) =(2microxmicrox + C1) (2σxx + C2)

(micro2x + micro2

x + C1) (σ2x + σ2

x + C2) (10)

λr1 and λr2 are two hyperparameters that control theweights of two parts of the reconstruction loss For the tar-get face we have the similar form of reconstruction loss

Lrecony= λr1Lpixel (y y) + λr2Lssim (y y) (11)

Thus the full reconstruction loss can be written as

Lrecon = Lreconx+ Lrecony

(12)

KL loss Since DF-VAE is a new conditional variationalauto-encoder reparameterization trick is utilized to makethe sampling operation differentiable by an auxiliary vari-able with independent marginal We use the typical KL lossin [16] with the form of

LKL (qφ (z) pθ (z)) =1

2

J983131

j=1

9830431 + log

983043(σj)

2983044minus (microj)2 minus (σj)

2983044

(13)

where J is the dimensionality of the latent prior z microj andσj are the j-th element of variational mean and sd vectorsrespectivelyMAdaIN loss The MAdaIN module is jointly trained withthe disentangled module in an end-to-end manner We applyMAdaIN loss for this module in a similar form as describedin [10] We use the VGG-19 [20] to compute MAdaIN lossto train Decoder Dδ

LMAdaIN = Lc + λmaLs (14)

Lc denotes the content loss which is the Euclidean dis-tance between the target features and the features of theswapped face Lc has the form of

Lc = 983042ominus c9830422 (15)

where o = mbt middot dt c = mb

t middot dt mbt is the blurred mask

described in the main paper

Ls represents the style loss which matches the meanand standard deviation of the style features Like [10] wematch the IN [22] statistics instead of using Gram matrixloss which can produce similar results Ls can be writtenas

Ls =

L983131

i=1

983042micro(Φi(o))minus micro(Φi(s))9830422

+

L983131

i=1

983042σ(Φi(o))minus σ(Φi(s))9830422

(16)

where o = mbt middot dt s = mb

t middot yt mbt is the blurred mask Φi

denotes the layer used in VGG-19 [20] Similar to [10] weuse relu1 1 relu2 1 relu3 1 relu4 1 layers withequal weights

λma is the weight of style loss to balance two parts ofMAdaIN lossTemporal loss We have given a detailed introduction oftemporal consistency constraint in our main paper Theform of temporal loss is the same as the newly proposedtemporal consistency constraint We will not repeat it hereTotal objective DF-VAE is an end-to-end many-to-manyface swapping framework We jointly train all parts of thenetworks The problem can be described as the optimizationof the following total objective

Ltotal = λ1Lrecon + λ2LKL + λ3LMAdaIN + λ4Ltemporal(17)

where λ1 λ2 λ3 λ4 are the weight hyperparameters of fourtypes of loss functions introduced above

A4 Implementation DetailsThe whole DF-VAE framework is end-to-end We use

the pretrained stacked hourglass networks [17] to extractlandmarks The numbers of stacks and blocks are set to4 and 1 respectively We exploit FlowNet 20 network [11]to estimate optical flows The typical AdaIN network [10]is applied to our style matching and fusion module Thelearning rate is set to 000005 for all parts of DF-VAE Weutilize Adam [14] and set β1 = 05 β2 = 0999 All theexperiments are conducted on NVIDIA Tesla V100 GPUs

A5 User Study of MethodsIn addition to user study based on datasets to examine

the quality of DeeperForensics-10 dataset we also carryout a user study to compare DF-VAE with state-of-the-artface manipulation methods We will present the user studyof methods in this sectionBaselines We choose three learning-based open-sourcemethods as our baselines DeepFakes [1] faceswap-GAN[2] and ReenactGAN [25] These three methods are repre-sentative which are based on different architectures Deep-Fakes [1] is a well-known method based on Auto-Encoders(AE) It uses a shared encoder and two separated decoders

DeepFakes DF-VAE (Ours)

DeepFakes vs Ours

DF-VAE (Ours)faceswap-GAN

faceswap-GAN vs Ours ReenactGAN vs Ours

ReenactGAN DF-VAE (Ours)

319 390

72

2891 2820

3138

Figure 2 Results of user study comparing methods The bar chartsshow the number of users who give preference in each comparedpair of manipulated videos

to perform face swapping faceswap-GAN [2] is based onGenerative Adversarial Networks (GAN) [6] which has asimilar structure as DeepFakes [1] but also uses a paired dis-criminators to improve face swapping quality ReenactGAN[25] makes a boundary latent space assumption and uses atransformer to adapt the boundary of source face to that oftarget face As a result ReenactGAN can perform many-to-one face reenactment After getting the reenacted faceswe use our carefully designed fusion method to obtain theswapped faces For a fair comparison DF-VAE utilizes thesame fusion method when compared to ReenactGAN [25]

Results We randomly choose 30 real videos fromDeeperForensics-10 as the source videos and 30 real videosfrom FaceForensics++ [18] as the target videos Thus eachmethod generates 30 fake videos Same as the user studybased on datasets we conduct the user study based on meth-ods among 100 professional participants who specialize incomputer vision research Because there are correspond-ing fake videos we let the users directly choose their pre-ferred fake videos between those generated by other meth-ods and those generated by DF-VAE Finally we got 3210answers for each compared pair The results are shown inFigure 2 We can see that DF-VAE shows an impressive ad-vantage over the baselines underscoring the high quality ofDF-VAE-generated fake videos

A6 Quantitative Evaluation Metrics

Frechet Inception Distance (FID) [8] is a widely exploitedmetric for generative models FID evaluates the similarityof distribution between the generated images and the realimages FID correlates well with the visual quality of thegenerated samples A lower value of FID means a betterquality

Inception Score (IS) [19] is an early and somewhat widelyadopted objective evaluation metric for generated imagesIS evaluates two aspects of generation quality articulationand diversity A higher value of IS means a better quality

Table 1 shows the FID and IS scores of our method com-pared to other methods DF-VAE outperforms all the threebaselines in quantitative evaluations by FID and IS

1313

1313

1313

Figure 3 Many-to-many (three-to-three) face swapping by a single model with obvious reduction of style mismatch problems This figureshows the results between three source identities and three target identities The whole process is end-to-end

Table 1 The FID and IS scores of DeepFakes [1] faceswap-GAN[2] ReenactGAN [25] and DF-VAE (Ours)

Method FID ISDeepFakes [1] 25771 1711

faceswap-GAN [2] 24718 1685ReenactGAN [25] 26325 1690DF-VAE (Ours) 22097 1714

A7 Many-to-Many Face SwappingBy a single model DF-VAE can perform many-to-many

face swapping with obvious reduction of style mismatchand facial boundary artifacts (see Figure 3 for the face swap-ping between three source identities and three target identi-ties) Even if there are multiple identities in both the sourcedomain and the target domain the quality of face swappingdoes not degrade

A8 Ablation StudiesAblation study of temporal loss Since the swapped facesdo not have the ground truth we evaluate the effectivenessof temporal consistency constraint ie temporal loss in aself-reenactment setting Similar to [13] we quantify the re-rendering error by Euclidean distance of per pixel in RGBchannels ([0 255]) Visualized results are shown in Fig-ure 4 Without the temporal loss the re-rendering error ishigher hence demonstrating the effectiveness of temporal

consistency constraintAblation study of different components We conduct fur-ther ablation studies wrt different components of our DF-VAE framework under many-to-many face swapping set-ting (see Figure 5) The source and target faces are shownin Column 1 and Column 2 In Column 3 our full methodDF-VAE shows high-fidelity face swapping results In Col-umn 4 style mismatch problems are very obvious if weremove the MAdaIN module If we remove the hourglass(structure extraction) module the disentanglement of struc-ture and appearance is not very thorough The swapped facewill be a mixture of multiple identities as shown in Column5 When we perform face swapping without constructingunpaired data in the same domain (see Column 6) the dis-entangled module will completely reconstruct the faces onthe side of Eβ thus the disentanglement is not establishedat all Therefore the quality of face swapping will degradeif we remove any component in DF-VAE framework

B Supplementary Information of Benchmark

In this section we will provide some supplementary in-formation of our real-world face forgery detection bench-mark First we will elaborate on the basic informationof five baselines in our benchmark in Section B1 Thenwe will provide evaluation results of face forgery detectionmodel performance wrt dataset size in Section B2

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 3: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

appearance prior of xprime and the structure prior of yt (smallcubes in red and orange) in the latent space and the re-constructed face dt shares the same structure with yt andkeeps the appearance of xprime Our framework allows concate-nations of structure and appearance latent codes extractedfrom arbitrary identities in inference and permits many-to-many face reenactment

In summary DF-VAE is a new conditional variationalauto-encoder [15] with robustness and scalability It condi-tions on two posteriors in different domains In the disen-tangled module the separated design of two encoders Eα

and Eβ the explicit structure heatmap and the unpaireddata construction jointly force Eα to learn structure infor-mation and Eβ to learn appearance information

A2 DerivationThe core equations of DF-VAE are Eq 4 Eq 5 and

Eq 6 We will provide the detailed mathematical deriva-tions in this section The previous three equations Eq 1Eq 2 and Eq 3 have very similar derivations therefore wewill not be repeating them

Derivation of Eq 4 and Eq 5

log pθ

983059xt|xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983059log pθ

983059xt|xt x

prime983060983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

pθ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044middot

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt xt x

prime983044+ log

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096 dzx

= DKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt xt x

prime983060983060+ L

983059θφ xt xt x

prime983060

where

L983043θφ xt xt x

prime983044= Eqφ(zx|xtxtx

prime)

983077log

983043xt zx|xt x

prime983044

qφ (zx|xt xt xprime)

983078

DKL

983043qφ

983043zx|xt xt x

prime983044 983042pθ

983043zx|xt xt x

prime983044983044 ge 0

Derivation of Eq 6

L983059θφ xt xt x

prime983060

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

= Eqφ

983043zx|xtxtx

prime983044983093

983095logpθ

983059xt|zx xt x

prime983060pθ

983059zx|xt x

prime983060

qφ983043zx|xt xt x

prime983044

983094

983096

=

983133qφ

983059zx|xt xt x

prime983060983093

983095minus logqφ

983059zx|xt xt x

prime983060

pθ983043zx|xt x

prime983044+ log pθ

983059xt|zx xt x

prime983060983094

983096 dzx

= minusDKL

983059qφ

983059zx|xt xt x

prime983060 983042pθ983059zx|xt x

prime983060983060

+ Eqφ

983043zx|xtxtx

prime983044983147log pθ

983059xt|zx xt x

prime983060983148

A3 ObjectiveReconstruction loss In the reconstruction the source faceand target face share the same forms of loss functions Thereconstruction loss of the source face Lreconx can be writ-ten as

Lreconx = λr1Lpixel (x x) + λr2Lssim (x x) (8)

Lpixel indicates pixel loss It calculates the Mean Abso-lute Error (MAE) after reconstruction which can be writtenas

Lpixel (x x) =1

CHW983042xminus x9830421 (9)

Lssim denotes ssim loss It computes the Structural Sim-ilarity (SSIM) of the reconstructed face and the originalface which has the form of

Lssim (x x) =(2microxmicrox + C1) (2σxx + C2)

(micro2x + micro2

x + C1) (σ2x + σ2

x + C2) (10)

λr1 and λr2 are two hyperparameters that control theweights of two parts of the reconstruction loss For the tar-get face we have the similar form of reconstruction loss

Lrecony= λr1Lpixel (y y) + λr2Lssim (y y) (11)

Thus the full reconstruction loss can be written as

Lrecon = Lreconx+ Lrecony

(12)

KL loss Since DF-VAE is a new conditional variationalauto-encoder reparameterization trick is utilized to makethe sampling operation differentiable by an auxiliary vari-able with independent marginal We use the typical KL lossin [16] with the form of

LKL (qφ (z) pθ (z)) =1

2

J983131

j=1

9830431 + log

983043(σj)

2983044minus (microj)2 minus (σj)

2983044

(13)

where J is the dimensionality of the latent prior z microj andσj are the j-th element of variational mean and sd vectorsrespectivelyMAdaIN loss The MAdaIN module is jointly trained withthe disentangled module in an end-to-end manner We applyMAdaIN loss for this module in a similar form as describedin [10] We use the VGG-19 [20] to compute MAdaIN lossto train Decoder Dδ

LMAdaIN = Lc + λmaLs (14)

Lc denotes the content loss which is the Euclidean dis-tance between the target features and the features of theswapped face Lc has the form of

Lc = 983042ominus c9830422 (15)

where o = mbt middot dt c = mb

t middot dt mbt is the blurred mask

described in the main paper

Ls represents the style loss which matches the meanand standard deviation of the style features Like [10] wematch the IN [22] statistics instead of using Gram matrixloss which can produce similar results Ls can be writtenas

Ls =

L983131

i=1

983042micro(Φi(o))minus micro(Φi(s))9830422

+

L983131

i=1

983042σ(Φi(o))minus σ(Φi(s))9830422

(16)

where o = mbt middot dt s = mb

t middot yt mbt is the blurred mask Φi

denotes the layer used in VGG-19 [20] Similar to [10] weuse relu1 1 relu2 1 relu3 1 relu4 1 layers withequal weights

λma is the weight of style loss to balance two parts ofMAdaIN lossTemporal loss We have given a detailed introduction oftemporal consistency constraint in our main paper Theform of temporal loss is the same as the newly proposedtemporal consistency constraint We will not repeat it hereTotal objective DF-VAE is an end-to-end many-to-manyface swapping framework We jointly train all parts of thenetworks The problem can be described as the optimizationof the following total objective

Ltotal = λ1Lrecon + λ2LKL + λ3LMAdaIN + λ4Ltemporal(17)

where λ1 λ2 λ3 λ4 are the weight hyperparameters of fourtypes of loss functions introduced above

A4 Implementation DetailsThe whole DF-VAE framework is end-to-end We use

the pretrained stacked hourglass networks [17] to extractlandmarks The numbers of stacks and blocks are set to4 and 1 respectively We exploit FlowNet 20 network [11]to estimate optical flows The typical AdaIN network [10]is applied to our style matching and fusion module Thelearning rate is set to 000005 for all parts of DF-VAE Weutilize Adam [14] and set β1 = 05 β2 = 0999 All theexperiments are conducted on NVIDIA Tesla V100 GPUs

A5 User Study of MethodsIn addition to user study based on datasets to examine

the quality of DeeperForensics-10 dataset we also carryout a user study to compare DF-VAE with state-of-the-artface manipulation methods We will present the user studyof methods in this sectionBaselines We choose three learning-based open-sourcemethods as our baselines DeepFakes [1] faceswap-GAN[2] and ReenactGAN [25] These three methods are repre-sentative which are based on different architectures Deep-Fakes [1] is a well-known method based on Auto-Encoders(AE) It uses a shared encoder and two separated decoders

DeepFakes DF-VAE (Ours)

DeepFakes vs Ours

DF-VAE (Ours)faceswap-GAN

faceswap-GAN vs Ours ReenactGAN vs Ours

ReenactGAN DF-VAE (Ours)

319 390

72

2891 2820

3138

Figure 2 Results of user study comparing methods The bar chartsshow the number of users who give preference in each comparedpair of manipulated videos

to perform face swapping faceswap-GAN [2] is based onGenerative Adversarial Networks (GAN) [6] which has asimilar structure as DeepFakes [1] but also uses a paired dis-criminators to improve face swapping quality ReenactGAN[25] makes a boundary latent space assumption and uses atransformer to adapt the boundary of source face to that oftarget face As a result ReenactGAN can perform many-to-one face reenactment After getting the reenacted faceswe use our carefully designed fusion method to obtain theswapped faces For a fair comparison DF-VAE utilizes thesame fusion method when compared to ReenactGAN [25]

Results We randomly choose 30 real videos fromDeeperForensics-10 as the source videos and 30 real videosfrom FaceForensics++ [18] as the target videos Thus eachmethod generates 30 fake videos Same as the user studybased on datasets we conduct the user study based on meth-ods among 100 professional participants who specialize incomputer vision research Because there are correspond-ing fake videos we let the users directly choose their pre-ferred fake videos between those generated by other meth-ods and those generated by DF-VAE Finally we got 3210answers for each compared pair The results are shown inFigure 2 We can see that DF-VAE shows an impressive ad-vantage over the baselines underscoring the high quality ofDF-VAE-generated fake videos

A6 Quantitative Evaluation Metrics

Frechet Inception Distance (FID) [8] is a widely exploitedmetric for generative models FID evaluates the similarityof distribution between the generated images and the realimages FID correlates well with the visual quality of thegenerated samples A lower value of FID means a betterquality

Inception Score (IS) [19] is an early and somewhat widelyadopted objective evaluation metric for generated imagesIS evaluates two aspects of generation quality articulationand diversity A higher value of IS means a better quality

Table 1 shows the FID and IS scores of our method com-pared to other methods DF-VAE outperforms all the threebaselines in quantitative evaluations by FID and IS

1313

1313

1313

Figure 3 Many-to-many (three-to-three) face swapping by a single model with obvious reduction of style mismatch problems This figureshows the results between three source identities and three target identities The whole process is end-to-end

Table 1 The FID and IS scores of DeepFakes [1] faceswap-GAN[2] ReenactGAN [25] and DF-VAE (Ours)

Method FID ISDeepFakes [1] 25771 1711

faceswap-GAN [2] 24718 1685ReenactGAN [25] 26325 1690DF-VAE (Ours) 22097 1714

A7 Many-to-Many Face SwappingBy a single model DF-VAE can perform many-to-many

face swapping with obvious reduction of style mismatchand facial boundary artifacts (see Figure 3 for the face swap-ping between three source identities and three target identi-ties) Even if there are multiple identities in both the sourcedomain and the target domain the quality of face swappingdoes not degrade

A8 Ablation StudiesAblation study of temporal loss Since the swapped facesdo not have the ground truth we evaluate the effectivenessof temporal consistency constraint ie temporal loss in aself-reenactment setting Similar to [13] we quantify the re-rendering error by Euclidean distance of per pixel in RGBchannels ([0 255]) Visualized results are shown in Fig-ure 4 Without the temporal loss the re-rendering error ishigher hence demonstrating the effectiveness of temporal

consistency constraintAblation study of different components We conduct fur-ther ablation studies wrt different components of our DF-VAE framework under many-to-many face swapping set-ting (see Figure 5) The source and target faces are shownin Column 1 and Column 2 In Column 3 our full methodDF-VAE shows high-fidelity face swapping results In Col-umn 4 style mismatch problems are very obvious if weremove the MAdaIN module If we remove the hourglass(structure extraction) module the disentanglement of struc-ture and appearance is not very thorough The swapped facewill be a mixture of multiple identities as shown in Column5 When we perform face swapping without constructingunpaired data in the same domain (see Column 6) the dis-entangled module will completely reconstruct the faces onthe side of Eβ thus the disentanglement is not establishedat all Therefore the quality of face swapping will degradeif we remove any component in DF-VAE framework

B Supplementary Information of Benchmark

In this section we will provide some supplementary in-formation of our real-world face forgery detection bench-mark First we will elaborate on the basic informationof five baselines in our benchmark in Section B1 Thenwe will provide evaluation results of face forgery detectionmodel performance wrt dataset size in Section B2

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 4: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

Ls represents the style loss which matches the meanand standard deviation of the style features Like [10] wematch the IN [22] statistics instead of using Gram matrixloss which can produce similar results Ls can be writtenas

Ls =

L983131

i=1

983042micro(Φi(o))minus micro(Φi(s))9830422

+

L983131

i=1

983042σ(Φi(o))minus σ(Φi(s))9830422

(16)

where o = mbt middot dt s = mb

t middot yt mbt is the blurred mask Φi

denotes the layer used in VGG-19 [20] Similar to [10] weuse relu1 1 relu2 1 relu3 1 relu4 1 layers withequal weights

λma is the weight of style loss to balance two parts ofMAdaIN lossTemporal loss We have given a detailed introduction oftemporal consistency constraint in our main paper Theform of temporal loss is the same as the newly proposedtemporal consistency constraint We will not repeat it hereTotal objective DF-VAE is an end-to-end many-to-manyface swapping framework We jointly train all parts of thenetworks The problem can be described as the optimizationof the following total objective

Ltotal = λ1Lrecon + λ2LKL + λ3LMAdaIN + λ4Ltemporal(17)

where λ1 λ2 λ3 λ4 are the weight hyperparameters of fourtypes of loss functions introduced above

A4 Implementation DetailsThe whole DF-VAE framework is end-to-end We use

the pretrained stacked hourglass networks [17] to extractlandmarks The numbers of stacks and blocks are set to4 and 1 respectively We exploit FlowNet 20 network [11]to estimate optical flows The typical AdaIN network [10]is applied to our style matching and fusion module Thelearning rate is set to 000005 for all parts of DF-VAE Weutilize Adam [14] and set β1 = 05 β2 = 0999 All theexperiments are conducted on NVIDIA Tesla V100 GPUs

A5 User Study of MethodsIn addition to user study based on datasets to examine

the quality of DeeperForensics-10 dataset we also carryout a user study to compare DF-VAE with state-of-the-artface manipulation methods We will present the user studyof methods in this sectionBaselines We choose three learning-based open-sourcemethods as our baselines DeepFakes [1] faceswap-GAN[2] and ReenactGAN [25] These three methods are repre-sentative which are based on different architectures Deep-Fakes [1] is a well-known method based on Auto-Encoders(AE) It uses a shared encoder and two separated decoders

DeepFakes DF-VAE (Ours)

DeepFakes vs Ours

DF-VAE (Ours)faceswap-GAN

faceswap-GAN vs Ours ReenactGAN vs Ours

ReenactGAN DF-VAE (Ours)

319 390

72

2891 2820

3138

Figure 2 Results of user study comparing methods The bar chartsshow the number of users who give preference in each comparedpair of manipulated videos

to perform face swapping faceswap-GAN [2] is based onGenerative Adversarial Networks (GAN) [6] which has asimilar structure as DeepFakes [1] but also uses a paired dis-criminators to improve face swapping quality ReenactGAN[25] makes a boundary latent space assumption and uses atransformer to adapt the boundary of source face to that oftarget face As a result ReenactGAN can perform many-to-one face reenactment After getting the reenacted faceswe use our carefully designed fusion method to obtain theswapped faces For a fair comparison DF-VAE utilizes thesame fusion method when compared to ReenactGAN [25]

Results We randomly choose 30 real videos fromDeeperForensics-10 as the source videos and 30 real videosfrom FaceForensics++ [18] as the target videos Thus eachmethod generates 30 fake videos Same as the user studybased on datasets we conduct the user study based on meth-ods among 100 professional participants who specialize incomputer vision research Because there are correspond-ing fake videos we let the users directly choose their pre-ferred fake videos between those generated by other meth-ods and those generated by DF-VAE Finally we got 3210answers for each compared pair The results are shown inFigure 2 We can see that DF-VAE shows an impressive ad-vantage over the baselines underscoring the high quality ofDF-VAE-generated fake videos

A6 Quantitative Evaluation Metrics

Frechet Inception Distance (FID) [8] is a widely exploitedmetric for generative models FID evaluates the similarityof distribution between the generated images and the realimages FID correlates well with the visual quality of thegenerated samples A lower value of FID means a betterquality

Inception Score (IS) [19] is an early and somewhat widelyadopted objective evaluation metric for generated imagesIS evaluates two aspects of generation quality articulationand diversity A higher value of IS means a better quality

Table 1 shows the FID and IS scores of our method com-pared to other methods DF-VAE outperforms all the threebaselines in quantitative evaluations by FID and IS

1313

1313

1313

Figure 3 Many-to-many (three-to-three) face swapping by a single model with obvious reduction of style mismatch problems This figureshows the results between three source identities and three target identities The whole process is end-to-end

Table 1 The FID and IS scores of DeepFakes [1] faceswap-GAN[2] ReenactGAN [25] and DF-VAE (Ours)

Method FID ISDeepFakes [1] 25771 1711

faceswap-GAN [2] 24718 1685ReenactGAN [25] 26325 1690DF-VAE (Ours) 22097 1714

A7 Many-to-Many Face SwappingBy a single model DF-VAE can perform many-to-many

face swapping with obvious reduction of style mismatchand facial boundary artifacts (see Figure 3 for the face swap-ping between three source identities and three target identi-ties) Even if there are multiple identities in both the sourcedomain and the target domain the quality of face swappingdoes not degrade

A8 Ablation StudiesAblation study of temporal loss Since the swapped facesdo not have the ground truth we evaluate the effectivenessof temporal consistency constraint ie temporal loss in aself-reenactment setting Similar to [13] we quantify the re-rendering error by Euclidean distance of per pixel in RGBchannels ([0 255]) Visualized results are shown in Fig-ure 4 Without the temporal loss the re-rendering error ishigher hence demonstrating the effectiveness of temporal

consistency constraintAblation study of different components We conduct fur-ther ablation studies wrt different components of our DF-VAE framework under many-to-many face swapping set-ting (see Figure 5) The source and target faces are shownin Column 1 and Column 2 In Column 3 our full methodDF-VAE shows high-fidelity face swapping results In Col-umn 4 style mismatch problems are very obvious if weremove the MAdaIN module If we remove the hourglass(structure extraction) module the disentanglement of struc-ture and appearance is not very thorough The swapped facewill be a mixture of multiple identities as shown in Column5 When we perform face swapping without constructingunpaired data in the same domain (see Column 6) the dis-entangled module will completely reconstruct the faces onthe side of Eβ thus the disentanglement is not establishedat all Therefore the quality of face swapping will degradeif we remove any component in DF-VAE framework

B Supplementary Information of Benchmark

In this section we will provide some supplementary in-formation of our real-world face forgery detection bench-mark First we will elaborate on the basic informationof five baselines in our benchmark in Section B1 Thenwe will provide evaluation results of face forgery detectionmodel performance wrt dataset size in Section B2

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 5: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

1313

1313

1313

Figure 3 Many-to-many (three-to-three) face swapping by a single model with obvious reduction of style mismatch problems This figureshows the results between three source identities and three target identities The whole process is end-to-end

Table 1 The FID and IS scores of DeepFakes [1] faceswap-GAN[2] ReenactGAN [25] and DF-VAE (Ours)

Method FID ISDeepFakes [1] 25771 1711

faceswap-GAN [2] 24718 1685ReenactGAN [25] 26325 1690DF-VAE (Ours) 22097 1714

A7 Many-to-Many Face SwappingBy a single model DF-VAE can perform many-to-many

face swapping with obvious reduction of style mismatchand facial boundary artifacts (see Figure 3 for the face swap-ping between three source identities and three target identi-ties) Even if there are multiple identities in both the sourcedomain and the target domain the quality of face swappingdoes not degrade

A8 Ablation StudiesAblation study of temporal loss Since the swapped facesdo not have the ground truth we evaluate the effectivenessof temporal consistency constraint ie temporal loss in aself-reenactment setting Similar to [13] we quantify the re-rendering error by Euclidean distance of per pixel in RGBchannels ([0 255]) Visualized results are shown in Fig-ure 4 Without the temporal loss the re-rendering error ishigher hence demonstrating the effectiveness of temporal

consistency constraintAblation study of different components We conduct fur-ther ablation studies wrt different components of our DF-VAE framework under many-to-many face swapping set-ting (see Figure 5) The source and target faces are shownin Column 1 and Column 2 In Column 3 our full methodDF-VAE shows high-fidelity face swapping results In Col-umn 4 style mismatch problems are very obvious if weremove the MAdaIN module If we remove the hourglass(structure extraction) module the disentanglement of struc-ture and appearance is not very thorough The swapped facewill be a mixture of multiple identities as shown in Column5 When we perform face swapping without constructingunpaired data in the same domain (see Column 6) the dis-entangled module will completely reconstruct the faces onthe side of Eβ thus the disentanglement is not establishedat all Therefore the quality of face swapping will degradeif we remove any component in DF-VAE framework

B Supplementary Information of Benchmark

In this section we will provide some supplementary in-formation of our real-world face forgery detection bench-mark First we will elaborate on the basic informationof five baselines in our benchmark in Section B1 Thenwe will provide evaluation results of face forgery detectionmodel performance wrt dataset size in Section B2

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 6: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

Table 2 The binary detection accuracy of the baseline methods on DeeperForensics-10 standard test set (std) hidden test set (hidden)when trained on the standard training set with different dataset sizes FS denotes the full dataset size of the standard training set reportedin the main paper FS divide x indicates the reduction of dataset size to 1x of the full dataset size

Method C3D [21] TSN [23] I3D [4] ResNet+LSTM [7 9] XceptionNet [5]Test (acc) std hidden std hidden std hidden std hidden std hidden

FS (Full Size) 9850 7475 9925 7700 10000 7925 10000 7825 10000 7700FS divide 2 9750 7500 9875 7850 10000 7813 10000 7888 10000 7675FS divide 4 9600 7525 9925 7650 9950 7700 10000 7825 10000 7650FS divide 8 9225 7213 9125 7088 9500 7350 9825 7563 10000 7613FS divide 16 8425 6688 8475 6950 9525 7425 9850 7625 10000 7488FS divide 32 6250 5450 6800 5825 8050 6788 9550 7413 9500 7138FS divide 64 6125 5388 5275 5000 6200 5725 9075 7025 8850 6775

B1 Details of Benchmark BaselinesWe elaborate on five baselines used in our face forgery

detection benchmark in this section Our benchmark con-tains four video-level face forgery detection methods C3D[21] Temporal Segment Networks (TSN) [23] Inflated 3DConvNet (I3D) [4] and ResNet+LSTM [7 9] One image-level detection method XceptionNet [5] which achievesthe best performance in FaceForensics++ [18] is evaluatedas well

bull C3D [21] is a simple but effective method which in-corporates 3D convolution to capture the spatiotempo-ral feature of videos It includes 8 convolutional 5max-pooling and 2 fully connected layers The size ofthe 3D convolutional kernels is 3times 3times 3 When train-ing C3D the videos are divided into non-overlappedclips with 16-frames length and the original face im-ages are resized to 112times 112

bull TSN [23] is a 2D convolutional network which splitsthe video into short segments and randomly selects asnippet from each segment as the input The long-range temporal structure modeling is achieved by thefusion of the class scores corresponding to these snip-pets In our experiment we choose BN-Inception [12]as the backbone and only train our model with theRGB stream The number of segments is set to 3 as de-fault and the original images are resized to 224times 224

bull I3D [4] is derived from Inception-V1 [12] It inflatesthe 2D ConvNet by endowing the filters and poolingkernels with an additional temporal dimension In thetraining we use 64-frame snippets as the input whosestarting frames are randomly selected from the videosThe face images are resized to 224times 224

bull ResNet+LSTM [7 9] is based on ResNet [7] architec-ture As a 2D convolutional framework ResNet [7] isused to extract spatial features (the output of the lastconvolutional layer) for each face image In order toencode the temporal dependency between images weplace an LSTM [9] module with 512 hidden units af-ter ResNet-50 [7] to aggregate the spatial features An

additional fully connected layer serves as the classi-fier All the videos are downsampled with a ratio of 5and the images are resized to 224times224 before feedinginto the network During training the loss is the sum-mation of the binary entropy on the output at all timesteps while only the output of the last frame is usedfor the final classification in inference

bull XceptionNet [5] is a depthwise-separable-convolutionbased CNN which has been used in [18] for image-level face forgery detection We exploit the sameXceptionNet model as [18] but without freezing theweights of any layer during training The face imagesare resized to 299times 299 In the test phase the predic-tion is made by averaging classification scores of allframes within a video

B2 Evaluation of Dataset SizeWe include additional evaluation results of face forgery

detection methods wrt dataset size as shown in Table 2The experiments are conducted on DeeperForensics-10 inthis setting All the models are trained on the standard train-ing set with different dataset sizes

The reduction of dataset size hurts the accuracy of all thebaselines on either standard test set (std) or hidden test set(hidden) This justifies that a larger dataset size can be help-ful for detection model performance It is noteworthy thataccuracy drops little if we do not reduce the dataset size sig-nificantly These observations concerning dataset size are inline with that of [18 24]

In contrast to previous works we also find that theimage-level detection method is less sensitive (ie has astronger ability to remain high accuracy) to dataset size thanpure video-level method One possible explanation for thisis that despite a significant reduction of video samples therestill exist enough image samples (frames in total) for theimage-level method to achieve good performance

C More Examples of Data CollectionIn this section we show more examples of our exten-

sive source video data collection (see Figure 6) Our high-

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 7: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]The source videos will also be released for further research

D PerturbationsWe also show some examples of perturbations in

DeeperForensics-10 Seven types of perturbations and themixture of two (Gaussian blur JPEG compression) three(Gaussian blur JPEG compression white Gaussian noisein color components) four (Gaussian blur JPEG compres-sion white Gaussian noise in color components changeof color saturation) perturbations are shown in Figure 7These perturbations are very common distortions existingin real life The comprehensiveness of perturbations inDeeperForensics-10 ensures its diversity to better simulatefake videos in real-world scenarios

References[1] Deepfakes httpsgithubcomdeepfakes

faceswap Accessed 2019-08-16 4 5[2] faceswap-gan httpsgithubcomshaoanlu

faceswap-GAN Accessed 2019-08-16 4 5[3] Chen Cao Yanlin Weng Shun Zhou Yiying Tong and Kun

Zhou Facewarehouse A 3d facial expression database forvisual computing TVCG 20413ndash425 2013 6 9

[4] Joao Carreira and Andrew Zisserman Quo vadis actionrecognition a new model and the kinetics dataset In CVPR2017 6

[5] Francois Chollet Xception Deep learning with depthwiseseparable convolutions In CVPR 2017 6

[6] Ian Goodfellow Jean Pouget-Abadie Mehdi Mirza BingXu David Warde-Farley Sherjil Ozair Aaron Courville andYoshua Bengio Generative adversarial nets In NeurIPS2014 4

[7] Kaiming He Xiangyu Zhang Shaoqing Ren and Jian SunDeep residual learning for image recognition In CVPR2016 6

[8] Martin Heusel Hubert Ramsauer Thomas UnterthinerBernhard Nessler and Sepp Hochreiter Gans trained by atwo time-scale update rule converge to a local nash equilib-rium In NeurIPS 2017 4

[9] Sepp Hochreiter and Jurgen Schmidhuber Long short-termmemory Neural computation 91735ndash1780 1997 6

[10] Xun Huang and Serge Belongie Arbitrary style transfer inreal-time with adaptive instance normalization In ICCV2017 3 4

[11] Eddy Ilg Nikolaus Mayer Tonmoy Saikia Margret KeuperAlexey Dosovitskiy and Thomas Brox Flownet 20 Evolu-

tion of optical flow estimation with deep networks In CVPR2017 4

[12] Sergey Ioffe and Christian Szegedy Batch normalizationAccelerating deep network training by reducing internal co-variate shift In ICML 2015 6

[13] Hyeongwoo Kim Pablo Carrido Ayush Tewari WeipengXu Justus Thies Matthias Niessner Patrick Perez ChristianRichardt Michael Zollhofer and Christian Theobalt Deepvideo portraits ACM TOG 37163 2018 5 8

[14] Diederik P Kingma and Jimmy Ba Adam A method forstochastic optimization arXiv preprint arXiv141269802014 4

[15] Durk P Kingma Shakir Mohamed Danilo Jimenez Rezendeand Max Welling Semi-supervised learning with deep gen-erative models In NeurIPS 2014 3

[16] Diederik P Kingma and Max Welling Auto-encoding vari-ational bayes arXiv preprint arXiv13126114 2013 1 23

[17] Alejandro Newell Kaiyu Yang and Jia Deng Stacked hour-glass networks for human pose estimation In ECCV 20162 4

[18] Andreas Rossler Davide Cozzolino Luisa Verdoliva Chris-tian Riess Justus Thies and Matthias Nieszligner Faceforen-sics++ Learning to detect manipulated facial images arXivpreprint arXiv190108971 2019 4 6

[19] Tim Salimans Ian Goodfellow Wojciech Zaremba VickiCheung Alec Radford and Xi Chen Improved techniquesfor training gans In NeurIPS 2016 4

[20] Karen Simonyan and Andrew Zisserman Very deep convo-lutional networks for large-scale image recognition arXivpreprint arXiv14091556 2014 3 4

[21] Du Tran Lubomir Bourdev Rob Fergus Lorenzo Torresaniand Manohar Paluri Learning spatiotemporal features with3d convolutional networks In ICCV 2015 6

[22] Dmitry Ulyanov Andrea Vedaldi and Victor Lempitsky In-stance normalization The missing ingredient for fast styliza-tion arXiv preprint arXiv160708022 2016 4

[23] Limin Wang Yuanjun Xiong Zhe Wang Yu Qiao DahuaLin Xiaoou Tang and Luc Van Gool Temporal segmentnetworks Towards good practices for deep action recogni-tion In ECCV 2016 6

[24] Sheng-Yu Wang Oliver Wang Richard Zhang AndrewOwens and Alexei A Efros Cnn-generated images aresurprisingly easy to spot for now arXiv preprintarXiv191211035 2019 6

[25] Wayne Wu Yunxuan Zhang Cheng Li Chen Qian and ChenChange Loy Reenactgan Learning to reenact faces viaboundary transfer In ECCV 2018 4 5

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 8: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

30

0

GTwo temporal loss w temporal loss

output error map output error map

mean error 4129mean error 4148

mean error 4024 mean error 3967

Figure 4 The quantitative evaluation of the effectiveness of temporal loss Similar to [13] we use the re-rendering error in a self-reenactment setting where the ground truth is known The error maps show Euclidean distance of per pixel in RGB channels ([0 255])The mean errors are shown above the images The corresponding color scale wrt error values is shown on the right side of the images

Source Target Full DF-VAE wo MAdaIN wo hourglass wo unpaired

Figure 5 The ablation studies of different components in DF-VAE framework in the many-to-many face swapping setting Column 1 andColumn 2 show the source face and the target face respectively Column 3 shows the results of the full method Column 4 5 6 show theresults when removing MAdaIN module hourglass (structure extraction) module and unpaired data construction respectively

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10

Page 9: DeeperForensics-1.0: A Large-Scale Dataset for Real-World ... · Real-World Face Forgery Detection Supplementary Material Liming Jiang1 Ren Li2 Wayne Wu1,2 Chen Qian2 Chen Change

Figure 6 More examples of the source video data collection Our high-quality collected data vary in identities poses expressionsemotions lighting conditions and 3DMM blendshapes [3]

wo Perturbation Color Saturation Change

Local Block-WiseDistortion

Color ContrastChange Gaussian Blur

White Gaussian Noise in Color Components

JPEG Compression Video Compression Rate Change Mixed (3) Perturbations

Raw Video

Mixed (2) Perturbations Mixed (4) Perturbations

Figure 7 Seven types of perturbations and the mixture of two (Gaussian blur JPEG compression) three (Gaussian blur JPEG compressionwhite Gaussian noise in color components) four (Gaussian blur JPEG compression white Gaussian noise in color components changeof color saturation) perturbations in DeeperForensics-10