Deep$Face$Recogni-on$ - University of Oxfordvgg/publications/2015/... · 2015. 9. 10. · DeepFace Fisher Vector Faces Dataset# YouTube#Faces#In#the#Wild:# UnrestrictedProtocol# Labeled#Faces#In#the#Wild:#

Deep Face Recogni-on Omkar M. Parkhi Andrea Vedaldi Andrew Zisserman

Visual Geometry Group, Department of Engineering Science, University of Oxford

0.

Objec)ve

Results

(ii) Correspondences

•  Build a large scale face dataset with minimal human interven-on

•  Train a convolu-onal neural network for face recogni-on which can compete with that of the Internet giants like Google and Facebook

Results

Incorpora)ng A=ributes

Contribu)ons

Representa)on Learning Comparison with the state of the art

•  New face dataset of 2.6 Million Faces with minimal manual labeling •  State-‐of-‐the-‐art results on YouTube faces in the wild dataset (EER 3%) [BeVer than Google’s FaceNet and Facebook’s DeepFace]

•  Comparable to the state-‐of-‐the-‐art results on the Labeled faces in the wild dataset (EER 0.87%)

[Comparable to Google’s FaceNet and beVer than Facebook’s DeepFace]

Dataset construc)on 1.   Candidate list genera)on: 5000 iden--es •  Tap the knowledge on the web

•  Collect representa-ve images for each celebrity -‐ 200 images per iden-ty

2.   Candidate list filtering •  Remove people with low representa-on on the Google image search.

•  Remove overlap with public benchmarks

•  2622 celebri-es in the final dataset

3. Rank image sets •  Download 2000 images per iden-ty

•  Learn classifier using data obtained in step 1

•  Rank 2000 images and select the top 1000

4.   Near duplicate removal •  VLAD descriptor based near duplicate removal

5. Manual filtering •  Cura-ng the dataset further using manual checks

No. Aim Mode # Persons # images /person

Total # images

Anno. effort

1 Candidate list genera-on Auto 5000 200 1,000,000 -‐

2 Candidate list filtering Manual 2622 -‐ -‐ 4 days

3 Rank image sets Auto 2622 1000 2,622,000 -‐

4 Near duplicate removal Auto 2622 623 1,635,159 -‐

5 Manual filtering Manual 2622 375 982,803 10 days

No. Aim # Persons Total # images

1 Labeled Faces In the Wild 5,749 13,233

2 WDRef 2995 99,773

3 Celeb Faces 10177 202,599

4 Ours 2622 2.6M

5 Facebook 4030 4.4M

6 Google 8M 200M

Dataset Comparison

Example images from our dataset

Dataset Sta)s)cs

image

Conv-‐64

maxpool

fc-‐4096

fc-‐4096

SoWmax

Conv-‐64

Conv-‐128

maxpool

Conv-‐128

Conv-‐256

maxpool

Conv-‐256

Conv-‐512

maxpool

Conv-‐512

Conv-‐512

Conv-‐512

maxpool

Conv-‐512

Conv-‐512

fc-‐2622

Objec)ve 100%-‐EER

Soemax classifica-on 97.30

Embedding Learning 99.10

No. Method # Training Images # Networks Accuracy

1 Fisher Vector Faces -‐ -‐ 93.10

2 DeepFace (Facebook) 4 M 3 97.35

3 DeepFace Fusion (Facebook) 500 M 5 98.37

4 DeepID-‐2,3 Full 200 99.47

5 FaceNet (Google) 200 M 1 98.87

6 FaceNet+ Alignment (Google) 200 M 1 99.63

7 Ours (VGG Face) 2.6 M 1 98.78

No. Method # Training Images # Networks 100%-‐

EER Accuracy

1 Video Fisher Vector Faces -‐ -‐ 87.7 83.8

2 DeepFace (Facebook) 4 M 1 91.4 91.4

4 DeepID-‐2,2+,3 200 -‐ 93.2

5 FaceNet (Google) 200 M 1 -‐ 95.1

7 Ours (VGG Face) 2.6 M 1 97.60 97.4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

True

Pos

itive

Rat

e

ROC Curve LFW Dataset

Our MethodDeepID3DeepFaceFisher Vector Faces

Dataset

YouTube Faces In the Wild: Unrestricted Protocol

Labeled Faces In the Wild: Unrestricted Protocol

•  “Very Deep” Architecture •  Different from previous architectures proposed for face recogni-on: •  locally connected layers (Facebook) •  incep-on (Google)

•  Learning a mul)-‐way classifier •  Soe Max Objec-ve •  2622 way classifica-on •  4096d descriptor

•  Learning Task Specific Embedding •  The embedding is learnt by minimizing the triplet loss

•  Learning a projec-on layer from 4096 to 1024 dimensions •  Online triplet forma-on at the beginning of each itera-on •  Fine tuned on target datasets

•  Experiments Labeled Faces in the Wild – Unrestricted Protocol

6 PARKHI et al.: DEEP FACE RECOGNITION

layer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18type input conv relu conv relu mpool conv relu conv relu mpool conv relu conv relu conv relu mpool convname – conv1_1 relu1_1 conv1_2 relu1_2 pool1 conv2_1 relu2_1 conv2_2 relu2_2 pool2 conv3_1 relu3_1 conv3_2 relu3_2 conv3_3 relu3_3 pool3 conv4_1

support – 3 1 3 1 2 3 1 3 1 2 3 1 3 1 3 1 2 3filt dim – 3 – 64 – – 64 – 128 – – 128 – 256 – 256 – – 256

num filts – 64 – 64 – – 128 – 128 – – 256 – 256 – 256 – – 512stride – 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1pad – 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1

layer 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37type relu conv relu conv relu mpool conv relu conv relu conv relu mpool conv relu conv relu conv softmxname relu4_1 conv4_2 relu4_2 conv4_3 relu4_3 pool4 conv5_1 relu5_1 conv5_2 relu5_2 conv5_3 relu5_3 pool5 fc6 relu6 fc7 relu7 fc8 prob

support 1 3 1 3 1 2 3 1 3 1 3 1 2 7 1 1 1 1 1filt dim – 512 – 512 – – 512 – 512 – 512 – – 512 – 4096 – 4096 –

num filts – 512 – 512 – – 512 – 512 – 512 – – 4096 – 4096 – 2622 –stride 1 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1pad 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0

Table 3: Network configuration. Details of the face CNN configuration A. The FC layersare listed as “convolution” as they are a special case of convolution (see Section 4.3). Foreach convolution layer, the filter size, number of filters, stride and padding are indicated.

using a “triplet loss” training scheme, illustrated in the next section. While the latter isessential to obtain a good overall performance, bootstrapping the network as a classifier, asexplained in this section, was found to make training significantly easier and faster.

4.2 Learning a face embedding using a triplet loss

Triplet-loss training aims at learning score vectors that perform well in the final application,i.e. identity verification by comparing face descriptors in Euclidean space. This is similar inspirit to “metric learning”, and, like many metric learning approaches, is used to learn a pro-jection that is at the same time distinctive and compact, achieving dimensionality reductionat the same time.

Our triplet-loss training scheme is similar in spirit to that of [17]. The output f(`t

) 2RD

of the CNN, pre-trained as explained in Section 4.1, is l

2-normalised and projected to a L ⌧D dimensional space using an affine projection x

t

=W

0f(`t

)/kf(`t

)k2, W

0 2 RL⇥D. Whilethis formula is similar to the linear predictor learned above, there are two key differences.The first one is that L 6=D is not equal to the number of class identities, but it is the (arbitrary)size of the descriptor embedding (we set L = 1,024). The second one is that the projectionW

0 is trained to minimise the empirical triplet loss

E(W 0) = Â(a,p,n)2T

max{0,a �kx

a

�x

n

k22 +kx

a

�x

p

k22}, x

i

=W

0 f(`i

)

kf(`i

)k2. (1)

Note that, differently from the previous section, there is no bias being learned here as thedifferences in (1) would cancel it. Here a � 0 is a fixed scalar representing a learning

margin and T is a collection of training triplets. A triplet (a, p,n) contains an anchor faceimage a as well as a positive p 6= a and negative n examples of the anchor’s identity. Theprojection W

0 is learned on target datasets such as LFW and YTF honouring their guidelines.The construction of the triplet training T set is discussed in Section 4.4.

4.3 Architecture

We consider three architectures based on the A, B, and D architectures of [19]. The CNNarchitecture A is given in full detail in Table 3. It comprises 11 blocks, each containing alinear operator followed by one or more non-linearities such as ReLU and max pooling. Thefirst eight such blocks are said to be convolutional as the linear operator is a bank of linearfilters (linear convolution). The last three blocks are instead called Fully Connected (FC);

Download

Effect of dataset cura)on

Dataset #images 100%-‐EER

Curated 982,803 92.83

Full 2,622,000 95.80

Effect of embedding learning

More experiments in the paper Example correct classifica)on: posi)ve pairs

Example correct classifica)on: nega)ve pairs

Example incorrect classifica)on

Acknowledgements

False posi)ve

CNN model and face detector available online!! Visit: h=p://www.robots.ox.ac.uk/~vgg/soWware/vgg_face/

This research is based upon work supported by the Office of the Director of Na-onal Intelligence (ODNI), Intelligence Advanced Research Projects Ac-vity (IARPA)

False nega)ve

Documents

Deep$Face$Recogni-on$ - University of Oxfordvgg/publications/2015/... · 2015. 9. 10. · DeepFace Fisher Vector Faces Dataset# YouTube#Faces#In#the#Wild:# UnrestrictedProtocol# Labeled#Faces#In#the#Wild:#