SCFD: Single-Stage Cartoon Face DetectorStage2 80×80 80×80 2 1 1 1 256 512 Down-sampling Stage3 40×40 40×40 2 1 1 2 512 768 Down-sampling Stage4 20×20 20×20 2 1 1 2 768 1024

ACFD: Asymmetric Cartoon Face Detector

Bin Zhang*, Jian Li*, Yabiao Wang, Zhipeng Cui{* Equal Contribution}

Youtu Lab, Tencent

Southeast University

Model size should not exceed 200M.

Inference time of a single picture (1920x1080) should not exceed

50ms.

Multi-scale and multi-model ensembles are allowed, in this way,

inference time is the sum of multi-scale multi-models.

Pretrained model can not be utilized.

WIDER Face can be used as the training dataset.

Task Analysis

Dataset Analysis

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

(0, 32] (32, 64] (64, 96] (96, 128] >128

distribution of box sizes

iCartoonFace

WIDER Face

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

1 2 (3, 10] (10, 25] (25, 40] >40

distribution of boxes per image

iCartoonFace

WIDER Face

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

6.00%

7.00%

8.00%

9.00%

0.1

5

0.2

9

0.3

9

0.4

9

0.5

9

0.6

9

0.7

9

0.8

9

0.9

9

1.0

9

1.1

9

1.2

9

1.3

9

1.4

9

1.5

9

1.6

9

1.7

9

1.8

9

1.9

9

2.0

9

2.1

9

2.2

9

2.3

9

2.4

9

2.5

9

2.6

9

2.7

9

2.8

9

2.9

9

distribution of face height and width ratio

iCartoonFace

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

12.00%

0.1

5

0.2

9

0.3

9

0.4

9

0.5

9

0.6

9

0.7

9

0.8

9

0.9

9

1.0

9

1.1

9

1.2

9

1.3

9

1.4

9

1.5

9

1.6

9

1.7

9

1.8

9

1.9

9

2.0

9

2.1

9

2.2

9

2.3

9

2.4

9

2.5

9

2.6

9

2.7

9

2.8

9

2.9

9

distribution of face height and width ratio

WIDER Face

Difficulties of iCartoonFace

animal-like unclear boundary blur

non-human-like robot ‘missing’

• State-of-the-art methods: SFD[1], PyramidBox[2], SRN[3], DSFD[4], RefineFace[5].

• Pipeline:

• One-stage anchor-based detectors are widely used.

• Take features of stride from 4 to 128 for predicting. (6 layers of pyramid features, for handling the small faces)

• Modules follow the backbone to fuse the pyramid features and enhance the semantic information sequentially.

• Match Strategy: Match anchors and faces with a smaller IoU threshold (0.35-0.4, it is 0.5 in generic object detection). (to assign enough anchors for each faces)

• Data Augmentation: Random crop for multi-scale training.

Related Face Detectors

[1] S. Zhang, X. Zhu, and et al. S3FD: Single Shot Scale-invariant Face Detector. ICCV, 2017.[2] X. Tang, D. K. Du, and et al. PyramidBox: A Context-assisted Single Shot Face Detector. ECCV, 2018.[3] C. Chi, S. Zhang, and et al. Selective Refinement Network for High Performance Face Detection, AAAI, 2019.[4] J. Li, Y. Wang, and et al. DSFD: Dual-Shot Face Detector. CVPR, 2019.[5] S. Zhang, C. Chi, and et al. RefineFace: Refinement Neural Network for High Performance Face Detection, TPAMI, 2020.

• A one-stage anchor-based pipeline with the ability to extract diverse

features by employing asymmetric conv. layers.

• Data augmentation for better handling the faces hard to detect, e.g.,

too small and too large faces, blur and occluded faces, etc.

• Dynamic match strategy to sample high-quality anchors for training,

providing enough anchors for each face.

• Margin loss for enhancing the power of discrimination especially for

those faces similar to the background, e.g., robot face.

Our ACFD

Pipeline

Head

Head

Head

Head

Head

Head

Regressio

n an

d C

lassification

Loss

A-BiFPN Anchor Head

1/4

1/8

1/16

1/32

1/64

1/128

A-VoVNet51

• Random crop to generate large training samples. (zoom in)

• Random expansion to generate small samples. (zoom out)

• Random tile faces to anchor scale for better align the receptive field.

Data Augmentation

Asymmetric Conv Block (ACB)

1x1 1x1 1x1 1x1

1x3 3x1 1x5 5x1

1x1 1x1 1x1 1x1

concat

RFE[1]

Norm Norm Norm

ReLU

ACB[2]

Norm

ReLU

Inference

[1] S. Zhang, C. Chi, and et al. RefineFace: Refinement Neural Network for High Performance Face Detection, TPAMI, 2020.[2] X. Ding, Y. Guo, and et al. Acnet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks. ICCV, 2019.

• 6 stages with strides from 4 to 128

A-VoVNet51[1]

Layer Output Stride Repeat Channel

Image 640×640 - - -

Conv1Conv2Conv3

320×320320×320160×160

212

111

6464

128

Stage1 160×160 1 1 256

Down-samplingStage2

80×8080×80

21

11

256512

Down-samplingStage3

40×4040×40

21

12

512768

Down-samplingStage4

20×2020×20

21

12

7681024

Down-samplingStage5

10×1010×10

21

11

1024128

Down-samplingStage6

5×55×5

21

11

128128

[1] Y. Lee, J. Park, and et al. CenterMask: Real-Time Anchor-Free Instance Segmentation. CVPR, 2020.

• 6 stages with strides from 4 to 128

A-VoVNet51[1]

Layer Output Stride Repeat Channel

Image 640×640 - - -

Conv1Conv2Conv3

320×320320×320160×160

212

111

6464

128

Stage1 160×160 1 1 256

Down-samplingStage2

80×8080×80

21

11

256512

Down-samplingStage3

40×4040×40

21

12

512768

Down-samplingStage4

20×2020×20

21

12

7681024

Down-samplingStage5

10×1010×10

21

11

1024128

Down-samplingStage6

5×55×5

21

11

128128

backbone AP

ResNet50

SE-ResNet50

Res2Net50

ResNeSt50

EfficientNet-B3

VoVNet51

A-VoVNet51

0.9018

0.9023

0.8959

0.8997

0.8863

0.9037

0.9074

[1] Y. Lee, J. Park, and et al. CenterMask: Real-Time Anchor-Free Instance Segmentation. CVPR, 2020.

• Top-down path:

𝑃𝑖𝑡𝑑 = 𝐶𝑜𝑛𝑣(

𝜔1 ⋅ 𝑃𝑖𝑖𝑛 + 𝜔2 ⋅ 𝑅𝑒𝑠𝑖𝑧𝑒(𝑃𝑖+1

𝑡𝑑 )

𝜔1 + 𝜔2 + 𝜖)

• Bottom-up path:

𝑃𝑖𝑏𝑢

= 𝐶𝑜𝑛𝑣((𝜔1 ⋅ 𝑃𝑖𝑖𝑛 + 𝜔2 ⋅ 𝑃𝑖

𝑡𝑑 +𝜔3

⋅ 𝑅𝑒𝑠𝑖𝑧𝑒(𝑃𝑖−1𝑡𝑑 ))/(𝜔1 +𝜔2 +𝜔3 + 𝜖))

A-BiFPN[1]

𝑃7𝑖𝑛

top-down bottom-up skip

𝑃6𝑖𝑛

𝑃5𝑖𝑛

𝑃4𝑖𝑛

𝑃3𝑖𝑛

𝑃2𝑖𝑛

𝑃7𝑏𝑢

𝑃6𝑏𝑢

𝑃5𝑏𝑢

𝑃4𝑏𝑢

𝑃3𝑏𝑢

𝑃2𝑏𝑢

𝑃6𝑡𝑑

𝑃5𝑡𝑑

𝑃4𝑡𝑑

𝑃3𝑡𝑑

Feature Module AP

A-BiFPNBiFPNSEPC (CVPR’20)FPN

0.90360.90180.88800.8830

[1] M. Tan, R. Pang, and et al. EfficientDet: Scalable and Efficient Object Detection. CVPR, 2020.

Exp:

• First step: the faces match anchors with IoU higher than a smallthreshold, usually 0.35~0.45.

• Second step: a anchor would be matched when IoU of its regressed box with any box greater than a large threshold, usually 0.7~0.8.

Dynamic Match Strategy

[1] T. Kong, F. Sun, and et al. Consistent optimization for single-shot object detection. arXiv, 2019.[2] Y. Liu, X. Tang, and et al. HAMBox: Delving Into Mining High-Quality Anchors on Face Detection. CVPR, 2020.

• Regression loss

ℓ𝑟𝑒𝑔 =1

𝑁1

𝑖∈ψ1

ℒ𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑥𝑖 , 𝑥𝑖∗) +

𝜆𝑟𝑒𝑔

𝑁2

𝑖∈𝜓2

ℒ𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑥𝑖 , 𝑥𝑖∗)

• Classification loss

ℓ𝑐𝑙𝑠 =1

𝑁1

𝑖∉𝜓2

ℒ𝑓𝑜𝑐𝑎𝑙(𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑝𝑖 , 𝑝𝑖∗) +

𝜆𝑐𝑙𝑠𝑁2

𝑖∈𝜓2

ℒ𝑓𝑜𝑐𝑎𝑙(𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑝 , 𝑝𝑖∗)

𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑥, 𝑥𝑜 = 𝑥𝑜 = 1 ⋅ 𝑥 − 𝑚 + 𝑥𝑜 = 0 ⋅ 𝑥

Loss Design

model AP

Baseline

+ dynamic match

0.8765

0.8890

model AP

Baseline

+ margin loss

0.9048

0.9073

[1] H. Wang, Y. Wang. Cosface: Large Margin Cosine Loss for Deep Face Recognition. CVPR, 2018.

• Training

• Split 50000 pictures into 45000 for training and 5000 for validating.

• Sample size: 640×640, batch size: 16×4 Tesla V100 GPUs.

• SGD with learning rate 0.04, multiplied by 0.1 at 200, 250 and 280 epoch, stop at 300

epoch.

• IoU thresholds for first and second match: 0.35 and 0.7, 𝜆𝑟𝑒𝑔 = 𝜆𝑐𝑙.𝑠 = 0.8.

• The margin of classification loss is 0.2.

• Testing

• Test scale: (480, 645), (640, 860), (800, 1075)

• Top-1000 predictions with confidence scores higher than 0.08 are processed by NMS

with IoU threshold 0.55, top-100 boxes would be preserved as the final results.

Experimental Details

• Time-consuming optimization

• Fuse conv. and BN layer. (accelerate about 3 ms when batch=1)

• Batch processing. (speed up to 60+ ms without post-processing)

• Convert PyTorch model to TensorRT by a simple torch2trt toolbox available at

https://github.com/z-bingo/torch2trt. (average runtime: 48~49 ms)

Time-Consuming Analysis

0 10 20 30 40 50 60 70 80 90 100

w/ BN fuse, batch=48, RT

w/ BN fuse, batch=48

w/ BN fuse, batch=1

w/o BN fuse, batch=1

Time-Consuming Analysis (ms)

Backbone BiFPN Head

https://github.com/z-bingo/torch2trt

Results on Leaderboard

0.8816

0.8975

0.9038

0.9118

0.9180 0.9187 0.9191

0.9214 0.9218 0.9223 0.9230

0.9236

0.9284 0.9291

0.8800

0.8850

0.8900

0.8950

0.9000

0.9050

0.9100

0.9150

0.9200

0.9250

0.9300

Visualization

scale

robot

crowded

animal-like

blur

ratiopose

blur

occlusion

non-human-like

• Asymmetric Conv Block (ACB)• Provide the backbone network with the ability to generate features with more

diverse receptive fields.

• BiFPN with ACB could aggregate and enhance multi-scale features at the same time,

previous methods achieved these by two separate modules.

• Dynamic Match Strategy• Instead of matching anchors and faces by a simple IoU threshold, it is employed for

mining enough high-quality anchors for each face.

• Margin Loss• Enhance the power of discrimination, especially for those faces who are similar to

the background, e.g., blur face, robot face.

• Data Augmentation• Augment the faces that are too small and too large to be accurately located.

Conclusion

Q&ABin Zhang*, Jian Li*, Yabiao Wang, Zhipeng Cui{* Equal Contribution}Contact e-mail: [email protected]

Youtu Lab, Tencent

Southeast University

Documents

SCFD: Single-Stage Cartoon Face DetectorStage2 80×80 80×80 2 1 1 1 256 512 Down-sampling Stage3 40×40 40×40 2 1 1 2 512 768 Down-sampling Stage4 20×20 20×20 2 1 1 2 768 1024