20
ACFD: Asymmetric Cartoon Face Detector Bin Zhang*, Jian Li*, Yabiao Wang, Zhipeng Cui {* Equal Contribution} Youtu Lab, Tencent Southeast University

SCFD: Single-Stage Cartoon Face DetectorStage2 80×80 80×80 2 1 1 1 256 512 Down-sampling Stage3 40×40 40×40 2 1 1 2 512 768 Down-sampling Stage4 20×20 20×20 2 1 1 2 768 1024

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

  • ACFD: Asymmetric Cartoon Face Detector

    Bin Zhang*, Jian Li*, Yabiao Wang, Zhipeng Cui{* Equal Contribution}

    Youtu Lab, Tencent

    Southeast University

  • Model size should not exceed 200M.

    Inference time of a single picture (1920x1080) should not exceed

    50ms.

    Multi-scale and multi-model ensembles are allowed, in this way,

    inference time is the sum of multi-scale multi-models.

    Pretrained model can not be utilized.

    WIDER Face can be used as the training dataset.

    Task Analysis

  • Dataset Analysis

    0.00%

    10.00%

    20.00%

    30.00%

    40.00%

    50.00%

    60.00%

    70.00%

    80.00%

    (0, 32] (32, 64] (64, 96] (96, 128] >128

    distribution of box sizes

    iCartoonFace

    WIDER Face

    0.00%

    10.00%

    20.00%

    30.00%

    40.00%

    50.00%

    60.00%

    70.00%

    1 2 (3, 10] (10, 25] (25, 40] >40

    distribution of boxes per image

    iCartoonFace

    WIDER Face

    0.00%

    1.00%

    2.00%

    3.00%

    4.00%

    5.00%

    6.00%

    7.00%

    8.00%

    9.00%

    0.1

    5

    0.2

    9

    0.3

    9

    0.4

    9

    0.5

    9

    0.6

    9

    0.7

    9

    0.8

    9

    0.9

    9

    1.0

    9

    1.1

    9

    1.2

    9

    1.3

    9

    1.4

    9

    1.5

    9

    1.6

    9

    1.7

    9

    1.8

    9

    1.9

    9

    2.0

    9

    2.1

    9

    2.2

    9

    2.3

    9

    2.4

    9

    2.5

    9

    2.6

    9

    2.7

    9

    2.8

    9

    2.9

    9

    distribution of face height and width ratio

    iCartoonFace

    0.00%

    2.00%

    4.00%

    6.00%

    8.00%

    10.00%

    12.00%

    0.1

    5

    0.2

    9

    0.3

    9

    0.4

    9

    0.5

    9

    0.6

    9

    0.7

    9

    0.8

    9

    0.9

    9

    1.0

    9

    1.1

    9

    1.2

    9

    1.3

    9

    1.4

    9

    1.5

    9

    1.6

    9

    1.7

    9

    1.8

    9

    1.9

    9

    2.0

    9

    2.1

    9

    2.2

    9

    2.3

    9

    2.4

    9

    2.5

    9

    2.6

    9

    2.7

    9

    2.8

    9

    2.9

    9

    distribution of face height and width ratio

    WIDER Face

  • Difficulties of iCartoonFace

    animal-like unclear boundary blur

    non-human-like robot ‘missing’

  • • State-of-the-art methods: SFD[1], PyramidBox[2], SRN[3], DSFD[4], RefineFace[5].

    • Pipeline:

    • One-stage anchor-based detectors are widely used.

    • Take features of stride from 4 to 128 for predicting. (6 layers of pyramid features, for handling the small faces)

    • Modules follow the backbone to fuse the pyramid features and enhance the semantic information sequentially.

    • Match Strategy: Match anchors and faces with a smaller IoU threshold (0.35-0.4, it is 0.5 in generic object detection). (to assign enough anchors for each faces)

    • Data Augmentation: Random crop for multi-scale training.

    Related Face Detectors

    [1] S. Zhang, X. Zhu, and et al. S3FD: Single Shot Scale-invariant Face Detector. ICCV, 2017.[2] X. Tang, D. K. Du, and et al. PyramidBox: A Context-assisted Single Shot Face Detector. ECCV, 2018.[3] C. Chi, S. Zhang, and et al. Selective Refinement Network for High Performance Face Detection, AAAI, 2019.[4] J. Li, Y. Wang, and et al. DSFD: Dual-Shot Face Detector. CVPR, 2019.[5] S. Zhang, C. Chi, and et al. RefineFace: Refinement Neural Network for High Performance Face Detection, TPAMI, 2020.

  • • A one-stage anchor-based pipeline with the ability to extract diverse

    features by employing asymmetric conv. layers.

    • Data augmentation for better handling the faces hard to detect, e.g.,

    too small and too large faces, blur and occluded faces, etc.

    • Dynamic match strategy to sample high-quality anchors for training,

    providing enough anchors for each face.

    • Margin loss for enhancing the power of discrimination especially for

    those faces similar to the background, e.g., robot face.

    Our ACFD

  • Pipeline

    Head

    Head

    Head

    Head

    Head

    Head

    Regressio

    n an

    d C

    lassification

    Loss

    A-BiFPN Anchor Head

    1/4

    1/8

    1/16

    1/32

    1/64

    1/128

    A-VoVNet51

  • • Random crop to generate large training samples. (zoom in)

    • Random expansion to generate small samples. (zoom out)

    • Random tile faces to anchor scale for better align the receptive field.

    Data Augmentation

  • Asymmetric Conv Block (ACB)

    1x1 1x1 1x1 1x1

    1x3 3x1 1x5 5x1

    1x1 1x1 1x1 1x1

    concat

    RFE[1]

    Norm Norm Norm

    ReLU

    ACB[2]

    Norm

    ReLU

    Inference

    [1] S. Zhang, C. Chi, and et al. RefineFace: Refinement Neural Network for High Performance Face Detection, TPAMI, 2020.[2] X. Ding, Y. Guo, and et al. Acnet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks. ICCV, 2019.

  • • 6 stages with strides from 4 to 128

    A-VoVNet51[1]

    Layer Output Stride Repeat Channel

    Image 640×640 - - -

    Conv1Conv2Conv3

    320×320320×320160×160

    212

    111

    6464

    128

    Stage1 160×160 1 1 256

    Down-samplingStage2

    80×8080×80

    21

    11

    256512

    Down-samplingStage3

    40×4040×40

    21

    12

    512768

    Down-samplingStage4

    20×2020×20

    21

    12

    7681024

    Down-samplingStage5

    10×1010×10

    21

    11

    1024128

    Down-samplingStage6

    5×55×5

    21

    11

    128128

    [1] Y. Lee, J. Park, and et al. CenterMask: Real-Time Anchor-Free Instance Segmentation. CVPR, 2020.

  • • 6 stages with strides from 4 to 128

    A-VoVNet51[1]

    Layer Output Stride Repeat Channel

    Image 640×640 - - -

    Conv1Conv2Conv3

    320×320320×320160×160

    212

    111

    6464

    128

    Stage1 160×160 1 1 256

    Down-samplingStage2

    80×8080×80

    21

    11

    256512

    Down-samplingStage3

    40×4040×40

    21

    12

    512768

    Down-samplingStage4

    20×2020×20

    21

    12

    7681024

    Down-samplingStage5

    10×1010×10

    21

    11

    1024128

    Down-samplingStage6

    5×55×5

    21

    11

    128128

    backbone AP

    ResNet50

    SE-ResNet50

    Res2Net50

    ResNeSt50

    EfficientNet-B3

    VoVNet51

    A-VoVNet51

    0.9018

    0.9023

    0.8959

    0.8997

    0.8863

    0.9037

    0.9074

    [1] Y. Lee, J. Park, and et al. CenterMask: Real-Time Anchor-Free Instance Segmentation. CVPR, 2020.

  • • Top-down path:

    𝑃𝑖𝑡𝑑 = 𝐶𝑜𝑛𝑣(

    𝜔1 ⋅ 𝑃𝑖𝑖𝑛 + 𝜔2 ⋅ 𝑅𝑒𝑠𝑖𝑧𝑒(𝑃𝑖+1

    𝑡𝑑 )

    𝜔1 + 𝜔2 + 𝜖)

    • Bottom-up path:

    𝑃𝑖𝑏𝑢

    = 𝐶𝑜𝑛𝑣((𝜔1 ⋅ 𝑃𝑖𝑖𝑛 + 𝜔2 ⋅ 𝑃𝑖

    𝑡𝑑 +𝜔3

    ⋅ 𝑅𝑒𝑠𝑖𝑧𝑒(𝑃𝑖−1𝑡𝑑 ))/(𝜔1 +𝜔2 +𝜔3 + 𝜖))

    A-BiFPN[1]

    𝑃7𝑖𝑛

    top-down bottom-up skip

    𝑃6𝑖𝑛

    𝑃5𝑖𝑛

    𝑃4𝑖𝑛

    𝑃3𝑖𝑛

    𝑃2𝑖𝑛

    𝑃7𝑏𝑢

    𝑃6𝑏𝑢

    𝑃5𝑏𝑢

    𝑃4𝑏𝑢

    𝑃3𝑏𝑢

    𝑃2𝑏𝑢

    𝑃6𝑡𝑑

    𝑃5𝑡𝑑

    𝑃4𝑡𝑑

    𝑃3𝑡𝑑

    Feature Module AP

    A-BiFPNBiFPNSEPC (CVPR’20)FPN

    0.90360.90180.88800.8830

    [1] M. Tan, R. Pang, and et al. EfficientDet: Scalable and Efficient Object Detection. CVPR, 2020.

    Exp:

  • • First step: the faces match anchors with IoU higher than a smallthreshold, usually 0.35~0.45.

    • Second step: a anchor would be matched when IoU of its regressed box with any box greater than a large threshold, usually 0.7~0.8.

    Dynamic Match Strategy

    [1] T. Kong, F. Sun, and et al. Consistent optimization for single-shot object detection. arXiv, 2019.[2] Y. Liu, X. Tang, and et al. HAMBox: Delving Into Mining High-Quality Anchors on Face Detection. CVPR, 2020.

  • • Regression loss

    ℓ𝑟𝑒𝑔 =1

    𝑁1

    𝑖∈ψ1

    ℒ𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑥𝑖 , 𝑥𝑖∗) +

    𝜆𝑟𝑒𝑔

    𝑁2

    𝑖∈𝜓2

    ℒ𝑠𝑚𝑜𝑜𝑡ℎ𝐿1(𝑥𝑖 , 𝑥𝑖∗)

    • Classification loss

    ℓ𝑐𝑙𝑠 =1

    𝑁1

    𝑖∉𝜓2

    ℒ𝑓𝑜𝑐𝑎𝑙(𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑝𝑖 , 𝑝𝑖∗) +

    𝜆𝑐𝑙𝑠𝑁2

    𝑖∈𝜓2

    ℒ𝑓𝑜𝑐𝑎𝑙(𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑝 , 𝑝𝑖∗)

    𝑓𝑚𝑎𝑟𝑔𝑖𝑛 𝑥, 𝑥𝑜 = 𝑥𝑜 = 1 ⋅ 𝑥 − 𝑚 + 𝑥𝑜 = 0 ⋅ 𝑥

    Loss Design

    model AP

    Baseline

    + dynamic match

    0.8765

    0.8890

    model AP

    Baseline

    + margin loss

    0.9048

    0.9073

    [1] H. Wang, Y. Wang. Cosface: Large Margin Cosine Loss for Deep Face Recognition. CVPR, 2018.

  • • Training

    • Split 50000 pictures into 45000 for training and 5000 for validating.

    • Sample size: 640×640, batch size: 16×4 Tesla V100 GPUs.

    • SGD with learning rate 0.04, multiplied by 0.1 at 200, 250 and 280 epoch, stop at 300

    epoch.

    • IoU thresholds for first and second match: 0.35 and 0.7, 𝜆𝑟𝑒𝑔 = 𝜆𝑐𝑙.𝑠 = 0.8.

    • The margin of classification loss is 0.2.

    • Testing

    • Test scale: (480, 645), (640, 860), (800, 1075)

    • Top-1000 predictions with confidence scores higher than 0.08 are processed by NMS

    with IoU threshold 0.55, top-100 boxes would be preserved as the final results.

    Experimental Details

  • • Time-consuming optimization

    • Fuse conv. and BN layer. (accelerate about 3 ms when batch=1)

    • Batch processing. (speed up to 60+ ms without post-processing)

    • Convert PyTorch model to TensorRT by a simple torch2trt toolbox available at

    https://github.com/z-bingo/torch2trt. (average runtime: 48~49 ms)

    Time-Consuming Analysis

    0 10 20 30 40 50 60 70 80 90 100

    w/ BN fuse, batch=48, RT

    w/ BN fuse, batch=48

    w/ BN fuse, batch=1

    w/o BN fuse, batch=1

    Time-Consuming Analysis (ms)

    Backbone BiFPN Head

    https://github.com/z-bingo/torch2trt

  • Results on Leaderboard

    0.8816

    0.8975

    0.9038

    0.9118

    0.9180 0.9187 0.9191

    0.9214 0.9218 0.9223 0.9230

    0.9236

    0.9284 0.9291

    0.8800

    0.8850

    0.8900

    0.8950

    0.9000

    0.9050

    0.9100

    0.9150

    0.9200

    0.9250

    0.9300

  • Visualization

    scale

    robot

    crowded

    animal-like

    blur

    ratiopose

    blur

    occlusion

    non-human-like

  • • Asymmetric Conv Block (ACB)• Provide the backbone network with the ability to generate features with more

    diverse receptive fields.

    • BiFPN with ACB could aggregate and enhance multi-scale features at the same time,

    previous methods achieved these by two separate modules.

    • Dynamic Match Strategy• Instead of matching anchors and faces by a simple IoU threshold, it is employed for

    mining enough high-quality anchors for each face.

    • Margin Loss• Enhance the power of discrimination, especially for those faces who are similar to

    the background, e.g., blur face, robot face.

    • Data Augmentation• Augment the faces that are too small and too large to be accurately located.

    Conclusion

  • Q&ABin Zhang*, Jian Li*, Yabiao Wang, Zhipeng Cui{* Equal Contribution}Contact e-mail: [email protected]

    Youtu Lab, Tencent

    Southeast University