1
DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model Eldar Insafutdinov 1 , Leonid Pishchulin 1 , Bjoern Andres 1 , Mykhaylo Andriluka 1,2 , and Bernt Schiele 1 1 Max Planck Institute for Informatics 2 Stanford University Saarbrücken, Germany Stanford, USA Goal • Multi-person pose estimation in monocular images State of the Art • DeepCut [5]: joint body part labeling and grouping + joint reasoning at finest level of details weak pairwise based on geometry only infeasible run-time: takes hours to complete Contributions •A deeper, stronger and faster multi-person model + “deeper”: strong part detectors based on ResNet [3] + “stronger”: novel image-conditioned pairwise terms + “faster”: dramatic speed-ups due to strong pairwise and incremental optimization + NEW: heuristic solver for real-time inference Unary Terms • deeper architectures based on Residual Networks [3] • dilation and de-convolution reduce stride to 8 px • intermediate supervision: add loss into mid-layers • joint training of classification and regression tasks DeeperCut Overview Joint part labeling and grouping via 0/1 variables detection candidates dense graph labeled body parts body part labeling joint person clusters I II III IV dD cC α dc x dc + dd ( D 2 ) c,c C β dd cc x dc x d c y dd detection part part labeling { part clustering constraints cost min (x,y)XDC subset partitioning ∈{0, 1} I. Unary terms • Body part detection candidates • Capture distribution of scores over all part classes II. Pairwise terms • Capture part relationships within/across people proximity: same body part class (c = c ) kinematic relations: different part classes (c! = c ) III. Integer Linear Program (ILP) • Substitute z dd cc = x dc x d c y dd to linearize objective NP-Hard problem solved via branch-and-cut (1% gap) Linear constraints on 0/1 labelings: plausible poses – uniqueness d D : cC x dc 1 – consistency dd D 2 : y dd cC x dc dd D 2 : y dd cC x d c – transitivity dd d D 3 : y dd + y d d 1 y dd Pairwise Terms image conditioned pairwise using CNN regression train CNN to regress body part locations use regressed offsets and angles as features to train logistic regression to output pairwise probability regression from left shoulder regression from right knee pairwise vs. unary predictions right knee all parts regression from all parts unary only Multi-stage optimization • speed-up inference via incremental optimization 1. solve for head and shoulder locations 2. add elbows/wrists to stage 1 solution, re-optimize 3. add rest of body parts to stage 2 solution, re-optimize Stage 1 Stage 2 Stage 3 head, shoulder elbow, wrist hip, knee, ankle Quantitative Multi-Person Results • MPII Multi-Person [1] Mean Average Precision (mAP) metric Setting Head Sho Elb Wri Hip Knee Ank mAP s/frame subset of 288 images DeepCut [5] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995 DeeperCut +image cond. pw. 83.1 75.8 64.6 54.0 60.6 52.0 44.9 62.6 2336 +deeper archit. 83.3 79.4 66.1 57.9 63.5 60.5 49.9 66.2 1333 +multi-st. opt. 87.5 82.8 70.2 61.6 66.0 60.6 56.5 69.7 230 Iqbal&Gall, ECCVw’16 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10 full set DeeperCut 79.1 72.2 59.7 50.0 56.0 51.0 44.6 59.4 485 +heuristic solver 79.6 74.0 62.8 52.5 60.0 53.3 44.6 61.4 0.15 FR-CNN [6] + unary 64.9 62.9 53.4 44.1 50.7 43.1 35.2 51.0 1 Iqbal&Gall, ECCVw’16 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10 •We are Family (WAF) [2] Percentage of Correct Parts (PCP) metric Setting Head U Arms L Arms Torso mPCP AOP s/frame DeepCut [5] 99.3 81.5 79.5 87.1 84.7 86.5 22000 DeeperCut 99.3 83.8 81.9 87.1 86.3 88.1 13 Ghiasi et al., CVPR’14 - - - - 63.6 74.0 - Eichner&Ferrari, ECCV’10 97.6 68.2 48.1 86.1 69.4 80.0 - Chen&Yuille, CVPR’15 98.5 77.2 71.3 88.5 80.7 84.9 - Qualitative Multi-Person Results • Successful cases • Failure cases limbs across symmetry hard poses people confusion Single Person Results • Percentage of Correct Keypoints (PCK) metric • MPII Single Person dataset [1] Setting Head Sho Elb Wri Hip Knee Ank PCKh AUC DeepCut [5] (unary) 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 56.5 DeeperCut (unary) 96.6 94.6 88.5 84.4 87.6 83.9 79.4 88.3 60.7 Newell et al., ECCV’16 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9 62.9 • Leeds Sports Poses (LSP) [4] Setting Head Sho Elb Wri Hip Knee Ank PCK AUC DeepCut [5] (unary) 97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1 63.5 DeeperCut (unary) 97.4 92.7 87.5 84.4 91.5 89.9 87.2 90.1 66.1 Bulat&Tzimir., ECCV’16 97.2 92.1 88.1 85.2 92.2 91.4 88.7 90.7 63.4 • More comparisons at human-pose.mpi-inf.mpg.de References [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR’14. [2] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In ECCV’10. [3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv’15. [4] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC’10. [5] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16. [6] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS’15.

DeeperCut: A Deeper, Stronger, and Faster Multi … · DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model Eldar Insafutdinov1, Leonid Pishchulin1, Bjoern

  • Upload
    letuyen

  • View
    229

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DeeperCut: A Deeper, Stronger, and Faster Multi … · DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model Eldar Insafutdinov1, Leonid Pishchulin1, Bjoern

DeeperCut: A Deeper, Stronger, and FasterMulti-Person Pose Estimation Model

Eldar Insafutdinov1, Leonid Pishchulin1, Bjoern Andres1,Mykhaylo Andriluka1,2, and Bernt Schiele1

1Max Planck Institute for Informatics 2Stanford University

Saarbrücken, Germany Stanford, USA

Goal

• Multi-person pose estimation in monocular images

State of the Art

• DeepCut [5]: joint body part labeling and grouping

+ joint reasoning at finest level of details

– weak pairwise based on geometry only

– infeasible run-time: takes hours to complete

Contributions

• A deeper, stronger and faster multi-person model

+ “deeper”: strong part detectors based on ResNet [3]

+ “stronger”: novel image-conditioned pairwise terms

+ “faster”: dramatic speed-ups due to strong pairwise

and incremental optimization

+ NEW: heuristic solver for real-time inference

Unary Terms

• deeper architectures based on Residual Networks [3]

• dilation and de-convolution reduce stride to 8 px

• intermediate supervision: add loss into mid-layers

• joint training of classification and regression tasks

DeeperCut Overview

• Joint part labeling and grouping via 0/1 variables

detection candidates

dense graph

labeled body partsbody part labeling

joint person clusters

I

II IIIIV

d∈D

c∈C

αdc xdc +

dd′∈

(D2

)∑

c,c′∈C

βdd′cc′ xdcxd′c′ydd′

detection part part labeling

{

part clustering

constraints

cost

min(x,y)∈XDC

subset partitioning

∈{0, 1}

I. Unary terms

• Body part detection candidates

• Capture distribution of scores over all part classes

II. Pairwise terms

• Capture part relationships within/across people

– proximity: same body part class (c = c′)

– kinematic relations: different part classes (c!= c′)

III. Integer Linear Program (ILP)

• Substitute zdd ′cc′ = xdc xd ′c′ ydd ′ to linearize objective

• NP-Hard problem solved via branch-and-cut (1% gap)

• Linear constraints on 0/1 labelings: plausible poses

– uniqueness

∀d ∈ D :

c∈C

xdc ≤ 1

– consistency

∀dd ′ ∈�D

2

�: ydd ′ ≤

c∈C

xdc

∀dd ′ ∈�D

2

�: ydd ′ ≤

c∈C

xd ′c

– transitivity

∀dd ′d ′′ ∈�D

3

�: ydd ′+ yd ′d ′′ − 1≤ ydd ′′

Pairwise Terms

• image conditioned pairwise using CNN regression

– train CNN to regress body part locations

– use regressed offsets and angles as features to train

logistic regression to output pairwise probability

regression from left shoulder

regression from right knee

pairwise vs. unary predictions

righ

tkn

ee

all

part

s

regression from all parts unary only

Multi-stage optimization

• speed-up inference via incremental optimization

1. solve for head and shoulder locations

2. add elbows/wrists to stage 1 solution, re-optimize

3. add rest of body parts to stage 2 solution,

re-optimize

Stage 1 Stage 2 Stage 3

head, shoulder elbow, wrist hip, knee, ankle

Quantitative Multi-Person Results

• MPII Multi-Person [1]

– Mean Average Precision (mAP) metric

Setting Head Sho Elb Wri Hip Knee Ank mAP s/frame

subset of 288 images

DeepCut [5] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995

DeeperCut

+image cond. pw. 83.1 75.8 64.6 54.0 60.6 52.0 44.9 62.6 2336

+deeper archit. 83.3 79.4 66.1 57.9 63.5 60.5 49.9 66.2 1333

+multi-st. opt. 87.5 82.8 70.2 61.6 66.0 60.6 56.5 69.7 230

Iqbal&Gall, ECCVw’16 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10

full set

DeeperCut 79.1 72.2 59.7 50.0 56.0 51.0 44.6 59.4 485

+heuristic solver 79.6 74.0 62.8 52.5 60.0 53.3 44.6 61.4 0.15

FR-CNN [6] + unary 64.9 62.9 53.4 44.1 50.7 43.1 35.2 51.0 1

Iqbal&Gall, ECCVw’16 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10

• We are Family (WAF) [2]

– Percentage of Correct Parts (PCP) metric

Setting Head U Arms L Arms Torso mPCP AOP s/frame

DeepCut [5] 99.3 81.5 79.5 87.1 84.7 86.5 22000

DeeperCut 99.3 83.8 81.9 87.1 86.3 88.1 13

Ghiasi et al., CVPR’14 - - - - 63.6 74.0 -

Eichner&Ferrari, ECCV’10 97.6 68.2 48.1 86.1 69.4 80.0 -

Chen&Yuille, CVPR’15 98.5 77.2 71.3 88.5 80.7 84.9 -

Qualitative Multi-Person Results

• Successful cases

• Failure cases

limbs across symmetry hard poses

people confusion

Single Person Results

• Percentage of Correct Keypoints (PCK) metric

• MPII Single Person dataset [1]

Setting Head Sho Elb Wri Hip Knee Ank PCKh AUC

DeepCut [5] (unary) 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 56.5

DeeperCut (unary) 96.6 94.6 88.5 84.4 87.6 83.9 79.4 88.3 60.7

Newell et al., ECCV’16 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9 62.9

• Leeds Sports Poses (LSP) [4]

Setting Head Sho Elb Wri Hip Knee Ank PCK AUC

DeepCut [5] (unary) 97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1 63.5

DeeperCut (unary) 97.4 92.7 87.5 84.4 91.5 89.9 87.2 90.1 66.1

Bulat&Tzimir., ECCV’16 97.2 92.1 88.1 85.2 92.2 91.4 88.7 90.7 63.4

• More comparisons at human-pose.mpi-inf.mpg.de

References[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New

benchmark and state of the art analysis. In CVPR’14.

[2] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In

ECCV’10.

[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv’15.

[4] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human

pose estimation. In BMVC’10.

[5] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele.

Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16.

[6] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with

region proposal networks. In NIPS’15.