Models for Multi-View Object Class Detection Han-Pang Chiu 1

Models for Multi-View Object Class Detection

Han-Pang Chiu

1

Multi-View Object Class Detection

2

Training Set

Test Set

Multi-View Same Object

Multi-View Object Class

Single-View Object Class

The Roadblock

3

- The learning processes for each viewpoint of the same object class should be related.

• All existing methods for multi-view object class detection require many real training images of objects for many viewpoints.

- a 3D class skeleton: The arrangement of part centroids in 3D.

The Potemkin1 model can be viewed as a collection of parts, which are oriented 3D primitives.

4

The Potemkin Model

- 2D projective transforms: The shape change of each part from one view to another.

1So-called “Potemkin villages” were artificial villages, constructed only of facades. Our models, too are constructed of facades.

The Potemkin Model

multiple 2D models[Crandall07, Torralba04, Leibe07]

5

explicit 3D model[Hoiem07, Yan07]

cross-view constraints[Thomas06, Savarese07, Kushal07]

Related Approaches

Data-Efficiency , Compatibility

2D3D

6

Two Uses of the Potemkin Model

Multi-View Object Class

Detection System

2D Test Image Detection Result

1. Generate virtual training data

3D Understanding

2. Reconstruct 3D shapes of detected objects

7

Outline

Potemkin Model Basic Generalized 3D

Estimation Class Skeleton

Real Training

Data

Supervised Part

Labeling

Use

Virtual Training

Data Generation

- K projection matrices

8

Definition of the Basic Potemkin Model

3D Space

K view bins

- K view bins

- a class skeleton (S1,S2,…,SN): class-dependent

2D Transforms

- NK2 transformation matrices

• A basic Potemkin model for an object class with N parts.

9

T,

Estimating the Basic Potemkin Model Phase 1

- Learn 2D projective transforms from a 3D oriented primitive

view

view

T2, T3

, ………………

8 Degrees Of Freedom

view view

T1,

10

Estimating the Basic Potemkin Model Phase 2

- We compute 3D class skeleton for the target object class.- Each part needs to be visible in at least two views from the view bins we are interested in. - We need to label the view bins and the parts of objects in real training images.

11

Using the Basic Potemkin Model

3D Model

SyntheticClass-Independent

2D Synthetic Views

Shape Primitives

Generic Transforms

Target Object Class

RealClass-Specific

Few Labeled Images

Skeleton

Part TransformsPart Transforms

The Basic Potemkin ModelEstimating Using

All Labeled Images

Virtual ImagesCombine PartsCombine Parts

VirtualView-Specific

12

13

Problem of the Basic Potemkin Model

-0.5

0

0.5

-1-0.5

00.5

1

-0.8

-0.6

-0.4

-0.2

0

0.2

34

2

6

y

1

5

x

z

-100 -50 0 50

-60

-40

-20

0

20

40

60

80

-4000-20000

2000 x

y

-100 -50 0 50

-60

-40

-20

0

20

40

60

80

-4000-20000

2000 x

y

-50 0 50

-60

-40

-20

0

20

40

60

80

-4000-2000

0x

y

-50 0 50

-60

-40

-20

0

20

40

60

80

-4000-2000

0 x

y

-50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000

02000 x

y

-50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-20000

2000 x

y

14

Outline



Multiple Primitives

Real Training

DataSupervised Part Labeling

Use Virtual Training Data Generation

Multiple Oriented Primitives

2D Transforms 2D views

MultiplePrimitives

15

• An oriented primitive is decided by the 3D shape and the starting view bin.

K viewsView1 View2 ……………………….. View K

azimuth

elevation

azimuth

3D Shapes

16

2D TransformT,

view

view

K view bins

3D Model

Target Object Class

All Labeled Images


RealClass-Specific

Few Labeled Images

2D Synthetic Views

Primitive Selection

Shape Primitives

Generic Transforms Skeleton

Part Transforms

Infer Part IndicatorInfer Part Indicator Virtual ImagesCombine PartsCombine Parts

Part Transforms


The Potemkin ModelEstimating Using

17

- Find a best set of primitives to model all parts

M

18

Greedy Primitive Selection

- Four primitives are enough for modeling four object classes (21 object parts).

1 2 3 4 5 6 7 80.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Number of Greedily Selected Primitives

Qua

lity

of T

rans

form

atio

n

chair

bicycle

caraircraft

all classes

Greedy Selection

view view

A

Mm

mT

,...,2,1

,

BABA

B

?

19

Primitive-Based Representation

• Better predict what objects look like in novel views

Single Primitive

Multiple Primitives 20

The Influence of Multiple Primitives

21

Virtual Training Images

3D Model

Target Object Class

All Labeled Images


RealClass-Specific

Few Labeled Images

2D Synthetic Views

Primitive Selection

Shape Primitives

Generic Transforms Skeleton

Part Transforms

Infer Part IndicatorInfer Part Indicator Virtual ImagesCombine PartsCombine Parts

Part Transforms


The Potemkin ModelEstimating Using

22

23

Outline

Potemkin Model Basic Generalized


Multiple Primitives

Real Training

Data

Supervised Part

Labeling

Self-Supervised

Part Labeling


Self-Supervised Part Labeling• For the target view, choose one model object and label its parts.• The model object is then deformed to other objects in the target view for part labeling.

20 40 6080100

50

100

150

20 4060 80100

50

100

150

20 40 60 80 100

50

100

150

100 samples

20 40 60 80 100

50

100

150

100 samples

10 20 30 40 50 60 70 80 90 100 110

20

40

60

80

100

120

140

160

93 correspondences (unwarped X)

10 20 30 40 50 60 70 80 90 100 110

20

40

60

80

100

120

140

160

k=6, o=1, I

f=0.06657, aff.cost=0.10301, SC cost=0.07626

50 100 150 200

20406080

100

50 100 150 200

20406080

100

50 100 150 200

20

40

60

80

100

100 samples

50 100 150 200

20

40

60

80

100

100 samples

20 40 60 80 100 120 140 160 180 200 220

10

20

30

40

50

60

70

80

90

100

110

75 correspondences (unwarped X)

20 40 60 80 100 120 140 160 180 200 220

10

20

30

40

50

60

70

80

90

100

110

k=6, o=1, I

f=0.055368, aff.cost=0.084792, SC cost=0.14406

24

Multi-View Class Detection Experiment• Detector: Crandall’s system (CVPR05, CVPR07)• Dataset: cars (partial PASCAL), chairs (collected by LIS)• Each view (Real/Virtual Training): 20/100 (chairs), 15/50 (cars)• Task: Object/No Object, No viewpoint identification

250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Object Class: Chair Object Class: Car

False Positive Rate False Positive Rate

True

Pos

itive

Rat

e

Real images

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Real imagesReal images from all views

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Real imagesReal images from all viewsReal + Virtual (single primitive)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Real imagesReal images from all viewsReal + Virtual (single primitive)Real + Virtual (multiple primitives)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Real + Virtual (self-supervised)

Real imagesReal images from all viewsReal + Virtual (single primitive)Real + Virtual (multiple primitives)

26

Outline



Multiple Primitives Class Planes

Real Training

Data

Supervised Part

Labeling

Self-Supervised

Part Labeling


27

Definition of the 3D Potemkin Model

3D Space

K view bins

- K view bins - K projection matrices, K rotation matrices, TR33

- a class skeleton (S1,S2,…,SN)- K part-labeled images-N 3D planes, Qi ,(i 1,…N): ai X+bi Y+ci Z+di =0

• A 3D Potemkin model for an object class with N parts.

28

3D Representation• Efficiently capture prior knowledge of 3D shapes of the target

object class.• The object class is represented as a collection of parts, which

are oriented 3D primitive shapes. • This representation is only approximately correct.

Estimating 3D Planes

29

-100 -50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000020004000 x

y

-50 0 50

-60

-40

-20

0

20

40

60

80

-4000-200002000

x

y

-50 0 50

-60

-40

-20

0

20

40

60

80

-4000-200002000 x

y

-100 -50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000020004000x

y

-50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000020004000 x

y

-100 -50 0 50

-60

-40

-20

0

20

40

60

80

-4000-2000020004000 x

y

No Occlusion Handling

Occlusion Handling

Self-Occlusion Handling

-50 0 50

-60

-40

-20

0

20

40

60

80

-4000-200002000 x

y

-100 -50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000020004000x

y

-50 0 50 100

-60

-40

-20

0

20

40

60

80

-4000-2000020004000 x

y

30

3D Potemkin Model: CarMinimum requirement: four views of one instanceNumber of Parts: 8(right-side, grille, hood, windshield, roof,back-windshield, back-grille, left-side)

-140 -120 -100 -80 -60 -40 -20 0 20

-60

-40

-20

0

20

40

60

-20-100x 10

4

-100 -50 0 50-100

-50

0

50

100

-20-100x 10

4-150 -100 -50 0 50 100

-100

-50

0

50

-15-10-505x 104

yx

31

32

Outline



Multiple Primitives Class Planes

Real Training

Data

Supervised Part

Labeling

Self-Supervised

Part Labeling


Single-View 3D Reconstruction

Single-View Reconstruction• 3D Reconstruction (X, Y, Z) from a Single 2D Image (xim, yim)

- a camera matrix (M), a 3D plane

33

034333231

24232221

34333231

14131211

34333231

24232221

14131211

dcZbYaX

mZmYmXm

mZmYmXmy

mZmYmXm

mZmYmXmx

mmmm

mmmm

mmmm

M

im

im

Detection(Leibe et al. 07)

Segmentation(Li et al. 05)

Automatic 3D Reconstruction• 3D Class-Specific Reconstruction from a Single 2D Image - a camera matrix (M), a 3D ground plane (agX+bgY+cgZ+dg=0)

34

2D Input Self-SupervisedPart Registration

20 40 60 80 100 120

10

20

30

40

50

60

70

Geometric Context

(Hoiem et al.05)

50 100 150 200 250 300 350

50

100

150

200

250

300

50 100 150 200 250 300 350

50

100

150

200

250

300

3D Output

-100 -50 0 50-100

-50

0

50

100

-20-100x 10

4

3D PotemkinModel

0: iiiii dZcYbXaP

20 40 60 80 100 120

10

20

30

40

50

60

70

Occluded Part

PredictionP120 40 60 80 100 120

10

20

30

40

50

60

70

P2

offset

• Hoiem et al. classified image regions into three geometric classes (ground, vertical surfaces, and sky).

• They treat detected objects as vertical planar surfaces in 3D.

• They set a default camera matrix and a default 3D ground plane.

Application: Photo Pop-up

35

Object Pop-up

36

The link of the demo videos:http://people.csail.mit.edu/chiu/demos.htm

Depth Map Prediction

• Match a predicted depth map against available 2.5D data • Improve performance of existing 2D detection systems

37

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

50 100 150 200 250 300

50

100

150

200

Application: Object Detection

38

• 109 test images and stereo depth maps, 127 annotated cars

50 100 150 200 250 300

50

100

150

200

0

5

10

15

20

25

30

35

40

45

50

zs

221

,

))((min21

aZaZD isaa

i

• 15 candidates/image (each candidate ci: bounding box bi, likelihood li from 2D detector, predicted depth map zi)

))1()1log()exp(log( wDwl ii scale offset

Likelihood from detector Depth consistency

Videre Designs

zi

Experimental Results

39

• Number of car training/test images: 155/109• Murphy-Torralba-Freeman detector (w = 0.5)• Dalal-Triggs detector (w=0.6)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FP per image

Det

ectio

n R

ate

2D Detector

2D Detector(With Depth)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50

0.1

0.2

0.3

0.4

0.5

0.6

FP per image

Det

ectio

n R

ate

2D Detector

2D Detector(With Depth)

Murphy-Torralba-Freeman Detector Dalal-Triggs Detector

Quality of Reconstruction• Calibration: Camera, 3D ground plane (1m by 1.2m table) • 20 diecast model cars

40

Average overlap centroid error orientation errorPotemkin 77.5 % 8.75 mm 2.34o

Single Plane 73.95 mm 16.26o

Ferrari F1: 26.56%, 24.89 mm, 3.37o

Application: Robot Manipulation• 20 diecast model cars, 60 trials• Successful grasp: 57/60 (Potemkin), 6/60 (Single Plane)

41


Application: Robot Manipulation• 20 diecast model cars, 60 trials• Successful grasp: 57/60 (Potemkin), 6/60 (Single Plane)

42

-1000100200

-600

-500

-400

-300

-200

-100

0

-200

-100

0

100

Xobject

Extrinsic parameters (object-centered)

Yobject

Z obje

ct

Occluded Part Prediction• A Basket instance

43

-0.50

0.51

1.5 -1-0.5

00.5

-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

y

2

35

6

x

4

1

z

-1000

100200

-600

-400

-200

0

-200

-100

0

100

Xobject

Extrinsic parameters (object-centered)

Yobject

Z obje

ct


Contributions

• The Potemkin Model: - Provide a middle ground between 2D and 3D - Construct a relatively weak 3D model - Generate virtual training data - Reconstruct 3D objects from a single image

• Applications - Multi-view object class detection - Object pop-up - Object detection using 2.5D data - Robot Manipulation

44

Acknowledgements

• Thesis committee members - Tómas Lozano-Pérez, Leslie Kaelbling, Bill Freeman

• Experimental Help - LableMe and detection system: Sam Davies - Robot system: Kaijen Hsiao and Huan Liu - Data collection: Meg A. Lippow and Sarah Finney - Stereo vision: Tom Yeh and Sybor Wang - Others: David Huynh, Yushi Xu, and Hung-An Chang

• All LIS people • My parents and my wife, Ju-Hui

45

46

Thank you!

Documents

Models for Multi-View Object Class Detection Han-Pang Chiu 1