Upload
clemence-lawrence
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Models for Multi-View Object Class Detection
Han-Pang Chiu
1
Multi-View Object Class Detection
2
Training Set
Test Set
Multi-View Same Object
Multi-View Object Class
Single-View Object Class
The Roadblock
3
- The learning processes for each viewpoint of the same object class should be related.
• All existing methods for multi-view object class detection require many real training images of objects for many viewpoints.
- a 3D class skeleton: The arrangement of part centroids in 3D.
The Potemkin1 model can be viewed as a collection of parts, which are oriented 3D primitives.
4
The Potemkin Model
- 2D projective transforms: The shape change of each part from one view to another.
1So-called “Potemkin villages” were artificial villages, constructed only of facades. Our models, too are constructed of facades.
The Potemkin Model
multiple 2D models[Crandall07, Torralba04, Leibe07]
5
explicit 3D model[Hoiem07, Yan07]
cross-view constraints[Thomas06, Savarese07, Kushal07]
Related Approaches
Data-Efficiency , Compatibility
2D3D
6
Two Uses of the Potemkin Model
Multi-View Object Class
Detection System
2D Test Image Detection Result
1. Generate virtual training data
3D Understanding
2. Reconstruct 3D shapes of detected objects
7
Outline
Potemkin Model Basic Generalized 3D
Estimation Class Skeleton
Real Training
Data
Supervised Part
Labeling
Use
Virtual Training
Data Generation
- K projection matrices
8
Definition of the Basic Potemkin Model
3D Space
K view bins
- K view bins
- a class skeleton (S1,S2,…,SN): class-dependent
2D Transforms
- NK2 transformation matrices
• A basic Potemkin model for an object class with N parts.
9
T,
Estimating the Basic Potemkin Model Phase 1
- Learn 2D projective transforms from a 3D oriented primitive
view
view
T2, T3
, ………………
8 Degrees Of Freedom
view view
T1,
10
Estimating the Basic Potemkin Model Phase 2
- We compute 3D class skeleton for the target object class.- Each part needs to be visible in at least two views from the view bins we are interested in. - We need to label the view bins and the parts of objects in real training images.
11
Using the Basic Potemkin Model
3D Model
SyntheticClass-Independent
2D Synthetic Views
Shape Primitives
Generic Transforms
Target Object Class
RealClass-Specific
Few Labeled Images
Skeleton
Part TransformsPart Transforms
The Basic Potemkin ModelEstimating Using
All Labeled Images
Virtual ImagesCombine PartsCombine Parts
VirtualView-Specific
12
13
Problem of the Basic Potemkin Model
-0.5
0
0.5
-1-0.5
00.5
1
-0.8
-0.6
-0.4
-0.2
0
0.2
34
2
6
y
1
5
x
z
-100 -50 0 50
-60
-40
-20
0
20
40
60
80
-4000-20000
2000 x
y
-100 -50 0 50
-60
-40
-20
0
20
40
60
80
-4000-20000
2000 x
y
-50 0 50
-60
-40
-20
0
20
40
60
80
-4000-2000
0x
y
-50 0 50
-60
-40
-20
0
20
40
60
80
-4000-2000
0 x
y
-50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000
02000 x
y
-50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-20000
2000 x
y
14
Outline
Potemkin Model Basic Generalized 3D
Estimation Class Skeleton
Multiple Primitives
Real Training
DataSupervised Part Labeling
Use Virtual Training Data Generation
Multiple Oriented Primitives
2D Transforms 2D views
MultiplePrimitives
15
• An oriented primitive is decided by the 3D shape and the starting view bin.
K viewsView1 View2 ……………………….. View K
azimuth
elevation
azimuth
3D Shapes
16
2D TransformT,
view
view
K view bins
3D Model
Target Object Class
All Labeled Images
SyntheticClass-Independent
RealClass-Specific
Few Labeled Images
2D Synthetic Views
Primitive Selection
Shape Primitives
Generic Transforms Skeleton
Part Transforms
Infer Part IndicatorInfer Part Indicator Virtual ImagesCombine PartsCombine Parts
Part Transforms
VirtualView-Specific
The Potemkin ModelEstimating Using
17
- Find a best set of primitives to model all parts
M
18
Greedy Primitive Selection
- Four primitives are enough for modeling four object classes (21 object parts).
1 2 3 4 5 6 7 80.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
Number of Greedily Selected Primitives
Qua
lity
of T
rans
form
atio
n
chair
bicycle
caraircraft
all classes
Greedy Selection
view view
A
Mm
mT
,...,2,1
,
BABA
B
?
19
Primitive-Based Representation
• Better predict what objects look like in novel views
Single Primitive
Multiple Primitives 20
The Influence of Multiple Primitives
21
Virtual Training Images
3D Model
Target Object Class
All Labeled Images
SyntheticClass-Independent
RealClass-Specific
Few Labeled Images
2D Synthetic Views
Primitive Selection
Shape Primitives
Generic Transforms Skeleton
Part Transforms
Infer Part IndicatorInfer Part Indicator Virtual ImagesCombine PartsCombine Parts
Part Transforms
VirtualView-Specific
The Potemkin ModelEstimating Using
22
23
Outline
Potemkin Model Basic Generalized
Estimation Class Skeleton
Multiple Primitives
Real Training
Data
Supervised Part
Labeling
Self-Supervised
Part Labeling
Use Virtual Training Data Generation
Self-Supervised Part Labeling• For the target view, choose one model object and label its parts.• The model object is then deformed to other objects in the target view for part labeling.
20 40 6080100
50
100
150
20 4060 80100
50
100
150
20 40 60 80 100
50
100
150
100 samples
20 40 60 80 100
50
100
150
100 samples
10 20 30 40 50 60 70 80 90 100 110
20
40
60
80
100
120
140
160
93 correspondences (unwarped X)
10 20 30 40 50 60 70 80 90 100 110
20
40
60
80
100
120
140
160
k=6, o=1, I
f=0.06657, aff.cost=0.10301, SC cost=0.07626
50 100 150 200
20406080
100
50 100 150 200
20406080
100
50 100 150 200
20
40
60
80
100
100 samples
50 100 150 200
20
40
60
80
100
100 samples
20 40 60 80 100 120 140 160 180 200 220
10
20
30
40
50
60
70
80
90
100
110
75 correspondences (unwarped X)
20 40 60 80 100 120 140 160 180 200 220
10
20
30
40
50
60
70
80
90
100
110
k=6, o=1, I
f=0.055368, aff.cost=0.084792, SC cost=0.14406
24
Multi-View Class Detection Experiment• Detector: Crandall’s system (CVPR05, CVPR07)• Dataset: cars (partial PASCAL), chairs (collected by LIS)• Each view (Real/Virtual Training): 20/100 (chairs), 15/50 (cars)• Task: Object/No Object, No viewpoint identification
250 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Object Class: Chair Object Class: Car
False Positive Rate False Positive Rate
True
Pos
itive
Rat
e
Real images
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Real imagesReal images from all views
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Real imagesReal images from all viewsReal + Virtual (single primitive)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Real imagesReal images from all viewsReal + Virtual (single primitive)Real + Virtual (multiple primitives)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Real + Virtual (self-supervised)
Real imagesReal images from all viewsReal + Virtual (single primitive)Real + Virtual (multiple primitives)
26
Outline
Potemkin Model Basic Generalized 3D
Estimation Class Skeleton
Multiple Primitives Class Planes
Real Training
Data
Supervised Part
Labeling
Self-Supervised
Part Labeling
Use Virtual Training Data Generation
27
Definition of the 3D Potemkin Model
3D Space
K view bins
- K view bins - K projection matrices, K rotation matrices, TR33
- a class skeleton (S1,S2,…,SN)- K part-labeled images-N 3D planes, Qi ,(i 1,…N): ai X+bi Y+ci Z+di =0
• A 3D Potemkin model for an object class with N parts.
28
3D Representation• Efficiently capture prior knowledge of 3D shapes of the target
object class.• The object class is represented as a collection of parts, which
are oriented 3D primitive shapes. • This representation is only approximately correct.
Estimating 3D Planes
29
-100 -50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000020004000 x
y
-50 0 50
-60
-40
-20
0
20
40
60
80
-4000-200002000
x
y
-50 0 50
-60
-40
-20
0
20
40
60
80
-4000-200002000 x
y
-100 -50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000020004000x
y
-50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000020004000 x
y
-100 -50 0 50
-60
-40
-20
0
20
40
60
80
-4000-2000020004000 x
y
No Occlusion Handling
Occlusion Handling
Self-Occlusion Handling
-50 0 50
-60
-40
-20
0
20
40
60
80
-4000-200002000 x
y
-100 -50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000020004000x
y
-50 0 50 100
-60
-40
-20
0
20
40
60
80
-4000-2000020004000 x
y
30
3D Potemkin Model: CarMinimum requirement: four views of one instanceNumber of Parts: 8(right-side, grille, hood, windshield, roof,back-windshield, back-grille, left-side)
-140 -120 -100 -80 -60 -40 -20 0 20
-60
-40
-20
0
20
40
60
-20-100x 10
4
-100 -50 0 50-100
-50
0
50
100
-20-100x 10
4-150 -100 -50 0 50 100
-100
-50
0
50
-15-10-505x 104
yx
31
32
Outline
Potemkin Model Basic Generalized 3D
Estimation Class Skeleton
Multiple Primitives Class Planes
Real Training
Data
Supervised Part
Labeling
Self-Supervised
Part Labeling
Use Virtual Training Data Generation
Single-View 3D Reconstruction
Single-View Reconstruction• 3D Reconstruction (X, Y, Z) from a Single 2D Image (xim, yim)
- a camera matrix (M), a 3D plane
33
034333231
24232221
34333231
14131211
34333231
24232221
14131211
dcZbYaX
mZmYmXm
mZmYmXmy
mZmYmXm
mZmYmXmx
mmmm
mmmm
mmmm
M
im
im
Detection(Leibe et al. 07)
Segmentation(Li et al. 05)
Automatic 3D Reconstruction• 3D Class-Specific Reconstruction from a Single 2D Image - a camera matrix (M), a 3D ground plane (agX+bgY+cgZ+dg=0)
34
2D Input Self-SupervisedPart Registration
20 40 60 80 100 120
10
20
30
40
50
60
70
Geometric Context
(Hoiem et al.05)
50 100 150 200 250 300 350
50
100
150
200
250
300
50 100 150 200 250 300 350
50
100
150
200
250
300
3D Output
-100 -50 0 50-100
-50
0
50
100
-20-100x 10
4
3D PotemkinModel
0: iiiii dZcYbXaP
20 40 60 80 100 120
10
20
30
40
50
60
70
Occluded Part
PredictionP120 40 60 80 100 120
10
20
30
40
50
60
70
P2
offset
• Hoiem et al. classified image regions into three geometric classes (ground, vertical surfaces, and sky).
• They treat detected objects as vertical planar surfaces in 3D.
• They set a default camera matrix and a default 3D ground plane.
Application: Photo Pop-up
35
Object Pop-up
36
The link of the demo videos:http://people.csail.mit.edu/chiu/demos.htm
Depth Map Prediction
• Match a predicted depth map against available 2.5D data • Improve performance of existing 2D detection systems
37
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
50 100 150 200 250 300
50
100
150
200
Application: Object Detection
38
• 109 test images and stereo depth maps, 127 annotated cars
50 100 150 200 250 300
50
100
150
200
0
5
10
15
20
25
30
35
40
45
50
zs
221
,
))((min21
aZaZD isaa
i
• 15 candidates/image (each candidate ci: bounding box bi, likelihood li from 2D detector, predicted depth map zi)
))1()1log()exp(log( wDwl ii scale offset
Likelihood from detector Depth consistency
Videre Designs
zi
Experimental Results
39
• Number of car training/test images: 155/109• Murphy-Torralba-Freeman detector (w = 0.5)• Dalal-Triggs detector (w=0.6)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
FP per image
Det
ectio
n R
ate
2D Detector
2D Detector(With Depth)
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.1
0.2
0.3
0.4
0.5
0.6
FP per image
Det
ectio
n R
ate
2D Detector
2D Detector(With Depth)
Murphy-Torralba-Freeman Detector Dalal-Triggs Detector
Quality of Reconstruction• Calibration: Camera, 3D ground plane (1m by 1.2m table) • 20 diecast model cars
40
Average overlap centroid error orientation errorPotemkin 77.5 % 8.75 mm 2.34o
Single Plane 73.95 mm 16.26o
Ferrari F1: 26.56%, 24.89 mm, 3.37o
Application: Robot Manipulation• 20 diecast model cars, 60 trials• Successful grasp: 57/60 (Potemkin), 6/60 (Single Plane)
41
The link of the demo videos:http://people.csail.mit.edu/chiu/demos.htm
Application: Robot Manipulation• 20 diecast model cars, 60 trials• Successful grasp: 57/60 (Potemkin), 6/60 (Single Plane)
42
-1000100200
-600
-500
-400
-300
-200
-100
0
-200
-100
0
100
Xobject
Extrinsic parameters (object-centered)
Yobject
Z obje
ct
Occluded Part Prediction• A Basket instance
43
-0.50
0.51
1.5 -1-0.5
00.5
-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
y
2
35
6
x
4
1
z
-1000
100200
-600
-400
-200
0
-200
-100
0
100
Xobject
Extrinsic parameters (object-centered)
Yobject
Z obje
ct
The link of the demo videos:http://people.csail.mit.edu/chiu/demos.htm
Contributions
• The Potemkin Model: - Provide a middle ground between 2D and 3D - Construct a relatively weak 3D model - Generate virtual training data - Reconstruct 3D objects from a single image
• Applications - Multi-view object class detection - Object pop-up - Object detection using 2.5D data - Robot Manipulation
44
Acknowledgements
• Thesis committee members - Tómas Lozano-Pérez, Leslie Kaelbling, Bill Freeman
• Experimental Help - LableMe and detection system: Sam Davies - Robot system: Kaijen Hsiao and Huan Liu - Data collection: Meg A. Lippow and Sarah Finney - Stereo vision: Tom Yeh and Sybor Wang - Others: David Huynh, Yushi Xu, and Hung-An Chang
• All LIS people • My parents and my wife, Ju-Hui
45
46
Thank you!