164
Real-Time Computer Vision Microsoft Computer Vision School Vincent Lepetit - CVLab - EPFL (Lausanne, Switzerland) 1

Vincent Lepetit - Real-time computer vision

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Vincent Lepetit - Real-time computer vision

Real-Time Computer Vision

Microsoft Computer Vision School

Vincent Lepetit - CVLab - EPFL (Lausanne, Switzerland)

1

Page 2: Vincent Lepetit - Real-time computer vision

demo

2

Page 3: Vincent Lepetit - Real-time computer vision

applications

...

3

Page 4: Vincent Lepetit - Real-time computer vision

• How the demo works (including Randomized Trees);

• More recent work.

4

Page 5: Vincent Lepetit - Real-time computer vision

Background• 3D world to 2D images (projection matrix,

internal parameters, external parameters, homography, ...);

• Robust estimation (non-linear least-squares, RANSAC, robust estimators, ...);

• Feature point matching (affine region detectors, SIFT, ...).

5

Page 6: Vincent Lepetit - Real-time computer vision

From the 3D World to a 2D Image

M

m

World coordinate system

What is the relation between the 3D coordinates of a point M and its correspondent m in the image captured by the camera ?

6

Page 7: Vincent Lepetit - Real-time computer vision

Perspective Projection

C

M

World coordinate system

Camera center

The image formation is modeled as a perspective projection, which is realistic for standard cameras:

The rays passing through a 3D point M and its correspondent m in the image all intersect at a single point C, the camera center.

m

7

Page 8: Vincent Lepetit - Real-time computer vision

Z

C

Expressing M in the Camera Coordinates System

M

m

World coordinate system

Camera coordinate systemX

Y

Mcam

Step 1: Express the coordinates of M in the camera coordinates system as Mcam.

This transformation corresponds to a Euclidean displacement (a rotation plus a translation):

Mcam = RM + Twhere: R is a 3x3 rotation matrix, and T is a 3- vector.

8

Page 9: Vincent Lepetit - Real-time computer vision

Homogeneous Coordinates

Lets replace by the 4- homogeneous vector : Just add a 1 as the fourth coordinate.

Now, the Euclidean displacement can be expressed as an linear transformation instead of an affine one:

Z

C

m

World coordinate system

Camera coordinate systemX

Y

M =

XYZ

⎜ ⎜ ⎜

⎟ ⎟ ⎟ → ˜ M =

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

M

˜ M

Mcam = RM + T →

Xcam

Ycam

Zcam

⎜ ⎜ ⎜

⎟ ⎟ ⎟

= RXYZ

⎜ ⎜ ⎜

⎟ ⎟ ⎟

+ T →

Xcam

Ycam

Zcam

⎜ ⎜ ⎜

⎟ ⎟ ⎟

= R | T( )

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

→ Mcam = R | T( ) ˜ M

Mcam

(R | T ) is a 3x4 matrix.9

Page 10: Vincent Lepetit - Real-time computer vision

Projection

Computation of the coordinates of m in the image plane, from Mcam (expressed in the camera coordinates system): Simply use Thales' theorem:

CX

mX

Mcam

f

Z

mX

XC

f

Mcam

m

Camera coordinate system

Z

Y

mX

f=XZ

→ mX = f XZ

10

Page 11: Vincent Lepetit - Real-time computer vision

From Projection to ImageCoordinates of m in pixels ?

C

f

m

Camera coordinate system

mX = f XZ, mY = f Y

Z

u

v

Image coordinate system u0

v0

1 pixel

1ku

1kv

)

)

mu = u0 + kumX , mv = v0 + kvmY

11

Page 12: Vincent Lepetit - Real-time computer vision

mX = f XZ, mY = f Y

Z

mu = u0 + kumX , mv = v0 + kvmY

In matrix form :uvw

⎜ ⎜ ⎜

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎜ ⎜ ⎜

⎟ ⎟ ⎟

XYZ

⎜ ⎜ ⎜

⎟ ⎟ ⎟

uvw

⎜ ⎜ ⎜

⎟ ⎟ ⎟ defines m in homogeneous coordinates

→mu =

uw

= u0 + ku fXZ

mv =vw

= v0 + kv fYZ

⎨ ⎪

⎩ ⎪

Putting • the perspective projection and• the transformation into pixel coordinatestogether:

12

Page 13: Vincent Lepetit - Real-time computer vision

The Full TransformationThe two transformations are chained to form the full transformation from a

3D point in the world coordinate system to its projection in the image:

The product of the internal calibration matrix and the external calibration matrix is a 3x4 matrix called the "projection matrix".

The projection matrix is defined up to a scale factor.

uvw

⎜ ⎜ ⎜

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎜ ⎜ ⎜

⎟ ⎟ ⎟

R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3

⎜ ⎜ ⎜

⎟ ⎟ ⎟

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

=

P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34

⎜ ⎜ ⎜

⎟ ⎟ ⎟

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

projection matrix

13

Page 14: Vincent Lepetit - Real-time computer vision

The Full Transformation

R, T, and the products kuf and kvf can be extracted from the projection matrix.

uvw

⎜ ⎜ ⎜

⎟ ⎟ ⎟

=

ku f 0 u00 kv f v00 0 1

⎜ ⎜ ⎜

⎟ ⎟ ⎟

R11 R13 R13 T1R21 R22 R23 T2R31 R32 R33 T3

⎜ ⎜ ⎜

⎟ ⎟ ⎟

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

=

P11 P12 P13 P14P21 P22 P23 P24P31 P32 P33 P34

⎜ ⎜ ⎜

⎟ ⎟ ⎟

XYZ1

⎜ ⎜ ⎜ ⎜

⎟ ⎟ ⎟ ⎟

projection matrix

14

Page 15: Vincent Lepetit - Real-time computer vision

Homography

m� = PM = [P1P2P3P4]

XYZ1

= [P1P2P3P4]

XY01

= [P1P2P4]

XY1

= H3×3m

m�M/mH3×3

15

Page 16: Vincent Lepetit - Real-time computer vision

Computing a Projection Matrix or a Homography from Point Correspondences

by solving a linear system

m�m

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

m� = Hm

m = [u, v, 1]�,m� = [u�, v�, 1]�

16

Page 17: Vincent Lepetit - Real-time computer vision

• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)

• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).

minR,T

i

dist2�HR,Tmi,m

�i

m�m

Computing a Projection Matrix or a Homography from Point Correspondences with a non-linear optimization

minR,T

i

dist2�PR,TMi,m�

i

MHR,Tmm�

PR,Tm

17

Page 18: Vincent Lepetit - Real-time computer vision

A Look to the Reprojection Error

reprojection error

1D camera under 2D translation

100 "3D points" taken at randomly in

[400;1000]x[-500;+500]

True camera position at (0, 0)

18

Page 19: Vincent Lepetit - Real-time computer vision

Gaussian Noise on the ProjectionsWhite cross: true camera position;Black cross: global minimum of the objective function.

In that case, the global minimum of the objective function is close to the true camera pose.

19

Page 20: Vincent Lepetit - Real-time computer vision

What if there are Outliers ?

M1

M2

m1

m2

C

M3

m3

M4

m4

incorrect measure (outlier)

20

Page 21: Vincent Lepetit - Real-time computer vision

Gaussian Noise on the Projections + 20% outliers

White cross: true camera position;Black cross: global minimum of the objective function.

The global minimum is now far from the true camera pose.

21

Page 22: Vincent Lepetit - Real-time computer vision

What Happened ?

The error on the 2D point locations mi is assumed to have a Gaussian (Normal) distribution with identical covariance matrices σI, and independent;

This assumption is violated when mi is an outlier.

Bayesian interpretation:

M1

M2

m1

m2

C

M3

m3

M4

m4

argminR,T

�i dist2

�PR,Tmi,m�

i

= argmaxR,T

�i N

�m�

i;PR,Tmi,σI�

22

Page 23: Vincent Lepetit - Real-time computer vision

Robust estimationIdea: Replace the Normal distribution by a more suitable distribution, or

equivalently replace the least-squares estimator by a "robust estimator" or “M-estimator”:

argminR,T

�i dist2

�PR,Tmi,m�

i

→ argminR,T

�i ρ

�dist

�PR,Tmi,m�

i

��

23

Page 24: Vincent Lepetit - Real-time computer vision

Example of an M-estimator:The Tukey Estimator

x2

ρ(x)

The Tukey estimator assumes the measures follow a distribution that is a mixture of:• a Normal distribution, for the inliers,• a uniform distribution, for the outliers.

�if |x| ≤ c ρ(x) = c2

6 (1− (1− (xc )2)3)

if |x| > c ρ(x) = c2

6

24

Page 25: Vincent Lepetit - Real-time computer vision

Normal distribution(inliers)

+ =

Uniform distribution(outliers)

Tukey estimator

-log(.)-log(.)

Least-squares

Mixture

25

Page 26: Vincent Lepetit - Real-time computer vision

Gaussian Noise on the Projections + 20% outliers + Tukey estimator

White cross: true camera position;Black cross: global minimum of the object function.

The global minimum is very close to the true camera pose.BUT: - local minimums;- the objective function is flat where all the correspondences are considered outliers.

26

Page 27: Vincent Lepetit - Real-time computer vision

Gaussian Noise on the Projections + 50% outliers + Tukey estimator

Even more local minimums.Numerical optimization can get trapped into a local minimum.

27

Page 28: Vincent Lepetit - Real-time computer vision

RANSAC

28

Page 29: Vincent Lepetit - Real-time computer vision

How to Optimize ?

Idea: sampling the space of solutions (the camera pose space here):

29

Page 30: Vincent Lepetit - Real-time computer vision

How to Optimize ?

Idea: sampling the space of solutions:

+ Numerical Optimization from the best sampled pose.

Problem: Exhaustive regular sampling is too expensive in 6 dimensions.Can we do a smarter sampling ?

30

Page 31: Vincent Lepetit - Real-time computer vision

RANSACRANSAC: RANdom SAmple Consensus

Line fitting: the "Throwing Out the worst residual" heuristics can fail (Example for the original paper [Fischler81]):

outlier

final least-squares solution

Ideal line

31

Page 32: Vincent Lepetit - Real-time computer vision

RANSACAs before, we could do a regular sampling, but would not be optimal:

Ideal line

32

Page 33: Vincent Lepetit - Real-time computer vision

Idea:

Generate hypotheses from subsets of the measurements.If a subset contains no gross errors, the estimated parameters (the hypothesis) are closed

to the true ones.

Take several subsets at random, retain the best one.

Ideal line

33

Page 34: Vincent Lepetit - Real-time computer vision

The quality of a hypothesis is evaluated by the number of measures that lie "close enough" to the predicted line.

We need to choose a threshold (T) to decide if the measure is "close enough". RANSAC returns the best hypothesis, ie the hypothesis with the largest number of

inliers.

T

1 if dist(mi,line p( )) ≤ T0 if dist(mi,line p( )) > T⎧ ⎨ ⎩ i

34

Page 35: Vincent Lepetit - Real-time computer vision

RANSAC for HomographiesTo apply RANSAC to homography estimation, we need a way to compute a

homography from a subset of measurements:

Since RANSAC only provides a solution estimated with a limited number of data, it must be followed by a robust minimization to refine the solution.

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

35

Page 36: Vincent Lepetit - Real-time computer vision

How to Get the Correspondences ?

• Extract Feature Points / Keypoints / Regions (Harris corner detector, extrema of Laplacian, affine region detectors, ...);

• standard approach: Match them based on Euclidean distances between descriptors such as SIFT, SURF, ...

m�m

36

Page 37: Vincent Lepetit - Real-time computer vision

Affine Region Detectors

Hessian-Affine detector MSER detector

37

Page 38: Vincent Lepetit - Real-time computer vision

Affine Normalization

Warp by M11/2 Warp by M2

1/2

We still have to correct for the orientation !

38

Page 39: Vincent Lepetit - Real-time computer vision

Select Canonical Orientation• Create histogram of local gradient directions computed over the image patch;• Each gradient contributes for its norm, weighted by its distance to patch center;• Assign canonical orientation at peak of smoothed histogram.

0 2π

39

Page 40: Vincent Lepetit - Real-time computer vision

Select Canonical Orientation

40

Page 41: Vincent Lepetit - Real-time computer vision

Description Vector

?

...

41

Page 42: Vincent Lepetit - Real-time computer vision

SIFT Description VectorMade of local histograms of gradients:

In practice: 8 orientations x 4 x 4 histograms = 128 dimensions vector.Normalised to be robust to light changes.

...

42

Page 43: Vincent Lepetit - Real-time computer vision

Matching Regions

m�

...

...

...

...

...

...

...

?

43

Page 44: Vincent Lepetit - Real-time computer vision

Matching: Approximate Nearest Neighbour

Best-Bin-First: Approximate nearest-neighbour search in k-d tree

44

Page 45: Vincent Lepetit - Real-time computer vision

Keypoint Matching

Pre-processingMake the actual classification easier

Nearest neighbor classification

The standard approach is a particular case of classification:

Search in the Database

Idea: let’s try another classification method!

45

Page 46: Vincent Lepetit - Real-time computer vision

One Class per KeypointOne class per keypoint: the set of the keypoint’s possible appearances

under various perspective, lighting, noise...

class 1

class 2

46

Page 47: Vincent Lepetit - Real-time computer vision

Training phase

Classifier

Classifierclass 1

class 1

class 2

...

Run-Time

47

Page 48: Vincent Lepetit - Real-time computer vision

Which Classifier ?We want a classifier that:

• can handle many classes;• is very fast;• has reasonable recognition performances (a

very high recognition rate is not an necessary requirement).

48

Page 49: Vincent Lepetit - Real-time computer vision

Which Classifier ?• Randomized Trees [Amit & Geman, 1997];• Random forests [Breiman, 2001].

49

Page 50: Vincent Lepetit - Real-time computer vision

An (Ideal) Single Tree

binary test

binary test

binary test

class #

50

Page 51: Vincent Lepetit - Real-time computer vision

How to Build the Tree ?

binary test ?

training set

51

Page 52: Vincent Lepetit - Real-time computer vision

binary test ?

training set

found by minimizing the entropy after the test:

S

Sleft

Sright

argmintest

|Sleft||S| Entropy(Sleft) + |Sright|

|S| Entropy(Sright)

52

Page 53: Vincent Lepetit - Real-time computer vision

binary test

training set

S

Problem: runs quickly out of training samples for the deeper tests

53

Page 54: Vincent Lepetit - Real-time computer vision

Idea: Use Several Sub-Optimal TreesEach tree is trained with a random subset of the training set.

54

Page 55: Vincent Lepetit - Real-time computer vision

Idea: Use Several Sub-Optimal Trees

The leaves contain the probabilities over the classes, computed from the training set.

55

Page 56: Vincent Lepetit - Real-time computer vision

Classification with Several Sub-Optimal TreesThe test sample is dropped into each tree, and the probabilities in the leaves it reached are averaged:

+ + ) = (13

56

Page 57: Vincent Lepetit - Real-time computer vision

Visual InterpretationEach tree partitions the space in a different way and compute the probability of each class for each cell of the partition:

57

Page 58: Vincent Lepetit - Real-time computer vision

Visual InterpretationCombining the trees gives a fine partition with a better estimate of the class probabilities:

58

Page 59: Vincent Lepetit - Real-time computer vision

For PatchesPossible tests: compare the intensities of two pixels around the keypoint after Gaussian smoothing:

• Very efficient to compute;• Invariant to light change by any raising function.

mfi =

�1 if I(m + dmi,1) ≤ I(m + dmi,2)0 otherwise

m + dmi,1

m + dmi,2 I : image after Gaussian smoothing

59

Page 60: Vincent Lepetit - Real-time computer vision

Results

60

Page 61: Vincent Lepetit - Real-time computer vision

Randomized Trees (and Random Ferns) applied to image patches are becoming a powerful tool for Computer Vision.

61

Page 62: Vincent Lepetit - Real-time computer vision

[Shotton et al, CVPR’11]

Used to infer body parts in the Kinect body tracking system.

The tests rely on the depth map.

62

Page 63: Vincent Lepetit - Real-time computer vision

Tests in [Shotton et al, CVPR’11]Classes are the body parts. The goal is to label each pixel with the label of the part it belongs to.

Tests compare the depth of two pixels around the considered pixel.

The displacements are normalized by the depth of the considered pixel for invariance:

fi(m) =�

1 if depth(m + dm1depth(m) ) ≤ depth(m + dm2

depth(m) )0 otherwise

m

63

Page 64: Vincent Lepetit - Real-time computer vision

3D Pose EstimationMean-Shift is used to find the joint locations from the body parts.

64

Page 65: Vincent Lepetit - Real-time computer vision

Training

“Training 3 trees to depth 20 from 1 million images takes about 1 day on a 1000 core cluster” [Shotton et al, CVPR’11]

Most of the training data is synthetic:

65

Page 66: Vincent Lepetit - Real-time computer vision

A SubtreeAverage of the patches that reach this node

66

Page 67: Vincent Lepetit - Real-time computer vision

[Gall et Lempitsky, CVPR’09; Barinova et al, CVPR’10]

Hough Forest for Object Detection:• Random Forests used to make each patch vote for the object centroid;

• The tests compare the output of filters and histograms-of-gradient between 2 pixels;• The leaves contain the displacement toward the object center.

Accumulated votes from all patches

Final detectionEach patch votes for the object centroid

Votes from the 3 patches

67

Page 68: Vincent Lepetit - Real-time computer vision

Tests used in [Gall et Lempitsky, CVPR’09]

Channels: the 3 color channels, absolute values of the first and second derivatives of the image, and 9 channels from HoG (Histograms-of-Gradients).

fi(m) =�

1 if channeli(m + dm1) < channeli(m + dm2) + τ0 otherwise

HoG

68

Page 69: Vincent Lepetit - Real-time computer vision

[Bosch et al, ICCV’07]Image Classification using Random Forests and Ferns [Bosch et al, ICCV’07]Use a sliding window to detect objects.Much faster than SVMs, recognition performances similar.

69

Page 70: Vincent Lepetit - Real-time computer vision

[Bosch et al, ICCV’07]

Tests:

n and b: random vector and scalar.xm: vector computed from a Pyramidal Histogram-of-Gradients.

fi(m) =�

1 if n�xm + b ≤ 00 otherwise

70

Page 71: Vincent Lepetit - Real-time computer vision

[Kalal et al, CVPR’10]TLD (aka Predator), for Track, Learn, Detect:

• Random Ferns used to speed up detection;

• Trained online: the distributions in the leaves are updated online, using the incoming images.

71

Page 72: Vincent Lepetit - Real-time computer vision

[Kalal et al, CVPR’10]• Tests: 2bit binary patterns• Trained online: the distributions in the leaves are updated online, using the

incoming images.

72

Page 73: Vincent Lepetit - Real-time computer vision

Random Ferns: A Simplified Tree-Like Classifier

73

Page 74: Vincent Lepetit - Real-time computer vision

For Keypoint Recognition, We Can Use Random Tests!

Number of trees

Recognition rate

Comparison of the recognition rates for 200 keypoints:

tests selected by minimizing entropy

tests with random locations

74

Page 75: Vincent Lepetit - Real-time computer vision

We can use random tests • For a small number of classes

– we can try several tests, and– retain the best one according to some criterion.

75

Page 76: Vincent Lepetit - Real-time computer vision

We can use random tests• For a small number of classes

– we can try several tests, and– retain the best one according to some criterion.

• When the number of classes is large– any test does a decent job:

76

Page 77: Vincent Lepetit - Real-time computer vision

Why it is Interesting

• Building the trees takes no time (we still have to estimate the posterior probabilities);

• Allows incremental learning;

• Simplifies the classifier structure.

77

Page 78: Vincent Lepetit - Real-time computer vision

The Tree Structure is not Needed

78

Page 79: Vincent Lepetit - Real-time computer vision

The Tree Structure is not Needed

f1

f2

f3

79

Page 80: Vincent Lepetit - Real-time computer vision

The Tree Structure is not Needed

f1

f2

f3

Results of pixel comparisons (0 or 1) Class Label

The distributions can be expressed simply, as:

80

Page 81: Vincent Lepetit - Real-time computer vision

Compromise:

which is proportional to

but complete representation of the joint distribution infeasible.

Naive Bayesian ignores the correlation:

We are looking for

argmaxi

P(C = ci patch)

If patch can be represented by a set of image features { fi }:

P(C = ci patch) = P(C = ci f1, f2,… fn, fn+1,… … fN )

81

Page 82: Vincent Lepetit - Real-time computer vision

Training

82

Page 83: Vincent Lepetit - Real-time computer vision

Training

83

Page 84: Vincent Lepetit - Real-time computer vision

Training0

1

1

6

84

Page 85: Vincent Lepetit - Real-time computer vision

Training1

0

0

0

1

1

6

1

85

Page 86: Vincent Lepetit - Real-time computer vision

Training1

0

1

5

1

0

0

0

1

1

6

1

86

Page 87: Vincent Lepetit - Real-time computer vision

Training

87

Page 88: Vincent Lepetit - Real-time computer vision

Training

88

Page 89: Vincent Lepetit - Real-time computer vision

Training Results

Normalize:

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

89

Page 90: Vincent Lepetit - Real-time computer vision

Training Results

Normalize:

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

90

Page 91: Vincent Lepetit - Real-time computer vision

Recognition

91

Page 92: Vincent Lepetit - Real-time computer vision

Normalization

Normalize:

P( f1, f2,…, fn |C = ci)000001

111

∑ =1

92

Page 93: Vincent Lepetit - Real-time computer vision

Subtlety with Normalizationpleaf, class =

Number of samples(leaf, class)Number of samples(class)

too selective:Number of samples(leaf, class) can be 0 simply because the training set is finite.

we use:pleaf, class =

Number of samples(leaf, class)+NregularizationNumber of samples(class)+Number of leaves×Nregularization

This can be done by simply initializing the counters to Nregularization instead of 0.

93

Page 94: Vincent Lepetit - Real-time computer vision

Influence of Nregularization

pleaf, class =Number of samples(leaf, class)+Nregularization

Number of samples(class)+Number of leaves×Nregularization

Nregularization (log scale)

Recognition rate

50%

94

Page 95: Vincent Lepetit - Real-time computer vision

Implementation of Feature Point Recognition with Ferns

1: for(int i = 0; i < H; i++) P[i] = 0.; 2: for(int k = 0; k < M; k++) { 3: int index = 0, * d = D + k * 2 * S; 4: for(int j = 0; j < S; j++) { 5: index <<= 1; 6: if (*(K + d[0]) < *(K + d[1])) 7: index++; 8: d += 2; } 9: p = PF + k * shift2 + index * shift1;10: for(int i = 0; i < H; i++) P[i] += p[i]; }

• Very simple to implement;• No need for orientation, perspective, light correction.

95

Page 96: Vincent Lepetit - Real-time computer vision

Number of inliers for Ferns

Number of inliers for SIFT

each point corresponds to an image from a 1000-frame sequence

Ferns are much faster, sometimes more accurate, but SIFT does not need training.

Ferns versus SIFT

96

Page 97: Vincent Lepetit - Real-time computer vision

Randomized Trees vs Ferns

Ferns more discriminant but more sensitive to outliers.

Ferns with productRT (with random tests) with product

Ferns with averageRT (with random tests) with average

Different combination strategies: average (RT) / product (Ferns)

Rec

ogni

tion

rate

Number of structures

97

Page 98: Vincent Lepetit - Real-time computer vision

Randomized Trees vs FernsInfluence of the number of classes:

Ferns with product

Ferns with averageRec

ogni

tion

rate

98

Page 99: Vincent Lepetit - Real-time computer vision

Memory and Computation Time

• Recognition time grows linearly with the number of Trees/Ferns and the number of classes.

• Recognition time grows linearly with the logarithm of the depth of Trees/Ferns.

• Memory grows linearly with the number of Trees/Ferns and the number of classes.

• Memory grows exponentially with the depth of Trees/Ferns.

• Increasing the depth may result in overfitting.• Increasing the number of Trees/Ferns (usually) improves

recognition.

99

Page 100: Vincent Lepetit - Real-time computer vision

Influence of the Number of FernsFerns with productRT (with random tests) with product

Ferns with averageRT (with random tests) with average

Rec

ogni

tion

rate

Number of structures

Increasing the number of Ferns/Trees improves the recognition rate, but increases the computation time and memory.

100

Page 101: Vincent Lepetit - Real-time computer vision

Number of Ferns / Number of Leaves / Memory / Computation Time

Rec

ogni

tion

Rat

e

Fern size

Number of Ferns

Com

puta

tion

Tim

eFern size

101

Page 102: Vincent Lepetit - Real-time computer vision

Conclusions on Randomized Trees and Ferns

• Simple to implement, Ferns even simpler;

• Both very fast, but dumb: need a lot of training examples to learn.

• Use a lot of memory to store the posterior distributions in the leaves.

102

Page 103: Vincent Lepetit - Real-time computer vision

We now have correspondences between a reference image of the object and the input image:

Some correspondences are correct, some are not.We can estimate the homography between the 2 images by applying RANSAC on subsets of 4 correspondences.

103

Page 104: Vincent Lepetit - Real-time computer vision

Computing a Homography from Point Correspondencesby solving a linear system

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

m�m

�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�

104

Page 105: Vincent Lepetit - Real-time computer vision

Computing a Homography from Point Correspondencesby solving a linear system

m�m

�m� = H �m�m = [u, v, 1]�, �m� = [ku�, kv�, k]�

u� = H11u+H12u+H13H31u+H32u+H33

v� = H21u+H22u+H23H31u+H32u+H33

105

Page 106: Vincent Lepetit - Real-time computer vision

Computing a Homography from Point Correspondencesby solving a linear system

u� = H11u+H12u+H13H31u+H32u+H33

v� = H21u+H22u+H23H31u+H32u+H33

�u v 1 0 0 0 uu� vu� u�

0 0 0 u v 1 uv� vv� v�

H11

H12

H13

H21

H22

H23

H31

H32

H33

=�

00

Using four correspondences:BX = 08

with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�

106

Page 107: Vincent Lepetit - Real-time computer vision

How to Solve this Linear System ?

• X is the null eigenvector of B.

• In practice: the eigenvector corresponding to the smallest eigenvalue.

BX = 08

with X = [H11,H12,H13,H21,H22,H23,H31,H32,H33]�

107

Page 108: Vincent Lepetit - Real-time computer vision

• Non-linear least-squares minimization: Minimization of a physical, meaningful error (reprojection error, in pixels)

• Minimization algorithms: Gauss-Newton or Levenberg-Marquardt (very efficient).

Computing a Homography from Point Correspondences with a non-linear optimization

minR,T

i

dist2�HR,Tmi,m

�i

m�mHR,Tm

108

Page 109: Vincent Lepetit - Real-time computer vision

Numerical Optimization

p0

Start from an initial guess p0:

p0 can be taken randomly but should be as close as possible to the global minimum:

- pose computed at time t-1;- pose predicted from pose computed at time t-1 and a motion model;- ...

p1p2

109

Page 110: Vincent Lepetit - Real-time computer vision

Numerical OptimizationGeneral methods:• Gradient descent / Steepest Descent;• Conjugate Gradient;• ...

Non-linear Least-squares optimization:• Gauss-Newton;• Levenberg-Marquardt;• ...

110

Page 111: Vincent Lepetit - Real-time computer vision

Numerical OptimizationWe want to find p that minimizes:

where • p is a vector of parameters that define the camera pose (translation vector + parameters of the rotation matrix);• b is a vector made of the measurements (here the m’i);• f is the function that relates the camera pose to these measurements.

f(p) =

u(HR(p),T(p)m1)v(HR(p),T(p)m1)

...

b =

u(m�

1)v(m�

1)...

E(p) =�

i dist2�HR(p),T(p)mi,m�

i

= �f(p)− b�2

111

Page 112: Vincent Lepetit - Real-time computer vision

Gradient descent / Steepest Descent

Weaknesses:- How to choose λ ? - Needs a lot of iterations in long and narrow valleys:

pi+1 = pi − λ∇E(pi)

E(pi) = f (pi) −b2

= f (pi) −b( )T f (pi) −b( )→∇E(pi) = 2J f (pi) −b( ) with J the Jacobian matrix of f , computed at pi

112

Page 113: Vincent Lepetit - Real-time computer vision

The Gauss-Newton and the Levenberg-Marquardt algorithms

But first, the Linear Least-Squares Case:

If the function f is linear ie f(p) = Ap, p can be estimated as:

p=A+b

where A+ is the pseudo-inverse of A: A+=(ATA)-1AT€

E(p) = f (p) −b 2

113

Page 114: Vincent Lepetit - Real-time computer vision

Non-Linear Least-Squares: The Gauss-Newton algorithm

Iteration steps:

pi+1=pi + ∆i

∆i is chosen to minimize the residual || f(pi+1) – b ||2. It is computed by approximating f to the first order:

Δ i = argminΔ

f (pi + Δ) −b 2

= argminΔ

f (pi) + JΔ −b 2 First order approximation: f (pi + Δ) ≈ f (pi) + JΔ

= argminΔ

εi + JΔ 2εi = f (pi) −b denotes the residual at iteration i

Δ i is the solution of the system JΔ = −εi in the least − squares sense :Δ i = −J+εi where J+ is the pseudo - inverse of J

114

Page 115: Vincent Lepetit - Real-time computer vision

Non-Linear Least-Squares: The Levenberg-Marquardt Algorithm

In the Gauss-Newton algorithm:

In the Levenberg-Marquardt algorithm:

Levenberg-Marquardt Algorithm:

0. Initialize λ with a small value: λ = 0.001

1. Compute ∆i and E(pi + ∆i)

2. If E(pi + ∆i) > E(pi): λ ← 10 λ and go back to 1 [happens when the linear approximation of f is too coarse]

3. If E(pi + ∆i) < E(pi): λ ← λ / 10, pi+1 ← pi + ∆i and go back to 1.

Once converged, set λ ← 0 and continue up to convergence.

Δ i = − JTJ( )−1JTεi

Δ i = − JTJ + λI( )−1JTεi

115

Page 116: Vincent Lepetit - Real-time computer vision

Non-Linear Least-Squares: the Levenberg-Marquardt Algorithm

• When λ is small, LM behaves similarly to the Gauss-Newton algorithm.• When λ becomes large, LM behaves similarly to a steepest descent to guarantee

convergence.

Δ i = − JTJ + λI( )−1JTεi

116

Page 117: Vincent Lepetit - Real-time computer vision

Another Way to Refine the Pose:Template Matching

117

Page 118: Vincent Lepetit - Real-time computer vision

Global region tracking by minimizing cross-correlation: •Useful for objects difficult to model using local features;•Accurate.

Template T

Input Image Ip

118

Page 119: Vincent Lepetit - Real-time computer vision

Lucas-Kanade Algorithm

Gauss-Newton step:€

minp

W (I,p)[m j ]−T[m j ]( )2

j∑

Δ i = Jp+ ⋅ εp,I

Template T

mj

Input Image I

Pseudo-inverse of the Jacobian of W(I, p) evaluated at p and the mj

εp,I = (…,T[m j ]−W (I,p)[m j ],…)T

p

119

Page 120: Vincent Lepetit - Real-time computer vision

Template T

Lucas-Kanade Algorithm

Computing J and J+ is computationally expensive.

p0

p

120

Page 121: Vincent Lepetit - Real-time computer vision

Inverse Compositional Algorithm[Baker et al. IJCV03]

Template T

Input Image It

pi = pi-1 + dpi

dpi = Jp= 0+ εp= 0,I

Jp=0 is a constant matrix and can therefore be precomputed !

-pi-1dpi

121

Page 122: Vincent Lepetit - Real-time computer vision

ESM (Efficient Second-order Method)(1) I = T + Jp=0dp + dpTHp=0dp [second-order Taylor expansion]

(2) Jp=dp = Jp=0 + 2dpTHp=0 [derivation of (1) wrt p]

(3) dpTHp=0 = ½(Jp=dp - Jp=0) [from Equation (2)]

(4) I = T + Jp=0 + ½(Jp=dp - Jp=0)dp [by injecting (3) in (1)]

(5) dp = [½(Jp=0 + Jp=dp)]+ (I - T) [from Equation (4)]

Like Gauss-Newton but replace Jp=0 by ½(Jp=0 + Jp=dp).Need to compute Jp=dp at each iteration, and a pseudo-inverse

at each iteration, but need much less iterations.122

Page 123: Vincent Lepetit - Real-time computer vision

BRIEF [ECCV’10]very fast feature point descriptor

123

Page 124: Vincent Lepetit - Real-time computer vision

Remark

• Moving legacy code to new CPUs does not result in a speed-up anymore;

• Should consider the features of new platforms: parallelism (multi-cores, GPU), locality, ...

124

Page 125: Vincent Lepetit - Real-time computer vision

1

1

0...

0

1

BRIEF descriptor

Gaussian smoothing

125

Page 126: Vincent Lepetit - Real-time computer vision

1

1

0...

0

1

BRIEF descriptor

Gaussian smoothing

Alternatively, using integral images:

126

Page 127: Vincent Lepetit - Real-time computer vision

Integral Images

Integral Image

Integral Image(u, v) =�

i=1..u

j=1..v

Image(i, j)

127

Page 128: Vincent Lepetit - Real-time computer vision

=

-

-

+

How to Use Integral Images

128

Page 129: Vincent Lepetit - Real-time computer vision

[Viola & Jones, IJCV’01]

Features computed in constant time

129

Page 130: Vincent Lepetit - Real-time computer vision

Computing Integral Images

IntegralImage[u][v] = IntegralImage[u][v-1] +LineBuffer[u] +Image[u][v]

130

Page 131: Vincent Lepetit - Real-time computer vision

Evaluation

131

Page 132: Vincent Lepetit - Real-time computer vision

Evaluation

132

Page 133: Vincent Lepetit - Real-time computer vision

Computation Speed

For BRIEF, most of the time is spent in Gaussian smoothing.

133

Page 134: Vincent Lepetit - Real-time computer vision

Matching Speed distance(BRIEF descriptor 1, BRIEF descriptor 2)

= Hamming distance(BRIEF descriptor 1, BRIEF descriptor 2)

= number of bits set to 1(BRIEF descriptor 1 xor BRIEF descriptor 2)

= popcount(BRIEF descriptor 1 xor BRIEF descriptor 2)

10- to 15-fold speed increase on Intel's Bloomfield (SSE 4.2) and AMD's Phenom (SSE 4a)

134

Page 135: Vincent Lepetit - Real-time computer vision

Matching Speed

135

Page 136: Vincent Lepetit - Real-time computer vision

Picking the Locations

uniform distribution Gaussian distribution Gaussian distribution for location and length

uniform distribution on Polar coordinates

census transform locations

136

Page 137: Vincent Lepetit - Real-time computer vision

Picking the Locations

uniform distribution Gaussian distribution Gaussian distribution for location and length

uniform distribution on Polar coordinates

census transform locations

137

Page 138: Vincent Lepetit - Real-time computer vision

Rotation and Scale Invariance

138

Page 139: Vincent Lepetit - Real-time computer vision

Rotation and Scale InvarianceDuplicate the Descriptors:18 rotations x 3 scales

...

...

...

139

Page 140: Vincent Lepetit - Real-time computer vision

code released in GPL on CVLab website

140

Page 141: Vincent Lepetit - Real-time computer vision

DOT [CVPR’10]dense descriptor for object detection

Joint work with Stefan Hinterstoisser (TU Munich)

141

Page 142: Vincent Lepetit - Real-time computer vision

Template matching with an efficient representation of the images and the templates.

object detection with a sliding window and template matching

142

Page 143: Vincent Lepetit - Real-time computer vision

143

Page 144: Vincent Lepetit - Real-time computer vision

144

Page 145: Vincent Lepetit - Real-time computer vision

Initial Similarity Measure

145

Page 146: Vincent Lepetit - Real-time computer vision

Making the Similarity Measure Robust to Small Motions

146

Page 147: Vincent Lepetit - Real-time computer vision

Downsampling

147

Page 148: Vincent Lepetit - Real-time computer vision

Ignoring the Dependencies between the Regions...

148

Page 149: Vincent Lepetit - Real-time computer vision

Lists of Dominant Orientations

149

Page 150: Vincent Lepetit - Real-time computer vision

Fast Computation with Bitwise Operations

0000110000010000

150

Page 151: Vincent Lepetit - Real-time computer vision

Code available under LGPL license athttp://campar.in.tum.de/personal/hinterst/index/

151

Page 152: Vincent Lepetit - Real-time computer vision

New Method, LINE[PAMI, under revision]

152

Page 153: Vincent Lepetit - Real-time computer vision

Initial Similarity Measure

ESteger(I,O, c) =�

r

��� cos�orientation(O, r)− orientation(I, c + r)

����

previous measure:153

Page 154: Vincent Lepetit - Real-time computer vision

Making the Similarity Measure Robust to Small Motions

ESteger(I,O, c) =�

r

��� cos�orientation(O, r)− orientation(I, c + r)

����

E(I,O, c) =�

r

�max

t∈region(c+r)

�� cos�orientation(O, r)− orientation(I, t)

���

154

Page 155: Vincent Lepetit - Real-time computer vision

Avoiding to Recompute the max Operator1. spread the gradients

155

Page 156: Vincent Lepetit - Real-time computer vision

2. precompute response mapsBecause• we consider only a discrete set of gradient directions;• we do not consider the gradient norms,we can precompute a response for each region in the image and a gradient direction for the template

in the template

in the template

156

Page 157: Vincent Lepetit - Real-time computer vision

Optimized Version

11010

1. The sets of orientations in the image regions are encoded with a binary representation:

157

Page 158: Vincent Lepetit - Real-time computer vision

Optimized Version2. The binary representation is used as an index to lookup tables with the precomputed responses for each gradient direction in the template:

158

Page 159: Vincent Lepetit - Real-time computer vision

Avoiding Caches MissesThe response maps are re-arranged into linear memories:

159

Page 160: Vincent Lepetit - Real-time computer vision

Using the Linear Memories

The similarity measure can be computed for all the image locations by summing linear memories, shifted by an offset that depends on the template.

160

Page 161: Vincent Lepetit - Real-time computer vision

Advantage of Linearizing the Memory

Speed-up factor

161

Page 162: Vincent Lepetit - Real-time computer vision

DOT [CVPR’10]LINE

162

Page 163: Vincent Lepetit - Real-time computer vision

LINE-MOD [Hinterstoisser et al, ICCV’11]

Extension to the Kinect: the templates combine the image and the depth map.

163

Page 164: Vincent Lepetit - Real-time computer vision

thanks!

164