ROBOT VISION Lesson 10: Object Tracking and Visual Servoing Matthias Rüther

Robot Vision SS 2005 Matthias Rüther 1

ROBOT VISION Lesson 10: Object Tracking and

Visual Servoing

Matthias Rüther


Contents

Object Tracking– Appearance based tracking

• Kalman filtering• Condensation algorithm

– Model based tracking• Model fitting and tracking

Visual Servoing– Principle

– Servoing Types


Tracking

Tracking


Definition of Tracking

Tracking:

– Generate some conclusions about the motion of the scene, objects, or the camera, given a sequence of images.

– Knowing this motion, predict where things are going to project in the next image, so that we don’t have so much work looking for them.


Why Track?


Tracking a Silhouette by Measuring Edge Positions

Observations are positions of edges along normals to tracked contour


Why not Wait and Process the Set of Images as a Batch?

E.g. in a car system, detecting and tracking pedestrians in real time is important.

Recursive methods require less computing


Implicit Assumptions of Tracking

Physical cameras do not move instantly from a viewpoint to another.

Objects do not teleport between places around the scene.

Relative position between camera and scene changes incrementally.

We can model motion


Related Fields

Signal Detection and Estimation

Radar technology


The Problem: Signal Estimation

We have a system with parameters– Scene structure, camera motion, automatic zoom

– System state is unknown (“hidden”)

We have measurements– Components of stable “feature points” in the images.

– “Observations”, projections of the state.

We want to recover the state components from the observations


Necessary Models


A Simple Example of Estimation by Least Square Method


Recursive Least Square Estimation

We don’t want to wait until all data have been collected to get an estimate of the depth.

We don’t want to reprocess old data when we make a new measurement.

Recursive method: data at step i are obtained from data at step i-1


Recursive Least Square Estimation 2


Recursive Least Square Estimation 3


Least Square Estimation of the State Vector of a Static System


Least Square Estimation of the State Vector of a Static System 2


Dynamic System


Recursive Least Square Estimation for a Dynamic System (Kalman Filter)


Estimation when System Model is Nonlinear (Extended Kalman Filter)


Tracking Steps


Recursive Least Square Estimation for a Dynamic System (Kalman Filter)


Tracking as a Probabilistic Inference Problem

Find distributions for state vector ai and for measurement vector xi. Then we are able to compute the expectations âi and x^i.

• Simplifying assumptions (same as for HMM)


Tracking as Inference


Model based tracking


IDEA: if motion is caused by known 3-D object, we cantrack 3-D motion parameters, not just individual features!

ADVANTAGES: - low dimensionality (3 rotations, 3 translations independent of number of features tracked)- mutually constrained motion instead of independently moving points

LIMITATIONS:- 6 params only with rigid objects! Not articulated, not deformable.- assumes 3-D model known a priori

MODEL-BASED 3-D TRACKING


[Wunsch,Hirzinger IEEE RA 1997]

SKETCH OF ALGORITHM:

0. Initialize 3-D pose R0, t0 (rot, transl)

1. Extract features from image It

2. Match img features with features of 3-D modelpositioned at Rt-1, tt-1

3. Evaluate global error metric in 3-D space

(notice, not in image space)

4. Estimate Rt, tt aligning img and model features

5. Next frame and go to 1.

Example Algorithm


FEATURES: for instance using image edges with orient. and offset d (and sx, sy camera scale factors), then )0,sincos( yx ss n

is the normal of the 3-D plane through the img edge.

Corresponding model edge

3-D plane through img edge

ERROR METRIC: in 3-D space for efficiency (no back-projection): orthogonality of n and model edge

p

q

22 )]([)]([ tqntpn RRE TT

Some Details


MINIMISATION: using, say, 3 types of features:

}{min 321,

j

fjj

j

fjj

j

fjjtR EwEwEw

Trick 1: Approximating R with differential rotations:

xxxxRx ][

0

0

0

dxdy

dxdz

dydz

All E terms can be linearized, a linear system obtained from the quadratic minimization, and a solution computed in closed form: e.g., for edges,

kkTk

kkTk

Tkk

Tkk

Tkk

Tkk

vpn

npnt

vvnv

vnnn

)(

)(

Some Details


... where kTk

Tk pnv

The resulting linear systemA [t ] = b

is (trick 2) applied iteratively at each time instant to reduce errors; a few iterations should suffice for small frame-to-frame displacements.

NOTICE ASSUMPTIONS MADE:- rigid object- model known a priori- small frame-to-frame displacements - img-model feature correspondences known

(if small displacements, by min distance)

Some Details


Problems with Tracking

Initial detection– If it is too slow we will never catch up

– If it is fast, why not do detection at every frame?

Even if raw detection can be done in real time, tracking saves processing cycles compared to raw detection.

The CPU has other things to do.

Detection is needed again if you lose tracking

Most vision tracking prototypes use initial detection done by hand


Visual Servoing

Vision System operates in a closed control loop.

Better Accuracy than „Look and Move“ systems

Figures from S.Hutchinson: A Tutorial on Visual Servo Control


Visual Servoing

Example: Maintaining relative Object Position

Figures from P. Wunsch and G. Hirzinger. Real-Time Visual Tracking of 3-D Objects with Dynamic Handling of Occlusion


Visual Servoing

Camera Configurations:

End-Effector Mounted Fixed



Visual Servoing

Servoing Architectures



Visual Servoing

Position-based and Image Based control

– Position based: • Alignment in target coordinate system• The 3D structure of the target is rconstructed• The end-effector is tracked• Sensitive to calibration errors• Sensitive to reconstruction errors

– Image based:• Alignment in image coordinates• No explicit reconstruction necessary• Insensitive to calibration errors• Only special problems solvable• Depends on initial pose• Depends on selected features

target

End-effector

Image of target

Image of end effector


Visual Servoing

EOL and ECL control

– EOL: endpoint open-loop; only the target is observed by the camera

– ECL: endpoint closed-loop; target as well as end-effector are observed by the camera

EOL ECL


Visual Servoing

Position Based Algorithm:1. Estimation of relative pose

2. Computation of error between current pose and target pose

3. Movement of robot

Example: point alignment

p1

p2


Visual Servoing

Position based point alignment

Goal: bring e to 0 by moving p1

e = |p2m – p1m|

u = k*(p2m – p1m)

pxm is subject to the following measurement errors: sensor position, sensor calibration, sensor measurement error

pxm is independent of the following errors: end effector position, target position

p1m p2m

d


Visual Servoing Image based point alignment

Goal: bring e to 0 by moving p1

e = |u1m – v1m| + |u2m – v2m|

uxm, vxm is subject only to sensor measurement error

uxm, vxm is independent of the following measurement errors: sensor position, end effector position, sensor calibration, target position

p1 p2

c1 c2

u1

u2

v1 v2

d1d2


Visual Servoing

Example Laparoscopy

Figures from A.Krupa: Autonomous 3-D Positioning of Surgical Instruments in Robotized Laparoscopic Surgery Using Visual Servoing


Visual Servoing

Example Laparoscopy

Figures from A.Krupa: Autonomous 3-D Positioning of Surgical Instruments in Robotized Laparoscopic Surgery Using Visual Servoing


Tracking using CONDENSATION

CONditional DENSity PropagATION

M. Isard and A. Blake, CONDENSATION – Conditional density propagation for visual tracking, Int. J. Computer Vision 29(1), 1998, pp. 4-28.


Goal

Model-based visual tracking in dense clutter at near video frame rates


Example


Approach

Probabilistic framework for tracking objects such as curves in clutter using an iterative sampling algorithm.

Model motion and shape of target

Top-down approach

Simulation instead of analytic solution


Probabilistic Framework

Object dynamics form a temporal Markov chain

Observations, zt , are independent (mutually and w.r.t process)

Use Bayes’ rule


Notation

X State vector, e.g., curve’s position and orientation

Z Measurement vector, e.g., image edge locations

p(X) Prior probability of state vector; summarizes prior domain knowledge, e.g., by independent measurements

p(Z) Probability of measuring Z; fixed for any given image

p(Z | X) Probability of measuring Z given that the state is X; compares image to expectation based on state

p(X | Z) Probability of X given that measurement Z has occurred; called state posterior


Tracking as Estimation

Compute state posterior, p(X|Z), and select next state to be the one that maximizes this (Maximum a Posteriori (MAP) estimate)

Measurements are complex and noisy, so posterior cannot be evaluated in closed form

Particle filter (iterative sampling) idea: – Stochastically approximate the state posterior with a set of N

weighted particles, (s, ), where s is a sample state and is its weight

Use Bayes’ rule to compute p(X|Z)


Factored Sampling

Generate a set of samples that approximates the posterior p(X|Z)

Sample set s={s(1), …, s(N)} generated from p(X); each sample has a weight (“probability”)


Factored Sampling

N=15

• CONDENSATION for one image


Estimating Target State

State samples Mean of weighted state samples


Bayes’ Rule

( | ) ( )( | )

( )

p Z X p Xp X Z

p Z

This is what you canevaluate

This is what you mayknow a priori, or whatyou can predict

This is what you want. Knowing p(X|Z) will tell us what is the most likely state X.

This is a constant for agiven image


CONDENSATION Algorithm

1. Select: Randomly select N particles from {st-1(n)} based on weights

t-1(n); same particle may be picked multiple times (factored

sampling)

2. Predict: Move particles according to deterministic dynamics (drift), then perturb individually (diffuse)

3. Measure: Get a likelihood for each new sample by comparing it with the image’s local appearance, i.e., based on p(zt|xt); then update weight accordingly to obtain {(st

(n), t(n))}


CONDENSATION Scheme


Notes on Updating

Enforcing plausibility: Particles that represent impossible configurations are discarded

Diffusion modeled with a Gaussian

Likelihood function: Convert “goodness of prediction” score to pseudo-probability

– More markings closer to predicted markings -> higher likelihood


State Posterior


State Posterior Animation


Object Motion Model

For video tracking we need a way to propagate probability densities, so we need a “motion model” such as Xt+1 = A Xt + B Wt where W is a noise term and A and B are state

transition matrices that can be learned from training sequences

The state, X, of an object, e.g., a B-spline curve, can be represented as a point in a 6D state space of possible 2D affine transformations of the object


Evaluating p(Z | X)

where m = {true measurement is zm} for m = 1,…,M, and q = 1 - mp(m) is the probability that the target is not visible

1

( | ) ( | ) ( | , ) ( )M

m mm

p z x qp z clutter p z x p

2

m m m mm

x z if x z

otherwise


Dancing Example


Hand Example


Pointing Hand Example


3D Model-based Example

3D state space: image position + angle

Polyhedral model of object


Advantages of Particle Filtering

Nonlinear dynamics, measurement model easily incorporated

Copes with lots of false positives

Multi-modal posterior okay (unlike Kalman filter)

Multiple samples provides multiple hypotheses

Fast and simple to implement

Documents

ROBOT VISION Lesson 10: Object Tracking and Visual Servoing Matthias Rüther