Tracking with Local Spatio -Temporal Motion Patterns in Extremely Crowded Scenes

Tracking with Local Spatio-Temporal Motion Patterns

in Extremely Crowded Scenes

Present by 陳群元

outline Introduction Previous work Predicting motion patterns Spatio-temporal transition distribution Discerning pedestrians Experimental results conclusion

introduction Tracking individuals in extremely crowded scenes

is a challenging task, we predict the local spatio-temporal motion

patterns that describe the pedestrian movement at each space-time location in the video.

we robustly model the individual’s unique motion and appearance to discern them from surrounding pedestrians.

Previous work Previous work track features and associate similar

trajectories to detect individual moving entities within crowded scenes.

We encode many possible motions in the HMM, and derive a full distribution of the motion at each spatio-temporal location in the video.


Markov Model An example : a 3-state Markov Chain λ

o State 1 generates symbol A only, State 2 generates symbol B only, and State 3 generates symbol C only

o Given a sequence of observed symbols O={CABBCABC}, the only one corresponding state sequence is {S3S1S2S2S3S1S2S3}, and the corresponding probability isP(O|λ)=P(q0=S3) P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2) =0.10.30.30.70.20.30.30.2=0.00002268

T2.05.03.0

5.03.02.02.07.01.01.03.06.0

π

A 1.05.04.0

5.02.03.02.07.01.01.03.06.0

A s2 s3

A

B C

0.6

0.7

0.30.3

0.20.2

0.10.3

0.7

s1

Hidden Markov Model

An example : a 3-state discrete HMM λ

o Given a sequence of observations O={ABC}, there are 27 possible corresponding state sequences, and therefore the corresponding probability is

s2

s1

s3

{A:.3,B:.2,C:.5}

{A:.7,B:.1,C:.2}{A:.3,B:.6,C:.1}

0.6

0.7

0.30.3

0.20.2

0.10.3

0.7

1.05.04.01.0,6.0,3.02.0,1.0,7.0

5.0,2.0,3.0

5.02.03.02.07.01.01.03.06.0

333

222

111

CBACBA

CBA

A

bbbbbbbbb

07.02.0*7.0*5.023 22 20

007.01.0*1.0*7.03 2 2 , , 322 when ..

sequence state: ,27

1,

27

1,

SSPSSPSqPiP

SPSPSPiPSSSige

ii iPiPi iPP

q

CBAqOq

qqqOqOO

Predicting Motion Patterns

Spatio-temporal gradient

f(Pos) = (f(Pos+1) -f(Pos) + f(Pos) -f(Pos-1))/2 = f(Pos+1)-f(Pos-1)/2;

For each pixel i in cuboid I is intensity

spatio-temporal motion pattern

the local spatio-temporal motion pattern represented by a 3D Gaussian of spatio-temporal gradients

Training HMM The hidden states of the HMM are represented by

a set of motion patterns

The probability of an observed motion pattern given a hidden state s is

Kullback–Leibler divergence

Kullback–Leibler divergence is a non-symmetric measure of the difference between two probability distributions P and Q.

predictive distribution After training a collection of HMMs on a video of

typical crowd motion, we predict the motion pattern at each space-time location that contains the tracked subject.

where S is the set of hidden states, w(s) is defined by

Vector of scaled message

Reference :A Tutorial On Hidden Markov Models andSelected Applications in Speech Recognition.

predicted localspatio-temporal motion pattern

a weighted sum of the 3D Gaussian distributions associated with the HMM’s hidden states

The centroid we are interested in is a multivariate normal density that minimizes the total distortions. Formally, a centroid c is defined as,

Reference: On Divergence Based Clustering of Normal Distributions and Its Application to HMM Adaptation

Predicted motion pattern

where and are the mean and covariance of the hidden state s, respectively.


Bayesian probabilities

we use the gradient information to estimate the optical flow within each specific sub-volume and track the target in a Bayesian framework.

Bayesian tracking can be formulated as maximizing the posterior distribution of the state xt of the target at time t given available measurements z1:t = {zi; i = 1 : : : t} by

zt is the image at time t, p (xt|xt-1) is the transition distribution, and p (zt|xt) is the likelihood.

state vector x t as the width, height, and 2D location of the target within the image.

we focus on the target’s movement between frames and use a 2nd-degree autoregressive model for the transition distribution of the target’s width and height.

Ideally, the state transition distribution p (xt|xt-1) directly reflects the two-dimensional motion of the target between frames t -1 and t.

where is the 2D optical flow vector, and is the covariance matrix.

optical flow Assuming the movement to be small, the image

constraint at I(x,y,t) with Taylor series can be developed to get

H.O.T

The predicted motion pattern is defined by a mean gradient vector and a covariance matrix

The motion information encoded in the spatio-temporal gradients can be expressed in the form of the structure tensor matrix

The optical flow can then be estimated from the structure tensor by solving

where w = [u; v; z]T is the 3D optical flow

Covariance matrix


Typical models of the likelihood distribution p (z t |x t )

where is the variance, is a distance measure, and Z is a normalization term.

difference between a region R (defined by state x t ) of the observed image z t and the template.

We assume pedestrians exhibit consistency in their appearance and their motion, and model them in a joint likelihood by

where pA and pM are the appearance and motion likelihoods

Update motion template

After tracking in frame t, we update each pixel i in the motion template by

where is the motion template at time t, Is the region of spatio-temporal

gradient defined by the tracking result (i.e., the expected value of the posterior)

is the learning rate.

update this error measurement

The error at pixel i and time t becomes

ti and ri are the normalized gradient vectors of the motion template and the tracking result at time t

To reduce the contributions of frequently changing pixels to the computation of the motion likelihood, we weigh each pixel in the likelihood’s distance measure.

where Z is a normalization term such that

distance measure The distance measure of the motion likelihood

distribution becomes


The training video for the concourse scene contains 300 frames (about 10 seconds of video),

the video for ticket gate scene contains 350 frames.

We set the cuboid size to 10*10*10 for both scenes.

The learning rate , appearance variance , and motion variance are 0.05.


Conclusion In this paper, we derived a novel probabilistic

method that exploits the inherent spatially and temporally varying structured pattern of a crowd’s motion to track individuals in extremely crowded scenes.

The end Thank you

Documents

Tracking with Local Spatio -Temporal Motion Patterns in Extremely Crowded Scenes