Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic,...

Real-time on-line learning of transformed hidden Markov models

Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang

Microsoft, University of Toronto, University of Illinois

Six break points vs. six things in video

• Traditional video segmentation: Find breakpoints Example: MovieMaker (cut and paste)

• Our goal: Find possibly recurring scenes or objects

1 32 42 1 4 3 2 3 2 3 5 6

timeline

REPRESENTATIVE FRAMES

Transformed hidden Markov model

c Class with prior P(c=k) = πk

P(z|c) = N(z;μc,Φc)

x = Tz

Ex = Tμ

Var x = TΦTT

p(x|c,T) = N(x; Tμc, TΦcTT)

Generation is repeated for each frame of the sequence, with the pair (T,c) being the state of a Markov chain.

Translation T with uniform prior

Latent image z

Observed frame x

Goal: maximize total likelihood of a dataset

log p(X) = log Σ{T,c} Σz p(X,{T,c},Z)

= log Σ{T,c} Σz q({T,c},Z)p(X,{T,c},Z)/q({T,c},Z)

≥ Σ{T,c} Σz q({T,c},Z)log p(X,{T,c},Z)

- Σ{T,c} Σz q({T,c},Z)log q({T,c},Z) = B

We express q(T,c,z) = q({T,c}) * q(Z|{T,c})

{T,c} represents values of transformation and class for all frames, i.e., the path that the video sequence takes through the state space of the model.

Instead of the likelihood, we optimize the bound B, which is tight for q=p({T,c},Z|X)

Posterior approximation

We allow q({T,c}) to have a non-zero probability only on M most probable paths:

q({T,c}) = Σm=1:M rmδ({T,c} - {T,c}*m)

(Viterbi 1982)

This reduces a number of problems with adaptive scaling in the exact forward-backward inference.

Expensive part of the E stepFind quick way to calculatelog p(x|c,T) = -N/2 log(2π) – ½ log|TΦcTT| - ½ (x-Tμc)T(TΦcTT)-1(x-Tμc) for all possible shifts T in the E step of EM algorithm

Shifted cluster mean Tμ

Frame x

Computing Mahalanobis distance using FFTs

= Σ .* T

= sum x.*Tμ

= IFFT FFT(x) .* conj( FFT )

xT(TΦTT)-1Tμ = xTT(Φ-1μ) = xTT(diag Φ-1 .* μ)

All terms that have to be evaluated for all T can be expressed as correlations, e.g. :

(where summation is over pixels)

N log N versus N2 !

Parameter optimization

F = ΣT Σc Σz q(T,c,z)log p(X,T,c,z)

= ΣT Σc Σz q({T,c}) * q(z|{T,c}) x

(logπ{Tc} + Σtimelog p(xt,zt|Tt,ct)

+ Σtimelog p(ct+1|ct) log p(Tt+1|Tt,ct))

Solve ∂F/∂()=0 for an estimated q.

On-line vs. batch EM

Example:Update equation for the class mean

Σt ΣT q(Tt,ct)E[z|xt,ct,Tt] = Σtq(ct) μct

Batch EM:– solve for μ using all frames.– Inference and parameter optimization iterated.

On-line EM:– rewrite the equation for one extra frame– establish the relationsip between μ(t+1) and μ(t). – Parameters updated after each frame. No need for iteration.

Reducing the complexity of the M step

ΣT q(Tt,ct)E[z|xt,ct,Tt] can be expressed as a sum of convolutions.

For example, when there is no observation noise, E[z|xt,ct,Tt ]= Tt

Txt, and

ΣT q(Tt,ct)E[z|xt,ct,Tt] = IFFT (FFT(q).* FFT(x))

(similar trick applies to variance estiamates)

Represent pixels on a polar grid!

Shifts in the log-polar

coordinates correspond

to scale and rotation

changes in the Cartesian

coordiante system

How to deal with scale and rotation?

rotation

Estimating the number of classes

• The algorithm is initialized with a single class

• A new class is introduced whenever the frame likelihood drops bellow a threshold

• The classes can be merged in the end to achieve a more compact representation

Clustering a 20-minute whale watching video

Clustering a 20-minute beach video

0 min 9 min

Shots from the first class

Discovering objects using motion priors

Different motion prior predefined for each of the classes

Three characteristic frames from 240x320 input sequence

Learned means and variances

Tracking results

Summary

Before - CVPR 99/00

28x44 images

Grayscale images

1 day of computation for 15 sec video

Batch EM

Exact inference

Fixed number of clusters

Limited number of translations

Memory inefficient

120x160 images

Full color images

5-10 frames/sec

On-line EM

Approximate inference

Variable number of clusters

All possible translations

Memory efficient

Sneak preview: Panoramic THMMs

c P(c=k) = πk

P(z|c) = N(z;μc,Φc)

x = WTz

Ex = WTμ

Var x = WTΦTTWT

p(x|c,T) = N(x; WTμc, WTΦcTTWT)

Video clustering - model

• Appearance meanvariance

• Camera/object motion

• Temporal constraints

• Unsupervised learning – the only input is the video

Current implementation

• DirectShow filter for frame clustering (5-15 frames/sec!)

• Translation invariance• On-line learning algorithm• Classes repeating across video• Potential applications:

– Video segmentation– Content based search/retrieval– Short video summary creation– DVD chapter creation

Comparing with layered sprites

“Perfect” segmentation

Layered sprites. Jojic, CVPR 01

But, THMM is hundreds/thousands of time faster!

Example with more content

Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic,...

Documents

Maturski Draganić Nemanja

Piel fría de Nebojsa Despotovic

Internet - Nemanja (PMF)

Object recognition with hierarchical stel modelscvpapers.com/papers/eccvCameraReady.pdfObject recognition with hierarchical stel models Alessandro Perina 1; 2, Nebojsa Jojic , Umberto

Portfolio Nemanja Simic

Betonski Mostovi Nebojsa Mojsilovic

Pcelarstvo Nebojsa Velickovic

Nemanja Blagojevic CV_eng

Nebojsa Jurjevic references

Nebojsa Doncov Izvestaj Komisije

Bioinformatics and Graphical Models: Computation, approximation, and their value MSR: Nebojsa Jojic, Vladimir Jojic, Chris Meek, David Heckerman UW: Jim

Nebojsa Velickovic Neca - Katalog Ikona

Nebojsa Romcevic - Karolina Nojber

Seminarski Rad Nemanja

korovi - Jovanovic Nemanja

Audio-Visual Graphical Models Matthew Beal Gatsby Unit University College London Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft

Nemanja Maslar-Thesis

PIC Microcontrollers Nebojsa Matic

Internet Nemanja

Zeolit Nebojsa sajt