Tv parser an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

TVParser: An Automatic TV Video ParsingMethod

Chao Liang

National Laboratory of Pattern Recognition (NLPR)Chinese Academy of Sciences, Institute of Automation (CASIA)

March 9, 2011

Chao Liang TVParser: An Automatic TV Video Parsing Method



Conclusion

Outline

1 IntroductionMotivationRelated work

2 Our solutionBasic ideasRole histogram

3 TVParser modelModel formulationParameter estimationState inference

4 Experimental ResultsData setsFace namingScene segmentation

5 Conclusion




Conclusion

MotivationRelated work

Introduction

MotivationVoluminous TV videos vs. efficient management




Conclusion


Introduction

TV videoStory plot (scene structure)

[Scene: Monica and Rachel's, Carol and Susan are showing off Ben to the gang.]

Phoebe: Oh my God, oh, ok, was that too much pressure for him?Susan: Oh, is he hungry already?Carol: I guess so. (Carol starts to breast feed Ben.)… …

[Scene: Central Perk, the gang is all there.]

Julie: Rachel, do you have any muffins left?Rachel: Yeah, I forget which ones.Julie: Oh, you're busy, that's ok, I'll get it. Anybody else want one?… …

Characters (named faces)

RACH MNCA PHBE JOY CHANROSS




Conclusion


Related work

Movie/Script alignment

Script-subtitle alignment

[Scene: Rachel is

entering the living room.]

Monica: Julie.

Rachel: What?!

00:10:44,210 -->

00:10:45,177

Monica: Julie.

00:10:45,444 -->

00:10:46,775

Rachel: What?!

script subtitle movie

Disadvantages

Syntax and words discrepancy between the script and subtitleAvailability of the subtitle




Conclusion


Related work (cont.)

Face naming

Fully supervisedWeakly supervised

[Scene: Rachel is

entering the living room.]

Monica: Julie.

Rachel: What?!

(a) weakly supervised (b) fully supervised

Monica

Rachel

Disadvantages

Expensive manual labelsLarge-scale applications




Conclusion


Related work (cont.)

Scene segmentation

Content-based methodScript-guided method

bq4, shot4bq3, shot3bq1, shot1 bq2, shot2

aq2,q4aq1,q3

Scene q4Scene q3Scene q1 Scene q2

Shot 4Shot 3Shot 2

HMM : λ= {A, B, п} = {A(aqi, qj), B(bqi, shotj),п}

Observation

sequence

Hidden state

sequence

Viterbi alignment : Q = {q1, q2, q3, q4, q5, ...}

Shot 1

. . .aq2,q3aq1,q2 aq3,q4

aq1,q4

t = 1 t = 2 t = 3 t = 4

Disadvantages

Matching units are asymmetricLatent geometric distribution




Conclusion

Basic ideasRole histogram

Our solution

Basic ideasA generative TVParser model to align video and script bymining face-name correspondence.

JOEY

MNCA

RACH

CHAN 0 0

1 0

2 0

2 0

0 1

0 1

1 1

2 2

0 0 0

0 1 1

1 0 2

0 0 1

0 2

0 0

0 0

1 2

0 0 1

1 1 0

2 1 0

3 0 1

S1 S2 S3 S4 S7 S8 S9 S10 S11C1 C2 C3

C1:{S1, ,S4} C2:{S6, ,S8} C3:{S10, ,S12}

name histogram face histogram

0

0

0

1

S12

AdvantagesFace names can be identified in an unsupervised way (learning)Global optimal scene segmentation can be inferred (inference)Fast algorithms for both parameter learning and state inference




Conclusion


Role histogram

Basic ideaBag-of-Words (BoW) representationRole composition is a generic and semantic feature for bothvideo (as face histogram) and script (as name histogram)

Name clustering

Face clusteringDifficulty: variational environment conditions, e.g. pose, etc.




Conclusion


Role histogram

Face clustering

Solution I: Semi-supervised kernel k-means clustering

Key points

Incorporate pairwise constraints (must-link and cannot-link)Adopt manifold-manifold distance

t

must-link and cannot-link manifold-manifold distance




Conclusion


Role histogram

Face clusteringSolution II: Loose clustering number

Key pointsAllowing purified substructures




Conclusion

Model formulationParameter estimationState inference

Model formulation

Graphical TVParser model

. . . . . . . . .

v(i-1)

ti-1+di-1 ti ti+di ti+1 ti+1+di+1ti-1

si-1 si si+1

v(i) v(i+1)

pi-1 = (ti-1 , di-1) pi = (ti , di) pi+1 = (ti+1 , di+1)

S : {si |i=1, · · ·, r} is observed script scene sequence;V : {vj |j=1, · · ·, u} is observed video shot sequence;P : {pi=(ti , di )|i=1, · · · , r} is the hidden video scene partitionsequence where t1 = 1,

∑i di = u and ti = ti−1 + di−1 (i > 1).




Conclusion


Model formulation

Complete TVParser model

P(V,S,P) = P(s1)P(p1|s1)P(v(1)|p1, s1)

×r∏

i=2

P(si |si−1)P(pi |si )P(v(i)|pi , si )

The generative process

(1) Enter into the i th script scene si from its predecessor si−1;

(2) Decide si ’s related partition pi = (ti , di );

(3) Generate the corresponding video shot subsequence v(i) = v[ti :ti+dj ]

indexing from ti to ti + di




Conclusion


Model formulation

Additional constraint

P(s1) = 1 ⇔ s1 = 1P(si |si−1) = 1 ⇔ si = i , si−1 = i − 1

Simplified TVParser model

P(V,S,P) =r∏

i=1

P(pi |si )︸︷︷︸duration

P(v(i)|pi , si )︸︷︷︸observation




Conclusion


Model formulation

Scene duration probability

Poisson distribution

P(pi |si ;λi ) =λdii e

−λi

di != e−λi ·

λdiidi !

Reasons

Poisson is a plausible model of state duration;Model parameter, λ = {λi}, is the expected duration of scenes;Parameter can be estimated by Maximum likelihood method




Conclusion


Model formulation

Observation probability

Gaussian distribution

P(v(i)|pi , si ;A, σi ) =1√

2πσ2i

exp

{−

(si − A v(i))>(si − A v(i))

2σ2i

}

Meaning for parameter A

A = [Aij ] ∈ RM×N is the face-name relation matrix that associatesM name with N face clusters. By regulating the entry of A asAij ≥ 0 and

∑i Aij = 1, we can treat each column as a identity

distribution of the face cluster.




Conclusion


Parameter estimation

Model parameters Ψ = {{λi}, {σ2i },A}Maximum likelihood estimation (MLE)

maxΨ̂

∑P

P(P|V,S; Ψ) · logP(V,S,P; Ψ̂)

s.t. 111>MA = 111

>N

A ≥ 0,

Optimization problem

For {λi}and{σi}, unconstraint optimizationFor A, constraint optimization




Conclusion



Re-estimation for {λi}

λi =

∑piP(pi |V,S; Ψ) · di∑piP(pi |V,S; Ψ)

Re-estimation for {σi}

σ2i =

∑piP(pi |V,S; Ψ) · (si−Av(i))(si−Av(i))>∑

piP(pi |V,S; Ψ)




Conclusion



Re-estimation for A

Aij ← Aij

√√√√ (W −111Mηηη>)+

ij

2(AU)ij + (W −111Mηηη>)−ij

where

W =∑P

P(P|V,S; Ψ)r∑

i=1

1

σ2i

siv>(i)

U =∑P

P(P|V,S; Ψ)r∑

i=1

1

2σ2i

v(i)v>(i)

ηηη>=1

M· (111>

MW − 2 111>

NU)




Conclusion



Summation in both W and U∑P

P(P|V,S; Ψ)

Sum over the whole possible partition sequence spaceTypical example: u = 15 (scenes) and r = 300 (shots), thenpossible segmentation number: C15

299 ≈ O(1024) (Intractable!)

Solution: Sequence ⇒ segments∑P

P(P|V,S; Ψ)r∑

i=1

=r∑

i=1

∑pi

P(pi |V,S; Ψ)




Conclusion



Posterior probability P(pi |V,S; Ψ)

Forward-backward algorithm

Forward-backward variables{αpi (si ) ,P(si , pi , v[1:ti+di ]; Ψ)

βpi (si ) ,P(v[ti+di+1:u]|si , pi ; Ψ)

Forward-backward recursionInitial conditions




Conclusion


State inference

Hidden partition sequence P∗Viterbi Algorithm

Local optimal

δτ (si ; θ) , maxp[1:i−1]

P(p[1:i−1], s[1:i−1], τ ∈ qi , o[1:τ ]; θ)

Forward recursionBacktracking




Conclusion

Data setsFace namingScene segmentation

Data sets

Two TV series

6 episodes from American TV series “Friends”5 episodes from Chinese TV series “I Love My Family”(Family)

Data details (average per episode)

Length: 30 minRole number: 10Face number: 2× 105

Shot number: 300




Conclusion


Face naming

BaselinesFace clustering

Unconstrained kernel K means (KK)Constraint K -means (CK)Completely positive factorization (CP)Constraint spectral Learning (SL)

Face Recognition

K nearest neighbor (KNN)Support vector machine (SVM)




Conclusion


Face naming

CriteriaFace clustering

NMI =

∑l

∑h nl.h log(

n·nl,hnlnh

)√(∑

l nl log nln )(∑

h nh log nhn )

where n is the number of objects, nl is the size of the l th classin the groundtruth, nh is the size of the hth cluster in the resultand nl,h is the size of their intersect.Face Recognition

Fw =∑i

wi ·2× precisioni × recalliprecisioni + recalli

where wi denotes the weight of the i th role according tohis/her spoken lines in the script.




Conclusion


Face naming

Face clusteringConstraint vs. unconstraintClustering number variance

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

CK

KK

SSKK

SL

CP

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

CK

KK

SSKK

SL

CP

NM

I sc

ore

Cluster number (x times) Cluster number (x times)

NM

I sc

ore

Friends Family




Conclusion


Face naming

Face recognition (naming)Optimal recognition achieved when the clustering numberapproximates 2 times of the character number

Cluster number (x times) Cluster number (x times)

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

A purifying rate

Precision

Recall

Fw-measure

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.2

0

0.2

0.4

0.6

0.8

A purifying rate

Precision

Recall

Fw-measure

Friends Family




Conclusion


Face naming

Main character naming resultAccuracyRobustness

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1st

main character

2nd

main character

3rd

main character

4th main character

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1st

main character

2nd

main character

3rd

main character

4th main character

Wei

ghte

d F

-mea

sure

Cluster number (x times)

Wei

ghte

d F

-mea

sure

Cluster number (x times)

Friends Family




Conclusion


Face naming

Compare with supervised methodsComparable to supervised methodsEven better when training set is limited

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

SVM

TVParser (1st

best)

TVParser (2nd

best)

TVParser (3rd

best)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

SVM

TVParser (1st

best)

TVParser (2nd

best)

TVParser (3rd

best)

Wei

gh

ted

F-m

easu

re

training-test-ratio

Wei

gh

ted

F-m

easu

re

training-test-ratio

Friends Family




Conclusion


Scene segmentation

BaselinesScene segmentation methods (algorithms)

Shot similarity graph (SSG)Dynamic time warping (DTW)Hidden Markov model (HMM)




Conclusion


Scene segmentation

Criteria

Scene segmentation

ρ = (r∑

i=1

diu

r∑j=1

d2ij

d2i

) · (r∑

j=1

d∗j

u

r∑i=1

d2ij

d∗2j

)

where dij is the length of overlap between the scene segmentpi and p∗j , di is the length of the scene pi and r is total lengthof all scenes. This purity value ranges from 0 to 1, and thelarger a value is, the closer it is to the groundtruth.




Conclusion


Scene segmentation

Scene segmentation result

Segmentation Sources Purity ScoresMethods (video+) Friends Family

SSG - 0.55± 0.11 0.53± 0.07DTW sub.+scr. 0.60± 0.13 -HMM scr. 0.59± 0.08 0.53± 0.05

TVParser scr. 0.67± 0.07 0.58± 0.03




Conclusion


Scene segmentation

Scene segmentation result under various role histogramsName histogram: first four characters are dominantFace histogram: more clusters are generally better

Face histogram dimensionNam

e hist

ogram

dimens

ion

Puri

ty s

core Aver

age

puri

tyAv

erag

e pu

rity

Face histogram size

↑0.12（≈71%）

↑0.05（≈29%）

2

4

6

8

10

X 0.00

X 0.50

X 1.00

X 1.50

X 2.00

X 2.50

0.4

0.5

0.6

0.7

0.4

0.45

0.5

0.55

0.6

0.65

X 0.25 X 0.75 X 1.25 X 1.75 X 2.250.46

0.5

0.54

0.58

0.6

2 3 4 5 6 7 8 9 10 110.4

0.45

0.5

0.55

0.6

Face histogram size




Conclusion

Conclusion

We propose a generative model to formulate story plotdevelopment in TV videos, which solves face naming andscene segmentation in an unified framework.

Key novelties

Unsupervised face naming through model parameter learningGlobal optimal scene segmentation by hidden state inferenceFast algorithms for both parameter learning and state inference

Future work

Personalized applications, e.g. TV video synthesis, etc;Generic cross-media analysis and association methods.




Conclusion

Q & A

Thanks!


Technology

Tv parser an automatic tv video parsing method_liang_20100309