35
Introduction Our solution TVParser model Experimental Results Conclusion TVParser: An Automatic TV Video Parsing Method Chao Liang National Laboratory of Pattern Recognition (NLPR) Chinese Academy of Sciences, Institute of Automation (CASIA) March 9, 2011 Chao Liang TVParser: An Automatic TV Video Parsing Method

Tv parser an automatic tv video parsing method_liang_20100309

Embed Size (px)

DESCRIPTION

group

Citation preview

Page 1: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

TVParser: An Automatic TV Video ParsingMethod

Chao Liang

National Laboratory of Pattern Recognition (NLPR)Chinese Academy of Sciences, Institute of Automation (CASIA)

March 9, 2011

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 2: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Outline

1 IntroductionMotivationRelated work

2 Our solutionBasic ideasRole histogram

3 TVParser modelModel formulationParameter estimationState inference

4 Experimental ResultsData setsFace namingScene segmentation

5 Conclusion

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 3: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

MotivationRelated work

Introduction

MotivationVoluminous TV videos vs. efficient management

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 4: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

MotivationRelated work

Introduction

TV videoStory plot (scene structure)

[Scene: Monica and Rachel's, Carol and Susan are showing off Ben to the gang.]

Phoebe: Oh my God, oh, ok, was that too much pressure for him?Susan: Oh, is he hungry already?Carol: I guess so. (Carol starts to breast feed Ben.)… …

[Scene: Central Perk, the gang is all there.]

Julie: Rachel, do you have any muffins left?Rachel: Yeah, I forget which ones.Julie: Oh, you're busy, that's ok, I'll get it. Anybody else want one?… …

Characters (named faces)

RACH MNCA PHBE JOY CHANROSS

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 5: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

MotivationRelated work

Related work

Movie/Script alignment

Script-subtitle alignment

[Scene: Rachel is

entering the living room.]

Monica: Julie.

Rachel: What?!

00:10:44,210 -->

00:10:45,177

Monica: Julie.

00:10:45,444 -->

00:10:46,775

Rachel: What?!

script subtitle movie

Disadvantages

Syntax and words discrepancy between the script and subtitleAvailability of the subtitle

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 6: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

MotivationRelated work

Related work (cont.)

Face naming

Fully supervisedWeakly supervised

[Scene: Rachel is

entering the living room.]

Monica: Julie.

Rachel: What?!

(a) weakly supervised (b) fully supervised

Monica

Rachel

Disadvantages

Expensive manual labelsLarge-scale applications

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 7: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

MotivationRelated work

Related work (cont.)

Scene segmentation

Content-based methodScript-guided method

bq4, shot4bq3, shot3bq1, shot1 bq2, shot2

aq2,q4aq1,q3

Scene q4Scene q3Scene q1 Scene q2

Shot 4Shot 3Shot 2

HMM : λ= {A, B, п} = {A(aqi, qj), B(bqi, shotj),п}

Observation

sequence

Hidden state

sequence

Viterbi alignment : Q = {q1, q2, q3, q4, q5, ...}

Shot 1

. . .aq2,q3aq1,q2 aq3,q4

aq1,q4

t = 1 t = 2 t = 3 t = 4

Disadvantages

Matching units are asymmetricLatent geometric distribution

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 8: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Basic ideasRole histogram

Our solution

Basic ideasA generative TVParser model to align video and script bymining face-name correspondence.

JOEY

MNCA

RACH

CHAN 0 0

1 0

2 0

2 0

0 1

0 1

1 1

2 2

0 0 0

0 1 1

1 0 2

0 0 1

0 2

0 0

0 0

1 2

0 0 1

1 1 0

2 1 0

3 0 1

S1 S2 S3 S4 S7 S8 S9 S10 S11C1 C2 C3

C1:{S1, ,S4} C2:{S6, ,S8} C3:{S10, ,S12}

name histogram face histogram

0

0

0

1

S12

AdvantagesFace names can be identified in an unsupervised way (learning)Global optimal scene segmentation can be inferred (inference)Fast algorithms for both parameter learning and state inference

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 9: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Basic ideasRole histogram

Role histogram

Basic ideaBag-of-Words (BoW) representationRole composition is a generic and semantic feature for bothvideo (as face histogram) and script (as name histogram)

Name clustering

Face clusteringDifficulty: variational environment conditions, e.g. pose, etc.

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 10: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Basic ideasRole histogram

Role histogram

Face clustering

Solution I: Semi-supervised kernel k-means clustering

Key points

Incorporate pairwise constraints (must-link and cannot-link)Adopt manifold-manifold distance

t

must-link and cannot-link manifold-manifold distance

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 11: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Basic ideasRole histogram

Role histogram

Face clusteringSolution II: Loose clustering number

Key pointsAllowing purified substructures

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 12: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Model formulation

Graphical TVParser model

. . . . . . . . .

v(i-1)

ti-1+di-1 ti ti+di ti+1 ti+1+di+1ti-1

si-1 si si+1

v(i) v(i+1)

pi-1 = (ti-1 , di-1) pi = (ti , di) pi+1 = (ti+1 , di+1)

S : {si |i=1, · · ·, r} is observed script scene sequence;V : {vj |j=1, · · ·, u} is observed video shot sequence;P : {pi=(ti , di )|i=1, · · · , r} is the hidden video scene partitionsequence where t1 = 1,

∑i di = u and ti = ti−1 + di−1 (i > 1).

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 13: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Model formulation

Complete TVParser model

P(V,S,P) = P(s1)P(p1|s1)P(v(1)|p1, s1)

×r∏

i=2

P(si |si−1)P(pi |si )P(v(i)|pi , si )

The generative process

(1) Enter into the i th script scene si from its predecessor si−1;

(2) Decide si ’s related partition pi = (ti , di );

(3) Generate the corresponding video shot subsequence v(i) = v[ti :ti+dj ]

indexing from ti to ti + di

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 14: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Model formulation

Additional constraint

P(s1) = 1 ⇔ s1 = 1P(si |si−1) = 1 ⇔ si = i , si−1 = i − 1

Simplified TVParser model

P(V,S,P) =r∏

i=1

P(pi |si )︸ ︷︷ ︸duration

P(v(i)|pi , si )︸ ︷︷ ︸observation

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 15: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Model formulation

Scene duration probability

Poisson distribution

P(pi |si ;λi ) =λdii e

−λi

di != e−λi ·

λdiidi !

Reasons

Poisson is a plausible model of state duration;Model parameter, λ = {λi}, is the expected duration of scenes;Parameter can be estimated by Maximum likelihood method

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 16: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Model formulation

Observation probability

Gaussian distribution

P(v(i)|pi , si ;A, σi ) =1√

2πσ2i

exp

{−

(si − A v(i))>(si − A v(i))

2σ2i

}

Meaning for parameter A

A = [Aij ] ∈ RM×N is the face-name relation matrix that associatesM name with N face clusters. By regulating the entry of A asAij ≥ 0 and

∑i Aij = 1, we can treat each column as a identity

distribution of the face cluster.

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 17: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Parameter estimation

Model parameters Ψ = {{λi}, {σ2i },A}Maximum likelihood estimation (MLE)

maxΨ̂

∑P

P(P|V,S; Ψ) · logP(V,S,P; Ψ̂)

s.t. 111>MA = 111

>N

A ≥ 0,

Optimization problem

For {λi}and{σi}, unconstraint optimizationFor A, constraint optimization

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 18: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Parameter estimation

Re-estimation for {λi}

λi =

∑piP(pi |V,S; Ψ) · di∑piP(pi |V,S; Ψ)

Re-estimation for {σi}

σ2i =

∑piP(pi |V,S; Ψ) · (si−Av(i))(si−Av(i))>∑

piP(pi |V,S; Ψ)

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 19: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Parameter estimation

Re-estimation for A

Aij ← Aij

√√√√ (W −111Mηηη>)+

ij

2(AU)ij + (W −111Mηηη>)−ij

where

W =∑P

P(P|V,S; Ψ)r∑

i=1

1

σ2i

siv>(i)

U =∑P

P(P|V,S; Ψ)r∑

i=1

1

2σ2i

v(i)v>(i)

ηηη>=1

M· (111>

MW − 2 111>

NU)

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 20: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Parameter estimation

Summation in both W and U∑P

P(P|V,S; Ψ)

Sum over the whole possible partition sequence spaceTypical example: u = 15 (scenes) and r = 300 (shots), thenpossible segmentation number: C15

299 ≈ O(1024) (Intractable!)

Solution: Sequence ⇒ segments∑P

P(P|V,S; Ψ)r∑

i=1

=r∑

i=1

∑pi

P(pi |V,S; Ψ)

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 21: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

Parameter estimation

Posterior probability P(pi |V,S; Ψ)

Forward-backward algorithm

Forward-backward variables{αpi (si ) ,P(si , pi , v[1:ti+di ]; Ψ)

βpi (si ) ,P(v[ti+di+1:u]|si , pi ; Ψ)

Forward-backward recursionInitial conditions

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 22: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Model formulationParameter estimationState inference

State inference

Hidden partition sequence P∗Viterbi Algorithm

Local optimal

δτ (si ; θ) , maxp[1:i−1]

P(p[1:i−1], s[1:i−1], τ ∈ qi , o[1:τ ]; θ)

Forward recursionBacktracking

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 23: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Data sets

Two TV series

6 episodes from American TV series “Friends”5 episodes from Chinese TV series “I Love My Family”(Family)

Data details (average per episode)

Length: 30 minRole number: 10Face number: 2× 105

Shot number: 300

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 24: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

BaselinesFace clustering

Unconstrained kernel K means (KK)Constraint K -means (CK)Completely positive factorization (CP)Constraint spectral Learning (SL)

Face Recognition

K nearest neighbor (KNN)Support vector machine (SVM)

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 25: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

CriteriaFace clustering

NMI =

∑l

∑h nl.h log(

n·nl,hnlnh

)√(∑

l nl log nln )(∑

h nh log nhn )

where n is the number of objects, nl is the size of the l th classin the groundtruth, nh is the size of the hth cluster in the resultand nl,h is the size of their intersect.Face Recognition

Fw =∑i

wi ·2× precisioni × recalliprecisioni + recalli

where wi denotes the weight of the i th role according tohis/her spoken lines in the script.

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 26: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

Face clusteringConstraint vs. unconstraintClustering number variance

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

CK

KK

SSKK

SL

CP

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

CK

KK

SSKK

SL

CP

NM

I sc

ore

Cluster number (x times) Cluster number (x times)

NM

I sc

ore

Friends Family

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 27: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

Face recognition (naming)Optimal recognition achieved when the clustering numberapproximates 2 times of the character number

Cluster number (x times) Cluster number (x times)

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

A purifying rate

Precision

Recall

Fw-measure

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.0-0.2

0

0.2

0.4

0.6

0.8

A purifying rate

Precision

Recall

Fw-measure

Friends Family

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 28: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

Main character naming resultAccuracyRobustness

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1st

main character

2nd

main character

3rd

main character

4th main character

X 0.0 X 1.0 X 2.0 X 3.0 X 4.0 X 5.00

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1st

main character

2nd

main character

3rd

main character

4th main character

Wei

ghte

d F

-mea

sure

Cluster number (x times)

Wei

ghte

d F

-mea

sure

Cluster number (x times)

Friends Family

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 29: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Face naming

Compare with supervised methodsComparable to supervised methodsEven better when training set is limited

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

SVM

TVParser (1st

best)

TVParser (2nd

best)

TVParser (3rd

best)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

KNN

SVM

TVParser (1st

best)

TVParser (2nd

best)

TVParser (3rd

best)

Wei

gh

ted

F-m

easu

re

training-test-ratio

Wei

gh

ted

F-m

easu

re

training-test-ratio

Friends Family

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 30: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Scene segmentation

BaselinesScene segmentation methods (algorithms)

Shot similarity graph (SSG)Dynamic time warping (DTW)Hidden Markov model (HMM)

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 31: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Scene segmentation

Criteria

Scene segmentation

ρ = (r∑

i=1

diu

r∑j=1

d2ij

d2i

) · (r∑

j=1

d∗j

u

r∑i=1

d2ij

d∗2j

)

where dij is the length of overlap between the scene segmentpi and p∗j , di is the length of the scene pi and r is total lengthof all scenes. This purity value ranges from 0 to 1, and thelarger a value is, the closer it is to the groundtruth.

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 32: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Scene segmentation

Scene segmentation result

Segmentation Sources Purity ScoresMethods (video+) Friends Family

SSG - 0.55± 0.11 0.53± 0.07DTW sub.+scr. 0.60± 0.13 -HMM scr. 0.59± 0.08 0.53± 0.05

TVParser scr. 0.67± 0.07 0.58± 0.03

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 33: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Data setsFace namingScene segmentation

Scene segmentation

Scene segmentation result under various role histogramsName histogram: first four characters are dominantFace histogram: more clusters are generally better

Face histogram dimensionNam

e hist

ogram

dimens

ion

Puri

ty s

core Aver

age

puri

tyAv

erag

e pu

rity

Face histogram size

↑0.12(≈71%)

↑0.05(≈29%)

2

4

6

8

10

X 0.00

X 0.50

X 1.00

X 1.50

X 2.00

X 2.50

0.4

0.5

0.6

0.7

0.4

0.45

0.5

0.55

0.6

0.65

X 0.25 X 0.75 X 1.25 X 1.75 X 2.250.46

0.5

0.54

0.58

0.6

2 3 4 5 6 7 8 9 10 110.4

0.45

0.5

0.55

0.6

Face histogram size

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 34: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Conclusion

We propose a generative model to formulate story plotdevelopment in TV videos, which solves face naming andscene segmentation in an unified framework.

Key novelties

Unsupervised face naming through model parameter learningGlobal optimal scene segmentation by hidden state inferenceFast algorithms for both parameter learning and state inference

Future work

Personalized applications, e.g. TV video synthesis, etc;Generic cross-media analysis and association methods.

Chao Liang TVParser: An Automatic TV Video Parsing Method

Page 35: Tv parser  an automatic tv video parsing method_liang_20100309

IntroductionOur solution

TVParser modelExperimental Results

Conclusion

Q & A

Thanks!

Chao Liang TVParser: An Automatic TV Video Parsing Method