Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶

Visual Event Recognition in Videos by Learning from Web Data

Lixin Duan†, Dong Xu†, Ivor Tsang†, Jiebo Luo¶

† Nanyang Technological University, Singapore¶ Kodak Research Labs, Rochester, NY, USA

Outline

• Overview of the Event Recognition System• Similarity between Videos– Aligned Space-Time Pyramid Matching

• Cross-Domain Problem– Adaptive Multiple Kernel Learning

• Experiments• Conclusion

Overview

• GOAL: Recognize consumer videos

• Large intra-class variability; limited labeled videos

⋮⋮ ⋮

Sports

Picnic

Wedding

• GOAL: Recognize consumer videos by leveraging a large number of loosely labeled web videos (e.g., from YouTube)

⋮⋮ ⋮

Sports

Picnic

Wedding

Overview

Consumer Videos

A Large Number of Web Videos

Overview

Video Database

Test video Classifier Output

• Flowchart of the system

• Pyramid matching methods

– Temporally aligned pyramid matching, D. Xu and S.-F. Chang [1]

– Unaligned space-time pyramid matching, I. Laptev [2]

Similarity between Videos

Time axis Space axes Space-time axes


• Aligned Space-Time Pyramid Matching– Each video is divided into non-overlapped space-

time volumes, where .– Greater variability

• Two-step approach– Distances between space-time volumes: solved by

existing methods such as bag-of-words model, I. Laptev [2]


• Aligned Space-Time Pyramid Matching– Level 1

V i V j

Distance


V i

Distance

V j

• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

Distance


• Integer-flow Earth Mover’s Distance (EMD), Y. Rubner [3]

F̂ rc=arg minF rc∈{0,1}

∑u=1

H

∑v=1

I

F rc Drc ∑c=1

R

F rc=1 ,∀ r ;∑r=1

R

F rc=1 ,∀ c .s.t.

D(V i ,V j)=∑r=1

R

∑c=1

R

F̂ rc Drc

∑r=1

R

∑c=1

R

F̂ rc

V i V j

Cross-Domain Problem

• Data distribution mismatch between consumer videos and web videos– Consumer videos: Naturally captured– Web videos: Edited; Selected

• Maximum Mean Discrepancy (MMD), K. M. Borgwardt [4]

DIST k (DA ,DT )=‖ 1n A∑i=1

nA

𝜑 (xiA )−

1nT

∑i=1

nT

𝜑 (xiT )‖ℋ

⇒DIST k2 (DA ,DT )=tr(KS)

where , and .


• Suppose there are pre-learned classifiers • is learned by SVM with the labeled training

data from both domains• Proposed target decision function

f T (x )=∑p=1

P

𝛽p f p(x )+Δ f (x)

where is the linear combination coefficient and is the perturbation function.

Prior information


• Motivated by Multiple Kernel Learning (MKL) (F. Bach [5]), perturbation function

• MKL:• MMD

Δ f (x )=∑m=1

M

dmwm′ 𝜑m (𝐱 )+b

where .

Ω (𝐝 )≔DISTk2 ( DA , DT )=tr (KS)=𝐡′𝐝

, where

where


• Adaptive Multiple Kernel Learning (A-MKL)

min𝐝∈𝒟G (𝐝 )=1

2Ω2 (𝐝 )+𝜃 ⋅ J (𝐝)

where

J (𝐝 )= min𝐰m ,𝛃, b , 𝜉 i

12 (∑

m=1

M

dm‖𝐰m‖2+𝜆‖𝛃‖2)+C∑

i=1

n

𝜉 i

s . t . y i(∑p=1

P

𝛽 p f p (x)+∑m=1

M

dmwm′ 𝜑m ( x )+b)≥1−𝜉 i ,𝜉 i≥0

MMD Structural risk functional


• Dual form of

• A-MKL algorithm– Iteratively solve the linear coefficients and the

dual variables in the dual form of .

min𝛂𝛂 ′𝟏+¿ 1

2(𝛂∘ 𝐲 ) ′ (∑

m=1

M

dm~𝐊m) (𝛂∘ 𝐲 ) ¿

s . t .𝛂 ′ 𝐲=0 ,𝟎≤𝛂 ≤C𝟏


• Feature Replication (FR), H. Daumé III [6]– Augment features

• Domain Transfer SVM (DTSVM), L. Duan [7]– No prior information

• Adaptive SVM (A-SVM), J. Yang [8]

– is pre-defined– is modeled by SVM

Experiments

• Data set– 195 consumer videos and 906 web videos collected

by ourselves and from Kodak Consumer Video Benchmark Data Set [5]

– 6 events: “wedding”, “birthday”, “picnic”, “parade”, “show” and “sports”

– Training data: 3 videos per event from consumer videos and all web videos

– Test data: The rest consumer videos

Experiments

• Two types of features– Space-time (ST) feature, Laptev et al. [1]– SIFT feature, Lowe [2]

• Four types of base kernels– Gaussian: – Laplacian: – Inverse Square Distance: – Inverse Distance:

Experiments

• Aligned Space-Time Pyramid Matching (ASTPM) vs. Unaligned Space-Time Pyramid Matching (USTPM)– ASTPM is better than USTPM at Level 1

Aligned Unaligned

Experiments

• 80 base kernels in total: 2 pyramid levels, 2 types of features, 5 kernel parameters and 4 types of kernels

• Average classifiers at Level ()– : 20 base classifiers learned by SVM– : 20 base classifiers learned by SVM– Pre-learned classifiers : 4 average classifiers

f T (𝐱 )=∑p=1

P

𝛽p f p(x)+∑m=1

M

dmwm′ 𝜑m ( x )+b

Experiments

• Comparisons of cross-domain learning methods– (a) SIFT features– (b) ST features– (c) SIFT features and ST features

– “parade”: 75.7% (A-MKL) vs. 62.2% (FR)

Experiments

• Comparisons of cross-domain learning methods

• Relative improvements– SVM_T: 36.9%– SVM_AT: 8.6%– Feature Replication (FR) [6]: 7.6%– Adaptive SVM (A-SVM) [7]: 49.6%– Domain Transfer SVM (DTSVM) [8]: 9.9%

•

• MKL-based methods – Better fuse SIFT features and ST features– Handle noise in the loose labels

Conclusion

• We propose a new event recognition framework for consumer videos by leveraging a large number of loosely labeled web videos.

• We develop a new aligned space-time pyramid matching method.

• We present a new cross-domain learning method A-MKL which handles the mismatch between the data distributions of the consumer video domain and the web video domain.

References

[1] D. Xu and S.-F. Chang. Video event recognition using kernelmethods with multi-level temporal alignment. T-PAMI,30(11):1985–1997, 2008.[2] I. Laptev, M. Marszałek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008.[3] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth mover’s distance as a metric for image retrieval. IJCV, 40(2): 99-121, 2000.[4] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. Smola. Integrating structured biological data by kernel maximum mean discrepancy. In ISMB, 2006.

References

[5] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality and the SMO algorithm. In ICML, 2004.[6] H. Daumé III. Frustratingly easy domain adaptation. In ACL, 2007.[7] L. Duan, I. W. Tsang, D. Xu, and S. J. Maybank. Domain transfer svm for video concept detection. In CVPR, 2009.[8] J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In ACM MM, 2007.[9] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.

Thank you!

Documents

Visual Event Recognition in Videos by Learning from Web Data Lixin Duan, Dong Xu, Ivor Tsang, Jiebo Luo ¶ Nanyang Technological University, Singapore ¶