43
Rank pooling and variants for action and activity recognition Basura Fernando ARC Centre of Excellence for Robotic Vision Research School of Engineering The Australian National University Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian N Rank Pooling 1 / 47

Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Rank pooling and variants for action and activityrecognition

Basura Fernando

ARC Centre of Excellence for Robotic VisionResearch School of Engineering

The Australian National University

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 1 / 47

Page 2: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Outline

1 Introduction

2 Dynamic Images

3 Hierarchical rank pooling

4 Learning discriminative dynamic

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 2 / 47

Page 3: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Collaborators

(a) StephenGould (ANU)

(b) TinneTuytelaars(KUL)

(c) EfstratiosGavves(UVA)

(d) AndreaVedaldi (OX)

(e) HakanBilen ((OX))

(f) MarcusHutter(ANU)

(g) JoseOramas(KUL)

(h) AmirGodrati(KUL)

(i) PeterAnderson(ANU)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 3 / 47

Page 4: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Action recognition from video sequences

A video is a sequence of n frames X = [x1,x2, . . . ,xn].

The frame at time t is represented by a vector xt ∈D.

The training set consists of Ntrn number of videos from C numberof action classes.

Objective: Classify each test video to correct action class

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 4 / 47

Page 5: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Temporal pooling and encoding methods

max-pooling - works well with CNN features

sum-pooling - works well with Fisher vectors

LSTM

Temporal pyramids

HMM, CRF, subspace-based methods

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 5 / 47

Page 6: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Temporal encoding with Rank Pooling

Let V be a sequence of smoothed dataV = [v1 � v2 � v3 · · · � vn] obtained from X

Let D be the dynamics of a video sequence V

Proposition : The dynamics of V can be approximated by linearfunction Φu = Φ(V ;u) parametrized by u

arg minu

||D − Φu||. (1)

For a given definition of dynamics D, there exists a family offunctions Φ.

What is a good family of such functions?

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 6 / 47

Page 7: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Learning to rank

A pairwise linear ranking machine Φ(v;u) = uT · v learnsparameter (u) from the data such that

∀ti, tj , vti � vtj ⇐⇒ uT · vti > uT · vtj

Using the structural risk minimization and max-margin framework, theobjective is then to optimize

argminu1

2‖u‖2 + C

∑∀i,jvti

�vtj

εij (2)

s.t. uT · (vti − vtj) ≥ 1− εijεij ≥ 0.

Observations:

Parameter (u) lies in the same space as the original data of V

Parameter (u) captures the information about ordering in V

Parameter (u) captures some information about the contents of V

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 7 / 47

Page 8: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Rank pooling: Main idea

Proposition

We propose to use the parameters ui ∈D of Φi as a new videorepresentation for capturing the specific appearance evolution of thevideo i.e. Di

Appeara

nce E

volu

tion

Natural order

≻ ≻ ≻Ranking Machine

Video Representation

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 8 / 47

Page 9: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

What is captured by rank pooling?

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 9 / 47

Page 10: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Rank Pooling - Algorithm

1 Extract dense or improved trajectory from the video2 Fisher encode each frame using HOG, HOF, MBH features (create

vti)3 Smooth the video signal X to obtain V = [v1 � v2 � v3 · · · � vn]4 Learns ranking function Φ(v;u) from X5 Represent each video with vector u6 Use standard classification framework for action classification.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 10 / 47

Page 11: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Variants

Forward and reverse Rank Pooling

Non-linear Rank Pooling with non-linear feature maps

Data augmentation with mirrored videos

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 11 / 47

Page 12: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Dynamic Image Networks - CVPR 2016

What if we can summarize the motion information of a video into ansingle image?

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 12 / 47

Page 13: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Dynamic images of UCF 101

Figure 1 : Dynamic images summarizing the actions and motions that happenin images in standard 2d image format.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 13 / 47

Page 14: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Learning dynamic images and dynamic maps

End to end CNN training with approximate rank pooling

Gradients of rank pooling is complex

Need faster forward pass approximation

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 14 / 47

Page 15: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Dynamic image networks - temporal objective

Let frames of a video be I1, . . . , IT .

Let ψ(It) ∈ Rd be a representation for individual frame It in thevideo.

Let Vt = 1t

∑tτ=1 ψ(Iτ ) be time average of these features up to

time t.

The ranking function associates to each time t a scoreS(t|u) = 〈u, Vt〉, where u ∈ Rd is a vector of parameters.

u = ρ(I1, . . . , IT ;ψ) = argminu

E(u),

E(u) =λ

2‖u‖2+ (3)

2

T (T − 1)×∑q>t

max{0, 1− S(q|u) + S(t|u)}.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 15 / 47

Page 16: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Dynamic image networks - fast approximate rankpooling

considers the first step in a gradient-based optimization ofobjective 6.

initialize video specific parameters with u = ~0

for arbitrary learning rate η > 0, new u will beu = ~0− η∇E(u)|u=~0 ∝ −∇E(u)|u=~0

∇E(~0) ∝∑q>t

∇max{0, 1− S(q|u) + S(t|u)}|u=~0

=∑q>t

∇〈u, Vt − Vq〉 =∑q>t

Vt − Vq.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 16 / 47

Page 17: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

We can further expand u as follows

u ∝∑q>t

Vq − Vt =∑q>t

1

q

q∑i=1

ψi −1

t

t∑j=1

ψj

=

T∑t=1

αtψt

where the coefficients αt are given by

αt = 2(T − t+ 1)− (T + 1)(HT −Ht−1) (4)

and where Ht =∑t

i=1 1/t is the t-th Harmonic number (and where wedefine H0 = 0). Hence the rank pooling operator reduces to

ρ(I1, . . . , IT ;ψ) =T∑t=1

αtψ(It). (5)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 17 / 47

Page 18: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

UCF101 dataset

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 18 / 47

Page 19: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Results on UCF101 & HMDB51 dataset

Method SPLIT1 SPLIT2 SPLIT3 MEAN

Mean Image 52.6 53.4 51.7 52.6Max Image 48.0 46.0 42.3 45.4SDI 57.2 58.7 57.7 57.9

Table 1 : Comparing several video representative image models using UCF101

Method HMDB51 UCF101MDI 32.3 68.6MDM(conv1) – 67.1MDI end-to-end 35.8 70.9

Table 2 : Evaluating the effect of end-to-end training for multiple dynamicimages and multiple dynamic maps after the convolutional layer 1.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 19 / 47

Page 20: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Hierarchical rank pooling - CVPR 2016

PART III. Hierarchical rank pooling

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 20 / 47

Page 21: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Need for hierarchical rank pooling

Capacity of linear flat rank pooling is limitedDynamics of multiple granularities can be capture (low-level,mid-level, and high-level dynamics)Hierarchical networks of non-linear dynamic functions can beemployed on sequence data

Figure 2 : Illustration of hierarchical rank pooling for encoding thetemporal dynamics of a video sequence.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 21 / 47

Page 22: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Hierarchical rank pooling - CVPR 2016

Let Φ(xmn , xmn+1, · · ·xmn+k)→ xm+1

k is the rank pooling functionapplied at layer m on k-th sequence

Ψ(xmn ) is a non-linear feature encoding

φ() φ() φ()

φ()

𝑥1(3)

𝑥2𝑥1 𝑥3 𝑥4 𝑥5

𝜓(𝑥1(1)) 𝜓(𝑥2

(1)) 𝜓(𝑥3

(1)) 𝜓(𝑥5

(1))𝜓(𝑥4

(1))

𝑥1(2) 𝑥2

(2)𝑥3(2)

𝜓(𝑥1(2)) 𝜓(𝑥2

(2)) 𝜓(𝑥3

(2))

Figure 3 : Illustration of hierarchical temporal encoding networks

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 22 / 47

Page 23: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Hierarchical rank pooling Φ() and Ψ

Sequence encoding machinery φ

Φ : u?∈ argminu

{1

2‖u‖2 +

C

2

J∑t=1

[|t− uTvt| − ε

]2≥0

}(6)

Capturing non-linear dynamics ψ

ψ(x) =(√

max{0, x},√

max{0,−x})

(7)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 23 / 47

Page 24: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Results - on baseline methods

Method Hollywood2 HMDB51 UCF101Average pooling 40.9 37.1 69.3Max pooling 42.4 39.1 72.5Tempo. pyramid (avg. pool) 46.5 39.1 73.3Tempo. pyramid (max pool) 48.7 39.8 74.8LSTM Srivastava2015 – 42.8 74.5LRCN Donahue2015 – – 68.8Rank pooling 44.2 40.9 72.2Recursive rank pooling 52.5 45.8 75.6Hierarchical rank pooling 56.8 47.5 78.8Improvement +8.1 +4.7 +4.0

Table 3 : Comparing several temporal pooling methods for activityrecognition using vgg-16’s fc6 features.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 24 / 47

Page 25: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Comparing to state-of-the-art

Hollywood2 HMDB51 UCF101our * method 76.7 66.9 91.4[Zha et al., 2015] – – 89.6[Fernando et al., 2015] 73.7 63.7 –[Lan et al., 2015] 68.0 65.4 89.1[Yue-Hei Ng et al., 2015] – – 88.6[Simonyan and Zisserman, 2014] – 59.4 88.0[Hoai and Zisserman, 2014] 73.6 60.8 –[Peng et al., 2014] – 66.8 –[Wu et al., 2014] – 56.4 84.2[Jain et al., 2013] 62.5 52.1 –[Wang and Schmid, 2013] 64.3 57.2 –[Wang et al., 2013] 58.2 46.6 –[Taylor et al., 2010] 46.6 – –

Table 4 : Comparison with the state-of-the-art methods.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 25 / 47

Page 26: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Parameter Evaluation : window size and stride

10 20 30 40

Window size

54

56

58

mAP

(%)

10 20 30 40

Window size

46

47

48

49

Accu

racy

(%)

10 20 30 40

Window size

77

78

79

Accu

racy

(%)

123 5 10 15 20

Stride size

54

56

58

mA

P (%

)

(a) Hollywood2

123 5 10 15 20

Stride size

42

44

46

48

Acc

urac

y (%

)

(b) HMDB-51

123 5 10 15 20

Stride size

75

76

77

78

79

Acc

urac

y (%

)

(c) UCF101

Figure 4 : Activity recognition performance versus window size (top) andstride (bottom).

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 26 / 47

Page 27: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Parameter Evaluation : hierarchy depth

1 2 3

Depth

40

45

50

55

mA

P (

%)

(a) Hollywood2

1 2 3

Depth

40

42

44

46

48

Accura

cy (

%)

(b) HMDB-51

Figure 5 : Activity recognition performance versus hierarchy depth onHollywood2 and HMDB-51.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 27 / 47

Page 28: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Discriminative dynamics learning with CNN - ICML2016

PART IV. Discriminative dynamics learning with CNN

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 28 / 47

Page 29: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Discriminative dynamics learning with CNN - ICML2016

Co

nv-

1

Poo

l-1

ReL

U-1

Co

nv-

2

Poo

l-2

ReL

U-2

Co

nv-

3

ReL

U-3

Co

nv-

4

ReL

U-4

Co

nv-

5

ReL

U-5

Poo

l-5

FC-6

FC-7

ReL

U-6

ReL

U-7

L2-n

orm

Tem

po

ral

Poo

l

L2-n

orm

Soft

-max

Co

nv-

1

Poo

l-1

ReL

U-1

Co

nv-

2

Poo

l-2

ReL

U-2

Co

nv-

3

ReL

U-3

Co

nv-

4

ReL

U-4

Co

nv-

5

ReL

U-5

Poo

l-5

FC-6

FC-7

ReL

U-6

ReL

U-7

L2-n

orm

Co

nv-

1

Poo

l-1

ReL

U-1

Co

nv-

2

Poo

l-2

ReL

U-2

Co

nv-

3

ReL

U-3

Co

nv-

4

ReL

U-4

Co

nv-

5

ReL

U-5

Poo

l-5

FC-6

FC-7

ReL

U-6

ReL

U-7

L2-n

orm

Frame 1

Frame 2

Frame t

.

.

.

.

.

.

.

.

Figure 6 : The CNN network architecture takes a sequence of frames from avideo as inputs and feed forward till the end of the temporal pooling layer. Atthe temporal polling layer, the sequence of vectors are encoded byrank-pooling operator to produce fixed length video representation.Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 29 / 47

Page 30: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

General learning framework

~x is the sequence of frames.

ψθ is the learn-able feature function (e.g. CNN)

φ is the sequence encoding function or any leaning program thatreturns an encoding.

hβ is a classifier

~x = 〈xt〉ψθ7−→ 〈vt〉

φ7−→ uhβ7−→ y (8)

Given a dataset of sequence-label pairs, {(~x(i), y(i))}ni=1, our goal is tolearn all the parameters.Let ∆(·, ·) be a loss function.

∆(y, hβ(u)) = − logP (y | ~x). (9)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 30 / 47

Page 31: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Temporal encoding of the sequence 〈vt〉The sequence of CNN vectors is given by ~v = 〈v1, . . . ,vT 〉The temporal encoding is obtained by

u = φ(~v) ∈ Rq. (10)

To get the encoding we need to solve the following optimization

u ∈ arg minu′

f(~v,u′) (11)

For mean pooling the sequence encoder is given by

avg(~v) = argminu

{1

2

T∑t=1

‖u− vt‖2}. (12)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 31 / 47

Page 32: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Bilevel optimization inside CNN

We jointly estimate the parameters of the feature function andprediction function by minimizing the regularized empirical risk. Ourlearning problem is

minimizeθ,β∑n

i=1 ∆(y(i), hβ(u(i))

)+R(θ,β)

subject to u(i) ∈ argminu f(~v(i),u)(13)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 32 / 47

Page 33: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Gradient computation

Lemma

Let f : R× Rn → R be a continuous function with first and secondderivatives. Let

g(x) = argminy∈Rnf(x,y)

, then

g′(x) = −fY Y (x, g(x))−1fXY (x, g(x)).

where fY Y.= ∇2

yyf(x,y) ∈ Rn×n and fXY.= ∂

∂x∇yf(x,y) ∈ Rn.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 33 / 47

Page 34: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Gradient computation of sequence encoder

f(θ,u) =1

2‖u‖2 +

C

2

T∑t=1

[|t− uTvt| − ε

]2≥0

(14)

Now compute the gradients of the sequence encoder ddθargminuf(θ,u)

using lemma.

d

dθargminuf(θ,u) =

I + C∑et 6=0

vtvTt

−1C∑

et 6=0

etψ′θ(xt)− uTψ′θ(xt)vt

(15)

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 34 / 47

Page 35: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Results of End-to-end learning of video representations

Table 5 : Classification performance in average precision for activityrecognition on the Hollywood2 dataset.

class mean max rankpool mean max rankpoolsvm cnn

AnswerPhone 23.6 19.5 35.3 29.9 28.0 25.0DriveCar 60.9 50.8 40.6 55.6 48.6 56.9Eat 19.7 22.0 16.7 27.8 22.0 24.2FightPerson 45.6 28.3 28.1 26.6 17.6 30.4GetOutCar 39.5 29.2 28.1 48.9 43.8 55.5HandShake 28.3 24.4 34.2 38.4 40.0 32.0HugPerson 30.2 23.9 22.1 25.9 26.6 33.2Kiss 38.2 27.5 36.8 50.6 45.7 54.2Run 55.2 53.0 39.4 59.6 52.5 61.0SitDown 30.0 28.8 32.1 30.6 30.0 39.6SitUp 23.0 20.2 18.7 23.8 26.4 25.4StandUp 34.6 32.4 39.9 37.4 34.8 49.9AVG 35.7 30.0 31.0 37.9 34.7 40.6

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 35 / 47

Page 36: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

Tools and Data

LEAR - Improved Trajectories Video Description - Heng Wanglear.inrialpes.fr/people/wang/improved_trajectories

Fisher encoding VLFeat http://www.vlfeat.org/index.html

Rank pooling codehttps://bitbucket.org/bfernando/videodarwin

Dynamic Image Netshttps://github.com/hbilen/dynamic-image-nets

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 36 / 47

Page 37: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

End

Thank you!

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 37 / 47

Page 38: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References I

Fernando, B., Gavves, E., Oramas, J., Ghodrati, A., andTuytelaars, T. (2015).Modeling video evolution for action recognition.In CVPR.

Hoai, M. and Zisserman, A. (2014).Improving human action recognition using score distribution andranking.In ACCV.

Jain, M., Jegou, H., and Bouthemy, P. (2013).Better exploiting motion for better action recognition.In CVPR.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 38 / 47

Page 39: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References II

Ji, S., Xu, W., Yang, M., and Yu, K. (2013).3d convolutional neural networks for human action recognition.Pattern Analysis and Machine Intelligence, IEEE Transactions on,35(1):221–231.

Lan, Z., Lin, M., Li, X., Hauptmann, A. G., and Raj, B. (2015).Beyond gaussian pyramid: Multi-skip feature stacking for actionrecognition.In CVPR.

Laptev, I. (2005).On space-time interest points.IJCV, 64:107–123.

Peng, X., Zou, C., Qiao, Y., and Peng, Q. (2014).Action recognition with stacked fisher vectors.In ECCV.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 39 / 47

Page 40: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References III

Perronnin, F., Liu, Y., Sanchez, J., and Poirier, H. (2010).Large-scale image retrieval with compressed fisher vectors.In CVPR.

Pfister, T., Charles, J., and Zisserman, A. (2014).Domain-adaptive discriminative one-shot learning of gestures.In ECCV.

Rohrbach, M., Amin, S., Andriluka, M., and Schiele, B. (2012).A database for fine grained activity detection of cooking activities.In CVPR.

Simonyan, K. and Zisserman, A. (2014).Two-stream convolutional networks for action recognition invideos.In NIPS, pages 568–576.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 40 / 47

Page 41: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References IV

Taylor, G. W., Fergus, R., LeCun, Y., and Bregler, C. (2010).Convolutional learning of spatio-temporal features.In ECCV.

Wang, H., Klaser, A., Schmid, C., and Liu, C.-L. (2013).Dense trajectories and motion boundary descriptors for actionrecognition.IJCV, 103:60–79.

Wang, H. and Schmid, C. (2013).Action recognition with improved trajectories.In ICCV.

Wu, J., Cheng, J., Zhao, C., and Lu, H. (2013).Fusing multi-modal features for gesture recognition.In ICMI.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 41 / 47

Page 42: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References V

Wu, J., Zhang, Y., and Lin, W. (2014).Towards good practices for action video encoding.In CVPR.

Yao, A., Van Gool, L., and Kohli, P. (2014).Gesture recognition portfolios for personalization.In CVPR.

Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals,O., Monga, R., and Toderici, G. (2015).Beyond short snippets: Deep networks for video classification.In CVPR.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 42 / 47

Page 43: Rank pooling and variants for action and activity recognitionusers.cecs.anu.edu.au/~basura/posters/rankpooling.pdf · Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch

References VI

Zha, S., Luisier, F., Andrews, W., Srivastava, N., andSalakhutdinov, R. (2015).Exploiting image-trained CNN architectures for unconstrainedvideo classification.In BMVC.

Zhou, Y., Ni, B., Yan, S., Moulin, P., and Tian, Q. (2014).Pipelining localized semantic features for fine-grained actionrecognition.In ECCV.

Basura Fernando (ARC Centre of Excellence for Robotic VisionResearch School of Engineering The Australian National University)Rank Pooling 43 / 47