Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Conquering the CNN Over-Parameterization Dilemma:A Volterra Filtering Approach for Action Recognition
Siddharth Roheda, Hamid KrimNorth Carolina State University at Raleigh, NC, USA
{sroheda, ahk}@ncsu.edu
Abstract
The importance of inference in Machine Learning (ML) hasled to an explosive number of different proposals in ML, andparticularly in Deep Learning. In an attempt to reduce thecomplexity of Convolutional Neural Networks, we proposea Volterra filter-inspired Network architecture. This architec-ture introduces controlled non-linearities in the form of inter-actions between the delayed input samples of data. We pro-pose a cascaded implementation of Volterra Filtering so asto significantly reduce the number of parameters required tocarry out the same classification task as that of a conventionalNeural Network. We demonstrate an efficient parallel imple-mentation of this Volterra Neural Network (VNN), along withits remarkable performance while retaining a relatively sim-pler and potentially more tractable structure. Furthermore, weshow a rather sophisticated adaptation of this network to non-linearly fuse the RGB (spatial) information and the OpticalFlow (temporal) information of a video sequence for actionrecognition. The proposed approach is evaluated on UCF-101and HMDB-51 datasets for action recognition, and is shownto outperform state of the art CNN approaches.
1 IntroductionHuman action recognition is an important research topicin Computer Vision, and can be used towards surveillance,video retrieval, and man-machine interaction to name a few.The survey on Action Recognition approaches (Kong andFu 2018) provides a good progress overview. Video clas-sification usually involves three stages (Wang et al. 2009;Liu, Luo, and Shah 2009; Niebles, Chen, and Fei-Fei 2010;Sivic and Zisserman 2003; Karpathy et al. 2014), namely,visual feature extraction (local features like Histogramsof Oriented Gradients (HoG) (Dalal and Triggs 2005),or global features like Hue, Saturation, etc.), feature fu-sion/concatenation, and lastly classification. In (Yi, Krim,and Norris 2011), an intrinsic stochastic modeling of hu-man activity on a shape manifold is proposed and an ac-curate analysis of the non-linear feature space of activitymodels is provided. The emergence of Convolutional Neu-ral Networks (CNNs) (LeCun et al. 1998), along with the
Copyright ยฉ 2020, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.
availability of large training datasets and computational re-sources have come a long way to obtaining the various stepsby a single neural network. This approach has led to re-markable progress in action recognition in video sequences,as well as in other vision applications like object detec-tion (Sermanet et al. 2013), scene labeling (Farabet et al.2012), image generation (Goodfellow et al. 2014), imagetranslation (Isola et al. 2017), information distillation (Ro-heda et al. 2018b; Hoffman, Gupta, and Darrell 2016), etc. Inthe Action Recognition domain, datasets like the UCF-101(Soomro, Zamir, and Shah 2012), Kinetics (Kay et al. 2017),HMDB-51 (Kuehne et al. 2011), and Sports-1M (Karpa-thy et al. 2014) have served as benchmarks for evaluatingvarious solution performances. In action recognition appli-cations the proposed CNN solutions generally align alongtwo themes: 1. One Stream CNN (only use either spatialor temporal information), 2. Two Stream CNN (integrateboth spatial and temporal information). Many implemen-tations (Carreira and Zisserman 2017; Diba, Sharma, andVan Gool 2017; Feichtenhofer, Pinz, and Zisserman 2016;Simonyan and Zisserman 2014) have shown that integrat-ing both streams leads to a significant boost in recogni-tion performance. In Deep Temporal Linear Encoding (Diba,Sharma, and Van Gool 2017), the authors propose to use 2DCNNs (pre-trained on ImageNet (Deng et al. 2009)) to ex-tract features from RGB frames (spatial information) and theassociated optical flow (temporal information). The videois first divided into smaller segments for feature extractionvia 2D CNNs. The extracted features are subsequently com-bined into a single feature map via a bilinear model. Thisapproach, when using both streams, is shown to achieve a95.6 % accuracy on the UCF-101 dataset, while only achiev-ing 86.3 % when only relying on the RGB stream. Carreiraet al. (Carreira and Zisserman 2017) adopt the GoogLeNetarchitecture which was developed for image classification inImageNet (Deng et al. 2009), and use 3D convolutions (in-stead of 2D ones) to classify videos. This implementation isreferred to as the Inflated 3D CNN (I3D), and is shown toachieve a performance of 88.8 % on UCF-101 when trainedfrom scratch, while achieving a 98.0 % accuracy when alarger dataset (Kinetics) was used for pre-training the en-tire network (except for the classification layer). While these
arX
iv:1
910.
0961
6v3
[cs
.CV
] 1
2 N
ov 2
020
approaches achieve near perfect classification, the modelsare extremely heavy to train, and have a tremendous num-ber of parameters (25M in I3D, 22.6M in Deep Tempo-ral Linear Encoding). This in addition, makes the analy-sis, including the necessary degree of non-linearity, diffi-cult to understand, and the tractability elusive. In this paperwe explore the idea of introducing controlled non-linearitiesthrough interactions between delayed samples of a time se-ries. We will build on the formulations of the widely knownVolterra Series (Volterra 2005) to accomplish this task.While prior attempts to introduce non-linearity based onthe Volterra Filter have been proposed (Kumar et al. 2011;Zoumpourlis et al. 2017) , most have limited the develop-ment up to a quadratic form on account of the explosivenumber of parameters required to learn higher order com-plexity structure. While quadratic non-linearity is sufficientfor some applications (eg. system identification), it is highlyinadequate to capture all the non-linear information presentin videos.
Contributions: In this paper, we propose a Volterra Filter(Volterra 2005) based architecture where the non-linearitiesare introduced via the system response functions and henceby controlled interactions between delayed frames of thevideo. The overall model is updated on the basis of a cross-entropy loss of the labels resulting from a linear classifierof the generated features. An efficiently cascaded imple-mentation of a Volterra Filter is used in order to explorehigher order terms while avoiding over-parameterization.The Volterra filter principle is also exploited to combine theRGB and the Optical Flow streams for action recognition,hence yielding a performance driven non-linear fusion of thetwo streams. We further show that the number of parame-ters required to realize such a model is significantly lower incomparison to a conventional CNN, hence leading to fastertraining and significant reduction of the required resourcesto learn, store, or implement such a model.
2 Background and Related Work2.1 Volterra SeriesThe physical notion of a system is that of a black box withan input/output relationship yt/xt. If a non-linear system istime invariant, the relation between the output and input canbe expressed in the following form (Volterra 2005; Schetzen1980),
yt =
Lโ1โฯ1=0
W 1ฯ1xtโฯ1 +
Lโ1โฯ1,ฯ2=0
W 2ฯ1,ฯ2xtโฯ1xtโฯ2
+ ...+
Lโ1โฯ1,ฯ2,...,ฯK=0
WKฯ1,ฯ2,...,ฯKxtโฯ1xtโฯ2 ...xtโฯK ,
(1)
where L is the number of terms in the filter memory (alsoreferred to as the filter length), W k are the weights for thekth order term, and W k
ฯ1,ฯ2,...,ฯk= 0, for any ฯj < 0,
j = 1, 2, ..., k, โk = 1, 2, ...,K due to causality. Thisfunctional form is due to the mathematician Vito Volterra(Volterra 2005), and is widely referred to as the Volterra
Volterra Filter
Z-1
Z-1
tx
1tx
2tx
Z-1t Lx
ty
td
tE
Figure 1: Adaptive Volterra Filter
Series. The calculation of the kernels is a computationallycomplex problem, and a Kth order filter of length L, entailssolving LK equations. The corresponding adaptive weightsare a result of a target energy functional whose minimizationiteratively adapts the filter taps as shown in Figure 1.
It is worth observing from Equation 1 that the linear termis actually similar to a convolutional layer in CNNs. Non-linearities in CNNs are introduced only via activation func-tions, and not in the convolutional layer, while in Equation1 we will introduce non-linearities via higher order convolu-tions.
2.2 Nested Volterra FilterIn (Osowski and Quang 1994), a nested reformulation ofEquation 1 was used in order to construct a feedforward im-plementation of the Volterra Filter,
yt =
Lโ1โฯ1=0
xtโฯ1
[W 1
ฯ1 +
Lโ1โฯ2=0
xtโฯ2
[W 2
ฯ1,ฯ2
+
Lโ1โฯ3=0
xtโฯ3 [W3ฯ1,ฯ2,ฯ3 + ...]
]].
(2)
Each factor contained in the brackets can be interpreted asthe output of a linear Finite Impulse Response (FIR) filter,thus allowing a layered representation of the Filter. A nestedfilter implementation with L = 2 and K = 2 is shown inFigure 2. The length of the filter is increased by adding mod-ules in parallel, while the order is increased by additionallayers. Much like any multi-layer network, the weights ofthe synthesized filter are updated layer after layer accord-ing to a backpropogation scheme. The nested structure ofthe Volterra Filter, yields much faster learning in compari-son to that based on Equation 1. It, however, does not reducethe number of parameters to be learned, leading to potentialover-parameterization when learning higher order filter ap-proximations. Such a structure was used for a system iden-tification problem in (Osowski and Quang 1994). The meansquare error between the desired signal (dt) and the outputof the filter (yt) was used as the cost functional to be mini-mized,
Et =1
2(dt โ yt)2, (3)
๐ฆ
๐ฅ
๐
๐ฅ
๐
๐ฅ
๐
๐ฅ
๐
๐ฅ
๐
๐ฅ
๐Layer 1
Layer 2
Figure 2: Nested Volterra Filter
and the weights for the kth layer are updated per the follow-ing equations,
W kฯ1,ฯ2,...,ฯk
(t+ 1) = W kฯ1,ฯ2,...,ฯk
(t)โ ฮท โEtโW k
ฯ1,ฯ2,...,ฯk
,
(4)โEt
โW kฯ1,ฯ2,...,ฯk
= xtโฯkxtโฯkโ1...xtโฯ1(yt โ dt). (5)
2.3 Bilinear Convolution Neural NetworksThere has been work on introducing 2nd order non-linearities in the network by using a bi-linear operation onthe features extracted by convolutional networks. Bilinear-CNNs (B-CNNs) were introduced in (Lin, RoyChowdhury,and Maji 2015) and were used for image classification. In B-CNNs, a bilinear mapping is applied to the final features ex-tracted by linear convolutions leading to 2nd order featureswhich are not localized. As a result a feature extracted fromthe lower right corner of a frame in the B-CNN case, mayinteract with a feature from the upper right corner, and thesetwo are not necessarily related (hence introducing erroneousadditional characteristics), and this is in contrast to our pro-posed approach which highly controls such effects. CompactBilinear Pooling was introduced in (Gao et al. 2016) wherea bilinear approach to reduce feature dimensions was intro-duced. This was again performed after all the features hadbeen extracted via linear convolutions and was limited toquadratic non-linearities. In our work we will explore non-linearities of much higher orders and also account for conti-nuity of information between video frames for a given timeperiod with the immediately preceding period. This effec-tively achieves a Recurrent Network-like property which ac-counts for a temporal relationship.
3 Problem StatementLet a set of actions A = {a1, ..., aI}, be of interest fol-lowing a observed sequence of frames XTรHรW , whereT is the total number of frames, and H and W are the
height and width of a frame. At time t, the features Ft =g(X[tโL+1:t]), will be used for classification of the se-quence of frames X[tโL+1:t] and mapped into one of theactions in A, where L is the number of frames in the mem-ory of the system/process. A linear classifier with weightsW cl = {wcl
i }i=1,...,I , and biases bcl = {bcli }i=1,...,I willthen be central to determining the classification scores foreach action, followed by a soft-max function (ฯ(.)) to con-vert the scores into a probability measure. The probabilitythat the sequence of frames are associated to the ith actionclass is hence the result of,
Pt(ai) = ฯ(wclT
i .Ft + bcli ) =exp(wclT
i .Ft + bcli )โIm=1 exp(w
clTm .Ft + bclm)
.
(6)
4 Proposed Solution4.1 Volterra Filter based ClassificationIn our approach we propose a Volterra Filter structure to ap-proximate a function g(.). Given that video data is of interesthere, a spatio-temporal Volterra Filter must be applied. As aresult, this 3D version of the Volterra Filter discussed in Sec-tion 2 is used to extract the features,
F[ ts1s2
] = g
X[tโL+1:t
s1โp1:s1+p1s2โp2:s2+p2
] =
โฯ1,ฯ11,ฯ21
W 1[ ฯ1ฯ11ฯ21
]x[ tโฯ1s1โฯ11s2โฯ21
]
+โ
ฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
W 2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]x[ tโฯ1s1โฯ11s2โฯ21
]x[ tโฯ2s1โฯ12s2โฯ22
] + ... (7)
where, ฯj โ [0, L โ 1], ฯ1j โ [โp1, p1], and ฯ2j โ[โp2, p2]. Following this formulation, and as discussed inSection 3, the linear classifier is used to determine the prob-ability of each action in A. Updating the filter parametersis pursued by minimizing some measure of discrepancy rel-ative to the ground truth and the probability determined bythe model. Our adopted measure herein is the cross-entropyloss computed as,
E =โt,I
โdti logPt(ai), (8)
where, t โ {1, L + 1, 2L + 1, ..., T}, i โ {1, 2, ..., I}, anddti is the ground truth label for X[tโL+1:t] belonging to theith action class. In addition to minimizing the error, we alsoinclude a weight decay in order to ensure generalizability ofthe model by penalizing large weights. So, the overall costfunctional which serves as a target metric is written as,
ming
โt,I
โdti log ฯ(wclT
i .g(X[tโL+1:t]) + bcli )
+ฮป
2
[Kโk=1
โฅโฅW kโฅโฅ22+โฅโฅW cl
โฅโฅ22
],
(9)
where ฯ is the soft-max function, and K is the order of thefilter.
๐ฟ ๐:๐ณ
๐ฟ ๐ด๐ ๐ณ๐:๐ณ
๐ฟ ๐:๐ณ๐ ๐
๐ฟ ๐:๐ณ๐ ๐ญ๐๐
๐ญ๐๐
๐ญ๐ด๐
๐
๐ญ[๐:๐ณ๐]๐
๐ญ[๐:๐ณ๐ ๐]๐
๐ญ[๐ด๐ ๐ณ๐:๐ด๐]๐
๐ญ๐๐
๐ญ๐๐
๐ญ๐ด๐
๐
๐ญ ๐:๐ณ๐ฉ๐ฉ ๐
๐ญ ๐:๐ณ๐ฉ ๐๐ฉ ๐
๐ญ ๐ด๐ฉ ๐ ๐ณ๐ฉ:๐ด๐ฉ ๐
๐ฉ ๐
๐ญ๐๐ฉ
๐ญ๐๐ฉ
๐ญ๐๐ฉ
๐ฉ
๐ญ๐ณ =
๐ญ ๐:๐ด๐ฉ
๐ฉ๐ท๐ณ(๐๐)
๐ฝ๐ญ๐
๐ฝ๐ญ๐
๐ฝ๐ญ๐
๐ฝ๐ญ๐
๐ฝ๐ญ๐
๐ฝ๐ญ๐
๐ฝ๐ญ๐ฉ
๐ฝ๐ญ๐ฉ
๐ฝ๐ญ๐ฉ
๐พ๐๐
Figure 3: Block diagram for an Overlapping Volterra Neural Network
4.2 Non-Linearity Enhancement: CascadedVolterra Filters
A major challenge in learning the afore-mentioned architec-ture arises when higher order non-linearities are sought. Thenumber of required parameters for a Kth order filter is,
Kโk=1
(L.[2p1 + 1].[2p2 + 1])k. (10)
This complexity increases exponentially when the order isincreased, thus making a higher order (> 3) Volterra Filterimplementation impractical. To alleviate this limitation, weuse a cascade of 2nd order Volterra Filters, wherein, the sec-ond order filter is repeatedly applied until the desired orderis attained. AKth order filter is realized by applying the 2nd
order filterZ times, where,K = 22(Zโ1)
. If the length of thefirst filter in the cascade is L1, the input video X[tโL+1:t]
can be viewed as a concatenation of a set of shorter videos,
X[tL:t] =
[X[tL:tL+L1] X[tL+L1:tL+2L1]
...X[tL+(M1โ1)L1:tL+M1L1]
],
(11)
whereM1 = LL1
, and tL = tโL+1. Now, a 2nd order filterg1(.) applied on each of the sub-videos leads to the features,
F 1t[1:M1]
=
[g1(X[tL:tL+L1]) g1(X[tL+L1:tL+2L1])
...g1(X[tL+(M1โ1)L1:tL+M1L1])
].
(12)
A second filter g2(.) of length L2 is then applied to the out-put of the first filter,
F 2t[1:M2]
=
[g2(F
1t[1:L2]
) g2(F1t[L2+1:2L2]
)
...g2(F1t[(M2โ1)L2+1:(M2L2)]
)
],
(13)
where, M2 = M1
L2. Note that the features in the second
layer are generated by taking quadratic interactions betweenthose generated by the first layer, hence, leading to 4th orderterms.
Finally, for a cascade of Z filters, the final set of featuresis obtained as,
FZt[1:MZ ]=
[gZ(F
Zโ1t[1:LZ ]
) gZ(FZโ1t[LZ+1:2LZ ]
)
...gZ(FZโ1t[(MZโ1)LZ+1:(MZLZ )]
)
],
(14)
where, MZ = MZโ1
LZ.
Note that these filters can also be implemented in an over-lapping fashion leading to the following features for the zthlayer, z โ {1, ...,Z},
F zt[1:Mz ]
=
[gz(F
zโ1t[1:Lz ]
) gz(Fzโ1t[2:Lz+1]
)
...gz(Fzโ1t[(Mzโ1)โLz+1:Mzโ1]
)
],
(15)
where Mz = Mzโ1 โ Lz + 1. The implementation of anOverlapping Volterra Neural Network (O-VNN) to find thecorresponding feature maps for an input video is shown inFigure 3.
Proposition 1. The complexity of a Kth order cascadedVolterra filter will consist of,
Zโz=1
[(Lz.[2p1z + 1].[2p2z + 1])
+ (Lz.[2p1z + 1].[2p2z + 1])2]
(16)
parameters.
Proof. For a 2nd order filter (K = 2), the number of pa-
rameters required is[(L.[2p1 + 1].[2p2 + 1]) + (L.[2p1 +
1].[2p2 + 1])2]
(from equation 10). When such a filter is re-
peatedly applied Z times, it will lead toโZz=1
[(Lz.[2p1z +
1].[2p2z +1])+(Lz.[2p1z +1].[2p2z +1])2]
parameters with
order K = 22(Zโ1)
.
Furthermore, if a multi-channel input/output is consid-ered, the number of parameters is,
Zโz=1
(nzโ1ch .nzch)
[(Lz.[2p1z + 1].[2p2z + 1])
+ (Lz.[2p1z + 1].[2p2z + 1])2], (17)
where nzch is the number of channels in the output of the zthlayer.
4.3 System StabilityThe discussed system can be shown to be stable when theinput is bounded, i.e. the system is Bounded Input BoundedOutput (BIBO) stable.Proposition 2. An O-VNN with Z layers is BIBO stable ifโz โ {1, ...,Z},
โฯ1,ฯ11,ฯ21
โฃโฃโฃโฃโฃโฃW z1[ ฯ1ฯ11ฯ21
]โฃโฃโฃโฃโฃโฃ+
โฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
โฃโฃโฃโฃโฃโฃW z2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]โฃโฃโฃโฃโฃโฃ <โ. (18)
Proof. Consider the zth layer in the Cascaded implementa-tion of the Volterra Filter,
F z[1:Mz ]
=
[gz(F
zโ1t[1:Lz ]
) gz(Fzโ1t[2:Lz+1]
)
...gz(Fzโ1t[(Mzโ1)โLz+1:(Mzโ1)]
)
],
(19)
where, Mz =Mzโ1 โ Lz + 1. Then for mz โ {1, ...,Mz},โฃโฃโฃโฃโฃโฃF z[mzs1s2
]โฃโฃโฃโฃโฃโฃ =
โฃโฃโฃโฃโฃโฃโฃgzF zโ1[
mzโ1โLz+1:mzs1โp1:s1+p1s2โp2:s2:p2
]โฃโฃโฃโฃโฃโฃโฃ (20)
=
โฃโฃโฃโฃโฃ โฯ1,ฯ11,ฯ21
W z1[ ฯ1ฯ11ฯ21
]fzโ1[(Lz+mzโ1)โฯ1
s1โฯ11s2โฯ21
]
+โ
ฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
W z2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]fzโ1[(Lz+mzโ1)โฯ1
s1โฯ11s2โฯ21
]fzโ1[(Lz+mzโ1)โฯ2
s1โฯ12s2โฯ22
]โฃโฃโฃโฃโฃ
(21)
โคโ
ฯ1,ฯ11,ฯ21
โฃโฃโฃโฃโฃโฃW z1[ ฯ1ฯ11ฯ21
]โฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃfzโ1[
(Lz+mzโ1)โฯ1s1โฯ11s2โฯ21
]โฃโฃโฃโฃโฃโฃโฃ
+โฯ1,ฯ2ฯ11,ฯ12ฯ21,ฯ22
โฃโฃโฃโฃโฃโฃW z2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]โฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃfzโ1[
(Lz+mzโ1)โฯ1s1โฯ11s2โฯ21
]โฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃโฃfzโ1[
(Lz+mzโ1)โฯ2s1โฯ12s2โฯ22
]โฃโฃโฃโฃโฃโฃโฃ
(22)
โค Aโ
ฯ1,ฯ11,ฯ21
โฃโฃโฃโฃโฃโฃW z1[ ฯ1ฯ11ฯ21
]โฃโฃโฃโฃโฃโฃ+A2
โฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
โฃโฃโฃโฃโฃโฃW z2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]โฃโฃโฃโฃโฃโฃ . (23)
Note that Equation 23 states that a bounded input yields
โฯ1,ฯ11,ฯ21
โฃโฃโฃโฃโฃโฃโฃfzโ1[(Lz+mzโ1)โฯ1
s1โฯ11s2โฯ21
]โฃโฃโฃโฃโฃโฃโฃ โค A, for some A < โ.
Hence, the sufficient condition for the system to be BIBOstable is,
โฯ1,ฯ11,ฯ21
โฃโฃโฃโฃโฃโฃW z1[ ฯ1ฯ11ฯ21
]โฃโฃโฃโฃโฃโฃ+
โฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
โฃโฃโฃโฃโฃโฃW z2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]โฃโฃโฃโฃโฃโฃ <โ. (24)
If the input data (i.e. video frames) is bounded, so is theoutput of each layer provided that Equation 24 is satisfiedโz โ {1, ..., Z}, making the entire system BIBO stable.
4.4 Synthesis and Implementation of VolterraKernels
As noted earlier, the linear kernel (1st order) of the Volterrafilter is similar to the convolutional layer in the conventionalCNNs. As a result, it can be easily implemented using the3D convolution function in Tensorflow (Abadi et al. 2016).The second order kernel may be approximated as a productof two 3-dimensional matrices (i.e. a seperable operator),
W 2LรP1รP2รLรP1รP2
=
Qโq=1
W 2aqLรP1รP2ร1
W 2bq1รLรP1รP2
,
(25)where, P1 = 2p1 + 1, and P2 = 2p2 + 1. Taking account ofEquation 7 yields,
g
X[tโL+1:t
s1โp1:s1+p1s2โp2:s2+p2
] =
โฯ1,ฯ11,ฯ21
W 1[ ฯ1ฯ11ฯ21
]x[ tโฯ1s1โฯ11s2โฯ21
]
+โ
ฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
Qโq=1
W 2aq ฯ1ฯ11
ฯ21
W 2bq ฯ2ฯ12
ฯ22
x[ tโฯ1s1โฯ11s2โฯ21
]x[ tโฯ2s1โฯ12s2โฯ22
] (26)
=โ
ฯ1,ฯ11,ฯ21
W 1[ ฯ1ฯ11ฯ21
]x[ tโฯ1s1โฯ11s2โฯ21
]
+
Qโq=1
โฯ1,ฯ11,ฯ21ฯ2,ฯ12,ฯ22
(W 2aq ฯ1ฯ11
ฯ21
x[ tโฯ1s1โฯ11s2โฯ21
])(W 2bq ฯ2ฯ12
ฯ22
x[ tโฯ2s1โฯ12s2โฯ22
]).(27)
A larger Q will provide a better approximation of the 2nd
order kernel. The advantage of this class of approximationis two-fold. Firstly, the number of parameters can be furtherreduced, if for the zth layer, (Lz.[2p1z + 1].[2p2z + 1])2 >2Q(Lz.[2p1z + 1].[2p2z + 1]). The trade-off between per-formance and available computational resources must be ac-counted for when opting for such an approximation. Addi-tionally, this makes it easier to implement the higher orderkernels in Tensorflow (Abadi et al. 2016) by using the builtin convolutional operator.
The approximate quadratic layers in the CascadedVolterra Filter (see Figure 3) yield the following number ofparameters,
Zโz=1
[(Lz.[2p1z + 1].[2p2z + 1])
+ 2Q(Lz.[2p1z + 1].[2p2z + 1])
]. (28)
We will evaluate and compare both approaches when im-plementing the second order kernel (i.e. approximation andexact method) in Section 5.
4.5 Two-Stream Volterra NetworksMost previous studies in action recognition in videos havenoted the importance of using both the spatial and the tem-poral information for an improved recognition accuracy. Asa result, we also propose the use of Volterra filtering in com-bining the two information streams, exploring a potentialnon-linear relationship between them. In Section 5 we willsee that this actually boosts the performance, thereby indi-cating some non-linear relation between the two informationstreams. Independent Cascaded Volterra Filters are first usedin order to extract features from each modality,
FZRGB
[1:MZ ]= gRGBZ (...gRGB2 (gRGB1 (XRGB
[tโL+1:t]))) (29)
FZOF
[1:MZ ]= gOFZ (...gOF2 (gOF1 (XOF
[tโL+1:t]))). (30)
Upon gleaning features from the two streams, an additionalVolterra Filter is solely used for combining the generatedfeature maps from both modalities,
F(RGB+OF )t =
โฯ1,ฯ11,ฯ21,u1
W 1[ ฯ1ฯ11ฯ21u1
]fZu1[MZโฯ1s1โฯ11s2โฯ21
]
+โ
ฯ1,ฯ11,ฯ21,u1ฯ2,ฯ12,ฯ22,u2
W 2[ ฯ1ฯ11ฯ21
][ ฯ2ฯ12ฯ22
]fZu1[MZโฯ1s1โฯ11s2โฯ21
]fZu2[MZโฯ2s1โฯ12s2โฯ22
], (31)
where ฯj โ [0, LZ+1], ฯ1j โ [โp1, p1], ฯ2j โ [โp2, p2],and uj โ [RGB,OF ]. Figure 6-(c) shows the block diagramfor fusing the two information streams.
5 Experiments and ResultsWe proceed to evaluate the performance of this approach ontwo action recognition datasets, namely, UCF-101 (Soomro,Zamir, and Shah 2012) and HMDB-51 (Kuehne et al. 2011),and the comparison of the results with recent state of the artimplementations is given in Tables 1 and 2. Table 1 showsthe comparison with techniques that only exploit the RGBstream, while Table 2 shows the comparison when both in-formation streams are used. Note that our comparable per-formance to the state of the art is achieved with a signifi-cantly lower number of parameters (see Table 4). Further-more, a significant boost in performance is achieved by al-lowing non-linear interaction between the two informationstreams. The Optical Flow is computed using the TV-L1 al-gorithm (Zach, Pock, and Bischof 2007). Note that we trainthe network from scratch on both datasets, and do not use alarger dataset for pre-training, in contrast to some of the pre-vious implementations. The implementations that take ad-vantage of a different dataset for pre-training are indicatedby a โYโ in the pre-training column, while those that do not,are indicated by โNโ. When training from scratch the pro-posed solution is able to achieve best performance for bothscenarios: one stream networks (RGB frames only) and two-stream networks (RGB frames & Optical Flow). To fuse thetwo information streams (spatial and temporal), we evaluatethe following techniques:
1. Decision Level Fusion (Figure 6-(a)): The decision prob-abilities PRGBt (ai) and POFt (ai) are independently com-puted and are combined to determine the fused proba-bility P ft (ai) using (a) Weighted Averaging: P ft (ai) =ฮฒRGBPRGBt (ai)+ฮฒ
OFPOFt (ai), where ฮฒRGB+ฮฒOF =1, which control the importance/contribution of the RGBand Optical Flow streams towards making a final decision,or (b) Event Driven Fusion (Roheda et al. 2018a; Ro-heda et al. 2019): P ft (ai) = ฮณPMAX MI
t (aRGBi , aOFi ) +(1 โ ฮณ)PMIN MI
t (aRGBi , aOFi ), where ฮณ is a pseudo mea-sure of correlation between the two information streams,PMAX MIt (.) is the joint distribution with maximal mutual
information, and PMIN MIt (.) is the joint distribution with
minimal mutual information.2. Feature Level Fusion: Features are extracted from each
stream independently, and are subsequently merged be-fore making a decision. For this level of fusion we con-sider a simple Feature Concatenation (see Figure 6-(b)),and Two-Stream Volterra Filtering (see Section 4.5, Fig-ure 6-(c)).
(a)
(b)
1.2
1
0.8
0.6
0.4
0.2
(c)
0.4
0.35
0.25
0.2
0.15
0.1
0.3
0.05
Figure 4: (a): Input Video, (b): Features extracted by onlyRGB stream, (c): Features extracted by Two-Stream VolterraFiltering
(a)
(b) (c)
0.3
0.25
0.2
0.15
0.1
0.05
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
Figure 5: (a): Input Video, (b): Features extracted by onlyRGB stream, (c): Features extracted by Two-Stream VolterraFiltering
The techniques are summarized in Figure 6. In our imple-mentation we use an O-VNN with 8 layers on both the RGB
Decision Fusion
O-VNNRGB
O-VNNOF
๐ฟ๐๐น๐ฎ๐ฉ
๐ฟ๐๐ถ๐ญ ๐ญ ๐:๐ด๐ฉ
๐ถ๐ญ
๐ญ ๐:๐ด๐ฉ
๐น๐ฎ๐ฉ Wcl ๐ท๐๐น๐ฎ๐ฉ(๐๐)
Wcl ๐ท๐๐ถ๐ญ(๐๐)
๐ท๐๐(๐๐)
O-VNNRGB
O-VNNOF
๐ฟ๐๐น๐ฎ๐ฉ
๐ฟ๐๐ถ๐ญ ๐ญ ๐:๐ด๐ฉ
๐ถ๐ญ
๐ญ ๐:๐ด๐ฉ
๐น๐ฎ๐ฉ
๐ญ ๐:๐ด๐ฉ
๐น๐ฎ๐ฉ
๐ญ ๐:๐ด๐ฉ
๐ถ๐ญWcl ๐ท๐
๐(๐๐)
O-VNNRGB
O-VNNFUSE Wcl
O-VNNOF
๐ฟ๐๐น๐ฎ๐ฉ
๐ฟ๐๐ถ๐ญ ๐ญ ๐:๐ด๐ฉ
๐ถ๐ญ
๐ญ ๐:๐ด๐ฉ
๐น๐ฎ๐ฉ
๐ญ๐ ๐ท๐๐(๐๐)
(a) (b) (c)
Figure 6: (a): Decision Level Fusion, (b): Feature Concatenation, (c): Two-Stream Volterra Filtering
Method Pre-TrainingAvg
AccuracyUCF-101
AvgAccuracyHMDB-51
Slow Fusion (Karpathy et al. 2014) Y (Sports-1M) 64.1 % -Deep Temporal Linear Encoding Networks (Diba,
Sharma, and Van Gool 2017) Y (Sports-1M) 86.3 % 60.3 %
Inflated 3D CNN (Carreira and Zisserman 2017) Y (ImageNet + Kinetics) 95.1 % 74.3 %Soomro et al, 2012 N 43.9 % -
Single Frame CNN (Karpathy et al. 2014;Krizhevsky, Sutskever, and Hinton 2012) N 36.9 % -
Slow Fusion (Karpathy et al.; Baccouche et al.; Ji et al.) N 41.3 % -3D-ConvNet (Carreira and Zisserman 2017;
Tran et al. 2015) N 51.6 % 24.3 %
Volterra Filter N 38.19 % 18.76 %O-VNN (exact) N 58.73 % 29.33 %O-VNN (Q=7) N 53.77 % 25.76 %
Table 1: Performance Evaluation for one stream networks (RGB only): The proposed algorithm achieves best performancewhen trained from scratch
Method Pre-TrainingAvg
AccuracyUCF-101
AvgAccuracyHMDB-51
Two-Stream CNNs (Simonyan and Zisserman2014) Y (ILSVRC-2012) 88.0 % 72.7 %
Deep Temporal Linear Encoding Networks (Diba,Sharma, and Van Gool 2017) Y (BN-Inception + ImageNet) 95.6 % 71.1 %
Two Stream Inflated 3D CNN (Carreira andZisserman 2017) Y (ImageNet + Kinetics) 98.0 % 80.9 %
Two-Stream O-VNN (Q=15) Y (Kinetics) 98.49 % 82.63 %Two Stream Inflated 3D CNN (Carreira and
Zisserman 2017) N 88.8 % 62.2 %
Weighted Averaging: O-VNN (exact) N 85.79 % 59.13 %Weighted Averaging: O-VNN (Q=7) N 81.53 % 55.67 %Event Driven Fusion: O-VNN (exact) N 85.21 % 60.36 %Event Driven Fusion: O-VNN (Q=7) N 80.37 % 57.89 %
Feature Concatenation: O-VNN (exact) N 82.31 % 55.88 %Feature Concatenation: O-VNN (Q=7) N 78.79 % 51.08 %
Two-Stream O-VNN (exact) N 90.28 % 65.61 %Two-Stream O-VNN (Q=7) N 86.16 % 62.45 %
Table 2: Performance Evaluation for two stream networks (RGB & Optical Flow): The proposed algorithm achieves bestperformance on both datasets.
stream and the optical stream. Each layer uses Lz = 2 andp1z , p2z โ {0, 1, 2}. The outputs of the two filters are fedinto the fusion layer which combines the two streams. Thefusion layer uses LFuse = 2 and p1Fuse , p2Fuse โ {0, 1, 2}. It isclear from Table 2 that performing fusion using Volterra Fil-ters significantly boosts the performance of the system. Thisshows that there does exist a non-linear relationship betweenthe two modalities. This can also be confirmed from the factthat we can see significant values in the weights for the fu-sion layer (see Table 3). Figures 4 and 5 show one of the
u = RGB u = OF u = Fusion12โW
uโ22 352.15 241.2 341.3
Table 3: Norm of W u, where u โ [RGB,OF, Fusion].
0 10 20 30 40 50 60
Epochs
0
50
100
150
200
250
Loss
Q=1Q=5Q=7exact
Figure 7: Epochs vs Loss for various number of multipliersfor a Cascaded Volterra Filter
feature maps for an archery video and a fencing video. FromFigures 4, 5-(b),(c) it can be seen that when only the RGBstream is used, a lot of the background area has high values,while on the other hand, when both streams are jointly used,the system is able to concentrate on more relevant features.In 4-(c), the system is seen to concentrate on the bow andarrow which are central to recognizing the action, while in5-(c) the system is seen to concentrate on the pose of the hu-man which is central to identifying a fencing action. Figure7 shows the Epochs vs Loss graph for a Cascaded VolterraFilter when a different number of multipliers (Q) are usedto approximate the 2nd order kernel. The green plot showsthe loss when the exact kernel is learned, and it can be seenthat the performance comes closer to the exact kernel as Qis increased.
6 ConclusionWe proposed a novel network architecture for recognitionof actions in videos, where the non-linearities were intro-duced by the Volterra Series Formulation. We propose a Cas-caded Volterra Filter which leads to a significant reduction
Method Number ofParameters
ProcessingSpeed
(secs/video)Deep Temporal Linear
Encoding 22.6M -
Inflated 3D CNN 25M 3.52O-VNN (exact) 4.6M 0.73O-VNN (Q=7) 3.7M 0.61
Two-Stream O-VNN (exact) 10.1M 1.88Two-Stream O-VNN (Q=7) 8.2M 1.54Two-Stream O-VNN (Q=1) 2.5M 0.34
Table 4: Comparison of number of parameters required andprocessing speed with the state of the art. A video with 60frames is evaluated
in parameters while exploring the same complexity of non-linearities in the data. Such a Cascaded Volterra Filter wasalso shown to be a BIBO stable system. In addition, we alsoproposed the use of the Volterra Filter to fuse the spatial andtemporal streams, hence leading to a non-linear fusion of thetwo streams. The network architecture inspired by VolterraFilters achieves performance accuracies which are compara-ble to the state of the art approaches while using a fractionof the number of parameters used by them. This makes suchan architecture very attractive from a training point of view,with an added advantage of reduced storage space to savethe model.
References[Abadi et al. 2016] Abadi, M.; Barham, P.; Chen, J.; Chen,Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.;Isard, M.; et al. 2016. Tensorflow: A system for large-scalemachine learning. In 12th {USENIX} Symposium on Op-erating Systems Design and Implementation ({OSDI} 16),265โ283.
[Baccouche et al. 2011] Baccouche, M.; Mamalet, F.; Wolf,C.; Garcia, C.; and Baskurt, A. 2011. Sequential deep learn-ing for human action recognition. In International workshopon human behavior understanding, 29โ39. Springer.
[Carreira and Zisserman 2017] Carreira, J., and Zisserman,A. 2017. Quo vadis, action recognition? a new model andthe kinetics dataset. In proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, 6299โ6308.
[Dalal and Triggs 2005] Dalal, N., and Triggs, B. 2005. His-tograms of oriented gradients for human detection.
[Deng et al. 2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.;Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hier-archical image database. In 2009 IEEE conference on com-puter vision and pattern recognition, 248โ255. Ieee.
[Diba, Sharma, and Van Gool 2017] Diba, A.; Sharma, V.;and Van Gool, L. 2017. Deep temporal linear encoding net-works. In Proceedings of the IEEE conference on ComputerVision and Pattern Recognition, 2329โ2338.
[Farabet et al. 2012] Farabet, C.; Couprie, C.; Najman, L.;and LeCun, Y. 2012. Learning hierarchical features for
scene labeling. IEEE transactions on pattern analysis andmachine intelligence 35(8):1915โ1929.
[Feichtenhofer, Pinz, and Zisserman 2016] Feichtenhofer,C.; Pinz, A.; and Zisserman, A. 2016. Convolutionaltwo-stream network fusion for video action recognition. InProceedings of the IEEE conference on computer visionand pattern recognition, 1933โ1941.
[Gao et al. 2016] Gao, Y.; Beijbom, O.; Zhang, N.; and Dar-rell, T. 2016. Compact bilinear pooling. In Proceedings ofthe IEEE conference on computer vision and pattern recog-nition, 317โ326.
[Goodfellow et al. 2014] Goodfellow, I.; Pouget-Abadie, J.;Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville,A.; and Bengio, Y. 2014. Generative adversarial nets. InAdvances in neural information processing systems, 2672โ2680.
[Hoffman, Gupta, and Darrell 2016] Hoffman, J.; Gupta, S.;and Darrell, T. 2016. Learning with side informationthrough modality hallucination. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,826โ834.
[Isola et al. 2017] Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros,A. A. 2017. Image-to-image translation with conditional ad-versarial networks. In Proceedings of the IEEE conferenceon computer vision and pattern recognition, 1125โ1134.
[Ji et al. 2012] Ji, S.; Xu, W.; Yang, M.; and Yu, K. 2012.3d convolutional neural networks for human action recog-nition. IEEE transactions on pattern analysis and machineintelligence 35(1):221โ231.
[Karpathy et al. 2014] Karpathy, A.; Toderici, G.; Shetty, S.;Leung, T.; Sukthankar, R.; and Fei-Fei, L. 2014. Large-scalevideo classification with convolutional neural networks. InProceedings of the IEEE conference on Computer Visionand Pattern Recognition, 1725โ1732.
[Kay et al. 2017] Kay, W.; Carreira, J.; Simonyan, K.; Zhang,B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.;Back, T.; Natsev, P.; et al. 2017. The kinetics human actionvideo dataset. arXiv preprint arXiv:1705.06950.
[Kong and Fu 2018] Kong, Y., and Fu, Y. 2018. Human ac-tion recognition and prediction: A survey. arXiv preprintarXiv:1806.11230.
[Krizhevsky, Sutskever, and Hinton 2012] Krizhevsky, A.;Sutskever, I.; and Hinton, G. E. 2012. Imagenet classifica-tion with deep convolutional neural networks. In Advancesin neural information processing systems, 1097โ1105.
[Kuehne et al. 2011] Kuehne, H.; Jhuang, H.; Garrote, E.;Poggio, T.; and Serre, T. 2011. Hmdb: a large video databasefor human motion recognition. In 2011 International Con-ference on Computer Vision, 2556โ2563. IEEE.
[Kumar et al. 2011] Kumar, R.; Banerjee, A.; Vemuri, B. C.;and Pfister, H. 2011. Trainable convolution filters and theirapplication to face recognition. IEEE transactions on pat-tern analysis and machine intelligence 34(7):1423โ1436.
[LeCun et al. 1998] LeCun, Y.; Bottou, L.; Bengio, Y.;Haffner, P.; et al. 1998. Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE86(11):2278โ2324.
[Lin, RoyChowdhury, and Maji 2015] Lin, T.-Y.; Roy-Chowdhury, A.; and Maji, S. 2015. Bilinear cnnsfor fine-grained visual recognition. arXiv preprintarXiv:1504.07889.
[Liu, Luo, and Shah 2009] Liu, J.; Luo, J.; and Shah, M.2009. Recognizing realistic actions from videos in the wild.Citeseer.
[Niebles, Chen, and Fei-Fei 2010] Niebles, J. C.; Chen, C.-W.; and Fei-Fei, L. 2010. Modeling temporal structureof decomposable motion segments for activity classifica-tion. In European conference on computer vision, 392โ405.Springer.
[Osowski and Quang 1994] Osowski, S., and Quang, T. V.1994. Multilayer neural network structure as volterra fil-ter. In Proceedings of IEEE International Symposium onCircuits and Systems-ISCASโ94, volume 6, 253โ256. IEEE.
[Roheda et al. 2018a] Roheda, S.; Krim, H.; Luo, Z.-Q.; andWu, T. 2018a. Decision level fusion: An event driven ap-proach. In 2018 26th European Signal Processing Confer-ence (EUSIPCO), 2598โ2602. IEEE.
[Roheda et al. 2018b] Roheda, S.; Riggan, B. S.; Krim, H.;and Dai, L. 2018b. Cross-modality distillation: A case forconditional generative adversarial networks. In 2018 IEEEInternational Conference on Acoustics, Speech and SignalProcessing (ICASSP), 2926โ2930. IEEE.
[Roheda et al. 2019] Roheda, S.; Krim, H.; Luo, Z.-Q.; andWu, T. 2019. Event driven fusion. arXiv preprintarXiv:1904.11520.
[Schetzen 1980] Schetzen, M. 1980. The volterra and wienertheories of nonlinear systems.
[Sermanet et al. 2013] Sermanet, P.; Eigen, D.; Zhang, X.;Mathieu, M.; Fergus, R.; and LeCun, Y. 2013. Overfeat:Integrated recognition, localization and detection using con-volutional networks. arXiv preprint arXiv:1312.6229.
[Simonyan and Zisserman 2014] Simonyan, K., and Zisser-man, A. 2014. Two-stream convolutional networks for ac-tion recognition in videos. In Advances in neural informa-tion processing systems, 568โ576.
[Sivic and Zisserman 2003] Sivic, J., and Zisserman, A.2003. Video google: A text retrieval approach to objectmatching in videos. In null, 1470. IEEE.
[Soomro, Zamir, and Shah 2012] Soomro, K.; Zamir, A. R.;and Shah, M. 2012. Ucf101: A dataset of 101 humanactions classes from videos in the wild. arXiv preprintarXiv:1212.0402.
[Tran et al. 2015] Tran, D.; Bourdev, L.; Fergus, R.; Torre-sani, L.; and Paluri, M. 2015. Learning spatiotemporal fea-tures with 3d convolutional networks. In Proceedings of theIEEE international conference on computer vision, 4489โ4497.
[Volterra 2005] Volterra, V. 2005. Theory of functionals andof integral and integro-differential equations. Courier Cor-poration.
[Wang et al. 2009] Wang, H.; Ullah, M. M.; Klaser, A.;Laptev, I.; and Schmid, C. 2009. Evaluation of local spatio-temporal features for action recognition.
[Yi, Krim, and Norris 2011] Yi, S.; Krim, H.; and Norris,L. K. 2011. Human activity modeling as brownian motionon shape manifold. In International Conference on ScaleSpace and Variational Methods in Computer Vision, 628โ639. Springer.
[Zach, Pock, and Bischof 2007] Zach, C.; Pock, T.; andBischof, H. 2007. A duality based approach for realtimetv-l 1 optical flow. In Joint pattern recognition symposium,214โ223. Springer.
[Zoumpourlis et al. 2017] Zoumpourlis, G.; Doumanoglou,A.; Vretos, N.; and Daras, P. 2017. Non-linear convolutionfilters for cnn-based learning. In Proceedings of the IEEEInternational Conference on Computer Vision, 4761โ4769.