Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 6. Temporal aspects. Applications

Chapter 6

Summary. 1.  Temporal aspects RNN,LSTM 2.  Applications. 3D Conv.nets…

Video Analysis & Coding/Computer Vision 2

1. RNN

➔  Reccurent neural networks (RNNs) are a family of neural networks for processing sequential data.

➔  Formally, it is a neural network which is specialized for processing a sequence of values

➔  Advantage : sharing parameters across different parts of the model ( applied to the different time observations)

➔  We consider a RNN operating on a sequence of vectors

➔  In practice, RNN usually operate on minibatches of such sequences.


x 1( ) ,x 2( ) ,....,x τ( )

Rumelhart, D.E., McCelland, J.L., and the PDP Research Group (1986) Parallel Distibuted Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge

x 1( ) ,x 2( ) ,....,x τ( )

The idea of computational graph unfolding

➔  Consider a classical form of dynamical system (in CV ex. Dynamic model of a moving object in a video sequence ( e.g. with a constant velocity)

➔  is called the state of the system ➔  The equation is recurrent


s t( ) = f s t−1( );θ( )s t( )

Unfolding

➔  For a finite number of time steps

➔  Unfolding the equation by repeatedly applying the definition in this way has yielded an expression that does not involve recurrence.


s 3( ) = f s 2( );θ( ) = f f s 1( );θ( );θ( )

s ...( ) s t−1( ) s t( ) s t+1( )s ...( )

f f f f

RNN as


➔  The equation using external signals (h is a state)

h t( ) = f h t−1( ) ,x t( );θ( )

It is possible to use the same transition

h t−1( ) h t( )h t+1( )

x t−1( )x t( ) x t+1( )

f f ffh ...( ) h ...( )

And finally

➔  Unfolded recurrent network with Loss


h t−1( ) h t( )h t+1( )

x t−1( )x t( ) x t+1( )

W W WWh ...( ) h ...( )

Lt−1( ) Lt( ) Lt+1( )

o t−1( )o t( ) o t+1( )

U UU

V V V

y t−1( ) y t( ) y t+1( )

Equations of RNN

➔  Forward propagation

➔  Parameter estimation : backpropagation and gradient descent

➔  Difficult to train


a t( ) = b+Wh t−1( ) +Ux t( )

h t( ) = f a t( )( )o t( ) = c+Vh t( )

LSTM – Long-Short Term Memory

➔  Gated RNNs ➔  The idea : creating paths through time that have derivatives that neither

vanish, nor explode ➔  Connection weights may change at each time

➔  For video analysis LSTM have been mainly replaced by 3D convolutional neural networks


Hochreiter and Schmidhubner, 1997

Goal

10

Improve athletes performances

for teachers and athletes

through tools

CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Goal

11 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Offensive Forehand Loop

Input Output

-  Extract strokes in the temporal dimension

-  Classify the strokes

t

1 - A new dataset : TTStroke-21


129 videos at 120 fps 1 387 / 1 074 annotations before / after filtering for 20 classes 1 048 strokes + 272 negative samples extracted

Acquisition

Annotation platform Samples

TTStroke-21

[1] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recognition with dynamic image networks,” CoRR, vol. abs/1612.00738, 2016. [2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017. [3] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1510–1517, 2018.

Use of Dynamic Images[1] Very deep 3D

CNN[2]

Long-term Temporal Convolutions[3]

2 - Related Work


3 - Proposed method


Goal : good classification of the strokes extracted

-  Use of deep learning model

-  Need of temporal and spatial segmentation

-  Data augmentation Best accuracy : 91.4% against 43.1% for the state of the art method[2]

[2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.

Offensive Forehand Loop

3.a - Model Architecture


Siamese Spatio-Temporal Convolutional Neural Network Input

(W,H,T) = (100,120,120)

Training :

Stochastic Gradient descent Cross-entropy loss : = -x[class] + log(\sum_j exp(x[j])) learning rate = 0.001 for Siamese and 0.01 for one branch Nesterov Momentum Epochs 2000 Momentum : 0.5 decreased to 0.1 and 0.05 at epoch 1000 and 1500 Datasets : Training 70%,Validation 20%, Test : 10%

* “IMAGE SUPER RESOLUTION KERAS” from impremedia.net

3D convolutions*

3.b - Input Data

16 16 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

[4] C. Liu, “Beyond pixels: Exploring new representations and applications for motion analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 5 2009. [5] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern Recognition Letters, vol. 27, no. 7, pp. 773–780, 2006.

Original Frame

Motion estimation[4]

Foreground estimation[5] Foreground

Motion

17 17 17 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

3.b - Input Data

Spatial Segmentation using foreground motion

Xmax Xg

Final segmentation

Smoothing over temporal dimension using gaussian kernel of size 40 and standard deviation 4.44.

Xroi

3.c - Data Augmentation


Online augmentation applied before spatial segmentation to avoid padding Spatial :

-  random rotation range ±10° -  random translation in range ±0.1 in x and y directions -  random homothety in range 1 ± 0.1

Temporal : -  100 successive frames with the 50th frame selected according to a normal

probabilistic distribution along the temporal dimension of the stroke extracted

4 - Results


4 - Results


Training of our SSTC model

4 - Results

21 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis Training of the I3D model

4 - Results


Conclusion

➔  This course is very far from being complete ➔  It was an attempt to give fundamentals and some examples from

authors’s research ➔  Happy adventure with Deep Learning for your visual data and your

problems.

➔  Jenny Benois-Pineau


Documents

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université