23
Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 6. Temporal aspects. Applications

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université Bordeaux Chapter 6. Temporal aspects. Applications

Page 2: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Chapter 6

Summary. 1.  Temporal aspects RNN,LSTM 2.  Applications. 3D Conv.nets…

Video Analysis & Coding/Computer Vision 2

Page 3: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

1. RNN

➔  Reccurent neural networks (RNNs) are a family of neural networks for processing sequential data.

➔  Formally, it is a neural network which is specialized for processing a sequence of values

➔  Advantage : sharing parameters across different parts of the model ( applied to the different time observations)

➔  We consider a RNN operating on a sequence of vectors

➔  In practice, RNN usually operate on minibatches of such sequences.

Video Analysis & Coding/Computer Vision 3

x 1( ) ,x 2( ) ,....,x τ( )

Rumelhart, D.E., McCelland, J.L., and the PDP Research Group (1986) Parallel Distibuted Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge

x 1( ) ,x 2( ) ,....,x τ( )

Page 4: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

The idea of computational graph unfolding

➔  Consider a classical form of dynamical system (in CV ex. Dynamic model of a moving object in a video sequence ( e.g. with a constant velocity)

➔  is called the state of the system ➔  The equation is recurrent

Video Analysis & Coding/Computer Vision 4

s t( ) = f s t−1( );θ( )s t( )

Page 5: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Unfolding

➔  For a finite number of time steps

➔  Unfolding the equation by repeatedly applying the definition in this way has yielded an expression that does not involve recurrence.

Video Analysis & Coding/Computer Vision 5

s 3( ) = f s 2( );θ( ) = f f s 1( );θ( );θ( )

s ...( ) s t−1( ) s t( ) s t+1( )s ...( )

f f f f

Page 6: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

RNN as

Video Analysis & Coding/Computer Vision 6

➔  The equation using external signals (h is a state)

h t( ) = f h t−1( ) ,x t( );θ( )

It is possible to use the same transition

h t−1( ) h t( )h t+1( )

x t−1( )x t( ) x t+1( )

f f ffh ...( ) h ...( )

Page 7: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

And finally

➔  Unfolded recurrent network with Loss

Video Analysis & Coding/Computer Vision 7

h t−1( ) h t( )h t+1( )

x t−1( )x t( ) x t+1( )

W W WWh ...( ) h ...( )

Lt−1( ) Lt( ) Lt+1( )

o t−1( )o t( ) o t+1( )

U UU

V V V

y t−1( ) y t( ) y t+1( )

Page 8: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Equations of RNN

➔  Forward propagation

➔  Parameter estimation : backpropagation and gradient descent

➔  Difficult to train

Video Analysis & Coding/Computer Vision 8

a t( ) = b+Wh t−1( ) +Ux t( )

h t( ) = f a t( )( )o t( ) = c+Vh t( )

Page 9: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

LSTM – Long-Short Term Memory

➔  Gated RNNs ➔  The idea : creating paths through time that have derivatives that neither

vanish, nor explode ➔  Connection weights may change at each time

➔  For video analysis LSTM have been mainly replaced by 3D convolutional neural networks

Video Analysis & Coding/Computer Vision 9

Hochreiter and Schmidhubner, 1997

Page 10: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Goal

10

Improve athletes performances

for teachers and athletes

through tools

CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Page 11: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Goal

11 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Offensive Forehand Loop

Input Output

-  Extract strokes in the temporal dimension

-  Classify the strokes

t

Page 12: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

1 - A new dataset : TTStroke-21

12 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

129 videos at 120 fps 1 387 / 1 074 annotations before / after filtering for 20 classes 1 048 strokes + 272 negative samples extracted

Acquisition

Annotation platform Samples

TTStroke-21

Page 13: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

[1] H. Bilen, B. Fernando, E. Gavves, and A. Vedaldi, “Action recognition with dynamic image networks,” CoRR, vol. abs/1612.00738, 2016. [2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017. [3] G. Varol, I. Laptev, and C. Schmid, “Long-term temporal convolutions for action recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1510–1517, 2018.

Use of Dynamic Images[1] Very deep 3D

CNN[2]

Long-term Temporal Convolutions[3]

2 - Related Work

13 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Page 14: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

3 - Proposed method

14 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Goal : good classification of the strokes extracted

-  Use of deep learning model

-  Need of temporal and spatial segmentation

-  Data augmentation Best accuracy : 91.4% against 43.1% for the state of the art method[2]

[2] J. Carreira and A. Zisserman, “Quo vadis, action recognition? A new model and the kinetics dataset,” CoRR, vol. abs/1705.07750, 2017.

Offensive Forehand Loop

Page 15: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

3.a - Model Architecture

15 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Siamese Spatio-Temporal Convolutional Neural Network Input

(W,H,T) = (100,120,120)

Training :

Stochastic Gradient descent Cross-entropy loss : = -x[class] + log(\sum_j exp(x[j])) learning rate = 0.001 for Siamese and 0.01 for one branch Nesterov Momentum Epochs 2000 Momentum : 0.5 decreased to 0.1 and 0.05 at epoch 1000 and 1500 Datasets : Training 70%,Validation 20%, Test : 10%

* “IMAGE SUPER RESOLUTION KERAS” from impremedia.net

3D convolutions*

Page 16: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

3.b - Input Data

16 16 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

[4] C. Liu, “Beyond pixels: Exploring new representations and applications for motion analysis,” Ph.D. dissertation, Massachusetts Institute of Technology, 5 2009. [5] Z. Zivkovic and F. van der Heijden, “Efficient adaptive density estimation per image pixel for the task of background subtraction,” Pattern Recognition Letters, vol. 27, no. 7, pp. 773–780, 2006.

Original Frame

Motion estimation[4]

Foreground estimation[5] Foreground

Motion

Page 17: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

17 17 17 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

3.b - Input Data

Spatial Segmentation using foreground motion

Xmax Xg

Final segmentation

Smoothing over temporal dimension using gaussian kernel of size 40 and standard deviation 4.44.

Xroi

Page 18: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

3.c - Data Augmentation

18 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Online augmentation applied before spatial segmentation to avoid padding Spatial :

-  random rotation range ±10° -  random translation in range ±0.1 in x and y directions -  random homothety in range 1 ± 0.1

Temporal : -  100 successive frames with the 50th frame selected according to a normal

probabilistic distribution along the temporal dimension of the stroke extracted

Page 19: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

4 - Results

19 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Page 20: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

4 - Results

20 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Training of our SSTC model

Page 21: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

4 - Results

21 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis Training of the I3D model

Page 22: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

4 - Results

22 CBMI - September 6th, 2018 Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Page 23: Deep Learning for Computer Vision Pr. Jenny Benois-Pineau ...benois-p/DeepLearningCompVisionIPCV/Co… · Deep Learning for Computer Vision Pr. Jenny Benois-Pineau LABRI UMR 5800/Université

Conclusion

➔  This course is very far from being complete ➔  It was an attempt to give fundamentals and some examples from

authors’s research ➔  Happy adventure with Deep Learning for your visual data and your

problems.

➔  Jenny Benois-Pineau

Video Analysis & Coding/Computer Vision 23