Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video...

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

(a)3D-CNN at Bottom (b)3D-CNN on Top (c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

(d)Spatiotemporal LSTM (e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

(f) 10 ! 10 Prediction (g)Copy Test

KTH Action Dataset: Video Prediction and Replay

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Tsinghua University - Beijing - China - 100084 Mail: wangyb15@mails.tsinghua.edu.cn

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

RNN Unit RNN Unit RNN Unit

Frame3:T+2Frame1:T Frame2:T+1

FrameT+1 FrameT+2 FrameT+32D CNN

Decoders2D CNN

Decoders

Classifier

3D CNNEncoders

(a) 3D-CNN at Bottom

RNN Unit

2D CNNEncoders

RNN Unit RNN Unit

2D CNNEncoders

FrameT+1 FrameT+2 FrameT+3

FrameT+1 FrameT+2FrameT

Classifier

3D CNNDecoders

(b) 3D-CNN on Top

Frame1:T

FrameT+13D CNN

Decoders3D CNN

Decoders

E3D-LSTM E3D-LSTM E3D-LSTM

3D CNNEncoders

Classifier

Frameτ :T+τ Frame2τ :T+2τ

FrameT+2τ+1FrameT+τ+1

(c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

Forget Gateht-1k

ctk ht

′gt′ft

(d)Spatiotemporal LSTM

Recall Gate

Softmax

{ LayerNorm

Ct-2kCt−τ

k Ct−τ+1k

Ct−τ :t-1k

(e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

Inputs

Ground Truth

PredRNN++

ConvLSTM

VPN Baseline

PredRNN

(f) 10 ! 10 Prediction

PredRNN++

Prior Context (same as Seq. 2)

Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth)

ConvLSTM

Seq. 1 Inputs Seq. 1 Predictions

(g)Copy Test

KTH Action Dataset: Video Prediction and Replay

t=11t=1

PredRNN++

ConvLSTM

Inputs Prediction Ground Truth

PredRNN++

ConvLSTM

Inputs Prediction Ground Truth

Prior Inputs

t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Tsinghua University - Beijing - China - 100084 Mail: wangyb15@mails.tsinghua.edu.cn

Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video...

Documents

c 2017 Ryan Freedman - Illinois Security Labseclab.illinois.edu/wp-content/uploads/2017/08/freedman2017lstm.pdf · LSTM AND EXTENDED DEAD RECKONING AUTOMOBILE ROUTE PREDICTION USING

Learning to forget continual prediction with lstm

Crude oil price prediction using CEEMDAN and LSTM

Anomaly Detection in Electrocardiogram Readings with ...ceur-ws.org/Vol-2473/paper10.pdf · Computing the Prediction Errors for the full Time Se-ries After the LSTM prediction model

Retail sales forecasting using LSTM and ARIMA-LSTM: A

Social LSTM: Human Trajectory Prediction in Crowded Spaces

RNN & LSTM

Eidetic Systems

Stock Price Prediction Using Machine Learning and LSTM

Lstm shannonlab

Eidetic Company Profile 2014

Prediction of Cryptocurrency Price Movements from Order Book … · 2020. 5. 15. · Prediction of Cryptocurrency Price Movements from Order Book Data Using LSTM Neural Networks April

CS230 Deep Learningcs230.stanford.edu/projects_winter_2019/reports/15813450.pdfInput (current flight) LSTM seq Dense (x5) Prediction Figure 2: LSTM + CNN Architecture Due to the class

MAKING ONE'S WAY OUT OF EIDETIC PARALYSIS: THE VALUE OF

Capturing Pragmatic Knowledge in Article Usage Prediction ...jkabba/coling2016talk.pdf · •State of the art for article usage prediction –LSTM networks can learn complex dependencies

Social LSTM: Human Trajectory Prediction in Crowded Spacesopenaccess.thecvf.com/content_cvpr_2016/papers/Alahi... · 2017-04-04 · Social LSTM: Human Trajectory Prediction in Crowded

Examensarbete Systemarkitekturutbildningen Philip ...hb.diva-portal.org/smash/get/diva2:1142444/FULLTEXT01.pdf · box trading, lstm, rnn, time series forecasting , prediction, tensorflow,

Photographic Memory Also known as eidetic memory 5% of preschool children show evidence of eidetic memory Images persist

A Hybrid Prediction Method for Stock Price Using LSTM and

Prediction-based Resource Allocation using LSTM …...2019/06/26 · Prediction-based Resource Allocation using LSTM and minimum cost and maximum ﬂow algorithm GyunamPark,MinseokSong†