Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1,LuJiang2,Ming-HsuanYang2,3,Li-JiaLi4,MingshengLong1,LiFei-Fei4

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

(a)3D-CNN at Bottom (b)3D-CNN on Top (c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

(d)Spatiotemporal LSTM (e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

(1)

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

(f) 10 ! 10 Prediction (g)Copy Test

KTH Action Dataset: Video Prediction and Replay

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Tsinghua University - Beijing - China - 100084 Mail: [email protected]

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

RNN Unit RNN Unit RNN Unit

Frame3:T+2Frame1:T Frame2:T+1

FrameT+1 FrameT+2 FrameT+32D CNN

Decoders2D CNN

Decoders2D CNN

Decoders

Classifier

3D CNNEncoders

3D CNNEncoders

3D CNNEncoders

(a) 3D-CNN at Bottom

RNN Unit

2D CNNEncoders

RNN Unit RNN Unit

2D CNNEncoders

2D CNNEncoders

FrameT+1 FrameT+2 FrameT+3

FrameT+1 FrameT+2FrameT

Classifier

3D CNNDecoders

3D CNNDecoders

3D CNNDecoders

(b) 3D-CNN on Top

Frame1:T

FrameT+13D CNN

Decoders3D CNN

Decoders3D CNN

Decoders

E3D-LSTM E3D-LSTM E3D-LSTM

3D CNNEncoders

3D CNNEncoders

3D CNNEncoders

Classifier

Frameτ :T+τ Frame2τ :T+2τ

FrameT+2τ+1FrameT+τ+1

(c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

Forget Gateht-1k

ct-1k

xt

mtk-1

mtk

ctk ht

kgt

it

ft

ot

′it

′gt′ft

(d)Spatiotemporal LSTM

Recall Gate

Softmax

Ctk

Ht-1k

Htk

Xt

Mtk

′Ft

Gt

Ot

′It

It

Rt

Ct-1k

Mtk-1

{ LayerNorm

Ct-2kCt−τ

k Ct−τ+1k

Ct−τ :t-1k

′Gt

(e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

(1)

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

Inputs

Ground Truth

Ours

PredRNN++

ConvLSTM

VPN Baseline

PredRNN

(f) 10 ! 10 Prediction

PredRNN++

Ours

Prior Context (same as Seq. 2)

Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth)

ConvLSTM

Seq. 1 Inputs Seq. 1 Predictions

(g)Copy Test

KTH Action Dataset: Video Prediction and Replay

t=11t=1

PredRNN++

Ours

ConvLSTM

Inputs Prediction Ground Truth

PredRNN++

Ours

ConvLSTM

Inputs Prediction Ground Truth

Prior Inputs

t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Tsinghua University - Beijing - China - 100084 Mail: [email protected]

Documents

Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1,LuJiang2,Ming-HsuanYang2,3,Li-JiaLi4,MingshengLong1,LiFei-Fei4