1
Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1 , Lu Jiang 2 , Ming-Hsuan Yang 2,3 , Li-Jia Li 4 , Mingsheng Long 1 , Li Fei-Fei 4 1 Tsinghua University, 2 Google AI, 3 University of California, Merced, 4 Stanford University Summary I We build space-time models of the world through predictive unsupervised learning. I Task1: Future frames prediction. Applications: urban computing, weather forecasting, learning dynamics of complex environments I Task2: Early action recognition. Predicting future percepts from available information. We ask: Can pixel-level predictive learning help percept-level tasks? I Code/models available: github.com/google/e3d_lstm Motivations Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTM Video Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTM Video Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM I Common features in pixel-level and percept-level future prediction ! long-, short-term dependencies. I Our point: jointly learning long-, short-term video representations via recurrent 3DConv. Modeling Short-Term Video Representations I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrent transitions, to reduce the gradient vanishing problem RNN Unit RNN Unit RNN Unit Frame 3:T+2 Frame 1:T Frame 2:T+1 Frame T+1 Frame T+2 Frame T+3 2D CNN Decoders 2D CNN Decoders 2D CNN Decoders Classifier 3D CNN Encoders 3D CNN Encoders 3D CNN Encoders (a) 3D-CNN at Bottom RNN Unit 2D CNN Encoders RNN Unit RNN Unit 2D CNN Encoders 2D CNN Encoders Frame T+1 Frame T+2 Frame T+3 Frame T+1 Frame T+2 Frame T Classifier 3D CNN Decoders 3D CNN Decoders 3D CNN Decoders (b)3D-CNN on Top Frame 1:T Frame T+1 3D CNN Decoders 3D CNN Decoders 3D CNN Decoders E3D-LSTM E3D-LSTM E3D-LSTM 3D CNN Encoders 3D CNN Encoders 3D CNN Encoders Classifier Frame τ :T+τ Frame 2τ :T+2τ Frame T+2τ +1 Frame T+τ +1 (c) E3D-LSTM Network Modeling Long-Term Video Representations I Most prior work handles long-term video relations via the recursions of feed-forward networks (weak in learning temporal dependencies ) or the temporal state transitions of recurrent networks (easily leading to saturated forget gates ) I Our point: We introduce Recall Gate,a Transformer -like mechanism into LSTMs’ memory transitions, replacing the traditional forget gate Forget Gate h t-1 k c t-1 k x t m t k-1 m t k c t k h t k g t i t f t o t i t g t f t (d) Spatiotemporal LSTM Recall Gate Softmax C t k H t-1 k H t k X t M t k F t G t O t I t I t R t C t-1 k M t k-1 { LayerNorm C t-2 k C t τ k C t τ +1 k C t τ :t-1 k G t (e) Eidetic 3D LSTM R t = σ (W xr X t + W hr H k t -1 + b r ) RECALL(R t , C k t -:t -1 ) = softmax(R t · (C k t -:t -1 ) T ) ·C k t -:t -1 C k t = I t G t + LayerNorm(C k t -1 + RECALL(R t , C k t -:t -1 )) (1) Moving MNIST Dataset Model SSIM MSE Model SSIM MSE ConvLSTM 0.713 96.5 DFN 0.726 89.0 FRNN 0.819 68.4 VPN baseline 0.870 64.1 PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3 Inputs Ground Truth Ours PredRNN++ ConvLSTM VPN Baseline PredRNN (f)10 ! 10 Prediction PredRNN++ Ours Prior Context (same as Seq. 2) Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth) ConvLSTM Seq. 1 Inputs Seq. 1 Predictions (g)Copy Test KTH Action Dataset: Video Prediction and Replay t=11 t=1 PredRNN++ Ours ConvLSTM Inputs Prediction Ground Truth PredRNN++ Ours ConvLSTM Inputs Prediction Ground Truth Prior Inputs t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49 Early Action Recognition Model Front 25% Front 50% Baseline 1: 3D-CNN at bottom 10.28 16.05 Baseline 2: 3D-CNN on top 9.63 14.82 Baseline 3: Ours w/o 3D convolutions 9.58 13.92 Baseline 4: Ours w/o memory attention 11.39 18.84 Trained only on the recognition task 13.78 20.91 Pre-trained on the prediction task 14.00 22.15 Trained on both tasks with a fixed loss ratio 13.57 20.46 E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73 Pouring [sth.] into [sth.] until it overflows 3D-CNN E3D-LSTM Pouring [sth.] into [sth.] Pouring [sth.] into [sth.] Pouring [sth.]overflows Pouring [sth.]overflows Trying to pour [sth.] into [sth.], but missing so it spills next to it 3D-CNN E3D-LSTM Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.] Pouring [sth.] into [sth.] Trying to pourbut spills Poking a stackw/o collapsing Poking a stackcollapses Poking a stackcollapses Poking a stackcollapses Poking a stack of [sth.] so the stack collapses 0% 100% 25% 50% E3D-LSTM 3D-CNN

Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1,LuJiang2,Ming-HsuanYang2,3,Li-JiaLi4,MingshengLong1,LiFei-Fei4

  • Upload
    others

  • View
    10

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Eidetic 3D LSTM: A Model for Video Prediction and Beyond ... · Eidetic 3D LSTM: A Model for Video Prediction and Beyond Yunbo Wang 1,LuJiang2,Ming-HsuanYang2,3,Li-JiaLi4,MingshengLong1,LiFei-Fei4

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

(a)3D-CNN at Bottom (b)3D-CNN on Top (c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

(d)Spatiotemporal LSTM (e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

(1)

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

(f) 10 ! 10 Prediction (g)Copy Test

KTH Action Dataset: Video Prediction and Replay

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Tsinghua University - Beijing - China - 100084 Mail: [email protected]

Eidetic 3D LSTM: A Model for Video Prediction and Beyond

Yunbo Wang1, Lu Jiang2, Ming-Hsuan Yang2,3, Li-Jia Li4, Mingsheng Long1, Li Fei-Fei4

1Tsinghua University, 2Google AI, 3University of California, Merced, 4Stanford University

Summary

I We build space-time models of the world through predictiveunsupervised learning.

I Task1: Future frames prediction. Applications: urbancomputing, weather forecasting, learning dynamics of complexenvironments

I Task2: Early action recognition. Predicting future perceptsfrom available information. We ask: Can pixel-level predictivelearning help percept-level tasks?

I Code/models available: github.com/google/e3d_lstm

Motivations

Task 2D CNN 3D CNN LSTM Conv-in-LSTM 3DConv-in-LSTMVideo Pred. Mathieu ICLR16 VideoGAN Srivastava NIPS15 ConvLSTM E3D-LSTMVideo Recog. Two-Stream CNN C3D, I3D LRCN ConvGRU E3D-LSTM

I Common features in pixel-level and percept-level futureprediction ! long-, short-term dependencies.

I Our point: jointly learning long-, short-term videorepresentations via recurrent 3DConv.

Modeling Short-Term Video Representations

I 3DConv-in-LSTM: Integrating 3DConv into LSTMs’ recurrenttransitions, to reduce the gradient vanishing problem

RNN Unit RNN Unit RNN Unit

Frame3:T+2Frame1:T Frame2:T+1

FrameT+1 FrameT+2 FrameT+32D CNN

Decoders2D CNN

Decoders2D CNN

Decoders

Classifier

3D CNNEncoders

3D CNNEncoders

3D CNNEncoders

(a) 3D-CNN at Bottom

RNN Unit

2D CNNEncoders

RNN Unit RNN Unit

2D CNNEncoders

2D CNNEncoders

FrameT+1 FrameT+2 FrameT+3

FrameT+1 FrameT+2FrameT

Classifier

3D CNNDecoders

3D CNNDecoders

3D CNNDecoders

(b) 3D-CNN on Top

Frame1:T

FrameT+13D CNN

Decoders3D CNN

Decoders3D CNN

Decoders

E3D-LSTM E3D-LSTM E3D-LSTM

3D CNNEncoders

3D CNNEncoders

3D CNNEncoders

Classifier

Frameτ :T+τ Frame2τ :T+2τ

FrameT+2τ+1FrameT+τ+1

(c)E3D-LSTM Network

Modeling Long-Term Video Representations

I Most prior work handles long-term video relations via therecursions of feed-forward networks (weak in learning temporaldependencies) or the temporal state transitions of recurrentnetworks (easily leading to saturated forget gates)

I Our point: We introduce Recall Gate, a Transformer -likemechanism into LSTMs’ memory transitions, replacing thetraditional forget gate

Forget Gateht-1k

ct-1k

xt

mtk-1

mtk

ctk ht

kgt

it

ft

ot

′it

′gt′ft

(d)Spatiotemporal LSTM

Recall Gate

Softmax

Ctk

Ht-1k

Htk

Xt

Mtk

′Ft

Gt

Ot

′It

It

Rt

Ct-1k

Mtk-1

{ LayerNorm

Ct-2kCt−τ

k Ct−τ+1k

Ct−τ :t-1k

′Gt

(e)Eidetic 3D LSTM

Rt = �(Wxr ⇤ Xt +Whr ⇤Hkt�1 + br)

RECALL(Rt, Ckt�⌧ :t�1) = softmax(Rt · (C

kt�⌧ :t�1)

T) · Ck

t�⌧ :t�1

Ckt = It � Gt + LayerNorm(Ck

t�1 + RECALL(Rt, Ckt�⌧ :t�1))

(1)

Moving MNIST Dataset

Model SSIM MSE Model SSIM MSEConvLSTM 0.713 96.5 DFN 0.726 89.0FRNN 0.819 68.4 VPN baseline 0.870 64.1PredRNN 0.869 56.5 E3D-LSTM 0.910 41.3

Inputs

Ground Truth

Ours

PredRNN++

ConvLSTM

VPN Baseline

PredRNN

(f) 10 ! 10 Prediction

PredRNN++

Ours

Prior Context (same as Seq. 2)

Seq. 2 Inputs Seq. 2 Predictions (the first line is the expected ground truth)

ConvLSTM

Seq. 1 Inputs Seq. 1 Predictions

(g)Copy Test

KTH Action Dataset: Video Prediction and Replay

t=11t=1

PredRNN++

Ours

ConvLSTM

Inputs Prediction Ground Truth

PredRNN++

Ours

ConvLSTM

Inputs Prediction Ground Truth

Prior Inputs

t=3 t=5 t=7 t=9 t=13 t=15 t=17 t=19 t=21 t=23 t=25 t=27 t=29 t=31 t=33 t=35 t=37 t=39 t=41 t=43 t=45 t=47 t=49

Early Action Recognition

Model Front 25% Front 50%Baseline 1: 3D-CNN at bottom 10.28 16.05Baseline 2: 3D-CNN on top 9.63 14.82Baseline 3: Ours w/o 3D convolutions 9.58 13.92Baseline 4: Ours w/o memory attention 11.39 18.84Trained only on the recognition task 13.78 20.91Pre-trained on the prediction task 14.00 22.15Trained on both tasks with a fixed loss ratio 13.57 20.46E3D-LSTM (Trained on both tasks with a scheduled ratio) 14.59 22.73

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Poking a stack…w/o collapsing

Poking a stack…collapsesPoking a stack…collapsesPoking a stack…collapses

Poking a stack of [sth.] so the stack collapses

0% 100%25% 50%

3D-CNNsE3D-LSTM3D-CNN

Poking a stack of [sth.] without the stack collapsing

3D-CNNE3D-LSTM

Putting [sth.] onto [sth.] Poking a stack…w/o collapsing

Poking [sth.] so lightly Poking a stack…w/o collapsing

Pouring [sth.] into [sth.] until it overflows

3D-CNNE3D-LSTM

Pouring [sth.] into [sth.] Pouring [sth.] into [sth.]Pouring [sth.]…overflows Pouring [sth.]…overflows

Trying to pour [sth.] into [sth.], but missing so it spills next to it

3D-CNNE3D-LSTM

Pouring [sth.] out of [sth.] Pouring [sth.] out of [sth.]Pouring [sth.] into [sth.] Trying to pour…but spills

Tsinghua University - Beijing - China - 100084 Mail: [email protected]