47
Efficient and Tractable System Identification through Supervised Learning AHMED HEFNY MACHINE LEARNING DEPARTMENT CARNEGIE MELLON UNIVERSITY 8/1/2017 1 JOINT WORK WITH GEOFFREY GORDON, CARLTON DOWNEY, BYRON BOOTS (GEORGIA TECH), ZITA MARINHO, WEN SUN

Efficient and Tractable System Identification through ...ahefny/pubs/7_21_17_berkley.pdfEfficient and Tractable System Identification through Supervised ... PSIM [DAgger] RNN [BPTT]

  • Upload
    lymien

  • View
    218

  • Download
    2

Embed Size (px)

Citation preview

Efficient and Tractable System Identification through Supervised LearningA HM E D HE FNY

M A CHINE L E A RNING DE PA RTME NT

CARNE GIE M E L LON UNIV E RSIT Y

8/1/2017 1

J O I NT W O R K W I T H

G E O F F RE Y G OR DON, C A R LTON D O W NE Y,

BY R O N B O OT S ( G E O R GI A T E C H ) ,

Z I TA MA R I NH O, W E N S UN

Outline- Problem Statement:

◦ Learning Dynamical Systems

◦ Solution Properties

- Formulation:◦ A Taxonomy of Dynamical System Models

◦ Predictive State Models: Formulation and Learning

◦ Connection to Recurrent Networks

- Extensions:◦ Controlled Systems

◦ Reinforcement Learning

8/1/2017 2

Outline- Problem Statement:

◦ Learning Dynamical Systems

◦ Solution Properties

- Formulation:◦ A Taxonomy of Dynamical System Models

◦ Predictive State Models: Formulation and Learning

◦ Connection to Recurrent Networks

- Extensions:◦ Controlled Systems

◦ Reinforcement Learning

8/1/2017 3

Learning Dynamical Systems

8/1/2017 4

Learning Dynamical Systems

8/1/2017 5

Learning Dynamical Systems

𝒔𝒕 𝒔𝒕+𝟏

𝒐𝒕

System Dynamics

Observation Model

Latent System State

Position & speed

8/1/2017 6

Learning Dynamical Systems

𝒔𝒕 𝒔𝒕+𝟏

𝒐𝒕

System Dynamics

Observation Model

Latent System State

Position & speed

8/1/2017 7

𝑞𝑡 ≡ 𝑃 𝑠𝑡 𝑜1:𝑡−1

Learning Dynamical Systems

8/1/2017 8

𝑞𝑡 ≡ 𝑃 𝑠𝑡 𝑜1:𝑡−1

𝑞𝑡history𝑜1:𝑡−1

future𝑜𝑡:∞

𝑃 𝑜𝑡+𝜏 𝑜1:𝑡=1 ≡ 𝑃(𝑜𝑡+𝜏|𝑞𝑡)

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡 , 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

8/1/2017 9

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡 , 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

System- Non-linear- Partially observable- Controlled

Algorithm- Theoretical Guarantees - Scalability

8/1/2017 10

Outline- Problem Statement:

◦ Learning Dynamical Systems

◦ Solution Properties

- Formulation:◦ A Taxonomy of Dynamical System Models

◦ Predictive State Models: Formulation and Learning

◦ Connection to Recurrent Networks

- Extensions:◦ Controlled Systems

◦ Reinforcement Learning

8/1/2017 11

RNN [BPTT]

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

𝑜𝑡+1

𝑞𝑡 𝑞𝑡+1

𝑜𝑡

f

g

8/1/2017 12

Learn a model then derive f

HMM[EM, Tensor Decomp.]

Directly Learn f RNN [BPTT]

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

𝑜𝑡+1

𝑞𝑡 𝑞𝑡+1

𝑜𝑡

f

g

8/1/2017 13

Fix g (Predictive State) Learn g (Latent State)

Learn a model then derive f

HMM[EM, Tensor Decomp.]

Directly Learn f 𝑞𝑡 ≡ 𝐸 𝜓(𝑜𝑡:∞) ∣ 𝑜1:𝑡−1state = E[sufficient future stats]PSIM [DAgger]

RNN [BPTT]

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

𝑜𝑡+1

𝑞𝑡 𝑞𝑡+1

𝑜𝑡

f

g

8/1/2017 14

Fix g (Predictive State) Learn g (Latent State)

Learn a model then derive f

Predictive State Models[Method of moments: Two-stage regression]

HMM[EM, Tensor Decomp.]

Directly Learn f 𝑞𝑡 ≡ 𝐸 𝜓(𝑜𝑡:∞) ∣ 𝑜1:𝑡−1state = E[sufficient future stats]PSIM [DAgger]

RNN [BPTT]

Learning a Recursive FilterGiven:

Training Sequences:

(𝑜1, 𝑜2, … , 𝑜𝑇)

Output:◦ Initial belief 𝑞1◦ Filtering function 𝑓 𝑞𝑡+1 = 𝑓(𝑞𝑡, 𝑜𝑡)

◦ Observation function 𝑔 𝐸 𝑜𝑡 𝑜1:𝑡−1 = 𝑔(𝑞𝑡)

𝑜𝑡+1

𝑞𝑡 𝑞𝑡+1

𝑜𝑡

f

g

8/1/2017 15

Why restrict 𝑓 and 𝑔 ?

* Predictive State (a.k.a Observable Representation):

State is a prediction of future observation statistics

Future statistics are noisy estimates of the state.

Reduction to supervised learning.

* Additional assumptions on dynamics facilitate the development of an efficient algorithm with provable guarantees.

* Local improvement is still possible.

Learning a Recursive Filter

PSM PSIM RNN

8/1/2017 16

Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1

Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1

Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

future 𝜓𝑡

extended future 𝜉𝑡𝝍𝒕 𝒒𝒕

Indicator Vector Joint Probability Table

1st and 2nd moments

Gaussian Distribution

178/1/2017

Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1

Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1

Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡

Filtering: 𝑓 𝑞𝑡 , 𝑜𝑡 = 𝒇𝒇𝒊𝒍𝒕𝒆𝒓 𝑝𝑡, 𝑜𝑡 = 𝑓𝑓𝑖𝑙𝑡𝑒𝑟 𝑊𝑞𝑡 , 𝑜𝑡

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

future 𝜓𝑡

extended future 𝜉𝑡𝝍𝒕 𝒒𝒕 𝒇𝒇𝒊𝒍𝒕𝒆𝒓

Indicator Vector Joint Probability Table

Bayes Rule:𝑃 𝑜𝑡+1:𝑡+𝑘 𝑜1:𝑡 ∝ 𝑃(𝑜𝑡:𝑡+𝑘 ∣ 𝑜1:𝑡−1)

1st and 2nd moments Gaussian Distribution

Gaussian conditional mean and covariance.

fixed

learned

8/1/2017 18

Predictive State Model (Formulation)Predictive State 𝑞𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘−1 𝑜1:𝑡−1 ≡ 𝐸 𝜓𝑡 𝑜1:𝑡−1

Extended Predictive State 𝑝𝑡 ≡ 𝑃 𝑜𝑡:𝑡+𝑘 𝑜1:𝑡−1 ≡ 𝐸 𝜁𝑡 𝑜1:𝑡−1

Linear Dynamics 𝑝𝑡 =𝑾 𝑞𝑡

Filtering: 𝑓 𝑞𝑡 , 𝑜𝑡 = 𝒇𝒇𝒊𝒍𝒕𝒆𝒓 𝑝𝑡, 𝑜𝑡 = 𝑓𝑓𝑖𝑙𝑡𝑒𝑟 𝑊𝑞𝑡 , 𝑜𝑡

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

future 𝜓𝑡

extended future 𝜉𝑡

fixed

learned

Why linear 𝑾 ?Crucial to the consistency of learning algorithm.

Why this particular filtering formulation ?Matches existing models (HMM, Kalman filter, PSR)

8/1/2017 19

Predictive State Model (Learning)𝜓𝑡 and 𝜁𝑡 are unbiased estimates of 𝑞𝑡 and 𝑝𝑡:

• 𝜓𝑡 = 𝑞𝑡 + 𝜖𝑡

• 𝜁𝑡 = 𝑝𝑡 + 𝜈𝑡

Learning Procedure:

• ො𝑞0 =1

𝑁σ𝑖 𝜓𝑖

• Learn 𝑊 using linear regression with examples (𝜓𝑡 , 𝜁𝑡 )

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

future 𝜓𝑡

extended future 𝜉𝑡

𝜖𝑡 and 𝜈𝑡 are correlated𝐶𝑜𝑣 𝑞𝑡 , 𝑝𝑡 ≠ 𝐶𝑜𝑣(𝜓𝑡, 𝜁𝑡)

8/1/2017 20

Predictive State (Learning)𝜓𝑡 and 𝜁𝑡 are unbiased estimates of 𝑞𝑡 and 𝑝𝑡:

• 𝜓𝑡 = 𝑞𝑡 + 𝜖𝑡

• 𝜁𝑡 = 𝑝𝑡 + 𝜈𝑡

Learning Procedure:

• 𝑞0 =1

𝑁σ𝑖 𝜓𝑖

• Denoise examples (𝜓𝑡 , 𝜁𝑡) to obtain ( ො𝑞𝑡 , Ƹ𝑝𝑡)

• Learn 𝑊 using linear regression with examples (ො𝑞𝑡 , Ƹ𝑝𝑡)

𝜖𝑡 and 𝜈𝑡 are correlated𝐶𝑜𝑣 𝑞𝑡 , 𝑝𝑡 ≠ 𝐶𝑜𝑣(𝜓𝑡, 𝜁𝑡)

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

future 𝜓𝑡

extended future 𝜉𝑡

denoised future ො𝑞𝑡

denoised extended future Ƹ𝑝𝑡

???

𝑝𝑡 = 𝑊𝑞𝑡

𝐸 𝜁𝑡 𝑜𝑡−𝑘:𝑡−1 = 𝑊𝐸[𝜓𝑡 ∣ 𝑜𝑡−𝑘:𝑡−1]Use regression

8/1/2017 21

𝐸 𝜁𝑡 𝑜1:𝑡−1 = 𝑊𝐸[𝜓𝑡 ∣ 𝑜1:𝑡−1]

ො𝑞𝑡 = 𝑊 Ƹ𝑝𝑡

Learning Dynamical Systems Using Instrument Regression

𝑜𝑡−1 𝑜𝑡 𝑜𝑡+𝑘−1 𝑜𝑡+𝑘

history ℎ𝑡 future 𝜓𝑡

extended future 𝜉𝑡

S1A regression

S1B regression

S2 regression (learn W)

Condition on 𝑜𝑡 (filter) 𝑞𝑡+1Marginalize 𝑜𝑡 (predict) 𝑞𝑡+1|𝑡−1

denoised future 𝐸[𝑞𝑡|ℎ𝑡]estimated future ො𝑞𝑡

denoised extended future 𝐸[𝑝𝑡|ℎ𝑡]

Apply W

estimated extended future Ƹ𝑝𝑡

8/1/2017 22

In a nutshell* Predictive State:

◦ State is a prediction of future observations

◦ Future observations are noisy estimates of the state

* Two stage regression:◦ Use history features to “denoise” states (S1 Regression)

◦ Use denoised states to learn dynamics (S2 Regression)

8/1/2017 23

What do we gain ?* More understanding of existing algorithms:

◦ Spectral algorithms for learning HMMs, Kalman filters, PSRs are two stage regression algorithms with linear regression in all stages.

* Theoretical Results (Asymptotic and finite sample):

◦ Error in estimating 𝑊 is ෨𝑂 1/√𝑁 [Under mild assumptions]

◦ Exact rate depends on S1 regression error

* New flavors of dynamical systems learning algorithms:◦ HMM with logistic regression.

◦ Online learning of linear dynamical systems (Sun et al. 2015).

◦ Linear dynamical systems with sparse dynamics (Hefny et al 2015, Gus Xia 2016).

8/1/2017 24

Predictive State Models as RNNs

8/1/2017 25

𝑞𝑡 𝑞𝑡+1𝑝𝑡𝑊 𝑓𝑓𝑖𝑙𝑡𝑒𝑟

𝜓𝑡𝜓𝑡+1𝜓𝑡

𝑜𝑡

𝜓𝑡+1

Predictive State Models as RNNs

Predictive state models define RNNs that are easy to initialize !!

8/1/2017 26

Back to Special Case: Modeling Discrete Systems with Indicator VectorsAssume the discrete case: 𝑜𝑡 is an indicator vector.

Let 𝜓𝑡 = 𝑜𝑡 and 𝜁𝑡 = 𝑜𝑡 ⊗𝑜𝑡+1

Then:

𝑞𝑡 Probability Vector

𝑝𝑡 Joint Probability Table

𝑓(𝑝𝑡 , 𝑜𝑡) Choose column from 𝑝𝑡 corresponding to 𝑜𝑡 then renormalize

8/1/2017 27

Back to Special Case: Modeling Discrete Systems with Indicator Vectors

8/1/2017 28

𝒒𝒕

0.2

0.5

0.3

0.1 0.2 0.1

0.05 0.05 0.25

0.05 0.15 0.05

𝒑𝒕 𝒒𝒕+𝟏

0.2

0.5

0.3𝑾

|| . ||

0

1

0

𝒐𝒕

Back to Special Case: Modeling Discrete Systems with Indicator Vectors

8/1/2017 29

𝒒𝒕

0.2

0.5

0.3

𝒒𝒕+𝟏

0.2

0.5

0.3𝑾

|| . ||

0

1

0

𝒐𝒕

Back to Special Case: Modeling Discrete Systems with Indicator Vectors

8/1/2017 30

𝒒𝒕

0.2

0.5

0.3

𝒒𝒕+𝟏

0.2

0.5

0.3𝑾

|| . ||

0

1

0

𝒐𝒕 𝑞𝑡+1(𝑘)

∝ 𝐶⊤ (𝐴𝑞𝑡 ∘ 𝐵𝑜𝑡)

Multiplicative unit

Predictive State Models as RNNs

Predictive units have a multiplicative structure, similar to LSTMs and GRUs.

8/1/2017 31

What about the continuous case ?

Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.

A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.

Discrete Case General Case

Indicator Vector Kernel feature map 𝜙(𝑥)

Joint Probability Table P(X,Y) Covariance Operator 𝐶𝑋𝑌

Conditional Probability Table Conditional Operator 𝐶𝑋|𝑌

Normalization P(X,Y) P(X|Y) Kernel Bayes Rule 𝐶𝑋|𝑌 = 𝐶𝑋𝑌𝐶𝑌𝑌−1

8/1/2017 32

Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.

A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.

Discrete Case General Case RFF Approximation

Indicator Vector Kernel feature map 𝜙(𝑥) RFF Feature Vector

Joint Probability Table P(X,Y) Covariance Operator 𝐶𝑋𝑌 Covariance Matrix

Conditional Probability Table Conditional Operator 𝐶𝑋|𝑌 Conditional Matrix

Normalization P(X,Y) P(X|Y) Kernel Bayes Rule 𝐶𝑋|𝑌 = 𝐶𝑋𝑌𝐶𝑌𝑌−1 Solve Linear System

8/1/2017 33

Results

8/1/2017 34

Character prediction accuracy of English textError predicting handwriting trajectories

Outline- Problem Statement:

◦ Learning Dynamical Systems

◦ Solution Properties

- Formulation:◦ A Taxonomy of Dynamical System Models

◦ Predictive State Models: Formulation and Learning

◦ Connection to Recurrent Networks

- Extensions:◦ Controlled Systems

◦ Reinforcement Learning

8/1/2017 35

Extension to Controlled SystemsIn controlled systems we have observations and actions. (e.g. car velocity and pressure on pedals)

𝑜𝑡

𝑎𝑡

𝑠𝑡 𝑠𝑡+1

Recursive Filter:

𝐸 𝑜𝑡 𝑞𝑡 , 𝑎𝑡 = 𝑔 𝑞𝑡 , 𝑎𝑡

𝐸 𝑞𝑡+1 𝑞𝑡 , 𝑜𝑡 , 𝑎𝑡 = 𝑓(𝑞𝑡 , 𝑎𝑡 , 𝑜𝑡)

8/1/2017 36

Extension to Controlled SystemsSame principle

𝑃𝑡 = 𝑊(𝑄𝑡)

This time, the predictive state is a linear operator encoding conditional distribution of future observations given future actions.

Example: Think of 𝑄𝑡 and 𝑃𝑡 as conditional probability tables.

Requires appropriate modifications to S1 regression. S2 regression remains the same.

8/1/2017 37

Extension to Controlled SystemsTwo stage regression: It is all about finding

𝐸[𝑄𝑡|𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1]

0.1 0.8 0.5 0.2

0.3 0.1 0.2 0.1

0.6 0.1 0.3 0.7Ob

serv

atio

n

Action

8/1/2017 38

Extension to Controlled SystemsTwo stage regression: It is all about finding

𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1

Problem:

At each time step we observe a noisy version of

a slice of 𝑄𝑡

? ? 1 ?

? ? 0 ?

? ? 0 ?Ob

serv

atio

n

Action

8/1/2017 39

Extension to Controlled SystemsTwo stage regression: It is all about finding

𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1 = 𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1)

Problem:

At each time step we observe a noisy version of

a slice of 𝑄𝑡

Solution 1 (Joint Modeling):

Predict the joint distribution of observation and actions.

Manually convert to conditional table (e.g. normalize columns).

0 0 1 0

0 0 0 0

0 0 0 0Ob

serv

atio

n

Action

8/1/2017 40

Extension to Controlled SystemsTwo stage regression: It is all about finding

𝐸 𝑄𝑡 𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1 = 𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1)

Problem:

At each time step we observe a noisy version of

a slice of 𝑄𝑡

Solution 2 (Conditional Modeling):

Train regression model to fit the observed slice of 𝑄𝑡.

min𝑄σ𝑡

𝑄(𝑜𝑡−𝑘:𝑡−1, 𝑎𝑡−𝑘:𝑡−1) 𝜓𝑡𝑎 − 𝜓𝑡

𝑜 2

D/C D/C 1 D/C

D/C D/C 0 D/C

D/C D/C 0 D/COb

serv

atio

n

Action

8/1/2017 41

Results

8/1/2017 42

Results

Predicting the pose of a swimming robot:• Ground Truth• Linear ARX• RFFPSR (our model)

8/1/2017 43

Reinforcement Learning with Predictive State Policy Networks

8/1/2017 44

PSRNNFeed-forward

Policy𝑜𝑡 , 𝑎𝑡 𝑎𝑡+1

• Can be initialized• States have a meaning:

can be trained to match observations

Preliminary Results

8/1/2017 45

Conclusions- Predictive State Models for filtering in dynamical systems:◦ Predictive State: State is a prediction of future observations.

◦ Two Stage Regression: Learning predictive state models can be reduced to supervised learning.

- Predictive State Models are a special type of recurrent networks.

- Can be extended to controlled systems and employed in reinforcement learning. upon that to develop a principled and efficient approach for learning to predict and act in continuous systems ?

8/1/2017 46

Thank you !* Hefny, Downey and Gordon, “Supervised Learning for Dynamical System Learning” (NIPS15), https://arxiv.org/abs/1505.05310

* Hefny, Downey and Gordon, “Supervised Learning for Controlled Dynamical System Learning”, https://arxiv.org/abs/1702.03537

* Downey, Hefny, Li, Boots and Gordon, “Predictive State Recurrent Neural Networks”, https://arxiv.org/abs/1705.09353

8/1/2017 47