Efficient and Tractable System Identification through Supervised LearningA HM E D HE FNY
M A CHINE L E A RNING DE PA RTME NT
CARNE GIE M E L LON UNIV E RSIT Y
8/1/2017 1
J O I NT W O R K W I T H
G E O F F RE Y G OR DON, C A R LTON D O W NE Y,
BY R O N B O OT S ( G E O R GI A T E C H ) ,
Z I TA MA R I NH O, W E N S UN
Outline- Problem Statement:
β¦ Learning Dynamical Systems
β¦ Solution Properties
- Formulation:β¦ A Taxonomy of Dynamical System Models
β¦ Predictive State Models: Formulation and Learning
β¦ Connection to Recurrent Networks
- Extensions:β¦ Controlled Systems
β¦ Reinforcement Learning
8/1/2017 2
Outline- Problem Statement:
β¦ Learning Dynamical Systems
β¦ Solution Properties
- Formulation:β¦ A Taxonomy of Dynamical System Models
β¦ Predictive State Models: Formulation and Learning
β¦ Connection to Recurrent Networks
- Extensions:β¦ Controlled Systems
β¦ Reinforcement Learning
8/1/2017 3
Learning Dynamical Systems
ππ ππ+π
ππ
System Dynamics
Observation Model
Latent System State
Position & speed
8/1/2017 6
Learning Dynamical Systems
ππ ππ+π
ππ
System Dynamics
Observation Model
Latent System State
Position & speed
8/1/2017 7
ππ‘ β‘ π π π‘ π1:π‘β1
Learning Dynamical Systems
8/1/2017 8
ππ‘ β‘ π π π‘ π1:π‘β1
ππ‘historyπ1:π‘β1
futureππ‘:β
π ππ‘+π π1:π‘=1 β‘ π(ππ‘+π|ππ‘)
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘ , ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
8/1/2017 9
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘ , ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
System- Non-linear- Partially observable- Controlled
Algorithm- Theoretical Guarantees - Scalability
8/1/2017 10
Outline- Problem Statement:
β¦ Learning Dynamical Systems
β¦ Solution Properties
- Formulation:β¦ A Taxonomy of Dynamical System Models
β¦ Predictive State Models: Formulation and Learning
β¦ Connection to Recurrent Networks
- Extensions:β¦ Controlled Systems
β¦ Reinforcement Learning
8/1/2017 11
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘, ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
ππ‘+1
ππ‘ ππ‘+1
ππ‘
f
g
8/1/2017 12
Learn a model then derive f
HMM[EM, Tensor Decomp.]
Directly Learn f RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘, ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
ππ‘+1
ππ‘ ππ‘+1
ππ‘
f
g
8/1/2017 13
Fix g (Predictive State) Learn g (Latent State)
Learn a model then derive f
HMM[EM, Tensor Decomp.]
Directly Learn f ππ‘ β‘ πΈ π(ππ‘:β) β£ π1:π‘β1state = E[sufficient future stats]PSIM [DAgger]
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘, ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
ππ‘+1
ππ‘ ππ‘+1
ππ‘
f
g
8/1/2017 14
Fix g (Predictive State) Learn g (Latent State)
Learn a model then derive f
Predictive State Models[Method of moments: Two-stage regression]
HMM[EM, Tensor Decomp.]
Directly Learn f ππ‘ β‘ πΈ π(ππ‘:β) β£ π1:π‘β1state = E[sufficient future stats]PSIM [DAgger]
RNN [BPTT]
Learning a Recursive FilterGiven:
Training Sequences:
(π1, π2, β¦ , ππ)
Output:β¦ Initial belief π1β¦ Filtering function π ππ‘+1 = π(ππ‘, ππ‘)
β¦ Observation function π πΈ ππ‘ π1:π‘β1 = π(ππ‘)
ππ‘+1
ππ‘ ππ‘+1
ππ‘
f
g
8/1/2017 15
Why restrict π and π ?
* Predictive State (a.k.a Observable Representation):
State is a prediction of future observation statistics
Future statistics are noisy estimates of the state.
Reduction to supervised learning.
* Additional assumptions on dynamics facilitate the development of an efficient algorithm with provable guarantees.
* Local improvement is still possible.
Learning a Recursive Filter
PSM PSIM RNN
8/1/2017 16
Predictive State Model (Formulation)Predictive State ππ‘ β‘ π ππ‘:π‘+πβ1 π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Extended Predictive State ππ‘ β‘ π ππ‘:π‘+π π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Linear Dynamics ππ‘ =πΎ ππ‘
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
future ππ‘
extended future ππ‘ππ ππ
Indicator Vector Joint Probability Table
1st and 2nd moments
Gaussian Distribution
178/1/2017
Predictive State Model (Formulation)Predictive State ππ‘ β‘ π ππ‘:π‘+πβ1 π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Extended Predictive State ππ‘ β‘ π ππ‘:π‘+π π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Linear Dynamics ππ‘ =πΎ ππ‘
Filtering: π ππ‘ , ππ‘ = πππππππ ππ‘, ππ‘ = πππππ‘ππ πππ‘ , ππ‘
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
future ππ‘
extended future ππ‘ππ ππ πππππππ
Indicator Vector Joint Probability Table
Bayes Rule:π ππ‘+1:π‘+π π1:π‘ β π(ππ‘:π‘+π β£ π1:π‘β1)
1st and 2nd moments Gaussian Distribution
Gaussian conditional mean and covariance.
fixed
learned
8/1/2017 18
Predictive State Model (Formulation)Predictive State ππ‘ β‘ π ππ‘:π‘+πβ1 π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Extended Predictive State ππ‘ β‘ π ππ‘:π‘+π π1:π‘β1 β‘ πΈ ππ‘ π1:π‘β1
Linear Dynamics ππ‘ =πΎ ππ‘
Filtering: π ππ‘ , ππ‘ = πππππππ ππ‘, ππ‘ = πππππ‘ππ πππ‘ , ππ‘
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
future ππ‘
extended future ππ‘
fixed
learned
Why linear πΎ ?Crucial to the consistency of learning algorithm.
Why this particular filtering formulation ?Matches existing models (HMM, Kalman filter, PSR)
8/1/2017 19
Predictive State Model (Learning)ππ‘ and ππ‘ are unbiased estimates of ππ‘ and ππ‘:
β’ ππ‘ = ππ‘ + ππ‘
β’ ππ‘ = ππ‘ + ππ‘
Learning Procedure:
β’ ΰ·π0 =1
πΟπ ππ
β’ Learn π using linear regression with examples (ππ‘ , ππ‘ )
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
future ππ‘
extended future ππ‘
ππ‘ and ππ‘ are correlatedπΆππ£ ππ‘ , ππ‘ β πΆππ£(ππ‘, ππ‘)
8/1/2017 20
Predictive State (Learning)ππ‘ and ππ‘ are unbiased estimates of ππ‘ and ππ‘:
β’ ππ‘ = ππ‘ + ππ‘
β’ ππ‘ = ππ‘ + ππ‘
Learning Procedure:
β’ π0 =1
πΟπ ππ
β’ Denoise examples (ππ‘ , ππ‘) to obtain ( ΰ·ππ‘ , ΖΈππ‘)
β’ Learn π using linear regression with examples (ΰ·ππ‘ , ΖΈππ‘)
ππ‘ and ππ‘ are correlatedπΆππ£ ππ‘ , ππ‘ β πΆππ£(ππ‘, ππ‘)
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
future ππ‘
extended future ππ‘
denoised future ΰ·ππ‘
denoised extended future ΖΈππ‘
???
ππ‘ = πππ‘
πΈ ππ‘ ππ‘βπ:π‘β1 = ππΈ[ππ‘ β£ ππ‘βπ:π‘β1]Use regression
8/1/2017 21
πΈ ππ‘ π1:π‘β1 = ππΈ[ππ‘ β£ π1:π‘β1]
ΰ·ππ‘ = π ΖΈππ‘
Learning Dynamical Systems Using Instrument Regression
ππ‘β1 ππ‘ ππ‘+πβ1 ππ‘+π
history βπ‘ future ππ‘
extended future ππ‘
S1A regression
S1B regression
S2 regression (learn W)
Condition on ππ‘ (filter) ππ‘+1Marginalize ππ‘ (predict) ππ‘+1|π‘β1
denoised future πΈ[ππ‘|βπ‘]estimated future ΰ·ππ‘
denoised extended future πΈ[ππ‘|βπ‘]
Apply W
estimated extended future ΖΈππ‘
8/1/2017 22
In a nutshell* Predictive State:
β¦ State is a prediction of future observations
β¦ Future observations are noisy estimates of the state
* Two stage regression:β¦ Use history features to βdenoiseβ states (S1 Regression)
β¦ Use denoised states to learn dynamics (S2 Regression)
8/1/2017 23
What do we gain ?* More understanding of existing algorithms:
β¦ Spectral algorithms for learning HMMs, Kalman filters, PSRs are two stage regression algorithms with linear regression in all stages.
* Theoretical Results (Asymptotic and finite sample):
β¦ Error in estimating π is ΰ·¨π 1/βπ [Under mild assumptions]
β¦ Exact rate depends on S1 regression error
* New flavors of dynamical systems learning algorithms:β¦ HMM with logistic regression.
β¦ Online learning of linear dynamical systems (Sun et al. 2015).
β¦ Linear dynamical systems with sparse dynamics (Hefny et al 2015, Gus Xia 2016).
8/1/2017 24
Predictive State Models as RNNs
8/1/2017 25
ππ‘ ππ‘+1ππ‘π πππππ‘ππ
ππ‘ππ‘+1ππ‘
ππ‘
ππ‘+1
Predictive State Models as RNNs
Predictive state models define RNNs that are easy to initialize !!
8/1/2017 26
Back to Special Case: Modeling Discrete Systems with Indicator VectorsAssume the discrete case: ππ‘ is an indicator vector.
Let ππ‘ = ππ‘ and ππ‘ = ππ‘ βππ‘+1
Then:
ππ‘ Probability Vector
ππ‘ Joint Probability Table
π(ππ‘ , ππ‘) Choose column from ππ‘ corresponding to ππ‘ then renormalize
8/1/2017 27
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 28
ππ
0.2
0.5
0.3
0.1 0.2 0.1
0.05 0.05 0.25
0.05 0.15 0.05
ππ ππ+π
0.2
0.5
0.3πΎ
|| . ||
0
1
0
ππ
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 29
ππ
0.2
0.5
0.3
ππ+π
0.2
0.5
0.3πΎ
|| . ||
0
1
0
ππ
Back to Special Case: Modeling Discrete Systems with Indicator Vectors
8/1/2017 30
ππ
0.2
0.5
0.3
ππ+π
0.2
0.5
0.3πΎ
|| . ||
0
1
0
ππ ππ‘+1(π)
β πΆβ€ (π΄ππ‘ β π΅ππ‘)
Multiplicative unit
Predictive State Models as RNNs
Predictive units have a multiplicative structure, similar to LSTMs and GRUs.
8/1/2017 31
What about the continuous case ?
Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.
A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.
Discrete Case General Case
Indicator Vector Kernel feature map π(π₯)
Joint Probability Table P(X,Y) Covariance Operator πΆππ
Conditional Probability Table Conditional Operator πΆπ|π
Normalization P(X,Y) P(X|Y) Kernel Bayes Rule πΆπ|π = πΆπππΆππβ1
8/1/2017 32
Mean-maps for Continuous ObservationsMean-maps provide a powerful tool to model non-parametric distribution using the feature map of a universal kernel.
A discrete distribution is a special case that uses the delta kernel and indicator feature map. Continuous distributions can be modeled using e.g. RBF kernel.
Discrete Case General Case RFF Approximation
Indicator Vector Kernel feature map π(π₯) RFF Feature Vector
Joint Probability Table P(X,Y) Covariance Operator πΆππ Covariance Matrix
Conditional Probability Table Conditional Operator πΆπ|π Conditional Matrix
Normalization P(X,Y) P(X|Y) Kernel Bayes Rule πΆπ|π = πΆπππΆππβ1 Solve Linear System
8/1/2017 33
Results
8/1/2017 34
Character prediction accuracy of English textError predicting handwriting trajectories
Outline- Problem Statement:
β¦ Learning Dynamical Systems
β¦ Solution Properties
- Formulation:β¦ A Taxonomy of Dynamical System Models
β¦ Predictive State Models: Formulation and Learning
β¦ Connection to Recurrent Networks
- Extensions:β¦ Controlled Systems
β¦ Reinforcement Learning
8/1/2017 35
Extension to Controlled SystemsIn controlled systems we have observations and actions. (e.g. car velocity and pressure on pedals)
ππ‘
ππ‘
π π‘ π π‘+1
Recursive Filter:
πΈ ππ‘ ππ‘ , ππ‘ = π ππ‘ , ππ‘
πΈ ππ‘+1 ππ‘ , ππ‘ , ππ‘ = π(ππ‘ , ππ‘ , ππ‘)
8/1/2017 36
Extension to Controlled SystemsSame principle
ππ‘ = π(ππ‘)
This time, the predictive state is a linear operator encoding conditional distribution of future observations given future actions.
Example: Think of ππ‘ and ππ‘ as conditional probability tables.
Requires appropriate modifications to S1 regression. S2 regression remains the same.
8/1/2017 37
Extension to Controlled SystemsTwo stage regression: It is all about finding
πΈ[ππ‘|ππ‘βπ:π‘β1, ππ‘βπ:π‘β1]
0.1 0.8 0.5 0.2
0.3 0.1 0.2 0.1
0.6 0.1 0.3 0.7Ob
serv
atio
n
Action
8/1/2017 38
Extension to Controlled SystemsTwo stage regression: It is all about finding
πΈ ππ‘ ππ‘βπ:π‘β1, ππ‘βπ:π‘β1
Problem:
At each time step we observe a noisy version of
a slice of ππ‘
? ? 1 ?
? ? 0 ?
? ? 0 ?Ob
serv
atio
n
Action
8/1/2017 39
Extension to Controlled SystemsTwo stage regression: It is all about finding
πΈ ππ‘ ππ‘βπ:π‘β1, ππ‘βπ:π‘β1 = π(ππ‘βπ:π‘β1, ππ‘βπ:π‘β1)
Problem:
At each time step we observe a noisy version of
a slice of ππ‘
Solution 1 (Joint Modeling):
Predict the joint distribution of observation and actions.
Manually convert to conditional table (e.g. normalize columns).
0 0 1 0
0 0 0 0
0 0 0 0Ob
serv
atio
n
Action
8/1/2017 40
Extension to Controlled SystemsTwo stage regression: It is all about finding
πΈ ππ‘ ππ‘βπ:π‘β1, ππ‘βπ:π‘β1 = π(ππ‘βπ:π‘β1, ππ‘βπ:π‘β1)
Problem:
At each time step we observe a noisy version of
a slice of ππ‘
Solution 2 (Conditional Modeling):
Train regression model to fit the observed slice of ππ‘.
minπΟπ‘
π(ππ‘βπ:π‘β1, ππ‘βπ:π‘β1) ππ‘π β ππ‘
π 2
D/C D/C 1 D/C
D/C D/C 0 D/C
D/C D/C 0 D/COb
serv
atio
n
Action
8/1/2017 41
Results
Predicting the pose of a swimming robot:β’ Ground Truthβ’ Linear ARXβ’ RFFPSR (our model)
8/1/2017 43
Reinforcement Learning with Predictive State Policy Networks
8/1/2017 44
PSRNNFeed-forward
Policyππ‘ , ππ‘ ππ‘+1
β’ Can be initializedβ’ States have a meaning:
can be trained to match observations
Conclusions- Predictive State Models for filtering in dynamical systems:β¦ Predictive State: State is a prediction of future observations.
β¦ Two Stage Regression: Learning predictive state models can be reduced to supervised learning.
- Predictive State Models are a special type of recurrent networks.
- Can be extended to controlled systems and employed in reinforcement learning. upon that to develop a principled and efficient approach for learning to predict and act in continuous systems ?
8/1/2017 46
Thank you !* Hefny, Downey and Gordon, βSupervised Learning for Dynamical System Learningβ (NIPS15), https://arxiv.org/abs/1505.05310
* Hefny, Downey and Gordon, βSupervised Learning for Controlled Dynamical System Learningβ, https://arxiv.org/abs/1702.03537
* Downey, Hefny, Li, Boots and Gordon, βPredictive State Recurrent Neural Networksβ, https://arxiv.org/abs/1705.09353
8/1/2017 47