64

Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems
Page 2: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA

and

F.L. Lewis, NAI

Talk available online at http://www.UTA.edu/UTARI/acs

Reinforcement Learning and Adaptive Dynamic Programming (ADP) for

Discrete-Time Systems

Qian Ren Consulting Professor, State Key Laboratory of Synthetical Automation for Process Industries

Northeastern University, Shenyang, China

Supported by :NSF AFOSR EuropeONR – Marc SteinbergUS TARDEC

Supported by :China NNSFChina Project 111

Page 3: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Invited by Tianyou Chai

Page 4: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems
Page 5: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Kung Tz 500 BCConfucius

ArcheryChariot driving

MusicRites and Rituals

PoetryMathematics

孔子Man’s relations to

FamilyFriendsSocietyNationEmperorAncestors

Tian xia da tongHarmony under heaven

Page 6: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

He who exerts his mind to the utmost knows nature’s pattern.

The way of learning is none other than finding the lost mind.

Meng Tz500 BC

Man’s task is to understand patterns in nature andsociety.

Mencius

Page 7: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

SHORT COURSE OVERVIEW

Day 1- Markov Decision Processes and Discrete-time Approximate Dynamic Programming (ADP)

Day 2- ADP and Online Differential Games for Continuous-time Systems

Day 3- Data-driven Real-time Process Optimization, Supervisory Control, and Structured Controllers

Page 8: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.

BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.

New Chapters on:Reinforcement LearningDifferential Games

Page 9: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

F.L. Lewis and D. Vrabie,“Reinforcement learning and adaptivedynamic programming for feedback control,”IEEE Circuits & Systems Magazine, InvitedFeature Article, pp. 32-50, Third Quarter2009.

IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K. Vamvoudakis,“Reinforcement learning and feedbackControl,” Dec. 2012

Page 10: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Importance of Feedback Control

Darwin- FB and natural selectionVolterra- FB and fish population balanceAdam Smith- FB and international economyJames Watt- FB and the steam engineFB and cell homeostasis

The resources available to most species for their survival are meager and limited

Nature uses Optimal control

Page 11: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Cell Homeostasis The individual cell is a complex feedback control system. It pumps ions across the cell membrane to maintain homeostatis, and has only limited energy to do so.

Cellular Metabolism

Permeability control of the cell membrane

http://www.accessexcellence.org/RC/VL/GG/index.html

Optimality in Biological Systems

Page 12: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Optimality in Control Systems DesignR. Kalman 1960

Rocket Orbit Injection

http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf

FmmmF

rwvv

mF

rrvw

wr

cos

sin2

2

ObjectivesGet to orbit in minimum timeUse minimum fuel

Dynamics

Page 13: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Adaptive Control is generally not Optimal

Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.

We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing

Reinforcement Learning turns out to be the key to this!

F. Lewis, Draguna Vrabie, and Vassilis Syrmos, “Optimal Control,” 3rd edition, 2012. Chapter 11

F.L. Lewis and D. Vrabie, “Reinforcement learning and adaptive dynamic programming forfeedback control,” IEEE Circuits & Systems Magazine, Invited Feature Article, pp. 32-50, ThirdQuarter 2009.

Page 14: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Different methods of learning

SystemAdaptiveLearning system

ControlInputs

outputs

environmentTuneactor

Reinforcementsignal

Actor

Critic

Desiredperformance

Reinforcement learningIvan Pavlov 1890s

Actor-Critic Learning

We want OPTIMAL performance- ADP- Approximate Dynamic Programming

Page 15: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

OutlineData‐Based Online Optimal Control for Discrete‐Time SystemsPolicy IterationValue Iteration (HDP‐ Heuristic Dynamic Programming)Q learning

Page 16: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Discrete-time Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

Fact. The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Hamiltonian gives the Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Optimal control comes from stationarity condition 1min ( )k

T Tk k k k ku

x Qx u Ru V x

2 2( ) ( ) 0Tk k k kRu Bu P Ax Bu

1( )T Tk ku R B PB B PAx Optimal control

DT Riccati equation10 ( )T T T TA PA P Q A PB R B PB B PA

Page 17: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Bellman Equation– Linear Systems Quadratic cost (LQR)

1k k kx Ax Bu system

cost

The cost is quadratic

( ) T Tk i i i i

i kV x x Qx u Ru

( ) Tk k kV x x Px for some symmetric matrix P

1( ) ( )T Tk k k k k kV x x Qx u Ru V x Bellman equation

1 1T T T Tk k k k k k k kx Px x Qx u Ru x Px

( ) ( )T T T Tk k k k k k k k k kx Px x Qx u Ru Ax Bu P Ax Bu

Bellman eq= Lyapunov equationk ku Kx

0 (( ) ( ) )T T Tk kx A BK P A BK P Q K RK x

Assume ANY stabilizing fixed feedback policy

Then is the cost of using the SVFB ( ) Tk k kV x x Px k ku Kx

Page 18: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

xu SystemControlL On-line real-time

Control Loop

Off-line Design LoopUsing ARE

Optimal Control- The Linear Quadratic Regulator (LQR)

An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)

System modeling is expensive, time consuming, and inaccurate

( , )J Q RUser prescribed optimization criterion

10 ( )T T T TA PA P Q A PB R B PB B PA

1( )T TL R B PB B PA

1k k kx Ax Bu

Page 19: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1. System dynamics

2. Value/cost function

3. Bellman equation

4. HJ solution equation (Riccati eq.)

5. Policy iteration – gives the structure we need

We want to find optimal control solutions Data-Based MethodOnline in real-time Using adaptive control techniquesWithout knowing the full dynamics

For nonlinear systems and general performance indices

Page 20: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

ki

iiki

kh uxrxV ),()(

Discrete-Time Systems Optimal Adaptive Control

cost

( 1)

1( ) ( , ) ( , )i k

h k k k i ii k

V x r x u r x u

1 ( ) ( )k k k kx f x g x u system

Example ( , ) T Tk k k k k kr x u x Qx u Ru

1( ) ( , ( )) ( ) , (0) 0h k k k h k hV x r x h x V x V Bellman equation

)( kk xhu = the prescribed control input functionControl policy

Example k ku Kx Linear state variable feedback

1( ) ( )T Th k k k k k h kV x x Qx u Ru V x

Difference eq equivalent

Page 21: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

ki

iiki

kh uxrxV ),()(

)())(,()( 1 khkkkh xVxhxrxV

))())(,((min)( 1*

khkkhk xVxhxrxV

Hamiltonian

))(),((min)( 1**

kkkuk xVuxrxVk

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Discrete-Time Optimal Adaptive Control

cost

Bellman eq.

)()())(,()),(,( 1 khkhkkkk xVxVxhxrhxVxH

Optimal cost

Bellman’s Opt. PrincipleGives DT HJB equation

Optimal Control

)( kk xhu = the prescribed control policy

Focus on these two eqs

1 ( ) ( )k k k kx f x g x u system

Offline solutionDifficult to solveContains the dynamics

Page 22: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1( ) ( , ( )) ( ), (0) 0h k k k h k hV x r x h x V x V

Discrete-Time Optimal Control

Bellman eq

)( kk xhu = the prescribed control policy

Solutions by Comp. Intelligence Community

( ) ( , ( ))i kh k i i

i kV x r x h x

Theorem: Let solve the Bellman equation. Then ( )h kV x

Gives value for any prescribed control policy

Policy Evaluation for any given current policy

Policy must be stabilizing

Dynamics Does Not Appear

Page 23: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

))(),((minarg)(* 1*

kkkuk xVuxrxhk

Optimal Control

Bellman’s result

1'( ) arg min( ( , ) ( ))k

k k k h kuh x r x u V x

What about? -

Theorem. Bertsekas. Let be the value of any given policy h(xk).

Then

( )h kV x

' ( ) ( )h k h kV x V x

Policy Improvement

for a given policy h(.) ?

One step improvement property of Rollout Algorithms

Page 24: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

DT Policy Iteration

)())(,()( 111 kjkjkkj xVxhxrxV

1 1 1 1( ) arg min( ( , ) ( ))k

j k k k j kuh x r x u V x

Howard (1960) proved convergence for MDP

)())(,()( 1 khkkkh xVxhxrxV

Cost for any given control policy h(xk) satisfies the recursion

Recursive solution - Actor/Critic Structure

Pick stabilizing initial control

Policy Evaluation

Policy Improvement

f(.) and g(.) do not appear

Bellman eq.

Recursive formConsistency equation

e.g. Control policy = SVFB

( )k kh x Lx

1 1 1( ) ( , ( )) ( )k j k k j k j ke V x r x h x V x

Temporal difference

Page 25: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

( ) ( ) ( )T Tk i i i i

i kV x x Qx u x Ru x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

Hewer proved convergence in 1971

DT Lyapunov eq.

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

For any stabilizing policy, the cost is

DT Policy iterations

1 1

11 1 1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

Policy Iteration Solves Lyapunov equation WITHOUT knowing System Dynamics

Equivalent to an Underlying Problem- DT LQR:

1 ( ) ,k k k kx Ax Bu A BL x

LQR value is quadratic

k ku Lx

Page 26: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

1 111 1

1

( )1( ) ( )2

j kTj k k

k

dV xu x R g x

dx

DT Policy Iteration – Linear Systems Quadratic Cost- LQR

DT Policy iterations

How to implement online?

Page 27: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1 1 1 1T T T Tk j k k j k k k j jx P x x P x x Qx u Ru

1 1 1( ) ( ) ( ) ( )T Tj k k k j k j k j kV x x Qx u x Ru x V x

DT Policy Iteration – How to implement online?Linear Systems Quadratic Cost- LQR

Solves Lyapunov eq. without knowing A and B

( ) TV x x Px

DT Policy iterations

1k k kx Ax Bu

LQR cost is quadratic

1 111 12 11 121 2 1 2 1

1 12 212 22 12 22 1

1 2 1 21

1 2 1 211 12 22 11 12 22 1 1

2 2 2 21

( ) ( )2 2( ) ( )

k kk k k k

k k

k k

k k k k

k k

p p p px xx x x x

p p p px x

x xp p p x x p p p x x

x x

Quadratic basis set

( ) ( ) ( )Tk i i i i

i kV x x Qx u x Ru x

for some matrix P

Then update control using1( ) ( )T T

j k j k j j kh x L x R B P B B P Ax Need to know A AND B

for control update

1 1( ) ( ) ( ) ( )T T Tj k k k k j k j kW x x x Qx u x Ru x

Page 28: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Implementation- DT Policy IterationNonlinear Case

Value Function Approximation (VFA)

)()( xWxV T

basis functionsweights

LQR case- V(x) is quadratic

( ) ( )T TV x x Px W x

Quadratic basis functions

Nonlinear system case- use Neural Network

][ 1211 ppW T

( )x

Page 29: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

)())(,()( 111 kjkjkkj xVxhxrxV Value function update for given control – Bellman equation

Assume measurements of xk and xk+1 are available to compute uk+1

Then

))(,()()( 11 kjkkkTj xhxrxxW

Solve for weights in real-time using RLSor, batch LS- many trajectories with different initial conditions over a compact set

Then update control using

Need to know g(xk) for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Policy Iteration

)()( kTjkj xWxV VFA

Indirect Adaptive control with identification of the optimal value

1 11 11 1 1 1

1

( )1 1( ) ( ) ( ) ( )2 2

j kT T T Tj k k k k j

k

dV xu x R g x R g x x W

dx

Page 30: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

11 12

12 22

p pp p

11 12 22TW p p p

Solving the IRL Bellman Equation

Need data from 3 time intervals to get 3 equations to solve for 3 unknowns

Now solve by Batch least-squares

Solve for value function parameters

))(,()()( 11 kjkkkTj xhxrxxW

1 1 2 1 1( ) ( ) ( , ( ))Tj k k k j kW x x r x h x

1 2 3 2 2( ) ( ) ( , ( ))Tj k k k j kW x x r x h x

Page 31: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

k k+1

apply

observe cost

update Wj+1

Do RLS until convergence to Wj+1

update control

Reinforcement Learning (IRL) – A DATA-BASED APPROACH

Data set at time [k,k+1)

1, ( , ( )),k k j k kx r x h x x

k+2

observe xk+1

apply

observe cost

k+3

observe xk+2

apply

observe cost

Or use batch least-squares

Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics

This is a data-based approach that uses measurements of xk , ukInstead of the plant dynamical model.

f(xk) is not needed anywhere

))(,()()( 11 kjkkkTj xhxrxxW

observe xk

( , ( ))k j kr x h x 1 1( , ( ))k j kr x h x 2 2( , ( ))k j kr x h x

update Wj+1 update Wj+1

11 1 1 1

1( ) ( ) ( )2

T T Tj k k k ju x R g x x W

Page 32: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Continuous control feedback gain with discrete gain updates

k

Lj

j0 1 2 3 4 5

Update intervals j may not be the sameThey can be selected on‐line in real time

Gain update (Policy)

Control

LQR Case

( )k k j ku x L x

TSample period

j=1 j=2

Page 33: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Persistence of Excitation

Regression vector must be PE

))(,()()( 11 kjkkkTj xhxrxxW

Page 34: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

System

Action network

Policy Evaluation(Critic network)

( )j kh x

cost

The Adaptive Critic Architecture

Control policy

)())(,()( 111 kjkjkkj xVxhxrxV

Adaptive Critics

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value update

Control policy update

Leads to ONLINE FORWARD-IN-TIME implementation of optimal control

Optimal Adaptive Control

Use RLS until convergence

Page 35: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Adaptive Control

Plantcontrol output

Identify the Controller-Direct Adaptive

Identify the system model-Indirect Adaptive

Identify the performance value-Optimal Adaptive

)()( xWxV T

Page 36: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Simulation Example • Linear system – Aircraft longitudinal dynamics

• The HJB, i.e. ARE, Solution

1.0722 0.0954 0 -0.0541 -0.0153 4.1534 1.1175 0 -0.8000 -0.1010

A= 0.1359 0.0071 1.0 0.0039 0.0097 0 0 0 0.1353 0 0 0 0 0 0.1353

-0.0453 -0.0175-1.0042 -0.1131

B= 0.0075 0.0134 0.8647 0 0 0.8647

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Unstable, Two-input system

10 ( )T T T TA PA P Q A PB R B PB B PA

Page 37: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

• Simulation• The Cost function approximation – quadratic basis set

• The Policy approximation – linear basis set

ˆ ( )Ti ui ku W x

1 2 3 4 5( )T x x x x x x

11 12 13 14 15

21 22 23 24 25

u u u u uTu

u u u u u

w w w w wW

w w w w w

1 1 1ˆ ( , ) ( )Ti k Vi Vi kV x W W x

1 2

2 2 2 2 21 2 1 3 1 4 1 5 2 3 4 2 2 5 3 3 4 3 5 4 4 5 5( )T x x x x x x x x x x x x x x x x x x x x x x x x x x

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15T

V V V V V V V V V V V V V V V VW w w w w w w w w w w w w w w w

Page 38: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

• SimulationThe convergence of the value

[55.5411 15.2789 31.3032 -9.3255 -1.4536 2.3142 2.9234 -1.6594 -0.2430 24.8262 -1.3076 0.0920 1.5388 0.1564 1.0240]

TVW

11 12 13 14 15 1 2 3 4 5

21 22 23 24 25 2 6 7 8 9

31 32 33 34 35 3 7 10 11 12

41 42 43 44 45 4 8 11 13

51 52 53 54 55

0.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0.50.5 0.5 0.5 0

V V V V V

V V V V V

V V V V V

V V V V

P P P P P w w w w wP P P P P w w w w wP P P P P w w w w wP P P P P w w w wP P P P P

14

5 9 12 14 15

.50.5 0.5 0.5 0.5

V

V V V V V

ww w w w w

55.8348 7.6670 16.0470 -4.6754 -0.7265 7.6670 2.3168 1.4987 -0.8309 -0.1215 16.0470 1.4987 25.3586 -0.6709 0.0464 -4.6754 -0.8309 -0.6709 1.5394 0.0782

P

-0.7265 -0.1215 0.0464 0.0782 1.0240

Actual ARE soln:

10 ( )T T T TA PA P Q A PB R B PB B PA

Solves ARE online using real-time data measurements Without knowing A matrix

1, ( , ( )),k k j k kx r x h x x

Page 39: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Discrete-time nonlinear HJB solution using Approximate dynamic programming : Convergence Proof

• SimulationThe convergence of the control policy

4.1068 0.7164 0.3756 -0.5274 -0.0707 0.6330 0.1005 -0.1216 -0.0653 -0.0798uW

11 12 13 14 15 11 12 13 14 15

21 22 23 24 25 21 22 23 24 25

u u u u u

u u u u u

L L L L L w w w w wL L L L L w w w w w

-4.1136 -0.7170 -0.3847 0.5277 0.0707-0.6315 -0.1003 0.1236 0.0653 0.0798

L

Note- In this example, drift dynamics matrix A is NOT Needed.Riccati equation solved online without knowing A matrix

Actual optimal ctrl.

Page 40: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems
Page 41: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Greedy Value Fn. Update- Approximate Dynamic Programming Value Iteration= Heuristic Dynamic Programming (HDP)

Paul Werbos

)())(,()( 111 kjkjkkj xVxhxrxV

Policy Iteration

1 1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

A BL P A BL P Q L RL

L R B P B B P A

For LQRUnderlying RE

Hewer 1971

))(),((minarg)( 111 kjkkukj xVuxrxhk

)())(,()( 11 kjkjkkj xVxhxrxV

))(),((minarg)( 111 kjkkukj xVuxrxhk

Value Iteration

1

1

( ) ( )

( )

T Tj j j j j j

T Tj j j

P A BL P A BL Q L RL

L R B P B B P A

For LQRUnderlying RE Lancaster & Rodman

proved convergence

Two occurrences of cost allows def. of greedy update

Initial stabilizing control is needed

Lyapunov eq.

Simple recursion

Initial stabilizing control is NOT needed

Page 42: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Motivation for Value Iteration

1 1( ) ( )T Tj j j j j jA BL P A BL P Q L RL

1 1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 1( ) ( , ( )) ( )j k k k j kV x r x h x V x

1 ( ) ( ) Tj j i j j jP A BL P A BL Q L RL

PI Policy Evaluation Step

VI Policy Evaluation Step

Theorem

Let gain L be fixed and (A-BL) stable.Let in MR

Then

0 0P

LE= Lyapunov equation

MR= Matrix recursion

1i jP P

i.e. repeated application of the VI policy evaluation stepis the same as one application of the PI policy evaluation step

IF THE CONTROL POLICY IS NOT UPDATED

Idea of GPI

Needs stabilizing gain

Does not need stabilizing gain

Page 43: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

1 1( ) ( , ( )) ( )j k k j k j kV x r x h x V x Value function update for given control

Assume measurements of xk and xk+1 are available to compute uk+1

Then

1 1( ) ( , ( )) ( )T Tj k k j k j kW x r x h x W x

Solve for weights using RLSor, many trajectories with different initial conditions over a compact set

Then update control using

1( ) ( )T Tj k j k j j kh x L x R B P B B P Ax Need to know f(xk) AND g(xk)

for control update

Since xk+1 is measured, do not need knowledge of f(x) or g(x) for value fn. update

Implementation- DT HDP – Value Iteration

)()( kTjkj xWxV VFA

regression matrix Old weights

)())(,()( 111 kjkjkkj xVxhxrxV Policy Iteration was

Page 44: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Example 11.3-4: Policy Iteration and Value Iteration for the DT LQR

1. Policy Iteration: Hewer’s Algorithm

1 1112( ) ( ) .j T T j

k k k k k kV x x Qx u Ru V x

1 11 1 ,T j T T T j

k k k k k k k kx P x x Qx u Ru x P x

1 10 ( ) ( ) ( ) .j T j j j j T jA BK P A BK P Q K R K

1 1 11 1( ) arg min( ) ,j j T T T j

k k k k k k k kx K x x Qx u Ru x P x

1 1 1 1( ) .j T j T jK B P B R B P A

Value Update

Policy Improvement

2. Value Iteration: Lyapunov recursions1

1 1 ,T j T T T jk k k k k k k kx P x x Qx u Ru x P x

1 ( ) ( ) ( ) .j j T j j j T jP A BK P A BK Q K R K

3. Iterative Policy Evaluation1 ( ) ( ) .j T j TP A BK P A BK Q K R K

this recursion converges to the solution of the Lyapunov equation

Lyapunov equation

Lyapunov recursion

All algorithms have a model-free scalar form based on measuring states, and a model-based matrix form based on cancelling states

1 ' ''

( ) ( , ) ( ')u uj j xx xx j

u xV x x u P R V x

Page 45: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems
Page 46: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems
Page 47: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Motor control 200 Hz

Oscillation is a fundamental property of neural tissue Brain has multiple adaptive clocks with different timescales

theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.

gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.

• consolidation of memory• spatial mapping of the environment – place cells

The high frequency processing is due to the large amounts of sensorial data to be processed

Spinal cord

D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.

Work with Dan Levine, UTA Dept. of Psychology

D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.

Page 48: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Doya, Kimura, Kawato 2001

Limbic system

Page 49: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Cerebral cortexMotor areas

ThalamusBasal ganglia

Cerebellum

Brainstem

Spinal cord

Interoceptivereceptors

Exteroceptivereceptors

Muscle contraction and movement

Summary of Motor Control in the Human Nervous System

reflex

Supervisedlearning

ReinforcementLearning- dopamine

(eye movement)inf. olive

Hippocampus

Unsupervisedlearning

Limbic System

Motor control 200 Hz

theta rhythms 4-10 Hz

picture by E. StinguD. Vrabie

Memoryfunctions

Long term

Short term

Hierarchy of multiple parallel loops

gamma rhythms 30-100 Hz

Page 50: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Adaptive Actor-Critic structure

Theta waves 4-8 Hz

Reinforcement learning

Motor control 200 Hz

Paul Werbos

Page 51: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Q Learning

Page 52: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Four ADP Methods proposed by Paul Werbos

Heuristic dynamic programming

Dual heuristic programming

AD Heuristic dynamic programming

AD Dual heuristic programming

(Watkins Q Learning)

Critic NN to approximate:

Value

Gradient xV

)( kxV Q function ),( kk uxQ

GradientsuQ

xQ

,

Action NN to approximate the Control

Bertsekas- Neurodynamic Programming

Barto & Bradtke- Q-learning proof (Imposed a settling time)

Adaptive (Approximate) Dynamic Programming

Value Iteration

Page 53: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Q Learning

)(),(),( 1 khkkkkh xVuxruxQ policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Recursion for Q

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

- Action Dependent ADP

)())(,()( 1 khkkkh xVxhxrxV Value function recursion for given policy h(xk)

Optimal Adaptive Control for completely unknown DT systems

k k+1

xk xk+1uk

h(x)

Page 54: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

)(),(),( 1 khkkkkh xVuxruxQ

Specify a control policy ,....1,);( kkjxhu jj

policy h(.) used after time k

uk arbitrary

)())(,( khkkh xVxhxQ

Define Q function

Note

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ Bellman equation for Q

))(),(),( 1**

kkkkk xVuxruxQ

))(,(),(),( 1*

1**

kkkkkk xhxQuxruxQ

Optimal Q function

)))(,((min))(,()( ***kkhhkkk xhxQxhxQxV

Optimal control solution

)),((min)( **kkuk uxQxV

k

Simple expression of Bellman’s principle

)),((minarg)(* *kkuk uxQxh

k

))(,((minarg)(* kkhhk xhxQxh

Q Function Definition

Page 55: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Q Function HDP – Action Dependent HDP

Bradtke & Barto (1994) proved convergence for LQR

Q function for any given control policy h(xk) satisfies the Bellman equation

Recursive solution

Pick stabilizing initial control policy

Find Q function

Update control

))(,(),(),( 11 kkhkkkkh xhxQuxruxQ

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

)),((minarg)( 11 kkjukj uxQxhk

Now f(xk,uk) not needed

Page 56: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Q Learning does not need to know f(xk) or g(xk)

)(),(),( 1 khkkkkh xVuxruxQ

)()( kkT

kkkTkk

Tk BuAxPBuAxRuuQxx

For LQR PxxxWxV TT )()(

k

k

uuux

xuxxT

k

k

k

kT

k

k

k

kTT

TTT

k

k

ux

HHHH

ux

ux

Hux

ux

PBBRPABPBAPAAQ

ux

Q is quadratic in x and u

Control update is found by ][2])([20 kuukuxkT

kT

kuHxHuPBBRPAxB

uQ

sokjkuxuuk

TTk xLxHHPAxBPBBRu 1

11)(

Control found only from Q functionA and B not needed

V is quadratic in x

Page 57: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Q function update for control is given by

Assume measurements of uk, xk and xk+1 are available to compute uk+1

),(),( uxWuxQ T

Then

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

Solve for weights using RLS or backprop.

Since xk+1 is measured in training phase, do not need knowledge of f(x) or g(x) for value fn. update

regression matrix

Implementation- DT Q Function Policy Iteration

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

kjk xLu

Now u is an input to the NN- Werbos- Action dependent NN

)(x

For LQR

For LQR case

QFA – Q Fn. Approximation

Bradtke and Barto

Page 58: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

),(),(),( 1111 kjkjkkkkj xLxQuxruxQ

Q Policy Iteration

)),((minarg)( 11 kkjukj uxQxhk

Control policy update

),(),(),( 111 kjkkjkkkTj xLxrxLxuxW

kjkuxuuk xLxHHu 11

Model-free policy iteration

Bradtke, Ydstie, Barto

Greedy Q Fn. Update - Approximate Dynamic ProgrammingADP Method 3. Q Learning

Action-Dependent Heuristic Dynamic Programming (ADHDP)

Paul WerbosModel-free HDP

))(,(),(),( 111 kjkjkkkkj xhxQuxruxQ

Greedy Q Update

1111 target),(),(),( jkjkTjkjkkk

Tj xLxWxLxruxW

Update weights by RLS or backprop.

Stable initial control needed

Stable initial control NOT needed

Page 59: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

Direct OPTIMAL ADAPTIVE CONTROL

Q learning actually solves the Riccati Equation WITHOUT knowing the plant dynamics

Model-free ADP

Works for Nonlinear Systems

Proofs?Robustness?Comparison with adaptive control methods?

Page 60: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

ADHDP Application for Power system

• System Description

• The Discrete-time Model is obtained by applying ZOH to the CT

( ) [ ( ) ( ) ( ) ( )]

1/ / 0 00 1/ 1/ 0

1/ 0 1/ 1/0 0 0

0 0 1/ 0

1 / 0 0 0

Tg g

p p p

T T

G G G

E

TG

Tp p

x t f t P t X t F t

T K TT T

ART T T

K

B T

E K T

1/ [0.033,0.1]

/ [4,12]

1/ [2.564,4.762]1/ [9.615,17.857]1/ [3.081,10.639]

p

p p

T

G

G

T

K T

TTRT

Page 61: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

ADHDP Application for Power system

• The system state∆f _incremental frequency deviation (Hz) ∆Pg _incremental change in generator output (p.u. MW)∆Xg _incremental change in governor position (p.u. MW) ∆F _incremental change in integral control.∆Pd _is the load disturbance (p.u. MW); and

• The system parameters are:TG _the governor time constant

- TT _turbine time constant- TP _plant model time constant- Kp _planet model gain- R _speed regulation due to governor action - KE_ integral control gain.

Page 62: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

ADHDP Application for Power system

• ADHDP policy tuning

0 1000 2000 3000-1

0

1

2

3

time (k)

The

conv

erge

nce

of P

P11

P12

P13

P22

P23

P33

P34

P44 0 1000 2000 3000

-3

-2

-1

0

1

Time (k)The

conv

erge

nce

of th

e co

ntro

l po

L11

L12

L13

L14

Converges to

Optimal GARE solution and optimal game control

Page 63: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems

• Comparison

The ADHDP controller design The design from [1]

• The maximum frequency deviation when using the ADHDP controller is improved by 19.3% from the controller designed in [1]

[1] Wang, Y., R. Zhou, C. Wen, “Robust load-frequency controller design for power systems”, IEE Proc.-C, Vol. 140, No. I , 1993

0 5 10 15 20-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

Time in sec

stat

es x

1, x2, x

3,x4

X: 0.5Y: -0.2024

Frequency deviationIncrmental change of the governer outIncrmental change of the governer posIncrmental change of the in itegral cont

0 5 10 15 20-0.3

-0.25

-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

X: 0.5794Y: -0.2507

Time sec

stat

es x

1, x2, x

3,x4

Frequency deviationIncrmental change of the generator ouIncrmental change of the governer posIncrmental change of the in itegral cont

ADHDP Application for Power system

Page 64: Invited by Tianyou Chai - UTA 05 NEU short course/2016 DT ADP.pdfDay 1- Markov Decision Processes and Discrete -time Approximate Dynamic Programming (ADP) ... IEEE Circuits & Systems