Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Moncrief-O’Donnell Chair, UTA Research Institute (UTARI)The University of Texas at Arlington, USA
and
F.L. Lewis, NAI
Talk available online at http://www.UTA.edu/UTARI/acs
How to Do Research
Distinguished Foreign Consulting ProfessorChongqing University, China
Supported by :NSF - Paul WerbosAFOSR EuropeONR – Marc SteinbergUS TARDEC
Supported by :China NNSFChina Project 111
Invited by Y.D. Song
“Some are born with knowledge, some derive it from study, and some acquire it only after a painful realization of their ignorance.
But the knowledge being possessed, it comes to the same thing.
Some study with a natural ease, some from a desire for advantages, and some by strenuous effort.
But the achievement being made, it comes to the same thing.”
Pythagoras 500 BC Natural Philosophy, Ethics, and Mathematics
Numbers and the harmony of the spheres
Music, Mathematics, Gymnastics, Astronomy, Medicine
Translate music to mathematical equationsRatios in music
Fire, air, water, earth
The School of Pythagoras esoterikoi and exoterikoi
Mathematikoi - learnersAkoustiki- listeners
Patterns in NatureSerenity and Self-Possession
He who exerts his mind to the utmost knows nature’s pattern.
The way of learning is none other than finding the lost mind.
Meng Zi500 BC
Man’s task is to understand patterns innature and society.
Mencius
It is man’s obligation to explore the most difficult questions inthe clearest possible way and use reason and intellect to arriveat the best answer.
Man’s task is to understand patterns in nature and society.
The first task is to understand the individual problem, then toanalyze symptoms and causes, and only then to designtreatment and controls.
Ibn Sina 1002-1042(Avicenna)
Read papers‐ Look for Patterns
Read Papers!Find the Missing Link!
19
Purpose of Publishing Papers• disseminate research result for academic
exchanges• At stake is your reputation • It is not “the more, the better” • Write papers only when you have good results
and when you do, do your best. • Write it, rewrite it, and rewrite it again 止于至善。
Jie Huang, Chinese University of Hong Kong
20
Ingredients of Good Research
• An open problem, or a self‐formulated problem with impact to the field or society.
• The research is motivated or stimulated from the urge of knowledge discovery or from the need of society.
• New tools are developed, or existing tools are modified to solve the problem
• No pain, No‐gain
Jie Huang, Chinese University of Hong Kong
21
Importance of Writing• A good paper is always based on a piece of good research, but
a good research may not necessarily lead to a good paper. • A good paper can add significant value to a research work• The beginning of the writing is not the end of the research,
but a beginning of a deeper phase of the research. • The process of writing is the process of soul‐searching, and
may provoke new ideas and new directions.• 4C’s: Clarity, Concise, Creativity and Courtesy
Jie Huang, Chinese University of Hong Kong
Optimal ControlAdaptive Control Find the Key- Reinforcement learningFunctions of the Brain – multiple oscillations
Game Theory –Sun TzeApplications to Manufacturing
Optimality in Control Systems DesignR. Kalman 1960
Rocket Orbit Injection
http://microsat.sm.bmstu.ru/e-library/Launch/Dnepr_GEO.pdf
FmmmF
rwvv
mF
rrvw
wr
cos
sin2
2
ObjectivesGet to orbit in minimum timeUse minimum fuel
Dynamics
Optimal Control is Effective for:Aircraft AutopilotsVehicle engine controlAerospace VehiclesShip ControlIndustrial Process Control
Multi-player Games Occur in:EconomicsControl Theory disturbance rejectionTeam gamesInternational politicsSports strategy
But, optimal control and game solutions are found byOffline solution of Matrix Design equationsA full dynamical model of the system is needed
Optimality and Games
PBPBRQPAPA TT 10
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
Optimal Control- The Linear Quadratic Regulator (LQR)
System dynamics
Performance Index
Solve the Algebraic Riccati Equation (ARE)
Then the optimal feedback control is
Minimum energy, minimum control effort
The Solution
Result:Let Q and R be symmetric and positive definiteThen the control u=-Kx minimizes V(x)and makes the system stable and robust
MATLAB function [K,P]=lqr(A,B,Q,R)
xux Ax Bu
SystemControlK
PBPBRQPAPA TT 10
1 TK R B P
On-line real-timeControl Loop
Off-line Design LoopUsing ARE
Optimal Control- The Linear Quadratic Regulator (LQR)
An Offline Design Procedurethat requires Knowledge of system dynamics model (A,B)
System modeling is expensive, time consuming, and inaccurate
( , )Q R
User prescribed optimization criterion ( ( )) ( )T T
t
V x t x Qx u Ru d
A. Optimal control B. Zero-sum games C. Non zero-sum games
1. System dynamics
2. Value/cost function
3. Bellman equation
4. HJ solution equation (Riccati eq.)
5. Policy iteration – gives the structure we need
Adaptive Control Structures for:
We want to find optimal control solutions Online in real-time Using adaptive control techniquesWithout knowing the full dynamics
For nonlinear systems and general performance indices
Adaptive Control is online and works for unknown systems.Generally not Optimal
Optimal Control is off-line, and needs to know the system dynamics to solve design eqs.
We want ONLINE DIRECT ADAPTIVE OPTIMAL ControlFor any performance cost of our own choosing
Reinforcement Learning turns out to be the key to this!
Apply New Tools
Look in a different disciplineMathComputational Intelligence
Machine LearningThermodynamics Fluid MechanicsPhysics
Find The Key!
Reinforcement Learning
Every living organism improves its control actions based on rewards received from the environment
The resources available to living organisms are usually meager.Nature uses optimal control.
1. Apply a control. Evaluate the benefit of that control.2. Improve the control policy.
RL finds optimal policies by evaluating the effects of suboptimal policies
Optimality in Biological Systems
Different methods of learning
SystemAdaptiveLearning system
ControlInputs
outputs
environmentTuneactor
Reinforcementsignal
Actor
Critic
Desiredperformance
Reinforcement learningIvan Pavlov 1890s
Actor-Critic Learning
We want OPTIMAL performanceADP- Approximate Dynamic Programming- Paul Werbos
Sutton & Barto book
D. Vrabie, K. Vamvoudakis, and F.L. Lewis, Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles, IET Press, 2012.
BooksF.L. Lewis, D. Vrabie, and V. Syrmos, Optimal Control, third edition, John Wiley and Sons, New York, 2012.
New Chapters on:Reinforcement LearningDifferential Games
F.L. Lewis and D. Vrabie,“Reinforcement learning andadaptive dynamic programmingfor feedback control,”IEEE Circuits & SystemsMagazine, Invited Feature Article,pp. 32-50, Third Quarter 2009.
IEEE Control Systems Magazine,F. Lewis, D. Vrabie, and K.Vamvoudakis,“Reinforcement learning andfeedback Control,” Dec. 2012
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
0 ( , , ) 2 2 ( )T
T T T T T T TV VH x u V x Qx u Ru x x Qx u Ru x P Ax Bu x Qx u Rux x
Derivation of Linear Quadratic Regulator
System
Cost
Differentiate using Leibniz’ formula
Optimal Control is
Scalar equation
( )T TV x Qx u Ru
Differential equivalent is the Bellman equation
Stationarity condition 0 2 2 TH Ru B Pxu
1 10 2 ( )T T T T Tx P Ax BR B Px x Qx x PBR B Px HJB equation
10 T T T T T Tx PAx x A Px x Qx x PBR B Px
PBPBRQPAPA TT 10
1 Tu R B Px Kx
BuAxx
( ( )) ( ) ( ) ( )T T T
t
V x t x Qx u Ru d x t Px t
Full system dynamics must be knownOff‐line solution
0 ( , , ) 2 ( )T T TVH x u x P Ax Bu x Qx u Rux
Standard Solution for Linear Quadratic Regulator
0 ( ) ( )T TA BK P P A BK Q K RK
u Kx
System
Cost
Differential equivalent is the Bellman equation
Given any stabilizing FB policy
The cost value is found by solving Lyapunov equation = Bellman equation
Optimal Control is
Algebraic Riccati equation
Scalar equation
Matrix equation
0 ( , , ) 2 ( ) x
( ) ( )
T T T T
T T T
VH x u x P A BK x Qx x K RKxx
x A BK P P A BK Q K RK x
10 T T T T T Tx PAx x A Px x Qx x PBR B Px HJQ equation
Matrix equation
t
T
t
dtRuuxQdtuxrtxV ))((),())((
( , , ) ( , ) ( , ) ( ) ( ) ( , ) 0T TV V VH x u V r x u x r x u f x g x u r x u
x x x
112( ) ( )T Vu h x R g x
x
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
, (0) 0V
Nonlinear System dynamics
Cost/value
Bellman Equation, in terms of the Hamiltonian function
Stationary Control Policy
HJB equation
Derivation of Nonlinear Optimal Regulator
Off‐line solutionHJB hard to solve. May not have smooth solution.Dynamics must be known
Stationarity condition 0Hu
( , ) ( ) ( )x f x u f x g x u
Leibniz gives Differential equivalent
To find online methods for optimal control Focus on these two equations
),,(),(),(0 uxVxHuxruxf
xV T
0 ( , ( )) ( , ( ))T
jj j
Vf x h x r x h x
x
(0) 0jV
1121( ) ( ) jT
j
Vh x R g x
x
CT Policy Iteration – a Reinforcement Learning Technique
• Convergence proved by Leake and Liu 1967, Saridis 1979 if Lyapunov eq. solved exactly
• Beard & Saridis used Galerkin Integrals to solve Lyapunov eq.• Abu Khalaf & Lewis used NN to approx. V for nonlinear systems and proved convergence
RuuxQuxr T )(),(Utility
The cost is given by solving the CT Bellman equation
Policy Iteration Solution
Pick stabilizing initial control policy
Policy Evaluation ‐ Find cost, Bellman eq.
Policy improvement ‐ Update control Full system dynamics must be knownOff‐line solution
dxdVggR
dxdVxQf
dxdV T
TT *1
*
41
*
)(0
Scalar equation
M. Abu-Khalaf, F.L. Lewis, and J. Huang, “Policyiterations on the Hamilton-Jacobi-Isaacs equation for H-infinity state feedback control with input saturation,”IEEE Trans. Automatic Control, vol. 51, no. 12, pp.1989-1995, Dec. 2006.
( ) ( )u x h xGiven any admissible policy
0 ( )h x
Converges to solution of HJB
Can Avoid knowledge of drift term f(x)
Work of Draguna Vrabie
Policy iteration requires repeated solution of the CT Bellman equation
0 ( , ( )) ( , ( )) ( , ( )) ( ) ( , , ( ))T T
TV V VV r x u x x r x u x f x u x Q x u Ru H x u xx x x
This can be done online without knowing f(x)using measurements of x(t), u(t) along the system trajectories
Integral Reinforcement Learning
( ) ( )x f x g x u
D. Vrabie, O. Pastravanu, M. Abu-Khalaf, and F. L. Lewis, “Adaptive optimal control for continuous-time linear systems based on policy iteration,” Automatica, vol. 45, pp. 477-484, 2009.
Lemma 1 – Draguna Vrabie
Solves Bellman equation without knowing f(x,u)
( ( )) ( , ) ( ( )), (0) 0t T
t
V x t r x u d V x t T V
0 ( , ) ( , ) ( , , ), (0) 0TV Vf x u r x u H x u V
x x
Allows definition of temporal difference error for CT systems
( ) ( ( )) ( , ) ( ( ))t T
t
e t V x t r x u d V x t T
Integral reinf. form (IRL) for the CT Bellman eq.Is equivalent to
( ( )) ( , ) ( , ) ( , )t T
t t t T
V x t r x u d r x u d r x u d
value
Key Idea
( ( )) ( , ) ( ( ))t T
k k kt
V x t r x u dt V x t T
IRL Policy iteration
Initial stabilizing control is needed
Cost update
Control gain update
f(x) and g(x) do not appear
g(x) needed for control update
Policy evaluation‐ IRL Bellman Equation
Policy improvement11
21 1( ) ( )T kk k
Vu h x R g xx
),,(),(),(0 uxVxHuxruxf
xV T
Equivalent to
Solves Bellman eq. (nonlinear Lyapunov eq.) without knowing system dynamics
CT Bellman eq.
Integral Reinforcement Learning (IRL)- Draguna Vrabie
D. Vrabie proved convergence to the optimal value and control
Converges to solution to HJB eq.dx
dVggRdx
dVxQfdx
dV TTT *
1*
41
*
)(0
Shi nian shu muBai nian shu ren
Keshi-Wu nian shu xuesheng
十年树木,百年树人
t t+T
observe x(t)
apply uk=Lkx
observe cost integral
update P
Do RLS until convergence to Pk
update control gain
Integral Reinforcement Learning (IRL)
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
t+2T
observe x(t+T)
apply uk=Lkx
observe cost integral
update P
t+3T
observe x(t+2T)
apply uk=Lkx
observe cost integral
update P
11
Tk kK R B P
( , )t t T ( , 2 )t T t T ( 2 , 3 )t T t T
Or use batch least-squares
(x( )) ( ( )) ( ) ( ) ( ) ( , )t T
T T Tk k k
tW t x t T x Q K RK x d t t T
Solve Bellman Equation - Solves Lyapunov eq. without knowing dynamics
This is a data-based approach that uses measurements of x(t), u(t)Instead of the plant dynamical model.
A is not needed anywhere
Direct Optimal Adaptive Controller
A hybrid continuous/discrete dynamic controller whose internal state is the observed cost over the interval
Draguna Vrabie
DynamicControlSystemw/ MEMORY
xu
V
ZOH T
x Ax Bu System
T Tx Q x u Ru
Critic
ActorK
T T
Run RLS or use batch L.S.To identify value of current control
Update FB gain afterCritic has converged
Reinforcement interval T can be selected on line on the fly – can change
Solves Riccati Equation Online without knowing A matrix
1 TK R B P
PBPBRQPAPA TT 10
xux Ax Bu
SystemControlK On-line Control Loop
On-line Performance Loop
Data-driven Online Adaptive Optimal Control
An Online Supervisory Control Procedurethat requires no Knowledge of system dynamics model A
Automatically tunes the control gains in real time to optimize a user given cost functionUses measured data (u(t),x(t)) along system trajectories
( , )J Q RUser prescribed optimization criterion
( ) ( ) ( )( ) ( ) ( ) ( )t T
T T T Tk k k k
tx t P x t x Q K RK x d x t T P x t T
11
Tk kK R B P
Data set at time [t,t+T)
( ), ( , ), ( )x t t t T x t T
Kung Tz 500 BCConfucius
ArcheryChariot driving
MusicRites and Rituals
PoetryMathematics
孔子Man’s relations to
FamilyFriendsSocietyNationEmperorAncestors
Tian xia da tongHarmony under heaven
124 BC ‐ Han Imperial University in Chang‐an
Motor control 200 Hz
Oscillation is a fundamental property of neural tissue
Brain has multiple adaptive clocks with different timescales
theta rhythm, Hippocampus, Thalamus, 4-10 Hzsensory processing, memory and voluntary control of movement.
gamma rhythms 30-100 Hz, hippocampus and neocortex high cognitive activity.
• consolidation of memory• spatial mapping of the environment – place cells
The high frequency processing is due to the large amounts of sensorial data to be processed
Spinal cord
D. Vrabie and F.L. Lewis, “Neural network approach to continuous-time direct adaptive optimal control forpartially-unknown nonlinear systems,” Neural Networks, vol. 22, no. 3, pp. 237-246, Apr. 2009.
D. Vrabie, F.L. Lewis, D. Levine, “Neural Network-Based Adaptive Optimal Controller- A Continuous-Time Formulation -,” Proc. Int. Conf. Intelligent Control, Shanghai, Sept. 2008.
Doya, Kimura, Kawato 2001
Limbic system
Cerebral cortexMotor areas
ThalamusBasal ganglia
Cerebellum
Brainstem
Spinal cord
Interoceptivereceptors
Exteroceptivereceptors
Muscle contraction and movement
Summary of Motor Control in the Human Nervous System
reflex
Supervisedlearning
ReinforcementLearning- dopamine
(eye movement)inf. olive
Hippocampus
Unsupervisedlearning
Limbic System
Motor control 200 Hz
theta rhythms 4-10 Hz
picture by E. StinguD. Vrabie
Memoryfunctions
Long term
Short term
Hierarchy of multiple parallel loops
gamma rhythms 30-100 Hz
Adaptive Critic structure
Theta waves 4-8 Hz
Reinforcement learning
Motor control 200 Hz
Sun Tz bin fa孙子兵法
500 BC
Manufacturing as the Interactions of Multiple AgentsEach machine has it own dynamics and cost functionNeighboring machines influence each other most stronglyThere are local optimization requirements as well as global necessities
( )i
i i i i i i ij j jj N
A d g B u e B u
12
0
( (0), , ) ( )i
T T Ti i i i i ii i i ii i j ij j
j N
J u u Q u R u u R u dt
Each process has its own dynamics
And cost function
Each process helps other processes achieve optimality and efficiency
F.L. Lewis, H. Zhang, A. Das, K. Hengster-Movric, Cooperative Control of Multi-Agent Systems: Optimal Design and Adaptive Control, Springer-Verlag, 2013
Key Point
Lyapunov Functions and Performance IndicesMust depend on graph topology
Hongwei Zhang, F.L. Lewis, and Abhijit Das“Optimal design for synchronization of cooperative systems: state feedback, observer and output feedback,”IEEE Trans. Automatic Control, vol. 56, no. 8, pp. 1948-1952, August 2011.
H. Zhang, F.L. Lewis, and Z. Qu, "Lyapunov, Adaptive, and Optimal Design Techniques for Cooperative Systems on Directed Communication Graphs," IEEE Trans. Industrial Electronics, vol. 59, no. 7, pp. 3026‐3041, July 2012.
Real‐Time Solution of Multi‐Player NZS Games
1
( ) ( )N
j jj
x f x g x u
Multi‐Player Nonlinear Systems
Optimal control *1 2
10
( (0), , , ) min ( ( ) ) ;i
NT
i N i i ij ij
V x Q x R dt i N
* * * * * * * *1 2 1 1 2( , , ,..., ) ( , , ,..., ),i i i iV V V i N Nash equilibrium
1
1
0 ,N
T T Ti c c i i j j jj ij jj j j
j
P A A P Q P B R R R B P i N
Requires Offline solution of coupled Hamilton‐Jacobi –Bellman eqs.
Kyriakos Vamvoudakis
Continuous‐time, N players
112
1
0 ( ) ( ) ( ) ( )N
T Ti j jj j j
j
V f x g x R g x V
114
1
( ) ( ) ( ) , (0) 0N
T T Ti j j jj ij jj j j i
j
Q x V g x R R R g x V V
Linear Quadratic Regulator Case‐ coupled AREs
These are hard to solveIn the nonlinear case, HJB generally cannot be solved
100
112( ) ( ) ,T
i ii i ix R g x V i N
Control policies
Lao Zi
The way that can be told is not the Constant WayThe name that can be named is not the Constant Name
For nameless is the true wayBeyond the myriad experiences of the world
To experience without intention is to sense the world
All experience is an archwherethrough gleams that untravelled landwhose margins fade forever as we move
Dao ke dao feichang daoMing ke ming feichang ming