1
INTRODUCTION Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a Human Posture Control System R. Pienaar and A.J. van den Bogert Department of Biomedical Engineering, The Cleveland Clinic Foundation, Cleveland OH Problem Description Mechanical system consisting of rigid segments articulating with each other via hinge joints (Figure 3) Each segment is 2m long Each segment has mass of 10kg Connection to ground is by a hinge joint All motion is constrained to two dimensions Problem Implementation Software simulation Ground contact link has possible torques of (-400, 0, 400) Nm Free swinging link has possible torques of (-200, 0, 200) Nm A torque action vector is selected and applied to the system for 50ms Software architecture (Figure 3) "Low level" equations of motion and mechanical behaviour are generated by SD/FAST for each pendulum system "Intermediate level" sdbio library connects learning system with SD/FAST; library was designed to allow addition of muscle-type actuators and joint geometries (currently not used for the current system) "Top level" component defines the learning system and adaptive controller Learning shows "noisy" exponential convergence A learning episode continues until the pendulum moves into "terminal" state Learning occurs during both exploration and exploitation Once the pendulum has been balanced for a continuous hour of "virtual time", the simulation is considered complete Strictly speaking, the pole balancing problem is not Markovian Discretization of continuous state into quantized table violates Markov condition Q learning is still able to learn an appropriate balancing stategy Learning performance Single link pendulum learns relatively quickly (about one hour of "virtual" time or five minutes of real time) Double link pendulum requires more time (about 2 - 3 days of "virtual" time or an hour of real time) Reducing the action possibilities resulted in faster learning for double link pendulum in some cases The controller had no a priori knowledge of the system it was to control Could only observe current state and received a "reward" value from the environment Q Learning was able to balance both single and double link pendulum systems Future work Application to human postural control model (Figure 5) Generalize learning over higher-resolution spaces Distributed vs. global control Symbolic Dynamics Inc. SD/FAST. http://www.symdyn.com, 1996. L. P. Kaebling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4: 237-285, 1996. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. C. J. C. H. Watkins and P. Dayan. Technical note: Q-learning. Machine Learning, 8: 279-292, 1992. Purpose of research is ultimately to develop an adaptive human posture control system Initial development "platforms" are inverted pendula Human mechanical system can be simplistically modeled as inverted pendulum Single link and double link Double link pendulum is a non-linear system that is comparatively simple to describe, yet difficult to control An adaptive human posture control system can be used to assist patients suffering from paraplegia and other degenerative neuro-muscular disorders Control system is Reinforcement Learning (RL) based Does not require a priori information on the plant Learns the plant’s system dynamics through exploring / exploiting strategies Develop / test a software development environment that can be used for human posture RL control Control a double link inverted pendulum through a specific RL algorithm called Q Learning. Reinforcement learning Learn from experience, using exploitation and exploration Learn to maximize the sum of all future rewards Biologically inspired Behavioural learning, conditioned reflexes Feedback does not fight the natural dynamics Consistent with electrophysiological data of dopamine neurons during motor learning Successful applications in artificial intelligence, few in robotics or biomechanics Q Learning (see Algorithm in Figure 1) Q(s,a) = expected sum of future rewards for executing action a in system state s Controller learns from experience control system; implicitly learns the system dynamics Results in adaptive optimal control without an explicit system model The Q(s, a) function (continuous or discrete) defines the values (or worth) of taking a particular action a in a particular state s. Assumes that underlying system can be described as a Markov decision process Markov process implies that for a given state a decision on future action depends only on the current state and not on the history that led up to the current state SPECIFIC GOALS BACKGROUND Initialize(Q(s,a), random_values) repeat observe(state(s)) (sensory information) forall(actions(a)) in state(s)) find_action(a) with highest(Q(s,a)) if(random_number > execute(action(a)) else execute(random(action(a)) observe(new_state(s)) receive(reward(r)) Q(s,a) = Q(s,a) + {r + max{Q(s’,b)} - Q(s,a)] : learning rate : discount factor : exploration rate until(converged(Q(s,a)) Reinforcement Controller sdbio SD/FAST System model Muscle description file Mechanical system file 0 0 ) 0 0 ) 1 1 ) ball hinge universal joint body mass METHODS Figure 1. Pseudo-code description of the Q-learning algorithm. Θ i : Angle of segment i (rad) ω i : Angular velocity of segment i (rad/s) Figure 2. Schematic overview of the single- and double-link pendulum systems. Figure 3. Conceptual overview of software architecture. RESULTS DISCUSSION 1a 1b 2a 2b 3 4 5a 5b 1a Quadriceps femoris 1b Hamstrings 2a Vasti 2b Glutei 3 Sartorius 4 Gastocnemius 5a Tibialis anterior 5b Tibialis posterior Figure 5. Future work will apply controller to human musculo-skeletal system CONCLUSION REFERENCES Figure 4. Controller GUI (on left) and performance graph (on right) (-400 0 400) (-400 0 400) 5.24 (-200 0 200) (-100 0 100) 3.96 (-400 0 400) (-200 0 200) 5.79 Ground link action vector (Nm) Free link action vector (Nm) Total learning time (days) Table 1. Learning performance for different action vectors

Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a …nmr.mgh.harvard.edu/~rudolph/old_ccf/RL_on_DLIP.pdf · 2003-12-19 · INTRODUCTION Reinforcement Learning

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a …nmr.mgh.harvard.edu/~rudolph/old_ccf/RL_on_DLIP.pdf · 2003-12-19 · INTRODUCTION Reinforcement Learning

INTRODUCTION

Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a Human Posture Control System

R. Pienaar and A.J. van den BogertDepartment of Biomedical Engineering, The Cleveland Clinic Foundation, Cleveland OH

� Problem Description

� Mechanical system consisting of rigid segments articulating with each other via hinge joints (Figure 3)

� Each segment is 2m long� Each segment has mass of 10kg

� Connection to ground is by a hinge joint

� All motion is constrained to two dimensions

� Problem Implementation

� Software simulation� Ground contact link has possible torques of (−400, 0, 400) Nm� Free swinging link has possible torques of (−200, 0, 200) Nm� A torque action vector is selected and applied to the system for

50ms

� Software architecture (Figure 3)

� "Low level" equations of motion and mechanical behaviour are generated by SD/FAST for each pendulum system

� "Intermediate level" sdbio library connects learning system with SD/FAST; library was designed to allow addition of muscle−type actuators and joint geometries (currently not used for the current system)

� "Top level" component defines the learning system and adaptive controller

� Learning shows "noisy" exponential convergence

� A learning episode continues until the pendulum moves into "terminal" state

� Learning occurs during both exploration and exploitation

� Once the pendulum has been balanced for a continuous hour of "virtual time", the simulation is considered complete

� Strictly speaking, the pole balancing problem is not Markovian

� Discretization of continuous state into quantized table violates Markov condition

� Q learning is still able to learn an appropriate balancing stategy

� Learning performance

� Single link pendulum learns relatively quickly (about one hour of "virtual" time or five minutes of real time)

� Double link pendulum requires more time (about 2 − 3 days of "virtual" time or an hour of real time)

� Reducing the action possibilities resulted in faster learning for double link pendulum in some cases

� The controller had no a priori knowledge of the system it was to control

� Could only observe current state and received a "reward" value from the environment

� Q Learning was able to balance both single and double link pendulum systems

� Future work

� Application to human postural control model (Figure 5)

� Generalize learning over higher−resolution spaces

� Distributed vs. global control

� Symbolic Dynamics Inc. SD/FAST. http://www.symdyn.com, 1996.

� L. P. Kaebling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4: 237−285, 1996.

� R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.

� C. J. C. H. Watkins and P. Dayan. Technical note: Q−learning. Machine Learning, 8: 279−292, 1992.

� Purpose of research is ultimately to develop an adaptive human posture control system

� Initial development "platforms" are inverted pendula

� Human mechanical system can be simplistically modeled as inverted pendulum

� Single link and double link� Double link pendulum is a non−linear system that is comparatively

simple to describe, yet difficult to control� An adaptive human posture control system can be used to assist patients

suffering from paraplegia and other degenerative neuro−muscular disorders

� Control system is Reinforcement Learning (RL) based

� Does not require a priori information on the plant

� Learns the plant’s system dynamics through exploring / exploiting strategies

� Develop / test a software development environment that can be used for human posture RL control

� Control a double link inverted pendulum through a specific RL algorithm called Q Learning.

� Reinforcement learning

� Learn from experience, using exploitation and exploration

� Learn to maximize the sum of all future rewards

� Biologically inspired� Behavioural learning, conditioned reflexes� Feedback does not fight the natural dynamics� Consistent with electrophysiological data of dopamine neurons

during motor learning

� Successful applications in artificial intelligence, few in robotics or biomechanics

� Q Learning (see Algorithm in Figure 1)

� Q(s,a) = expected sum of future rewards for executing action a in system state s

� Controller learns from experience control system; implicitly learns the system dynamics

� Results in adaptive optimal control without an explicit system model

� The Q(s, a) function (continuous or discrete) defines the values (or worth) of taking a particular action a in a particular state s.

� Assumes that underlying system can be described as a Markov decision process

� Markov process implies that for a given state a decision on future action depends only on the current state and not on the history that led up to the current state

SPECIFIC GOALS

BACKGROUND

� Initialize(Q(s,a), random_values)

� repeat

� observe(state(s)) (sensory information)

� forall(actions(a)) in state(s))

� find_action(a) with highest(Q(s,a))

� if(random_number > ��

� execute(action(a))

� else

� execute(random(action(a))

� observe(new_state(s))

� receive(reward(r))

� Q(s,a) = Q(s,a) + ⊆{r + ∈ max{Q(s’,b)} − Q(s,a)]

⊆: learning rate

∈: discount factor

�: exploration rate

� until(converged(Q(s,a))

ReinforcementController

sdbio

SD/FAST

System model

Muscle descriptionfile

Mechanical systemfile

(Θ0,ω

0) (Θ

0,ω

0)

(Θ1,ω

1)

ball

hinge

universal joint

body mass�

METHODS

Figure 1. Pseudo−code description of the Q−learning algorithm.

Θi : Angle of

segment i (rad)

ωi: Angular velocity

of segment i (rad/s)

Figure 2. Schematic overview of the single− and double−linkpendulum systems.

Figure 3. Conceptual overview of software architecture.

RESULTS

DISCUSSION

1a

1b

2a2b

3

4

5a

5b

1a Quadriceps femoris1b Hamstrings

2a Vasti2b Glutei

3 Sartorius

4 Gastocnemius

5a Tibialis anterior5b Tibialis posterior

Figure 5. Future work will apply controllerto human musculo−skeletal system

CONCLUSION

REFERENCES

Figure 4. Controller GUI (on left) and performance graph (on right)

(−400 0 400) (−400 0 400) 5.24(−200 0 200) (−100 0 100) 3.96(−400 0 400) (−200 0 200) 5.79

Ground link action vector (Nm)

Free link action vector (Nm)

Total learning time (days)

Table 1. Learning performance for different actionvectors