Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
INTRODUCTION
Reinforcement Learning on a Double Linked Inverted Pendulum: Towards a Human Posture Control System
R. Pienaar and A.J. van den BogertDepartment of Biomedical Engineering, The Cleveland Clinic Foundation, Cleveland OH
� Problem Description
� Mechanical system consisting of rigid segments articulating with each other via hinge joints (Figure 3)
� Each segment is 2m long� Each segment has mass of 10kg
� Connection to ground is by a hinge joint
� All motion is constrained to two dimensions
� Problem Implementation
� Software simulation� Ground contact link has possible torques of (−400, 0, 400) Nm� Free swinging link has possible torques of (−200, 0, 200) Nm� A torque action vector is selected and applied to the system for
50ms
� Software architecture (Figure 3)
� "Low level" equations of motion and mechanical behaviour are generated by SD/FAST for each pendulum system
� "Intermediate level" sdbio library connects learning system with SD/FAST; library was designed to allow addition of muscle−type actuators and joint geometries (currently not used for the current system)
� "Top level" component defines the learning system and adaptive controller
� Learning shows "noisy" exponential convergence
� A learning episode continues until the pendulum moves into "terminal" state
� Learning occurs during both exploration and exploitation
� Once the pendulum has been balanced for a continuous hour of "virtual time", the simulation is considered complete
� Strictly speaking, the pole balancing problem is not Markovian
� Discretization of continuous state into quantized table violates Markov condition
� Q learning is still able to learn an appropriate balancing stategy
� Learning performance
� Single link pendulum learns relatively quickly (about one hour of "virtual" time or five minutes of real time)
� Double link pendulum requires more time (about 2 − 3 days of "virtual" time or an hour of real time)
� Reducing the action possibilities resulted in faster learning for double link pendulum in some cases
� The controller had no a priori knowledge of the system it was to control
� Could only observe current state and received a "reward" value from the environment
� Q Learning was able to balance both single and double link pendulum systems
� Future work
� Application to human postural control model (Figure 5)
� Generalize learning over higher−resolution spaces
� Distributed vs. global control
� Symbolic Dynamics Inc. SD/FAST. http://www.symdyn.com, 1996.
� L. P. Kaebling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4: 237−285, 1996.
� R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998.
� C. J. C. H. Watkins and P. Dayan. Technical note: Q−learning. Machine Learning, 8: 279−292, 1992.
� Purpose of research is ultimately to develop an adaptive human posture control system
� Initial development "platforms" are inverted pendula
� Human mechanical system can be simplistically modeled as inverted pendulum
� Single link and double link� Double link pendulum is a non−linear system that is comparatively
simple to describe, yet difficult to control� An adaptive human posture control system can be used to assist patients
suffering from paraplegia and other degenerative neuro−muscular disorders
� Control system is Reinforcement Learning (RL) based
� Does not require a priori information on the plant
� Learns the plant’s system dynamics through exploring / exploiting strategies
� Develop / test a software development environment that can be used for human posture RL control
� Control a double link inverted pendulum through a specific RL algorithm called Q Learning.
� Reinforcement learning
� Learn from experience, using exploitation and exploration
� Learn to maximize the sum of all future rewards
� Biologically inspired� Behavioural learning, conditioned reflexes� Feedback does not fight the natural dynamics� Consistent with electrophysiological data of dopamine neurons
during motor learning
� Successful applications in artificial intelligence, few in robotics or biomechanics
� Q Learning (see Algorithm in Figure 1)
� Q(s,a) = expected sum of future rewards for executing action a in system state s
� Controller learns from experience control system; implicitly learns the system dynamics
� Results in adaptive optimal control without an explicit system model
� The Q(s, a) function (continuous or discrete) defines the values (or worth) of taking a particular action a in a particular state s.
� Assumes that underlying system can be described as a Markov decision process
� Markov process implies that for a given state a decision on future action depends only on the current state and not on the history that led up to the current state
SPECIFIC GOALS
BACKGROUND
� Initialize(Q(s,a), random_values)
� repeat
� observe(state(s)) (sensory information)
� forall(actions(a)) in state(s))
� find_action(a) with highest(Q(s,a))
� if(random_number > ��
� execute(action(a))
� else
� execute(random(action(a))
� observe(new_state(s))
� receive(reward(r))
� Q(s,a) = Q(s,a) + ⊆{r + ∈ max{Q(s’,b)} − Q(s,a)]
⊆: learning rate
∈: discount factor
�: exploration rate
� until(converged(Q(s,a))
ReinforcementController
sdbio
SD/FAST
System model
Muscle descriptionfile
Mechanical systemfile
(Θ0,ω
0) (Θ
0,ω
0)
(Θ1,ω
1)
ball
hinge
universal joint
body mass�
METHODS
Figure 1. Pseudo−code description of the Q−learning algorithm.
Θi : Angle of
segment i (rad)
ωi: Angular velocity
of segment i (rad/s)
Figure 2. Schematic overview of the single− and double−linkpendulum systems.
Figure 3. Conceptual overview of software architecture.
RESULTS
DISCUSSION
1a
1b
2a2b
3
4
5a
5b
1a Quadriceps femoris1b Hamstrings
2a Vasti2b Glutei
3 Sartorius
4 Gastocnemius
5a Tibialis anterior5b Tibialis posterior
Figure 5. Future work will apply controllerto human musculo−skeletal system
CONCLUSION
REFERENCES
Figure 4. Controller GUI (on left) and performance graph (on right)
(−400 0 400) (−400 0 400) 5.24(−200 0 200) (−100 0 100) 3.96(−400 0 400) (−200 0 200) 5.79
Ground link action vector (Nm)
Free link action vector (Nm)
Total learning time (days)
Table 1. Learning performance for different actionvectors