13

Click here to load reader

Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

Embed Size (px)

Citation preview

Page 1: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012 237

Real-Time Adaptive Control of a FlexibleManipulator Using Reinforcement Learning

Santanu Kumar Pradhan and Bidyadhar Subudhi, Senior Member, IEEE

Abstract—This paper exploits reinforcement learning (RL) fordeveloping real-time adaptive control of tip trajectory and deflec-tion of a two-link flexible manipulator handling variable payloads.This proposed adaptive controller consists of a proportionalderivative (PD) tracking loop and an actor-critic-based RL loopthat adapts the actor and critic weights in response to payloadvariations while suppressing the tip deflection and tracking thedesired trajectory. The actor-critic-based RL loop uses a recursiveleast square (RLS)-based temporal difference (TD) learning witheligibility trace and an adaptive memory to estimate the criticweights and a gradient-based estimator for estimating actorweights. Tip trajectory tracking and suppression of tip deflectionperformances of the proposed RL-based adaptive controller(RLAC) are compared with that of a nonlinear regression-baseddirect adaptive controller (DAC) and a fuzzy learning-basedadaptive controller (FLAC). Simulation and experimental resultsenvisage that the RLAC outperforms both the DAC and FLAC.

Note to Practitioners—This paper shows how to control a systemwith distributed flexibility. The reinforcement learning approachto develop adaptive control described in the paper can be applied tocontrol also complex flexible space shuttle system and for dampingof many vibratory systems.

Index Terms—Adaptive control, flexible-link manipulator, rein-forcement learning, tip trajectory tracking.

NOMENCLATURE

Joint position of the link.

Desired joint position of the link.

Modal displacement for the link.

Redefined tip trajectory of the link.

Tip reference velocity of the link.

Reference model output of the link.

Tip trajectory error of the link.

Change in tip trajectory error of the link.

Error w.r.t. reference model of the link.

Measure of tip trajectory tracking accuracy for link.

Critic output.

Manuscript received July 08, 2011; revised November 30, 2011; acceptedFebruary 11, 2012. Date of publication March 07, 2012; date of current ver-sion April 03, 2012. This paper was recommended for publication by AssociateEditor M. K. Jeong and Editor Y. Narahari upon evaluation of the reviewers’comments.

The authors are with the Department of Electrical Engineering, National Insti-tute of Technology, Rourkela, Orissa 769008, India (e-mail: [email protected]; [email protected]).

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TASE.2012.2189004

Actor output.

Reward.

Value function.

Discount factor.

Actor weights.

Critic weights.

Temporal difference error at instant.

Critic regressor vector.

Actor regressor vector.

Eligibility trace.

Kalman gain matrix.

Covariance matrix of the temporal difference error.

Forgetting factor.

Actor adaptation gain.

Incremental adaptive memory.

Gradient vector.

Gradient updating vector.

Value of the eligibility trace.

I. INTRODUCTION

F LEXIBLE-LINK MANIPULATORS (FLMs), i.e., ma-nipulators with thin and lightweight links offer several

advantages over rigid-link manipulators such as achievinghigh-speed operation, lower energy consumption, and in-crease in payload carrying capacity. These nonconvectionalmanipulators find applications requiring large workspace likeassembly of free-flying space structures and hazardous materialmanagement from safer distance [1]. All these advantagesand applications motivate towards the accelerated research inFLM control. However, controlling a FLM is difficult owing todistributed link flexibility which makes this type of manipulatorsystem dynamics nonminimum phase and underactuated [1].Further, control of a FLM becomes more challenging when ithas to handle variable payload. In order to achieve good tiptrajectory tracking while suppressing tip deflection with variedpayloads, adaptive control should be employed, which canprovide appropriate control torques to the actuators to achievethe above two-control tasks (good tip trajectory tracking andsuppression of tip deflection).

In the past, several papers on design of adaptive controllersfor FLMs with variable payloads have been reported. A simpledecoupled adaptive controller comprising of the estimation oflink’s natural frequency for a single link flexible manipulatorunder variable payload is proposed in [2]. Further advancement

1545-5955/$31.00 © 2012 IEEE

Page 2: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

238 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

in adaptive controller has been proposed in [3], where a dis-crete-time nonlinear adaptive controller for a single-link flex-ible manipulator using RLS-based payload estimation is used.The above adaptive controllers suffer from dependency on iden-tification procedure and excessive tuning of adaptive gains. In-telligent controllers based on supervised learning using neuralnetworks [4] and fuzzy logic have been designed by some in-vestigators [5] for FLMs under parametric uncertainty. How-ever, neural network-based controllers require training of thesynaptic weights to an optimal value which consume consid-erable amount of time and computational complexity. Fuzzylogic-based adaptive controller design depends upon proper for-mulation of control rule base. A hybrid neuro-fuzzy-based adap-tive controller has been proposed in [6]. Although, the abovehybrid neuro-fuzzy controller shows better performance com-pared to neural network and fuzzy logic-based adaptive con-trollers but it needs a priori information about the input outputrelationship, i.e., supervised and offline learning are essentiallyrequired. Also, adaptive control of a multilink flexible manipu-lator is more complex compared to a single-link flexible manip-ulator control problem owing to interlink coupling effects. Theabove discussion reveals that there is a need of a precise real-time adaptive control for FLMs under payload variation. Hence,development of a real-time adaptive control for both tip trajec-tory tracking and suppression of tip deflection for a two-linkflexible manipulator (TLFM) handling variable payload is theobjective of this paper. Unlike supervised learning, where thelearning is driven by error signal (difference between desiredand current response), RL occurs when an agent (manipulator)learns behaviors (tip trajectory tracking) through trial-and-errorinteraction with the environment (workspace)-based on “rein-forcement” signals from the environment [7]–[10].

The contribution of this paper lies in developing a newRL-based real-time adaptive control for a TLFM. Motivated bythe successful application of reinforcement learning in manycomplex systems such as an acrobot, elevator dispatching,dynamic cellular channel allocation, and inverted pendulum,etc., [7], this paper attempts to exploit actor-critic-based RLwith modification in critic as well as in actor to develop anadaptive control for a TLFM. Many of the previous works onRL-based control use least square (LS) approach to estimatethe weights of the value function [7]. But as the LS is a batchprocessing technique it is unsuitable for real-time control.Therefore, the proposed actor-critic RLAC uses a RLS-basedTD learning to obtain the optimal weights of the value functionin the critic. Further, a mechanism of eligibility trace [8] andadaptive memory are embedded to this TD algorithm to enhancelearning what we call as Recursive Least Square-EligibilityTrace-Adaptive Memory algorithm (RLS-ET-AM) algorithm.The proposed algorithm calculates the initial critic param-eters offline in order to reduce the computational overheadin real-time unlike previous approaches where either zero orrandom values were taken [8]. To ensure stability of the RLAC,a discrete-time PD controller is supplemented with the aboveRL learning. The proposed RLAC is compared with a DAC anda FLAC to validate the performances of the proposed RLAC.The rest of this paper is organized as follows. Section II presents

Fig. 1. Schematic diagram of a planar TLFM.

the dynamic model of the TLFM. Section III briefly reviewstwo other controllers such as DAC and FLAC to enable com-parison of efficacy of the proposed RLAC. In Section IV, thedevelopment of the proposed RLAC is presented. Simulationresults are discussed in Section V-A followed by experimentalresults in Section V-B. Section VI presents conclusions.

II. DYNAMIC MODEL OF THE TLFM

The schematic diagram of a planar TLFM is shown in Fig. 1,where is the actuated torque of the link, is the joint angleof the joint and represents the deflection alonglink. The outer free end of the TLFM is attached with payloadmass, . The dynamics of the TLFM is given by [11]

(1)

where is the positive-definite symmetric inertia matrix,and are the vectors containing of Coriolis and Centrifugalforces, respectively, is the stiffness matrix, and is thedamping matrix. If the output is taken as tip position, the overallmanipulator system becomes nonminimum phase [1]; hence,the redefined output is given by

(2)

where is length of the link. TLFM dynamics (1) can berewritten in state space form as

(3)

with as the state vector, i.e., and

and being the modal displacement and modal velocity forthe link, respectively, and the actual output vector, is givenby . To express the dynamics of the TLFM in terms

Page 3: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 239

Fig. 2. Nonlinear direct adaptive controller for TLFM.

of the redefined tip position and tip velocity, the states are rede-fined as .

The new state space representation of the TLFM using rede-fined output can be expressed as

(4)

where is the torque input with respect to the redefinedoutput

and

III. REVIEW OF DAC AND FLAC

A. Direct Adaptive Controller (DAC)

Fig. 2 shows the structure of the direct adaptive controller(DAC) [12]. It consists of a TLFM dynamic compensator anda PD feedback loop. The dynamic compensator provides jointdynamic torques , necessary to make the desired motionswith respect to real-time estimation of the TLFM parameters.The PD feedback loop, output regulates the aboutthe . The direct adaptive control law is derived as follows.Define as a vector containing the parameters of the TLFM,given by

where : link inertia; : hub inertia; : total inertiaof the link; ; total mass of the link and : totalcoupling mass.

The choice of vector is made so as to keep the number ofmanipulator parameter to minimum. Let be the estimate of .

and are the estimates of inertia matrix, Coriolis and Cen-trifugal force vector, respectively. Then, the TLFM dynamics(4) can be written as

(5)

where , is the parameter estimation error,is an matrix independent of the TLFM dynamic parameterswith being equal to dimension of the “ ” vector for link

Fig. 3. Fuzzy logic-based adaptive controller for TLFM.

The direct adaptive control law can be expressed as

(6)

(7)

where is the constant positive definite matrix, is the pos-itive definite PD gain matrix, and vector is defined as

(8)

B. Fuzzy Learning Based Adaptive Controller (FLAC)

Fig. 3 shows the structure of the FLAC [4]. It utilizes learningmechanism which automatically adjusts the rule base of thefuzzy controller (FC) so that the closed-loop performs accordingto the user defined reference model containing information ofthe desired behavior of the controlled system. It consists ofthree major components namely a FC, a reference model anda learning mechanism described as follows.

1) FC: has and as inputs and as output for linkwith fuzzy implication of the form

where is defined as , where , and arethe tip trajectory error terms for link at andinstants, respectively.

The output fuzzy set for both the links are initialized at zero,i.e., , the rule base is filled online [4] using thelearning mechanism as shown in Fig. 3.

2) Reference Model: is chosen according to the desiredclosed-loop performance such as rise time and settling time.The choice of the reference model is very important as itdictates the FLAC to perform in the desired manner.

3) Learning Mechanism: performs the function of modifyingthe knowledge base of the FC so that the closed-loop perfor-mance behaves as the reference model. The learning block con-stitutes a fuzzy inverse model and a knowledge base modifier.These are explained next.

The fuzzy inverse model makes an assessment of the deriva-tion of the current closed-loop system behavior from the spec-ified behavior of the reference model. The design of the fuzzyinverse model requires the knowledge of the closed-loop onlineTLFM tip position error profile and change in error profile.Theknowledge base modifier performs the function of modifyingthe FC so that the better payload adaptability can be achieved.To modify the knowledge base of the fuzzy controller, the rulesthat are “on” are determined, i.e., the value of is measured,

Page 4: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

240 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

Fig. 4. Structure of reinforcement learning-based adaptive control for a TLFM carrying a variable payload.

where and being the reference model erroroutput and reference model output, respectively. The entries ofthe rule base are modified according to the value of anda scalar output from rule base modifier shifts the center of thefuzzy rule base using the following rule:

where , is the fuzzy inverse model outputand is the amount by which the rule base will be modified.

IV. PROPOSED REAL-TIME RLAC FOR TLFM

RL takes places when an agent (TLFM) understands andlearns by observing the environment workspace)-based ona scalar internal reinforcement signal called reward and

TD error at instant TD error and it is the externalreinforcement signal that comes from the environment to min-imize a long term value function described next. Fig. 4 showsthe structure of RLAC for real-time implementation of theTLFM carrying a variable payload. It consists of two importantcomponents such as an actor-critic block and a PD control loop.The actor-critic block adapts the actor and critic weights,and in order to compensate for the joint torque input underpayload uncertainties.

The PD controller provides stable closed-loop performanceby regulating the desired tip trajectory. A zero order hold (ZOH)block is used to achieve a discrete value of the desired tip trajec-tory and redefined output . Thus, the net adaptive torque

for link is given by

(9)

where is the proportional derivative control action andis the estimated actor output. The PD control law uses the pastvalues of tip trajectory tracking error for link and thepast value of the PD control output . Thus, the digitalPD control action is generated using the following recursive law:

(10)where and are proportional and derivative gain, respec-tively, and and are the tracking errors at samplinginstants ( ) and ( ), respectively. is the controlaction at instant for link.

Fig. 5. Actor-Critic-based Reinforcement learning.

A. Actor-Critic Block

Fig. 5 describes the actor-critic-based RL, where denotesthe control policy applied to the actuators of the TLFM,

is the measured redefined tip trajectory given in (2) forthe link at instant. Reward at instant,is the result of the transition , where is thesuccessive value of at instant. Let a value or costbe assigned to the total cumulative reward function sayexpressed as

(11)

where is the discount factor at the instant. The value ofthe discount factor decides as how much weightage is to be givento future rewards. The RL searches for a control policy, inthe actor so that it minimizes the value function defined in (11)

(12)

It is difficult to achieve minimization of the in real-timeas (12) needs evaluation of an infinite sum backward in time. Toprovide forward in time solution of (12) approximation of the

is necessary. In order to approximate the (12)can be rewritten as sum of step reward and discount timesinfinite sum of the future value function in compact form as

(13)

Page 5: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 241

The difference equation equivalent of (13) is given [8] as

(14)

where

(14) is also known as Bellman equation. Based on this equation,can be defined as

(15)

is a prediction error between predicted and observed per-formance. If (15) holds good for some value of , thenmust approach zero. Thus, (15) becomes

(16)

The RLAC-based actor-critic RL consists of two separateblocks, actor and critic. In actor, the policy is updated andon critic, the is updated using a linear function ap-proximator-based on RLS algorithm. Hence, the value function

relating the critic weights can be expressed as

(17)

where is the regressor vector, , whereis the Kronecker product. Similarly, can be expressed in

regressor form as

(18)

where is the matrix of actor weight estimates and isthe actor regressor vector. Signals and play vital rolein determining the performance of the control policy by mini-mizing defined in (16). The performance measure of theTLFM control is attributed to achieve the desired tip trajectorytracking while simultaneously damping out the tip deflection.Therefore, defined in (8) which measures the accuracy of thetip trajectory tracking for link is used to formulate the andis given as

(19)

where is a predefined tolerance value and a reward (negative)is taken in (19) to improve the closed-loop performance. Sub-stituting for from (17) in (15), one obtains

(20)

B. Critic Weight ( ) Update Using the Proposed(RLS-ET-AM) Algorithm

The objective of the critic is to estimate using pro-posed (RLS-ET-AM) algorithm. Let be the estimate of

the value and a cost function for measurementsis chosen so as to minimize the temporal difference errordefined in (15). is given as

(21)

where is the reward function, is the estimate valueof the value function and is the eligibility traceused to improve the temporal difference learning by selectingthe eligible state embedded in ( ). We call this algorithm asRLS-ET algorithm. The eligibility trace is being defined as

ifif

where is discount factor, is the value of the eligibility traceand is the desired tip trajectory for link, it is to be notedthat both the values of and are less than unity. Substitutingthe value of from (17) in (21) gives

(22)

(22) can be modified in terms of predicted critic weights

(23)The least square solution of (23) is given as

(24)

A recursive form of the above equation with forgetting factor,can be obtained easily as follows:

(25)

with the Kalman gain and covariance matrix up-datation are given as follows:

(26)

(27)

Equations (25)–(27) constitute the RLS-based TD learning witheligibility trace. Further, an incremental adaptive memorycan be added to the above RLS-ET algorithm to enhance thelearning speed of the critic. The resulting weight updatataion

Page 6: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

242 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

expressions with RLS-based TD learning with eligibility traceand an adaptive memory (RLS-ET-AM) are given in (28)–(32)

(28)

(29)

(30)

(31)

The gradient vector is updated using following expression:

(32)

where is the identity matrix and

(33)

C. Actor Weight ( ) Update Using Gradient BasedEstimator and the Proposed RLS-ET-AM Algorithm

The actor weight vector can be updated using gradient-based estimator as described below. The control policy canbe written in parametric form as

(34)

and be its estimate. Then, the control policyestimation error can be written as

(35)

(36)

The control policy can also be rewritten in terms of the criticparameter as follows:

(37)

is the adaptation gain.By measuring the external reinforcement signal and in-

ternal reinforcement signal , the critic as well as actor weightsare updated. The learning terminates as soon as approximationerror tends towards zero. The proposed RLS-ET-AM algorithmis shown in Table I.

D. Convergence Analysis of the Critic Weights Using theProposed RLS-ET-AM Algorithm

The existing RLS_TD learning algorithm [8] is modified byadding an incremental adaptive memory to RLS-based linearfunction approximator with offline calculated critic weights. In

TABLE IPROPOSED RLS-ET-AM ALGORITHM

order to prove the convergence of above RLS-ET-AM algo-rithm, certain assumptions are used. These are as follows.

Assumption 1: The discrete event of states , with transi-tion probability matrix , and distribution satisfy

(38)

Assumption 2: The transition reward satisfies

(39)

where is the expectation with respect to distribution .Assumption 3: The matrix is linearly independent.Assumption 4: For every “ ,” the function satisfies

(40)

Assumption 5: is nonsingular

Theorem 1: Considering the above assumptions (1–5) andusing the proposed RLS-ET-AM algorithm given in (28)–(32),the critic weights converge to (optimal criticweights).

Proof: Applying matrix inversion Lemmato (30), it can be rewritten

as

(41)

where ; , and as-suming .

given in (29) is multiplied by giving

(42)

(43)

(43) can be rewritten using the expression for from (41)as

(44)

Page 7: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 243

Using the results obtained for updatation of covariance matrix ofthe TD error in (41) and in (44), the updatationof the critic weights defined in (28) can be rewritten as

(45)

Substituting for from (41) in (45) gives

(46)

Using (41) in (46) gives

(47)

denoting , in (47)and as one obtains

(48)

since

(49)

(50)

where denotes the number of measurements and from as-sumptions (1–5), it is known that is invertible, i.e.,

(51)

Thus, from (51) it is clear that converges to(optimal critic weights).

V. RESULTS AND DISCUSSIONS

A. Simulation Results

The numerical simulation of the DAC, FLAC, and RLAChas been performed using MATLAB/SIMULINK®. The DAC,FLAC, and the proposed RLAC have been applied to the TLFM

TABLE IIPHYSICAL PARAMETERS OF THE TLFM.

TABLE IIICONTROLLER PARAMETERS FOR TLFM

available in the Advanced Robotics Research Laboratory, Na-tional Institute of Technology (NIT), Rourkela. The physicalparameters of the studied TLFM are given in Table II and thecontroller parameters for RLAC, FLAC, and DAC are given inTable III.

To validate the tip trajectory tracking performances, the de-sired trajectory vector for two joints, ,2 are chosenas where , are the initialpositions of the links and are the final po-sitions for link-1 and link-2, is the time taken to reach thefinal positions which is taken 4 s and total simulation time is setas 10 s

(52)The universe of discourses for FLAC for link-1and link-2 tipposition errors were chosen as rad, tip positionchange in error was chosen to be [ , 2], respectively. The con-trol torque universe of discourse is [ , 5] Nm was chosen tokeep the control input within reasonable limits The fuzzy rulebase was taken from for the TLFM is a 9 9 array. The refer-ence model is taken as for both the links . Gains ofthe discrete PD controller for the RLAC were determined by as-suming the manipulator’s links to be rigid, i.e., for .The gains were obtained from closed-loop error equation (53)knowing the values of :

(53)

Page 8: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

244 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

Fig. 6. Tip trajectory tracking errors (link-1).

Fig. 7. Tip trajectory tracking errors (link-2).

where denotes the equivalent inertia of the joint. From(53), assuming critical damping, and can be deter-mined as

(54)

where is the link’s natural frequency1) Simulation Results for an Initial Payload of 0.157 Kg:

Figs. 6 and 7 show the tip trajectory tracking error curves forlink-1 and link-2, respectively. From Fig. 6, for link-1, it is seenthat there exists a tracking error of 0.4 in case of the FLAC and1 in case of DAC. However, the tracking error by the RLAC isalmost zero. Link-2 tracking error profiles in Fig. 7 reveal thatthe tracking errors are 0.45 for both DAC and FLAC, whereas itis almost zero in case of the RLAC. Thus, RLAC provides excel-lent tracking performance.Figs. 8 and 9 show the tip deflectiontrajectories for link-1 and link-2 carrying 0.157 kg of payload.From these figures, it is seen that the RLAC suppresses the tipdeflection faster compared to the DAC and FLAC by damping itwithin 4 s. Figs. 10 and 11 show the control torque profiles gen-erated by DAC, FLAC and RLAC for joint-1 and joint-2, respec-tively. From Figs. 10 and 11, it seen that the control input gener-ated by the RLAC becomes zero compared to DAC and FLAC

Fig. 8. Comparison of link-1 tip deflection.

Fig. 9. Comparison of link-2 tip deflection.

Fig. 10. Torque profiles (joint-1).

for link-1 and link-2 when the desired tip position is tracked.Thus, RLAC needs less control excitation for handling a pay-load of 0.157 kg compared to DAC and FLAC.

Page 9: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 245

Fig. 11. Torque profiles (joint-2).

Fig. 12. Tip trajectory tracking errors (Link-1).

2) Simulation Results for an Additional Payload of 0.3 Kg:An additional payload of 0.3 kg is now attached to the existinginitial payload of 0.157 kg making the overall payload 0.457 kg.Performances of three controllers RLAC, FLAC and DAC for0.457 kg payload were compared in Figs. 12–18. Figs. 12 and13 depict the tip trajectory tracking performance for link-1 andlink-2 From Fig. 12, it can be seen that the time evolution of theerror trajectory achieved by employing DAC has yielded max-imum overshoot compared to the FLAC and RLAC. Fig. 13shows that FLAC has yielded maximum overshoot comparedto the DAC and RLAC controllers. Suppressing the tip deflec-tion performances of RLAC, FLAC and DAC were compared inFigs. 14 and 15 for link-1 and link-2, respectively. From Fig. 14,it is seen that tip deflection is maximum in case of DAC com-pared to FLAC and also RLAC when a payload of 0.457 kg isattached for link-1. From Fig. 15, it is seen that the tip deflectiontrajectories for link-2 is more oscillatory when carrying 0.457 kgof payload in case of DAC compared to FLAC and RLAC.

Joint torque signals generated from DAC, FLAC and RLACare compared in Figs. 16 and 17. The adaptation of the actorand critic weights for RLAC carrying payload of 0.457 kg using

Fig. 13. Tip trajectory tracking errors (Link-2).

Fig. 14. Comparison of link-1 tip deflection.

Fig. 15. Comparison of link-2 tip deflection.

simulation model is shown in Fig. 18.The results show that asthe learning progresses, the updated critic weights converge totheir optimal values.

Page 10: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

246 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

Fig. 16. Torque profiles (joint-1).

Fig. 17. Torque profiles (joint-2).

Fig. 18. Simulation results for adaptation of the actor and critic weights to op-timal values.

B. Experimental Setup

The experimental setup used to implement the proposedRLAC and the other two for comparison is shown in Fig. 19.The setup has two links and two joints and an end effecter tocarry the variable payload. These two joints are excited by two

Fig. 19. Photograph of the experimental setup.

Fig. 20. Tip trajectory tracking errors (link-1).

DC servo motors powered at 42 watts with DC powersupply. The amplifier signal is provided by a two channelamplifier package (AMPAQ) power module shown in Fig. 19.The drives for link-1 and link-2 offer zero backlashes with gearratios of 100:1 and 50:1, respectively. Digital encoders areused to measure the angular speed and position and two staingauges for each link are used to measure the deflection in links.MATLAB real-time tool box issued to generate the target-logicC-code. CORE(TM) 2 Duo processor is used using real-timewin target to run the compiled code in real-time.

Analog to digital and digital to analog signal are processedusing an in built hardware-in-the-loop (HIL) board. All the threecontrollers (RLAC, FLAC, and DAC) have been implementedusing MATLAB/SIMULINK®

C. Experimental Results

1) Experimental Results for an Initial Payload of 0.157 Kg:Figs. 20–25 show comparison of the experimental results forTLFM obtained by employing RLAC, FLAC, and DAC with aninitial payload of 0.157 kg. Figs. 20 and 21 show the compar-ison of the tip trajectory tracking, after 4 s when the tip attainsthe final position, the steady-sate error is almost zero in caseof RLAC for link-1 and link-2, whereas the DAC and FLACyield steady-state errors of 0.1 and 0.2 for link-1 and link-2,respectively, after 4 s. Figs. 22 and 23 show the tip deflectiontrajectories for the link-1 and link-2 when loaded for a 0.457 kgpayload. From Fig. 22, it can be seen that RLAC yields 0.1 m

Page 11: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 247

Fig. 21. Tip trajectory tracking errors (link-2).

Fig. 22. Comparison of link-1 tip deflection.

Fig. 23. Comparison of link-2 tip deflection.

of initial deviation as compared to FLAC and DAC where thedeflection are 0.16 and 0.18 m for link-1. Link-2 tip deflectioncharacteristics are shown in Fig. 23, from which it is seen thatRLAC has 0.15 m of initial deviation as compared to who have

Fig. 24. Torque profiles (joint-1).

Fig. 25. Torque profiles (joint-2).

0.18 and 0.22 m of initial deviation, respectively, for FLAC andDAC.

Torque profiles for joint-1 generated by employing the threecontrollers are shown in Fig. 24, and that for joint-2 is shownin Fig. 25. The joint torque control input for link-1 obtained byDAC reaches to a maximum value (9 Nm) at 2 s when the tipreaches to the final position at 4 s the control input reduces to5 Nm. In case of FLAC where control input reaches to a max-imum value (2 Nm) at 2 s and 0.5 Nm for RLAC and torque be-comes zero when the tip reaches the final position at 4 s. FromFigs. 24 and 25, the joint control torque signals generated byDAC, FLAC and RLAC for link-2 have maximum of 12, 10,and 2.5 Nm, respectively.

2) Experimental Results for an Additional Payload of 0.3 Kg:An additional payload of 0.3 kg is added to the initial payloadof 0.157 kg. Figs. 26–32 show comparison of the experimentalresults for TLFM obtained by employing RLAC, FLAC, andDAC with a payload of 0.457 kg. Figs. 26 and 27 comparethe tip trajectory tracking performances for link-1 and link-2,respectively.

From Figs. 26 and 27, it is clear that when the final position isattained, the steady-state error in case of RLAC is almost zero,

Page 12: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

248 IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING, VOL. 9, NO. 2, APRIL 2012

Fig. 26. Tip trajectory tracking errors (link-1).

Fig. 27. Tip trajectory tracking errors (link-2).

Fig. 28. Comparison of link-1 tip deflection.

while a finite steady-state error exists in case of both DAC andFLAC. The TLFM is an infinite dimensional system due to dis-tributed link flexure. Higher modes have been neglected in mod-eling therefore there is a difference in steady-state error for sim-ulation and experimental results. Figs. 28 and 29 show the tip

Fig. 29. Comparison of link-2 tip deflection.

Fig. 30. Torque profiles (joint-1).

deflection trajectories for the link-1 and link-2 when asked fora payload of 0.457 kg. From Fig. 28, it is seen in case of RLACthere exists an initial deviation of 0.2 m as compared to FLACand DAC in which deflections are 0.2 and 0.25 m, respectively,for link-1. Link-2 tip deflection responses are shown in Fig. 29.RLAC has 0.3 m of initial deviation as compared to FLAC andDAC where initial deviations are 0.28 and 0.32 m, respectively.

Torque profile generated for joint-1 by the three controllers isshown in Fig. 30. From this figure, it is seen that the DAC torquesignal reaches to a maximum value of 9.5 Nm and reduces to5 Nm at 4 s when the final position is tracked. FLAC torquesignal becomes the maximum (2.5 Nm) at 2 s and almost reducesto zero when the final position is tracked.

FLAC torque signal reaches to maximum value of 9 Nm at1.5 s and reduces to 2 Nm at 4 s, whereas RLAC generates ap-propriate control torques with zero at the final position.

But RLAC generates control torque signal with less ampli-tude initially and zero magnitude, while the desired position hasbeen tracked. From Fig. 31, torque profile generated for joint-2,it is seen that the DAC torque signal reaches to maximum valueof 15 Nm at 1 s and reduces to 6 Nm at 4 s maximum value of2 Nm at 1.5 s with almost zero value at the final position. The

Page 13: Real-Time Adaptive Control of a Flexible Manipulator Using Reinforcement Learning

PRADHAN AND SUBUDHI: REAL-TIME ADAPTIVE CONTROL OF A FLEXIBLE MANIPULATOR USING REINFORCEMENT LEARNING 249

Fig. 31. Torque profiles (joint-2).

Fig. 32. Experimental results for adaptation of the actor and critic weights tooptimal values.

experimental values while updating the actor and critic weightsunder payload of 0.457 kg are shown in Fig. 32. The resultsshow that despite changes in payload, the critic weights con-verge to their optimal values. However, there is difference incritic weights in experiment and simulations. This is because ofapproximations in modeling of the TLFM.

VI. CONCLUSION

This paper has proposed a new real-time adaptive controllerfor tracking control of tip trajectory and suppressing tip deflec-tion for a TLFM, while subjected to handle variable payload-based on RL technique. The proposed RLAC provides bettertracking and tip deflection damping performance compared toboth a nonlinear DAC and a FLAC. The reason for superiorityin performance exhibited by RLAC is due to the integration ofoptimality with the adaptivity, i.e., its ability to adapt the actorand critic weights to their optimal values using Recursive LeastSquare-Eligibility Trace-Adaptive Memory feature embedded.However, DAC and FLAC are only adaptive controllers. Theproposed RLAC has been applied successfully to a laboratory

flexible robot setup. The RLAC has exhibited excellent perfor-mance in real-time control of this manipulator which has dis-tributed flexure along it links. Thus, it is expected that the RLACwill be useful in similar control applications such as in spacemanipulators.

ACKNOWLEDGMENT

The authors thank the editor and reviewers for their valuedcomments in improving the quality of this paper.

REFERENCES

[1] M. O. Tokhi and A. K. M. Azad, Flexible Robot Manipulators: Mod-eling, Simulation and Control. London, U.K.: IET, 2008.

[2] V. Feliu, K. S. Rattan, and B. H. Brown, “Adaptive control of a single-link flexible manipulator,” IEEE Contr. Syst. Mag., vol. 10, no. 2, pp.29–33, 1990.

[3] M. R. Rokui and K. Khorsani, “Experimental results on discrete timenonlinear adaptive tracking control of a flexible-link manipulator,”IEEE Trans. Syst. Man, Cybern., vol. 30, no. 1, pp. 151–164, 2000.

[4] V. G. Moudgal, W. A. Kwong, K. M. Passino, and S. Yurkovich,“Fuzzy learning control for a flexible-link robot,” IEEE Trans. FuzzySyst., vol. 3, no. 2, pp. 199–210, May 1995.

[5] L. B. Gutierrez, F. L. Lewis, and J. A. Lowe, “Implementation of aneural network tracking control for a single flexible link: Comparisonwith PD and PID controllers,” IEEE Trans. Ind. Electron., vol. 45, no.2, pp. 307–318, Apr. 1998.

[6] B. Subudhi and A. S. Morris, “Soft computing methods applied to thecontrol of a flexible robot manipulator,” Appl. Soft Comput., vol. 9, no.1, pp. 149–158, 2009.

[7] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduc-tion. Cambridge, MA: MIT Press, 1998.

[8] S. J. Bradtke and A. G. Barto, “Linear least-squares algorithms for tem-poral difference learning,” Mach. Learn., vol. 22, pp. 33–57, 1996.

[9] L. A. Prashanth and S. Bhatnagar, “Reinforcement learning withfunction approximation for traffic signal control,” IEEE Trans. Intell.Transp. Syst., vol. 12, no. 2, pp. 412–421, Jun. 2011.

[10] Q. Yang and S. Jagannathan, “Reinforcement learning control foraffine nonlinear discrete-time systems using online approximators,”IEEE Trans. Syst. Man, Cybern.-Part B: Cybern., pp. 1–14, 2011.

[11] A. de Luca and B. Siciliano, “Closed-form dynamic model of planarmultilink lightweight robots,” IEEE Trans. Syst. Man, Cybern., vol. 21,no. 4, pp. 826–839, 1991.

[12] J. J. E. Slotine and L. Weiping, “Adaptive manipulator control: A casestudy,” IEEE Trans. Automat. Control, vol. 33, no. 11, pp. 995–1003,Nov. 1988.

Santanu Kumar Pradhan received the B.E. degreein electrical and electronics engineering from BijuPatnaik University of Technology, Rourkela, India,in 2006 and the Master of Technology degree in en-ergy systems from the Indian Institute of Technology(IIT), Roorkee, in 2009. He is currently working to-wards the Ph.D. degree in electrical engineering at theNational Institute of Technology (NIT), Rourkela.

His research interests include adaptive control offlexible robots.

Bidyadhar Subudhi (M’94–SM’08) received B.S.degree in electrical engineering from the NationalInstitute of Technology (NIT), Rourkela, India, theMaster of Technology in control and instrumentationfrom the Indian Institute of Technology (IIT), Delhi,in 1994, and the Ph.D. degree in control system en-gineering from the University of Sheffield, Sheffield,U.K., in 2003.

Currently, he is working as a Professor and Headof the Electrical Engineering Department, NIT. Hisresearch interests include adaptive control, robotics,

and industrial electronics.