Papers Ball and Plate

Abstract—In this paper, a new fuzzy logic controller, namely

Reinforcement Learning Fuzzy Controller (RLFC), is proposed and implemented. Based on fuzzy logic, this newly proposed online-learning controller is capable of improving its behavior by learning from experiences it gains through interaction with the plant. RLFC is well established for hardware implementation with or without a priori knowledge about the plant. To evidence this claim, a hardware implementation of Ball and Plate system was established, and RLFC was then developed and applied to it. The obtained results are illustrated in this paper.

Index Terms—Fuzzy Logic Controller, Reinforcement Learning, Ball and Plate system, Balancing Systems, Model-free optimization

I. INTRODUCTION ALANCING systems are one of the most popular and challenging test platforms for control engineers. Such

systems are the traditional cart-pole system (inverted pendulum), the ball and beam (BnB) system, the multiple inverted pendulums, the ball and plate system (BnP), etc. These systems are the promising test-benches for investigating the performance of both model-free and model-based controllers. Considering those complicated ones (such as multiple inverted pendulums or BnP) even if one bothers to mathematically model them, the resulted model is likely to be too complicated to be used in a model-based design. One would highly prefer to use an implemented version of such a system (if available and not risky) and observe its behavior while the proposed controller is applied to it. This paper is devoted to the efforts done for a project in which the main goal is to control a ball over a flat rotary surface (the plate) mimicking human’s behavior controlling the same plant, i.e. the BnP system. The proposed controller neither should be dependent on any physical characteristics of the BnP system nor should it be supervised by an expert. It should learn an optimal behavior from its experiences interacting with the BnP system and improve its action-generation strategy; however, some prior knowledge about the overall behavior of the BnP

Manuscript received February 5, 2010. Nima Mohajerin is with the School of Science and Technology of Örebro University, Örebro, Sweden. (e-mail: [email protected]).

Mohammad B. Menhaj, is with the Electrical Engineering Department of Amirkabir University of Technology, (e-mail: [email protected]).

Ali Doustmohammadi, is with the Electrical Engineering Department of Amirkabir University of Technology, (e-mail: [email protected]).

system may also be available and be used in order to reduce the needed time for reaching the goal. Those few published papers on the BnP system are mainly devoted to achieve the defined goals regarding the BnP itself rather than how to achieve those goals [3, 4, 5, 8, 9]. They can be categorized into two main groups: those which are based on mathematical modeling (with or without hardware implementation), those proposing model-free controllers. Since the simplification in the mathematical modeling of the BnP system yields two separate BnB systems, the first category is goal-oriented and is of no interest for the current project [4, 5]. On the other hand, the hardware apparatus used in the second category is CE151 [6] (or in some rare cases another apparatus [9]) which all use image feedback for ball position sensing. However, among all, [5] is devoted to a mechatronic design of the BnP system controlled by a classic model-based which benefits a touch-sensor feedback and [3, 4] used the CE151 apparatus. Note that the image feedback is a time bottleneck which will be discussed in Section III. In [3], a fuzzy logic controller (FLC) is designed which learns online from a conventional controller. Although the work in [4] is done through mathematical modeling and is applied to CE151 apparatus, it is of more interest because it tackles the problem of trajectory planning (to be stated in Section III). Reports in [8] and [9] are focused on motion planning and control though they are less interesting for us.

To achieve the desired goal, a different approach will be demonstrated in this paper. This approach is based on fuzzy logic controller which learns on the basis of reinforcement learning. Additionally, a modified version of the BnP apparatus is also implemented in this project as a test platform for the proposed controller.

In this paper, the fundamental concepts of RL are embodied into fuzzy logic controlling methodologies. This leads to a new controller, namely Reinforcement Learning Fuzzy Controller (RLFC) which is capable of learning from its own experiences. Inherited from fuzzy controlling methodologies, RLFC is a model-free controller and, of course, previous knowledge about the system can be included in RLFC so as to decrease the learning time. However, as it will be seen, learning in RLFC is not a separate phase of its functioning phase.

This paper is divided into six sections. After this introduction, in section II RLFC is completely explained, both conceptually and mathematically. In section III, the BnP system is introduced and the hardware specification of the implemented version of this system is also outlines. In section IV some modifications on RLFC in order to be applicable to

A Reinforcement Learning Fuzzy Controller for the Ball and Plate System

Nima Mohajerin, Mohammad B. Menhaj, Member, IEEE, Ali Doustmohammadi

B

978-1-4244-8126-2/10/$26.00 ©2010 IEEE

the BnP system is fully discussed. Section V is dedicated to illustrate and analyze the results of RLFC performance on the implemented BnP system. In this section, RLFC performance is also compared with that of a human controlling the same plant. Finally, section VI concludes the paper.

II. CONTROLLER DESIGN This section is dedicated to explain the idea and

mathematics of the proposed controller, i.e. RLFC. First, the behavior of RLFC is outlined conceptually and then the mathematics will be detailed.

A. Outline According to Fig. 1, RLFC consists of two main blocks, a

controller (FLC) and a critic. The task of the controller is to generate and apply actions in each given situation as well as improving its state-action mapping while the critic has to evaluate the current state of the plant. Neither controller nor critic necessarily knows anything about how the plant responds to actions, how the actions will influence its states and what the best action is in any given situation. There is also no need for the critic to know how the actions in the controller are generated. The most important responsibility of the critic is to generate an evaluating signal (reinforcement signal), which best represents the current state of the plant. The reinforcement signal is then fed to the controller.

Once the controller receives a reinforcement signal, it should realize whether its generated action was indeed a good action or a bad one, and in both cases how good or how bad it was. These terms are embodied into a measure named reinforcement measure which apparently is a function of the reinforcement signal. Then according to this measure, the controller attempts to update its knowledge of generating actions, i.e. improves its mapping from states to actions. The process of generating actions is separate from the process of updating the knowledge and thus they can be assumed as two parallel tasks. This implies that while the learning procedure is a discrete-time process, the controlling procedure can be continues-time. However, without loss of generality it is assumed that the actions are generated in a discrete-time manner and each action is generated after the reinforcement signal - describing consequences of the previously generated action - has been reported and the parameters have been updated. The dashed line in Fig. 1 implies the controller awareness of its generated actions. Although apparently the controller is aware of its generated actions, in case of hardware implementation, inaccuracies in mechanic structure and electronic devices and other unknown disturbances may impose distortions on the generated actions.

B. The Controller The aforementioned concept is as general as to be

applicable to any fuzzy controller scheme; however, we assume that the fuzzy IF-THEN rules and controller structure are of Mamdani type [9]. Imagine that the input to the fuzzy inference engine (FIS) is an n element vector, each element is a fuzzy number produced by the fuzzyfication block [11]:

[ ]1 2T

nx x x=x (1)

Assume that in the universe of discourse of input variable

ix , i.e. iU , a number of in term sets are defined. The lth fuzzy rule is*: (for now forget about the consequence part)

1 21 1 2 2

nll ln n

l

IF x is A AND x is A AND AND x is A

THEN y is B

… (2)

where: 1 i il n≤ ≤ , and iliA is the il th defined fuzzy set on the

universe of discourse of ix , (1 )i n≤ ≤ . The output variable

is y and there are M fuzzy sets defined on the universe of discourse of y ( )M ∈ - V is the universe of discourse of

y. Note that (2) is a fuzzy relation defined on U V× [11] where 1 2 nU U U U= × × × . Note also that in real world, all of the above variables express physical quantities. Therefore all the corresponding universes of discourse are bounded and can be covered by a limited number of fuzzy sets.

For hardware implementation, what matters first is the processing speed of the controller. In other words, we have to establish a fuzzy controller architecture that leads to an optimum performance versus its complexity. For this reason, we propose the following elements for the FLC structure: singleton fuzziyfier, product inference engine and center average defuzzyfier [11]. In this case given an input vector ∗xthe output of the controller is given by:

( )( )

1 1

1 1

( )( )

( )

i

i

nL lli il i

nL li il i

y A xy f

A x

− ∗= =∗ ∗

∗= =

= =∑ ∏∑ ∏

x (3)

where ly − is the center of normal fuzzy set lB . However, as mentioned earlier, other FLC structures may be considered. Apparently, only those rules are participated in generating y that have a non-zero premise (IF-part), i.e. the fired rules. This fact is not dependent on the FLC structure.

On the design stage, the controller does not know which states it will observe, or in other words, the designer hardly may know what rules are useful to be embodied in the fuzzy rule base. Thus, all rules with premises made by all possible combinations of input variables, using AND operator, are included in the fuzzy rule base. The number of these rules is:

1

n

ii

L n=

= ∏ (4)

* Please note that superscripts are not powers, unless it is mentioned to be

so.

Fig. 1. A typical application of the RLFC.

Obviously, L grows exponentially with respect to the number of variables and defined term sets. Consequently, the processing time drastically increases. To solve this, we assume that for any given value to each variable, there are at most two term sets with non-zero membership values. This condition – which will be referred by the covering condition – is necessary and if it is held then the number of fuzzy rules which contribute to generate the actual output, i.e. the fired rules, is equal or less than 2n, i.e. 2 is powered by n. Noticeably, to reduce the needed time for discovering the fired rules we implement a set of conditional statements rather than an extensive search among all of the rules.

C. The Decision Making As previously mentioned, the controller is discrete-time. So,

in each iteration, as the controller observes the plant state, it distinguishes the fired rules. From now on, the L-rules FLC shrinks to a 2n-rules FLC where 2n L<< . The key-point in generating output, i.e. decision making, is how the consequences (THEN parts) of these fired rules are selected. Noticing (2), the assigned term set to the consequence of the lth rule is lB . It was also mentioned that there are M term sets defined on the universe of discourse of the output variable. Assume that these term sets are referred by iW where

1, 2,...,i M= . Having fired the lth rule in the kth iteration, the probability of choosing iW for its consequence is:

( )lkP j i= (5)

where j is a random variable with an unknown discrete distribution over indices. The aim of reinforcement learning algorithm, which is discussed in the next sub-section, is to learn this probability for each rule such that the overall performance of the controller is (nearly) optimized.

Our objective in this section is to sensibly define lkP to be

well suited for applying the reward/punishment procedure and also for software programming. Fulfilling these objectives, for each rule a bounded sub-space on ℜ is chosen. Note that ℜrepresents the set of real numbers. Factors for how this one dimensional sub-space should be chosen are discussed in section IV. Let the related sub-space to lth rule be:

)0 ,l l lMa a⎡Ω = ⎣ (6)

This distance is divided into M sub-distances as shown by (8-a), each of which is assigned to an index i where

1, 2, ,i M= . We have:

1

Ml l

rr

ω=

Ω =∪ (7)

)

{ } { }

1

0 1

) , , 1,...,

) , 1,..., 2) 1, 2,..., & , 1,2,..., :

l l lr r r

s s M

l lp q

a a a r M

b a a a a s Mc l L p q M p q

ω

ω ω

−

+

⎡= =⎣≤ ≤ ≤ = −

∀ ∈ ∈ ≠

= ∅∩

(8)

We calculate the probability lP by (9):

( )lill

P j iω

= =Ω

(9)

where lrω represents the numeric length of sub-distance

lrω and is calculated by (10).

1l l lr r ra aω −= − (10)

Till now the iteration counter, k, is omitted. However since the reinforcement learning procedure is done by tuning the above parameters, they are all a function of k. Thus, (9) turns to:

( )( )

( )

lil

k l

kP j i

k

ω= =

Ω (11)

Or:

1

0

( ) ( )( )( ) ( )

l ll i i

k l lM

a k a kP j ia k a k

−−= =−

(12)

By observing (7), (8) and (11) it is obvious that ( )l

kP j i= is a probability function and satisfies the necessary axioms.

D. Reinforcement Learning Algorithm In this sub-section the proposed algorithm for tuning the

above defined parameters is depicted. This algorithm is based on reinforcement learning methods and satisfies the six axioms mentioned by Barto in [12]. Let ( )r k be the reinforcement signal generated by the critic module in the kth iteration. Note that it represents the effect of the previously generated action by the FLC, i.e. ( 1)y k − , on the plant and before generating the kth action the parameters of the related fired rules should be updated. In other words, this scalar can represent the change in the current state of the plant.

To be more specific, as a general assumption, imagine that the smaller reinforcement signal represents a better state of the plant. Obviously, the change in the current state of the plant is stated by (13). Therefore, an improvement in the plant situation is indicated by ( ) 0r kΔ > while ( ) 0r kΔ <indicates that the plant situation is worsened.

( ) ( 1) ( )r k r k r kΔ = − − (13)

However, since (13) is based solely on the output of the critic, it does not contain information about which rules have been fired, which term sets have been chosen for generating

( 1)y k − , etc. Thus, ( )r kΔ is not immediately applicable for updating the parameters. The mentioned reward/punishment updating procedure means that if the generated action resulted in an improvement in the plant state, the probabilities of choosing the same term set for the premises of each corresponding fired rule should be increased. However, if this action caused the plant state to be worsened, these probabilities should be decreased.

( )r kΔ will be mathematically manipulated in order to

shape up the reinforcement measure. This measure is the amount by which the mentioned probabilities, (11), would be affected.

At the first step, a simple modification is done on ( )r kΔ . This step may be ignored if ( )r kΔ is already a suitable representation of change in the system state A comprehensive example will illustrate this case in section IV. This

manipulation is done by (.)f noticing that :f ℜ⎯⎯→ℜ :

( )( ) ( )r k f r k′Δ = Δ (14) To modify the probabilities defined by (11), the

corresponding sub-distances, (8) are changed.The amount of change in the sub-distances relating to the lth rule is defined by ( )l

i kϑ in (15).

( ) . ( , ). ( ). ( )li k g l i l r kϑ ε α ′= Δ (15)

Regarding (15), g is gain and is a scaling factor, ( , )l iε represents the exploration/exploitation policy [1], ( )lα is the firing rate of the lth rule and is obtained by replacing x in the membership function obtained from the premise part of the lth rule. Note that this factor expresses the contribution of this rule in generating the output.

There are a variety of exploration/exploitation policies [1, 2, , 12], however here we propose a simple one:

( )( , ) 1 l

in kl i

eθε ⎛ ⎞= −⎜ ⎟

⎝ ⎠ (16)

where ( )lin k counts how many times iW is chosen for the

lth rule and θ is a scaling factor. Considering a particular rule, as iW is chosen more for this rule, ( , )l iε grows

exponentially to one, letting ( ) . ( ). ( )li k g l r kϑ α ′→ Δ .

The reinforcement measure introduced in subsection II.A is given by (17).

{ }( ) , ( ) 0

( )max ( ), ( ) , ( ) 0

lil

i l li i

k r kk

k k r k

ϑω

ϑ ω

⎧ Δ ≥⎪Δ = ⎨− Δ <⎪⎩

(17)

In (17), max operator is used in case of ( ) 0r kΔ < . This is due to avoid large reinforcing those sub-distances that have not been chosen. The reason is clearer if (18) is studied. Equation (18) depicts the updating rule:

{ }1

( 1) ,( )

max ( ), ( 1) ( ) ,

lql

q l l lq q i

a k q ia k

a k a k k q iω−

⎧ − <⎪= ⎨− + Δ ≥⎪⎩

(18)

where 0,1,...,q M= . Regarding (18) there are several points to be noted: 1- i is the index of the chosen term set for the

consequence part of the lth fired rule , i.e. liB W= .

2- q is a counter which starts from i and ends to M. Apparently, there is no need to update those parameters which are not modified and q may starts from i. This will indeed reduce the processing time.

3- There are M parameters for each rule in the rule base.

But only the parameters of fired rules are modified. Hence there are at most .2nM modifications per iteration.

4- By (18) it should be understood that only the length of subspaces l

iω and lΩ are modified. Although, subspaces l

qω for 1,...,q i M= + are shifted, but their

length remains unchanged. 5- Modification of

lΩ always let other subspaces (and hence other indices) to be chosen. As the system learns more, a dominant subspace will be found, but the length of un-chosen subspaces are still not zero as long as the effect of choosing them is not observed as a worsening result. This feature is useful in case of slightly changing plants.

Theorem1. By (18) if a term set receives reward/punishment, then the probability of choosing that term set is increased/decreased.

Proof. a) Reward. Assume that iW has been chosen for lth rule

and the resulting action has an improving effect. Thus, ( )l

kP j i= should increase. Note that in this case

( ) 0r kΔ > and by (13),(14),(15),(16) and (17) it is obvious

that ( ) 0li kωΔ > . Using (5) we have:

1( ) ( ) ( )

( ) ( 1)

( ) ( 1)

l l lk k k

l li i

l l

P j i P j i P j i

k k

k k

ω ω−Δ = = = − =

−= −

Ω Ω −

(19)

In case of an improvement ( )lkP j iΔ = must be a positive

scalar. Using (9), (10), (11) and (12) in (19) we obtain:

( ) 1 1( ) ( ) ( 1) ( 1)( ) ( 1)

l l l ll i i i i

k l lm m

a k a k a k a kP j ia k a k

− −− − − −Δ = = −−

According to (18) the above equation yields:

( ) 1 1( )( )

l l l l ll i i i i

k l l lm m

a k a a aP j ia k a

ωω

− −+ Δ − −Δ = = −+ Δ

In which we omit (k-1) on the right-hand side of the above equation. It can easily be seen that:

( ) 1( )

( )

l l l li m i il

k l l lm m

k a a aP j i

a a k

ωω

−⎡ ⎤Δ − +⎣ ⎦Δ = =⎡ ⎤+ Δ⎣ ⎦

This equation with regard to (8) implies that: ( ) 0l

kP j iΔ = >

b) Punishment. In this case ( ) 0r kΔ < . Hence:

{ }( ) max ( ), ( )l l li i ik k kω ϑ ωΔ = −

Since a negative-length sub-distance is undefined, the max operators used in the above equation and (17) assures us that update rule (18) does not yield to undefined sub-distance. The max operators are used to satisfy (8-b).

In this case ( )lkP j iΔ = in (19) has to be a negative

scalar. The procedure is the same as part a. □

III. BALL AND PLATE SYSTEM A Ball and Plate system, aforementioned BnP, is an

evolution of the traditional Ball and Beam (BnB) system [13, 14]. It consists of a plate which its slope can be manipulated in two perpendicular directions and a ball over it. The behavior of the ball is of interest and can be controlled by tilting the plate. According to this scheme, various structures may be proposed and applied in practice. Usually an image feedback is used to locate the position of the ball. However, due to its less accuracy and slower sampling rate comparing touch screen sensors (or simply touch sensors), we opt for a touch sensor.

The implemented hardware structure in this project, as outlined in Fig.2, consists of five blocks. However, the whole system can be viewed as a single block which its input is a two element vector, i.e. the target angles – (20). The output of the assumed block is a six element vector which contains the position and velocity of the ball and the current angles of the plate. Table I illustrates the related parameters.

,

T

x yα α⎡ ⎤= ⎣ ⎦u (20)

In this paper we are interested in the following problem:

Simple command of the ball. The objective is to place the ball on any desired location on the plate surface starting from an arbitrary initial position.

Before explaining the control system, a summary of the

hardware specifications of the implemented BnP for this project is given below done in hardware implementation of the BnP in this project is stated. A complete or even brief description of the efforts how we made this apparatus is beyond scope of this paper. But a summary is needed to show that the plant for RLFC is made roughly and there are many inaccuracies that a classic controller will definitely be unable to control the plant. Referring back to Fig.2 each block

function is summarized next. The Actuating Block

• The actuating block consists of high accuracy stepping motors equipped with precise incremental encoders (3600 ppr*) coupled to their shafts plus accurate micro-stepping-enhanced drivers.

• The original step size of the steppers is 0.9 degrees and reducible by the drivers down to 1

200 of step size, i.e. 34.5 10−× degree per step.

• Taking into account the mechanical limitations, the smallest measurable and applicable amount of rotation is 0.1 degrees.

The Sensor Block

• The sensor is a 15 inches touch-screen sensor. • Sensor output is a message packet sent through

RS-232 serial communication with 19200 bps†. • Thus the fastest sampling rate of the whole sensor

block is 384 samples per second. This implies that the maximum available time for decision making is

31384 2.604 10−≅ × second.

• The area of the surface of the touch sensor on which pressure can be sensed is 30.41 × 22.81cm.

• The sensor resolution is 1900×1900 pixels. If the sensor sensitivity is uniformly distributed on its sensitive area, then each pixel is assigned to an area approximately equals to 1.8 × 1.2 mm2 of the surface of the sensor.

The Interface Block The third and main section of the BnP system is its

electronic interface. This interface receives commands from an external device in which the controller has been implemented and then takes necessary corresponding actions. Each decision made by the controller algorithm is translated and formed into a message packet which is then sent to the interface via a typical serial communication (RS232) or other communication platforms. Then, the interface sends necessary signals to actuators. In addition to some low-level signal manipulation (such as applying low-pass filter to sensor reading and noise cancelling), upon request from the main controller, the interface sends current information, such as ball position and velocity or position of the actuators, to the main controller.

* ppr: pulse per rotation † bps: bit per second. A measuring unit for serial communication.

Fig. 2. Hardware structure of the implemented BnP system. The dashed square separates the electronics section from the mechanical parts.

TABLE I PARAMETERS OF BALL AND PLATE SYSTEM AND THEIR UNITS

Symbol Parameter Unit

),( dd yx Target position of the ball Pixel

),( yx Current position of the ball Pixel

( , )x yv v Current velocity of the ball Pixel per second

),( yx αα Current angels of the plate Angel step

xα Plate angel with regard to x axis Angel step

yα Plate angel with regard to y axis Angel step

),( yx uu Control signal Angel step

xu plate target angel with regard to x axis Angel step

yu plate target angel with regard to y axis Angel step

IV. MODIFICATION OF RLFC FOR BALL AND PLATE In section II, RLFC was discussed completely. In this

section, necessary modifications for making it applicable to control the implemented BnP system for solving the second problem depicted in the previous section are expressed.

A. Primary Design Stage Fig.1 depicts the control architecture as well as signal flows.

With regard to Table I, illustrated signals are explained next: Control signal vector:

T

x yu u⎡ ⎤= ⎣ ⎦u (21)

Plant state vector: T

x yx y u u⎡ ⎤= ⎣ ⎦s (22)

Plant target state vector:

[ ]0 0 Td dx y=ds (23)

Error vector:

x y

T

x y v ve e e e⎡ ⎤= ⎣ ⎦e (24)

Where:

x de x x= − , y de y y= −

And since we want to make the ball steady: ,

x yv x v ye v e v= = In Fig.1 it is obvious that there are six input variables to the

controller section of RLFC. Let us arrange them in x vector as written in (25).

T

x y x y x ye e v v α α⎡ ⎤= ⎣ ⎦x (25)

On the universe of discourse of each input variable, a

specific number of term sets are defined. Let these numbers be, , , ,x y vx vy xn n n n nα and ynα . According to (4), the fuzzy

rule base contains L rules where: . . . . .x y vx vy x yL n n n n n nα α=

For our very specific implementation following quantities are assigned to the above variables:

( )7, 5, 2 4900x y vx vy x yn n n n n n Lα α= = = = = = ⇒ =

In Fig.3, the corresponding defined term sets are presented graphically. Number, shape and distribution of the defined term sets are chosen based on the logical sense and practical experience. There is no obligation for the system not to perform well if these selections are changed.

Assuming that the covering condition is satisfied, as it is the case in Fig.3, in each iteration, there are at most 26 =64 fired rules. However, if an extensive search is done among 4900 rules to discover fired rules, i.e. to calculate each premise membership value and check whether it is zero or not, then the processing time would grow out of tolerable bounds. Instead, since we know the exact location of each term set, if we locate each measured value of input variables on their corresponding universe of discourses then those term sets with non-zero membership value are discovered. Doing this procedure for all six input variables, there will be at most two discovered term sets for each of them. Hence the combinations of these term-sets using AND operator directly shows us which rules are fired. The mentioned locating procedure can be coded into a software programming language using a set of conditional statements. Apparently, if there are n defined term sets on a particular universe of discourse then (n+1) conditional statements are needed in order to discover the fired rules. Hence instead of L arithmetic calculations, we are faced with S logical comparisons, where:

1( 1)n

iiS n

== +∑ (26)

For our case study: S=32. There is no special constraint to be applied to distribution of

term sets on the universes of discourse of the output variables. There are M=51 triangular term sets which are uniformly distributed over each universe of discourse. Fig. 4 illustrates the location of these term sets.

B. Critic: Generation of Reinforcement Signal Until now, the reinforcement signal is only defined to be an

evaluation of the system current situation generated by the critic block. No other precise definition could have been given by now since the nature of this signal directly depends on the nature of the plant to be controlled. Referring back to Fig.1, it is obvious that the input to critic is the Error Signal, (24). According to this figure the reinforcement signal is a function of this vector:

( ) ( , , , )x y x yr g e g e e v v= = (27) Apparently, the critic is defined by g(.) and is the

designer’s choice. The only necessary condition is that this function should represent the current state of the plant as well possible as it can be. A proposed general form of this function

Fig. 4 The defined term sets on the universes of discourses of output variables.

(a)

(b) (c)

Fig. 3 shows the defined term sets on the universe of discourse of (a)

xe and ye , (b) xv and

yv , (c) xα and yα . For the respective units refer to

Table I.

is depicted in (28).

2 2 2 2( ) ( )x x v x y y v yr c e c v c e c v= + + + (28)

Equation (28) can be regarded as a revised version of LMS. Note that the aim of RLFC is to minimize (28). In (28) three coefficients are defined which are explained next. vc is the balancing coefficient between the velocity of the ball and its position error. As vc increases, the controller gives more credit to stabilizing the ball rather than guiding it to the desired location. xc and yc represent the mutual interaction

between the two actuators. This interaction comes from the inevitable impreciseness of the mechanical structure of BnP. According to this mutual interaction, the motion of the ball in each direction is not only a function of the corresponding angle of the plate. However, exact values of xc and yc are

one of the mechanical specifications. They can be chosen and then tuned experimentally or to be learnt.

C. Reinforcement Measure Having proposed the reinforcement signal, we are seeking

for a suitable function to produce ( )li kωΔ according to (17).

Equation (29) is a proposed function for (14):

( ) ( )max

sgn ( )( )( ) ( )( )

r kr kr k f r kr r k

β λΔ⎛ ⎞Δ′Δ = = +⎜ ⎟

⎝ ⎠ (29)

This function consists of two terms, the first one scales the pure reinforcement signal received from the critic, and the second one tunes the learning sensitiveness when the plant is around the target state; actions receive more reward/punishment as they affect the plant state when it is around the desired target state. Note that rmax can be easily calculated using (28).

According to (29) and (15), the reinforcement measure is given by (30) replaced in (17):

( )max

( ) sgn( ). 1 .( )l

i

li n k

r k rgr r ke

θϑ β λ⎛ ⎞Δ Δ⎛ ⎞Δ = − +⎜ ⎟⎜ ⎟

⎝ ⎠ ⎝ ⎠ (30)

D. Adding a Priori Knowledge From a very general point of view, the proposed algorithm

is a search in the space of possible actions. However, it is possible to add a priori knowledge in order to increase the learning speed. To describe that, it would be a great help to explain how the random selection of output term sets takes place. Digital processors can produce uniformly distributed random numbers and this is also used in RLFC. First, a random number is generated by the processor and then it is checked to which sub-distance (note equation (8)) it belongs. Then the corresponding number of that sub-distanfce will be the index of the chosen term set. Let the randomly generated number for the lth rule be lj . Equation (31) must hold:

llj ∈Ω (31)

By this concept, adding a priori knowledge in term of modifying bounds on lj is a simple procedure. For a typical example regarding our implemented RLFC, observe the following rule form:

11

23

l

x y

jx y x

x y

e is A AND e is whatever AND

IF v is A AND e is whatever AND THEN u is B

is whatever AND is whateverα α

⎧ ⎫⎪ ⎪⎪ ⎪⎨ ⎬⎪ ⎪⎪ ⎪⎩ ⎭

Apparently, this applies to a set of rules which the first and third conditions are fixed as mentioned. By referring back to the depicted term sets in Fig.3 and Fig.4, This clearly indicates that the sensible choice for this condition is a high deviation in

xα towards a positive direction with regard according to Cartesian co-ordination system. Thus:

[35,50]lj ∈ (31) This implies that those insensible choices are ignored for

the mentioned rules.

V. PERFORMANCE RESULTS OF RLFC ON BALL AND PLATE After the modified RLFC is implemented it is experimentally applied to the implemented BnP system. The illustrated graphs in this section are the results of a series of experiments. In all the experiments, the position of the ball versus time is collected using the monitoring section of the implemented BnP. Then these row data is modified by a sort of software enhancing procedures to avoid huge amount of confusing graphs. After omitting the time, the x position versus y is obtained. Then, the obtained points corresponding to a series of iterations are drawn on a single graph. Units of x and y are pixels and the origin of the coordinating system is selected as seen by the touch sensor. Each figure illustrates the touch-sensitive area of the plate. Thus illustrates the 1900×1900 pixels of the touch sensor. Location of the ball is illustrated approximately around the mean of the acquired points. Dark areas on the figures illustrate presence of the ball over the corresponding areas of the plate. The more each space of the figure darkens, the more the ball was in the corresponding area over all observed iterations. The target (desired) location of the ball in all the illustrated experiences is the centre of the plate.

In Fig. 5, improvement in behavior of the ball under control of RLFC is shown. It is observed that after approximately 70000 iterations, an acceptable performance is obtained. Note that the needed time for iterations normally varies. In our experiments, 70000 iterations took around 20 minutes. Since the performance of the system is satisfactory around 70000th iteration, at this stage RLFC is regarded as a trained system. However, since we do not omit the learning procedure afterwards, the comments under the next figures do not include the term “trained”. Instead, the 70000th iteration is mentioned as a reference point for a good train of the system. The control signals relating to the best performance illustrated in Fig. 5 is shown in Fig. 6.

In all the aforementioned experiments, the ball is initially located in the same place. To show that this is not a necessary condition, in Fig. 7 another start point is chosen. It is seen that since RLFC has not experienced these new states enough before, at first it could not perform well. However, as the ball touches a previously enough-experienced state (shown by an arrow), the behavior comes under the control. The number of iterations in this figure is about 3500.

In order to compare the performance of RLFC with that of a

human, 10 individuals (all healthy, normal, and matured with no apparent nervous or muscular disorder) were selected and asked to control the implemented BnP system. Each individual was allowed to try the system 10 times. Note that in all experiences, the steppers are released and the individuals controlled the plate by their own hands. This would indeed omit the most complicated nonlinearity and impreciseness of the system: the actuators. In Fig.8 the best performance is illustrated which was the 7th try of the individual who possessed the best control over the BnP system.

VI. CONCLUSION The main idea of the discussed work was to propose a

human-like controller, capable of learning from the past and own obtained experiences as well as embedding some prior knowledge and reasonable facts. Although still far away from an exact human-like behavior, the result of applying this controller to a very complex and uncertain plant resulted in satisfactory performance, especially when compared with a good human behavior trying to control the same plant. There are varieties of extensions and modifications to the proposed method, from the form of fuzzy IF-THEN rules to the method of tuning various defined parameters.

REFERENCES [1] R. S. Sutton A.G. Barto , Introduction to reinforcement learning,

MIT Press/Bradford Books, Cambridge, MA, 1998. [2] L. P. Kaelbling, M. L. Littman, A. W. Moore “Reinforcement

Learning: A Survey” Journal of Artificial Intelligence. vol. 4, pp. 237-285, May 1996.

[3] A. B. Rad, P. T. Chan, Wai Lun Lo, C. K. Mok “An Online Learning Fuzzy Controller” IEEE Trans. Industrial Engineering, vol 50, no. 50, pp. 1016-1021, October 2003.

[4] X. Fan, N. Zhang, S. Teng, “Trajectory planning and tracking of ball and plate system using hierarchical fuzzy control scheme” Elsevier Journal of Fuzzy Sets and Systems, vol. 144, pp. 297-312, 2003.

[5] S. Awtar, K. C. Craig, “Mechatronic Design of a Ball on Plate Balancing System” Proc. 7th Mechatronics Forum International Conference, Atlanta, GA, 2000.

[6] Humusoft User’s Manual “CE 151 Ball&Plate Apparatus”, Humusoft,

[7] E. H. Mamdani, “Application of fuzzy algorithms for simple dynamic plant,” Proceedings of IEE, vol. 121, pp. 1585-1588, 1974.

[8] M. Bai, H. Lu, J. Su, Y. Tian, “Motion Control of Ball and Plate System Using Supervisory Fuzzy Controller,” WCICA 2006, vol. 2, pp. 8127-8131, June, 2006.

[9] Wang, H. Tian, Y. Sui, Z. Zhang, X. Ding, “Tracking Control of Ball and Plate System with a Double Feedback Loop Structure,” ICMA 2007, August, 2007.

[10] R. Bellman, Adaptive Control Processes: A Guided Tour, Princton University Press, 1961.

[11] L. X. Wang, A Course in Fuzzy Systems and Control , Prentice-Hall International Inc.,1997.

[12] A. G. Barto, “Reinforcement Learning”, The Handbood of Brain Theory and Neural Networks, M. A. Arbib, Jaico Publishing House, The MIT Press, Cambridge, MA, USA, 2006, pp. -804-809.

[13] H. K. Lam, F. H. F. Leung, P. K. S. Tam, “Design of a fuzzy controller for stabilizing a ball and beam system”, The proceedings of IECON 1999, vol. 2, pp. 520 – 524.

[14] E. Laukanen, S. Yurkovich, “A Ball and Beam testbed for Fuzzy identification and Control design,” Proc. American Control Conf., June 1993, pp 665-669.

Fig. 7 . Different starting point. Fig. 8 . Human best performance.

Fig. 6 . Control inputs with respect to the best performance in Fig.5.

Fig. 5. Improvement in behavior of the ball under RLFC control. Top leftfigure corresponds to the first 15000 iterations, where initially a prioriknowledge is embedded. Top right and bottom left figures are showingimprovement in performance. After approximately 70000 iterations,bottom right performance is regarded as acceptable.

Documents

Papers Ball and Plate