1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

1/12

21Sudan Engineering Society Journal, March 2013, Volume59; No.1Sudan

A DETAILED APPROACH TO REINFORCEMENT LEARNING:A SEMI-BATCH

REACTOR CASE STUDY

Mustafa Abbas Mustafa1 and Tony Wilson

2

1 Department of Chemical Engineering, Faculty of Engineering, University of Khartoum.

2 Chemical and Environmental Engineering, Faculty of Engineering, the University

of Nottingham, United Kingdom

Received Sep. 2012, accepted after revision Jan 2013

ص

ل خ

ت س م

َ

شي

انح

ث

خ

ً

هن

غشج

ً

نا

يانثح

ح

َ

آ

خ

ح

سح

رد

ً

َ

ش

و

يع

ي

ا

َ

هغ

ً

نا

أ

َ

رنك

يع

.ايم

انغم

ىق

.اا

ٍ

انشتح

ل

(ضص

ً

نا

انهى

Reinforcement Learningهح

ً

ُ

أ

نا

هنا

انهى

هح

ً

هن

ذ

ُ

)

ى

ي

.ثش

ً

نا

انثق

هح

ً

نم

تنم

ذ

ُ

نا

نزا

اح

ش

ُ

نا

ش

َ

ى

نى

.انثشج

اب

ل

ي

اناسصيح

ح

ي

انق

ضص

ً

نا

انهى

ذ

ُ

ن

ييم

يط

ش

نإ

ثنا

زا

ف

انثق

(ب

ً

نا

تشايذ

ل

ً

ت

سج

ُ

ي

نح

ساح

ه

MATLABم

ى

رنك

ه

ج

.)

.اناسصيح

سب

ي

هن

انح

انسة

خ

َ

ت

ح

ً

ي

Abstract

The transient nature of semi-batch reactors, coupled with the unavailability of

accurate mathematical models and online measurements, restricts achieving optimal

operation. However, one finds that operators have managed, through experience, toimprove on previous performance. Reinforcement Learning (RL) has already been

identified as an approach to mimic this interactive learning process. Core elements

have not been presented in detail for direct application. This work aims to provide a

blueprint of the RL approach and a validation, through MATLAB implementation,

against a published case study. Moreover, the initial training data set is modified to

confirm the convergence of the algorithm.

Keywords: Reinforcement Learning, ANN, Optimization, Control, Semi-Batch Reactor

1. Introduction

Batch processing is an important sector of

the chemical process industry. In

comparison to continuous processes, batch

processes relate to the production of fine or

specialty chemicals, pharmaceuticals,

biochemicals, and polymers. There has been

an increasing interest in multi-product

batch production, so as to adjust better to

changing market conditions [1]. Althoughdifferent degrees of instrumentation and

automation could be found in industry,

many batch reactors are still operated

manually [2]. Despite the existence of an

important amount of literature on batch

unit optimization using an exact process

model and optimal control methods these

methodologies are rarely part of everyday

industrial practice [3]. Conventional optimal

control methods based on perfect process

models and continuous measurements aredifficult to apply in the industrial


2/12

22 Sudan Engineering Society Journal, March 2013, Volume59; No.1

A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

environment. The main reasons are the

scarcity (sometimes delay) of online

measurements; unavailability of accurate

mathematical models; batch to batch

variations (including raw material

variability); inherent unsteady-state nature

of Batch processes [3].

Despite of all this, many industrial processes

are operated with acceptable levels of

performance by human operators. The

operators use a combination of good

engineering insight, judgment and ability to

learn incrementally to define, implement,

and improve control policies for a great

variety of process tasks. On the other hand,

maintaining consistent quality becomes

difficult, due to changes in operations from

shift to shift and the difference in skill level

in operators. This results in the need to

develop methodologies and software tools

that can provide automation in batch

process units.

Martinez et al. [4,5,6] recognized the

potential use of the Reinforcement Learning(RL) algorithm to batch chemical processes,

and applied the algorithm to a semi-batch

reactor case study. However, no detailed

explanation was provided which provided

the main impetus for this work. The RL

algorithm is implemented using MATLAB

and compared against the same case study

reported by Martinez et al. [4], primarily to

validate the RL algorithm, but also to extend

results obtained.

2. Reinforcement Learning

In the late 1980's, Reinforcement Learning

emerged as an integration of three threads

that had been pursued independently. The

threads are: Trial and error learning;

Optimal control; Temporal-difference

learning methods.

The first thread started in the psychology of

animal learning, and revolved around the

trial and error nature of their learning.

Sutton and Barto's [7] review shows how

the first to express the essence of trial and

error learning was Edward Thorndike [8],

who was quoted to have said:

"of several responses made to the same

situation, those which are accompanied or

closely followed by satisfaction to the

animal will, other things being equal, be

more firmly connected with the situation, so

that, when it recurs, they will be more likely

to recur; those which are accompanied or

closely followed by discomfort to the animal

will, other things being equal, have their

connections with that situation weakened,

so that, when it recurs, they will be less

likely to occur. The greater the satisfaction

or discomfort, the greater the strengthening

or weakening of the bond ".

In essence trial and error learning followed

by good or bad outcomes alter their

tendency to be reselected. Thorndike called

this the "Law of Effect", since it describes

the effect of reinforcing events on thetendency to select actions. The two most

important aspects of trial and error learning

are: It is selectional, in other words, it

involves trying alternatives and selecting

among them by comparing their

consequences; it is associative, in the sense

that alternatives found by selection are

associated with particular situations.

Hence, the combination of those twoaspects is essential to trial and error

learning, as it is to the Law of Effect. In

other words, the Law of Effect is an

elementary way of combining search of

many actions in a given situation, and

memory of the best actions in the specific

situations. This thread makes up a big part

of the work in Artificial Intelligence, and led

to renewed interest in Reinforcement

Learning in the early 1980s.


3/12

Sudan Engineering Society Journal, March 2013, Volume59; No.1

Mustafa Abbas Mustafa andTony Wilson

23

The second thread deals with the optimal

control problem and the use of value

functions by Dynamic Programming

solutions. The term "optimal control" was

first used in the 1950s to describe the

problem of designing a controller, so as to

minimize a measure of a dynamical system's

behavior over time. In the mid-1950s,

Richard Bellman developed one solution to

this problem. His approach uses the concept

of a dynamical system's state, and a Value

Function to define a set of functional

equations, referred to now as the Bellman

equations. Later, the class of methods for

solving optimal control problems by solvingthe Bellman equations came to be known as

Dynamic Programming [9].

Markovian Decision Processes (MDPs), a

discrete stochastic version of the optimal

control problem were introduced by

Bellman [10]. Howard [11] later devised the

Policy Iteration Method for MDPs. All of

these make up the basic elements

underlying the theory and algorithms ofReinforcement Learning. The literature

contains many developments relating to

Dynamic Programming e.g. Bertsekas [12],

Puterman [13], Ross [14]. In addition,

Bryson [15] provides a history of optimal

control.

The final thread is a smaller and less distinct

thread concerning Temporal-Difference

methods of learning. Temporal Difference

(TD) methods are a general framework for

solving sequential prediction and control

problems, whereby an agent learns by

comparing temporally successive

predictions. The important part is that the

agent can learn before seeing the final

outcome. Nowadays, the field of Temporal

Difference covers more general methods for

learning to make long-term predictions

about dynamical systems (Sutton and Barto

[7], Watkins [16], Werbros [17]). This may

be particularly relevant in predicting

financial data, life spans, and weather

patterns.

2.1 RL Optimization Problem

Following the book on the subject by Sutton

and Barto [7] one could define

Reinforcement Learning, as simply being the

mapping of situations to actions so as to

maximize a numerical reward. An important

point to add is that the learner (e.g.

operator) is not told which actions to take

but must explore, between different

options, and exploit, what he already knows

about the process, to discover actions that

yield the highest reward.

The main elements of Reinforcement

Learning comprise of an agent (e.g.

operator, software) and an environment.

The agent is simply the controller, which

interacts with the environment by selectingcertain actions. The environment then

responds to those actions and presents new

situations to the agent. The agent’s

decisions are based on signals from the

environment, called the environment's

state. Figure 1 shows the main framework

of Reinforcement Learning. This is a typical

Reinforcement Learning problem,

characterized by:

1. Setting of explicit goals.

2.

Breaking of problem into decision steps.

3. Interaction with environment.

4. Sense of uncertainty.

5.

Sense of cause and effect.


4/12


5/12



25

the approximations of Q* and a* become

closer and closer to the actual values. After

completion of learning of the Value

Function, the Reinforcement Learning

algorithm is used to compute the optimal

action at every state.

An overview of the RL algorithm identifies

the following components and concepts:

1. Value Function: Objective function

reflecting how good/bad it is to be at a

certain state and taking a given action.

2. Bellman Optimality Equations:

Convergence criteria.

3.

Wire Fitting: Approximating the Value

Function

4.

Neural Network: Part of the Wire Fitting

approximation.

5.

Predictive Models: Used at each stage,

instead of the actual model, to provide a

one step-ahead prediction of states.

An initial training data set is provided and

the RL algorithm is executed offline.Following the completion of the learning

phase, the RL algorithm is implemented

online. The control policy is then to

calculate the optimal action a* for every

state encountered during progress of the

batch run. At the end of the batch run, the

training data set is updated, followed by

update of the predictive models and testing

of convergence criteria.

2.3 RL Methodology: A Detailed description

The criteria used for convergence, is the

Bellman Optimality Equation, (Equation 3).

}(3)Since the rewards (rt) are not known, in

advance, until the run has been completed,

r ( st, a) = 0, t < T was imposed. Also, is

set to 1, since the problem breaks downinto episodes. Hence, the Bellman

Optimality Equation could be rewritten as

follows:

} (4)The Value Function is calculated in general

using the following relationships:

{

(5)

Where PI is the Performance Index (a

function of the final conditions at time T).

Penalty of -1 is nominal value and it may be

appropriate to use other values in particular

problems

Since the main aim of the algorithm is

defining the optimal actions which result in

the optimal value function, Equation 5 could

be rewritten as follows (Since the goal is

always achieved with an optimal policy (*)

-hence the Value Function never equals -1):

(6) Equation 6 is true only when the RL

algorithm converges to the actual optimal

value function. During incremental learning

of the optimal value function, differences

occur which define the error: Bellman error.

The mean squared Bellman error, EB, is then

used in the approach to drive the learning

process to the true optimal value function(Equation 7 defines EB for a given state-

action pair (st,at)).

{ ,* +-

2.3.1 Implementation of Predictive Model

Martinez et al. [4] proposed the use of a

predictive model as part of the

Reinforcement Learning algorithm. The


6/12



most important requirement of a predictive

model is the ability to generalise properly

during learning. Otherwise the Value

Function may provide a poor control policy.

This problem is even worse for batch

processes showing significant batch-to-

batch variations in their behaviour (e.g.

polymerisation reactors). In general, the use

of pure inductive models (e.g. regression

models) is quite prone to poor predictions,

depending on the richness of experimental

conditions contained in the training data

set. The idea of hybrid modelling (Psichogios

and Ungar [18]; Thompson and Kramer

[19];Tsen et al. [3]) gives a differentapproach to improving the prediction

capability of an inductive model. Two

important requirements for any hybrid

prediction model to be used in batch

process optimization are:

1. The strong support of prediction in

experimental data points.

2. The ability to exploit qualitative

information from an imperfect processmodel based on first principles or other

sources of knowledge.

Process models based on first principles,

although numerically imprecise, are able to

capture the qualitative trends of process

variables quite well. Martinez et al. [4]

recognized the potential of using hybrid

predictive models in Reinforcement

Learning, and adapted a hybrid schemeproposed by Tsen et al. [3]. The proposed

predictive model makes use of an imperfect

process model f mod to introduce correction

terms based on local process trends. The

state change from st

a

to st

a

1caused by action

at

a

is extrapolated from an experimental

pair ( e

t

e

t a,s )as follows:

(8)Where

stands for the inductive model

(local regression model). The mainadvantage of Equation 8 is that partial

derivatives with respect to states and

actions are calculated using first principles

incorporated into the imperfect process

model . Hence, for example a processmodel based on nominal parameters could

be used to approximately calculate the

expected change in the Performance Index

when action at is taken and process state is

st + st in place of st. This would significantly

improve the prediction capability of the

regression model. If state-action data pairs

are assumed to be stratified with regard to

the stage-wise decision procedure, there

will exist several pairs with the same index

“t ” that are able to provide a prediction for

the state at “t +1”. Furthermore, the average

from extrapolations corresponding to all

those experimental pairs having the sameindex t is used. This average is defined by

weighting each individual extrapolation

(Equation 9) with a factor that measures the

proximity from the corresponding

experimental condition to ( e

t

e

t a,s ). The next

state prediction is then calculated using:

∑ ( ) ( ) (9) where is the set of all experimental pairs

associated with the t -the decision stage and

each j is a weighting factor defined as

√ . /. /

∑ √ . /. / (10)


7/12



27

2.3.2 Wire fitting function

Wire Fitting (Baird and Klopf [20]) is a

function approximation method, which is

specifically designed for self learning control

problems where simultaneous learning and

fitting of a function takes place. It is

particularly useful to Reinforcement

Learning systems, due to the following

reasons: It provides a way for approximating

the Value Function; It allows the maximum

value function to be calculated quickly and

easily.

Wire Fitting is an efficient way of

representing a function because it fits

surfaces using wires as shown in Figure 3. In

addition, they are even more attractive to

use in Reinforcement Learning algorithms

because it reduces the computational

requirements even further by providing the

Value Function directly

Q(s,a)

s a

Figure 3: Wire Fitting using 3 wires

The function Q (s,a) is supported by 3 wires,

although any number of wires could be

specified. Taking a 2D slice of Figure 3

shows the 3 wires, represented by the 3

black circles, which are called support

points, as shown in Figure 4

Q=f(a)

a

(Q3, a3)

(Q2, a2)(Q1, a1)

Figure: 4: Two dimensional slice of WireFitting surface

Baird and Klopf [20] suggested the use of

any general function approximation system

for the "Learned Function" shown in Figure

5, to learn the relation between different

states and control points (a1(s), Q 1(s), a2(s),

Q 2(s), a3(s), Q 3(s)).

Figure 5: Wire Fitting Architecture

The Interpolated Function is then defined by

a weighted-nearest-neighbour interpolation

of the control points as follows:

∑ [| | ( )]

∑ ,| | - Where the constant c determines theamount of smoothing for the approximation

and m defines the number of control wires.

Baird and Klopf [20] then suggested the use

of arbitrary values for the control points at

the beginning of the learning process. This

results in a long duration for the learning

process, which is impractical in industrial

operations where quick achievement andoptimization of process goals is required.

2.3.3 Calculations of Mean Squared

Bellman Error

To calculate the mean squared Bellman

error for states at this stage, both the values

of PI*, and Q*(s , a*) are required (Equation

6) for each state at T-1 in the current

training data set. The value PI* represents

the optimal Value Function for a state at T-1

and is calculated as shown in Figure 6 using


8/12



the predictive model (PM1) and running an

optimization routine (given the constraints)

between the maximum and minimum

values of the action (a min to a max).

T

S T-1

T-1

ST1

ST3

S T

a min

a max

PM1

PM1

e.g. state atwhichoptimum

value occurs

Search between min and maxvalue of action, using PM1

Figure 6: Calculation of optimal value

function (PI*) and optimal action (a*) given

a state atT-1where PM1 is the predictive

model for states at stage T-1 to T.

The optimization results in identification of

the best action a* which would lead to the

optimal value function PI* for each state at

T-1. The next step is to calculate the Value

Function, Q*(s, a*), which represents the

current Wire Fitting approximation of PI*

for states at T-1. This is achieved by

supplying the optimal action (a*), together

with the current state to the Wire Fittingapproximation function as input (Figure 7).

Neural Network

Interpolation Function

a*

Q*(s, a*)

s

m

i

ii

m

i

iiii

Q sQ saa

Q sQc saaQ

a sQ*

1

1

max

1

1

max

(s))()(*

)(s))(()(*(s)

*),(

a3Q3(s)

PM1

a2Q2(s)

PM1

a1Q1(s)

PM1

s ss

Figure 7: Calculation of optimal value

function approximation using Wire Fitting

for state s and optimal action a* at T-1

The calculation of the Value Functions Q 1(s),

Q 2(s), and Q 3(s) in Figure 7 is further

demonstrated in Figure 8. Let us take

state:1 . The Neural Network will supply 3 actions

(a1 a2, a3 shown in Figure 8).

2 . The predictive model will supply 3 final

states (ST1, ST2 ST3).

3. Consequently, given the 3 final states, the

3 value functions (Q 1(s), Q 2(s), and Q 3(s))

can be calculated by calculating the

performance indices.

T

S T 1

T-1

a3

a2

a1 S T1

S T3

S T2PM1

PM1

PM1

Q3(s)

Q2(s)

Q1(s)

Figure 8: Calculation of support points

(Q 1(s), Q 2(s), and Q 3(s)) for use in Wire

Fitting approximation of optimal value

function for states at T-1

A similar procedure is used to calculate the

mean squarred Bellman error for all states

at stages T-2 and T-3.

2.3.4 Back Propagation of Mean Squared

Bellman Error

After the calculation of the mean squared

Bellman errors, the weights and biases in

the Neural Network are modified as follows:

. / (12)using the chain rule,

. / (13)Furthermore, the derivative of the Value

Function with respect to the output of the

Neural Newtwork, is obtained analyticaly

by diferentation of Equation 11.

Back propagation of mean squared Bellman

error is performed initially for states at T-1

only. Once training is achieved for this

subset of state-action pairs, states at T-2 are


9/12



29

added in the training data set. This is then

finally followed by inclusion of the initial

state. During this stage-wise training

procedure, the back propagation of mean

squared Bellman error (EB) is continued until

it is below a certain tolerance for all the

states in the training data set. Furthermore,

each time a new experimental run is

available, the whole procedure is repeated.

MATLAB [21] is used to implement the

algorithm.

2.3.5 On-line Application of RL Algorithm

Following the learning of the optimal value

function for the current training data set,

the RL algorithm calculates the optimal

actions for states at different time intervals

and the corresponding optimal value

function. Given a state at a certain stage,

the algorithm will perform a forward sweep

and select the path that will lead to the

highest resultant value function as shown in

Figures 9 to 11 (The black dots show

examples of the optimal route for sample

states at the different time intervals). Itshould be noted that the use of only the

Neural Network and Predictive Models

(PM1, PM2, and PM3) is required for on-line

applications.

Figure 9: Search for optimal path for states

at T-1

Figure 10: Search for optimal path for

states at T-2

Figure 11: Search for optimal path for

states at T-3

3 Case Study

The case study considered by Martinez et al.

[4], is a semi-batch reactor. The main

product is formed according to an

autocatalytic reaction scheme, with a

kinetic mechanism for the main reaction asfollows:

→ (14)Meanwhile, a slower irreversible

degradation of the product simultaneously

takes place as following:

→ (15)The concentration of Product (B) can be

measured fast enough and hence is used for

the on-line optimization. As for the

assessment of the impurity level, it is

analyzed in the final productonly, because it

is costlyand time-consuming. The quality of

the product is on-specification, if less than

2% of B is lost to impurity. Otherwise, it is a

wasted batch and a penalty of -1 is applied.

In addition, there is a preference for

achieving the maximum possible conversionof A.

In order to guarantee the quality of the

batch, the feed flow rate is changed

according to intermediate measurement

samples of product B formation. Three

samples corresponding to accumulated

total liquid volumes of V=0.2 Vf , V=0.4 Vf ,

and V=0.6 Vf are taken to measure the

concentration of B. The result of each

sample is only available with a delay of 30

T

T

T

T-1

T-2

T

T-1

T-2

T-3


10/12



minutes. Other relevant data for the case

study is provided in Table 1 The process

goal is then defined as achieving the

product within specifications in less than 5

hours, with a conversion of higher than 90%

for the reactant A fed. Once the goal is

achieved, the Preference Index (PI) is

defined to have 3 units for each additional

percent conversion over 90% obtained and

1 unit for each hour reduction with respect

to the maximum reaction time. Hence, if an

on-specification product is obtained in 3

hours and with 92 % of reactant conversion,

then the goal is achieved and PI=8.

Table 1: Semi-batch reactor case study

Initial conditions

Vol.=0.5 m3; CA=1.92 kmol m

-3 ;

CB=0.550 kmol m-3

Reactor feed

CA=1.42 kmol m-3

; CB=0.75 kmol m-3

.

Kinetic parameters

k1= k11 )/.exp( 1 T E [m6 h

-1 kmol

-2]

K2= k22 )/.exp( 2 T E [h-1

]

Nominal parameters

k1=10 )/.000,1exp( T [m6 h

-1 kmol

-2]

K2=2.2 1015 )/.400,13exp( T [h-1]

Operating constraints

Feed rate 1.5 m3 h

-1; Temp. 50

0C;

Max. Volume=5 m3

4. Results and Discussion

Using the same initial training data set of

six batch runs used by Martinez et al. [4],

the RL algorithm was executed for an

additional 26 batch runs. The results

obtained from implementing the predictive

models are shown in Figure 12, together

with the results obtained by Martinez et al.

[4]. The same final PI value is approximately

achieved by both RL applications with

similar trends followed though the RL

algorithm developed here manages to

converge more steadily.

The initial training data set used by

Martinez et al. [4] contained a very good

batch run with a Performance Index equal

to 15.09. Hence it could be argued that the

RL algorithm was not actually incrementally

improving (or learning), but was rather

interpolating between previous good batch

runs. So the best batch run in the training

data set was replaced with another batch

run, which leads to the initial training

dataset having a maximum PerformanceIndex value equal to 9.15.

The implementation of the RL algorithm

was then repeated for the new training data

which shows how the RL algorithm manages

to achieve a much higher value of the

Performance Index of 11.36 in comparison

Figure 12: Validation of RL Methodology

-2

0

2

4

6

8

10

12

14

7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Number of batches

PerformanceIndex

Results produced by Martinez (1998a)

Results obtained from

implementing RL methodology


11/12



31

Figure 13: Performance Index as a function of number of additional batch runs (best batch run

in initial training data set results in a PI equal to 11.36)

to the previous best batch run in the

training data set (Figure 13). This issignificant in that it demonstrates capability

of the RL algorithm to improve beyond the

level observed in previous batches.

5. Conclusion

Reinforcement Learning provides a new

approach towards the full automation of

stochastic discrete systems such as batch

chemical processes. Those systems provide

clear challenges due to their inherenttransient behavior, unavailability of online

measurements in addition to delays in

measurements which will degrade the

performance of any process control system.

This work presents an in-depth description

of RL for direct application of the algorithm.

Furthermore, MATLAB implementation of

the RL approach was validated against a

published case study [4]. The results haveshown good agreement with literature in

addition to further assurance of

improvement in performance beyond the

best run already presented in the initial

training data set.

References

1. D. Bonvin, Optimal Operation of Batch

Reactors: A Personal View, J. Process

Control. 8 (1998) 355-368.

2.

Terwiesch, P., M. Agarwal and D.W.T

Rippin, Batch Unit Optimization withImperfect Modeling: a Survey, J. Process

Control. 4 (1994) 238-258.

3. A.Y. Tsen, S.S. Jhang, D.S. Wong, and B.

Joseph, Predictive Control of Quality in

Batch Polymerization using ANN Models,

AIChE Journal. 42 (1996) 455-465.

4. E.C. Martinez, J.A. Wilson and M.A.

Mustafa, An Incremental Learning

Approach to Batch Unit Optimisation,

The 1998 IChemE Research Event,

Newcastle, UK.

5. E.C. Martinez, R.A. Pulley and J.A.

Wilson, Learning to Control the

Performance of Batch Processes, Chem.

Eng. Res. Des., 76 (1998) 711-722.

6.

E.C. Martinez and J.A. Wilson, A Hybrid

Neural Network First Principles Approach

to Batch Unit Optimisation, Comput.

Chem. Eng., Suppl. 22 (1998) S893-S896.

7. R.S. Sutton and A.G. Barto,

Reinforcement Learning: An

Introduction, The MIT Press, Cambridge,

Massachusetts, London, England, 1998.

8. E.L. Thorndike, Animal Intelligence,

Hafner, Darien, Conn., 1991.

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Number of additional batches added to intial training data set

PerformanceIndex


12/12



9.

R. Bellman, Dynamic Programming,

Princeton University, Press, Princeton,

New Jersey, 1957.

10. R.E. Bellman, A Markov Decision Process,

J. Math. Mech. 6 (1957) 679-684.

11. R. Howard, Dynamic Programming and

Markov Processes, MIT Press,

Cambridge, MA, USA, 1960.

12. D.P. Bertsekas, Dynamic Programming

and Optimal Control, Athena Scientific,

Belmont, MA, USA, 1995.

13.

M.L Puterman, Markov Decision

Problems, Wiley, New York, USA, 1994.

14. S. Ross, Introduction to Stochastic

Dynamic Programming, Academic Press,

New York, USA, 1983.

15. A.E. Bryson, Optimal control-1950 to

1985, IEEE Control Systems Magazine. 16

(1996) 26-33.

16.

C.J.C.H Watkins, Learning from Delayed

Rewards, PhD thesis, Cambridge

University, UK, 1989.

17.

P.J. Weberos, Building and

Understanding Adaptive Systems: A

Statistical/Numerical Approach to

Factory Automation and Brain Research,

IEEE Transactions on Systems, Man, and

Cybernetics. 17 (1987) 7-20.

18. D.C. Psichogios and L.H. Ungar, A Hybrid

Neural Network First Principles Approach

to Process Modelling, AIChE Journal. 38

(1992) 1499-1511.

19.

M.L. Thompson and M.A. Kramer,

Modeling Chemical Processes using Prior

Knowledge and Neural Networks, AIChE

Journal. 40 (1994) 1328-1340.20. L.C. Baird and A.H. Klopf, Reinforcement

Learning with High-dimensional

Continuos Actions, Technical Report WL-

TR-93-1147, Wright Laboratory, Wright

Patterson Air Force Base, 1993.

21. MATLAB: Version 4 User's Guide. The

Math Works Inc, Natick, Massachusetts,

USA, 1995.

Documents

1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013