1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

Embed Size (px)

Citation preview

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    1/12

    21Sudan Engineering Society Journal,  March 2013, Volume59; No.1Sudan

    A DETAILED APPROACH TO REINFORCEMENT LEARNING:A SEMI-BATCH

    REACTOR CASE STUDY

    Mustafa Abbas Mustafa1 and Tony Wilson

    2

    1 Department of Chemical Engineering, Faculty of Engineering, University of Khartoum.

    2 Chemical and Environmental Engineering, Faculty of Engineering, the University

    of Nottingham, United Kingdom

    Received Sep. 2012, accepted after revision Jan 2013

    ص

     

    ل خ

     

    ت س م

     

    َ

    شي

     انح

     ث

     خ

    ً

    هن

     غشج

    ً

    نا

     يانثح

     

     ح

    َ

    آ

     خ

     ح

     سح

     رد

    ً

    َ

     ش

     و

     يع

     ي

     ا

    َ

     

     هغ

    ً

    نا

     أ

     

    َ

     رنك

     يع

     .ايم

     انغم

     ىق

     .اا 

     ٍ

     

     انشتح

     ل

    (ضص

    ً

    نا

     انهى

    Reinforcement Learningهح

    ً

     

    ُ

    أ

     

     نا

     هنا

     انهى

     هح

    ً

     هن

     ذ

    ُ

      )

     ى

     ي

      .ثش

    ً

    نا

     انثق

     هح

    ً

     نم

     تنم

     ذ

    ُ

    نا

     نزا

     اح

     ش

    ُ

    نا

     ش

    َ

     ى

     نى

      .انثشج

     اب

     ل

     ي

     اناسصيح

     ح

     ي

     انق

     ضص

    ً

    نا

     انهى

     ذ

    ُ

    ن

     ييم

     يط

     ش

     نإ

     ثنا

     زا

     ف

     انثق

    ً

    نا

     تشايذ

     ل

    ً

    ت

     سج

    ُ

    ي

     نح

     ساح

     ه

    MATLABم

     ى

     رنك

     ه

     ج

      .)

    .اناسصيح

     سب

     ي

     هن

     انح

     انسة

     خ

    َ

    ت

     ح

    ً

    ي

    Abstract

    The transient nature of semi-batch reactors, coupled with the unavailability of

    accurate mathematical models and online measurements, restricts achieving optimal

    operation. However, one finds that operators have managed, through experience, toimprove on previous performance. Reinforcement Learning (RL) has already been

    identified as an approach to mimic this interactive learning process. Core elements

    have not been presented in detail for direct application. This work aims to provide a

    blueprint of the RL approach and a validation, through MATLAB implementation,

    against a published case study. Moreover, the initial training data set is modified to

    confirm the convergence of the algorithm.

    Keywords: Reinforcement Learning, ANN, Optimization, Control, Semi-Batch Reactor  

    1. Introduction

    Batch processing is an important sector of

    the chemical process industry. In

    comparison to continuous processes, batch

    processes relate to the production of fine or

    specialty chemicals, pharmaceuticals,

    biochemicals, and polymers. There has been

    an increasing interest in multi-product

    batch production, so as to adjust better to

    changing market conditions [1]. Althoughdifferent degrees of instrumentation and

    automation could be found in industry,

    many batch reactors are still operated

    manually [2]. Despite the existence of an

    important amount of literature on batch

    unit optimization using an exact process

    model and optimal control methods these

    methodologies are rarely part of everyday

    industrial practice [3]. Conventional optimal

    control methods based on perfect process

    models and continuous measurements aredifficult to apply in the industrial

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    2/12

    22 Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

    environment. The main reasons are the

    scarcity (sometimes delay) of online

    measurements; unavailability of accurate

    mathematical models; batch to batch

    variations (including raw material

    variability); inherent unsteady-state nature

    of Batch processes [3].

    Despite of all this, many industrial processes

    are operated with acceptable levels of

    performance by human operators. The

    operators use a combination of good

    engineering insight, judgment and ability to

    learn incrementally to define, implement,

    and improve control policies for a great

    variety of process tasks. On the other hand,

    maintaining consistent quality becomes

    difficult, due to changes in operations from

    shift to shift and the difference in skill level

    in operators. This results in the need to

    develop methodologies and software tools

    that can provide automation in batch

    process units.

    Martinez et al. [4,5,6] recognized the

    potential use of the Reinforcement Learning(RL) algorithm to batch chemical processes,

    and applied the algorithm to a semi-batch

    reactor case study. However, no detailed

    explanation was provided which provided

    the main impetus for this work. The RL

    algorithm is implemented using MATLAB

    and compared against the same case study

    reported by Martinez et al. [4], primarily to

    validate the RL algorithm, but also to extend

    results obtained.

    2. Reinforcement Learning

    In the late 1980's, Reinforcement Learning

    emerged as an integration of three threads

    that had been pursued independently. The

    threads are: Trial and error learning;

    Optimal control; Temporal-difference

    learning methods.

    The first thread started in the psychology of

    animal learning, and revolved around the

    trial and error nature of their learning.

    Sutton and Barto's [7] review shows how

    the first to express the essence of trial and

    error learning was Edward Thorndike [8],

    who was quoted to have said:

    "of several responses made to the same

    situation, those which are accompanied or

    closely followed by satisfaction to the

    animal will, other things being equal, be

    more firmly connected with the situation, so

    that, when it recurs, they will be more likely

    to recur; those which are accompanied or

    closely followed by discomfort to the animal

    will, other things being equal, have their

    connections with that situation weakened,

    so that, when it recurs, they will be less

    likely to occur. The greater the satisfaction

    or discomfort, the greater the strengthening

    or weakening of the bond ".

    In essence trial and error learning followed

    by good or bad outcomes alter their

    tendency to be reselected. Thorndike called

    this the "Law of Effect", since it describes

    the effect of reinforcing events on thetendency to select actions. The two most

    important aspects of trial and error learning

    are: It is selectional, in other words, it

    involves trying alternatives and selecting

    among them by comparing their

    consequences; it is associative, in the sense

    that alternatives found by selection are

    associated with particular situations.

    Hence, the combination of those twoaspects is essential to trial and error

    learning, as it is to the Law of Effect. In

    other words, the Law of Effect is an

    elementary way of combining search of

    many actions in a given situation, and

    memory of the best actions in the specific

    situations. This thread makes up a big part

    of the work in Artificial Intelligence, and led

    to renewed interest in Reinforcement

    Learning in the early 1980s.

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    3/12

     Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    Mustafa Abbas Mustafa andTony Wilson

    23

    The second thread deals with the optimal

    control problem and the use of value

    functions by Dynamic Programming

    solutions. The term "optimal control" was

    first used in the 1950s to describe the

    problem of designing a controller, so as to

    minimize a measure of a dynamical system's

    behavior over time. In the mid-1950s,

    Richard Bellman developed one solution to

    this problem. His approach uses the concept

    of a dynamical system's state, and a Value

    Function to define a set of functional

    equations, referred to now as the Bellman

    equations. Later, the class of methods for

    solving optimal control problems by solvingthe Bellman equations came to be known as

    Dynamic Programming [9].

    Markovian Decision Processes (MDPs), a

    discrete stochastic version of the optimal

    control problem were introduced by

    Bellman [10]. Howard [11] later devised the

    Policy Iteration Method for MDPs. All of

    these make up the basic elements

    underlying the theory and algorithms ofReinforcement Learning. The literature

    contains many developments relating to

    Dynamic Programming e.g. Bertsekas [12],

    Puterman [13], Ross [14]. In addition,

    Bryson [15] provides a history of optimal

    control.

    The final thread is a smaller and less distinct

    thread concerning Temporal-Difference

    methods of learning. Temporal Difference

    (TD) methods are a general framework for

    solving sequential prediction and control

    problems, whereby an agent learns by

    comparing temporally successive

    predictions. The important part is that the

    agent can learn before seeing the final

    outcome. Nowadays, the field of Temporal

    Difference covers more general methods for

    learning to make long-term predictions

    about dynamical systems (Sutton and Barto

    [7], Watkins [16], Werbros [17]). This may

    be particularly relevant in predicting

    financial data, life spans, and weather

    patterns.

    2.1 RL Optimization Problem

    Following the book on the subject by Sutton

    and Barto [7] one could define

    Reinforcement Learning, as simply being the

    mapping of situations to actions so as to

    maximize a numerical reward. An important

    point to add is that the learner (e.g.

    operator) is not told which actions to take

    but must explore, between different

    options, and exploit, what he already knows

    about the process, to discover actions that

    yield the highest reward.

    The main elements of Reinforcement

    Learning comprise of an agent (e.g.

    operator, software) and an environment.

    The agent is simply the controller, which

    interacts with the environment by selectingcertain actions. The environment then

    responds to those actions and presents new

    situations to the agent. The agent’s

    decisions are based on signals from the

    environment, called the environment's

    state. Figure 1 shows the main framework

    of Reinforcement Learning. This is a typical

    Reinforcement Learning problem,

    characterized by:

    1.  Setting of explicit goals.

    2. 

    Breaking of problem into decision steps.

    3.  Interaction with environment.

    4.  Sense of uncertainty.

    5. 

    Sense of cause and effect.

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    4/12

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    5/12

     Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    Mustafa Abbas Mustafa andTony Wilson

    25

    the approximations of Q* and a* become

    closer and closer to the actual values. After

    completion of learning of the Value

    Function, the Reinforcement Learning

    algorithm is used to compute the optimal

    action at every state.

    An overview of the RL algorithm identifies

    the following components and concepts:

    1.  Value Function: Objective function

    reflecting how good/bad it is to be at a

    certain state and taking a given action.

    2.  Bellman Optimality Equations:

    Convergence criteria.

    3. 

    Wire Fitting: Approximating the Value

    Function

    4. 

    Neural Network: Part of the Wire Fitting

    approximation.

    5. 

    Predictive Models: Used at each stage,

    instead of the actual model, to provide a

    one step-ahead prediction of states.

    An initial training data set is provided and

    the RL algorithm is executed offline.Following the completion of the learning

    phase, the RL algorithm is implemented

    online. The control policy is then to

    calculate the optimal action a* for every

    state encountered during progress of the

    batch run. At the end of the batch run, the

    training data set is updated, followed by

    update of the predictive models and testing

    of convergence criteria.

    2.3 RL Methodology: A Detailed description

    The criteria used for convergence, is the

    Bellman Optimality Equation, (Equation 3).

      }(3)Since the rewards (rt) are not known, in

    advance, until the run has been completed,

    r ( st, a) = 0, t < T was imposed. Also,   is

    set to 1, since the problem breaks downinto episodes. Hence, the Bellman

    Optimality Equation could be rewritten as

    follows:

    }  (4)The Value Function is calculated in general

    using the following relationships: 

     

    {

      (5)

    Where PI is the Performance Index (a

    function of the final conditions at time T).

    Penalty of -1 is nominal value and it may be

    appropriate to use other values in particular

    problems

    Since the main aim of the algorithm is

    defining the optimal actions which result in

    the optimal value function, Equation 5 could

    be rewritten as follows (Since the goal is

    always achieved with an optimal policy (*)

    -hence the Value Function never equals -1):

           (6) Equation 6 is true only when the RL

    algorithm converges to the actual optimal

    value function. During incremental learning

    of the optimal value function, differences

    occur which define the error: Bellman error.

    The mean squared Bellman error, EB, is then

    used in the approach to drive the learning

    process to the true optimal value function(Equation 7 defines EB  for a given state-

    action pair (st,at)).

    { ,* +-  

       

    2.3.1 Implementation of Predictive Model

    Martinez et al. [4] proposed the use of a

    predictive model as part of the

    Reinforcement Learning algorithm. The

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    6/12

    26 Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

    most important requirement of a predictive

    model is the ability to generalise properly

    during learning. Otherwise the Value

    Function may provide a poor control policy.

    This problem is even worse for batch

    processes showing significant batch-to-

    batch variations in their behaviour (e.g.

    polymerisation reactors). In general, the use

    of pure inductive models (e.g. regression

    models) is quite prone to poor predictions,

    depending on the richness of experimental

    conditions contained in the training data

    set. The idea of hybrid modelling (Psichogios

    and Ungar [18]; Thompson and Kramer

    [19];Tsen et al. [3]) gives a differentapproach to improving the prediction

    capability of an inductive model. Two

    important requirements for any hybrid

    prediction model to be used in batch

    process optimization are:

    1.  The strong support of prediction in

    experimental data points.

    2.  The ability to exploit qualitative

    information from an imperfect processmodel based on first principles or other

    sources of knowledge.

    Process models based on first principles,

    although numerically imprecise, are able to

    capture the qualitative trends of process

    variables quite well. Martinez et al. [4]

    recognized the potential of using hybrid

    predictive models in Reinforcement

    Learning, and adapted a hybrid schemeproposed by Tsen et al. [3]. The proposed

    predictive model makes use of an imperfect

    process model  f    mod  to introduce correction

    terms based on local process trends. The

    state change from  st 

    a

    to st 

    a

    1caused by action

    at 

    a

      is extrapolated from an experimental

    pair (  e

    t

    e

    t  a,s )as follows:

         (8)Where

        stands for the inductive model

    (local regression model). The mainadvantage of Equation 8 is that partial

    derivatives with respect to states and

    actions are calculated using first principles

    incorporated into the imperfect process

    model . Hence, for example a processmodel based on nominal parameters could

    be used to approximately calculate the

    expected change in the Performance Index

    when action at is taken and process state is

    st + st in place of st. This would significantly

    improve the prediction capability of the

    regression model. If state-action data pairs

    are assumed to be stratified with regard to

    the stage-wise decision procedure, there

    will exist several pairs with the same index

    “t ” that are able to provide a prediction for

    the state at “t +1”. Furthermore, the average

    from extrapolations corresponding to all

    those experimental pairs having the sameindex t   is used. This average is defined by

    weighting each individual extrapolation

    (Equation 9) with a factor that measures the

    proximity from the corresponding

    experimental condition to (  e

    t

    e

    t  a,s ). The next

    state prediction is then calculated using:

    ∑   ( ) (    )   (9) where   is the set of all experimental pairs

    associated with the t -the decision stage and

    each  j   is a weighting factor defined as

      √ . /. /

    ∑ √ . /. /   (10) 

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    7/12

     Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    Mustafa Abbas Mustafa andTony Wilson

    27

    2.3.2 Wire fitting function

    Wire Fitting (Baird and Klopf [20]) is a

    function approximation method, which is

    specifically designed for self learning control

    problems where simultaneous learning and

    fitting of a function takes place. It is

    particularly useful to Reinforcement

    Learning systems, due to the following

    reasons: It provides a way for approximating

    the Value Function; It allows the maximum

    value function to be calculated quickly and

    easily.

    Wire Fitting is an efficient way of

    representing a function because it fits

    surfaces using wires as shown in Figure 3. In

    addition, they are even more attractive to

    use in Reinforcement Learning algorithms

    because it reduces the computational

    requirements even further by providing the

    Value Function directly

    Q(s,a)

      s  a

    Figure 3: Wire Fitting using 3 wires

    The function Q (s,a) is supported by 3 wires,

    although any number of wires could be

    specified. Taking a 2D slice of Figure 3

    shows the 3 wires, represented by the 3

    black circles, which are called support

    points, as shown in Figure 4

    Q=f(a)

    a

    (Q3, a3)

    (Q2, a2)(Q1, a1)

     

    Figure: 4: Two dimensional slice of WireFitting surface

    Baird and Klopf [20] suggested the use of

    any general function approximation system

    for the "Learned Function" shown in Figure

    5, to learn the relation between different

    states and control points (a1(s), Q 1(s), a2(s),

    Q 2(s), a3(s), Q 3(s)).

    Figure 5: Wire Fitting Architecture

    The Interpolated Function is then defined by

    a weighted-nearest-neighbour interpolation

    of the control points as follows:

    ∑ [| | ( )]

    ∑ ,| | -  Where the constant c determines theamount of smoothing for the approximation

    and m defines the number of control wires.

    Baird and Klopf [20] then suggested the use

    of arbitrary values for the control points at

    the beginning of the learning process. This

    results in a long duration for the learning

    process, which is impractical in industrial

    operations where quick achievement andoptimization of process goals is required.

    2.3.3 Calculations of Mean Squared

    Bellman Error

    To calculate the mean squared Bellman

    error for states at this stage, both the values

    of PI*, and Q*(s , a*) are required (Equation

    6) for each state at T-1 in the current

    training data set. The value PI* represents

    the optimal Value Function for a state at T-1

    and is calculated as shown in Figure 6 using

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    8/12

    28 Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

    the predictive model (PM1) and running an

    optimization routine (given the constraints)

    between the maximum and minimum

    values of the action (a min to a max).

    T

    S T-1

    T-1

    ST1

    ST3

    S T 

    a min

    a max

    PM1

    PM1

    e.g. state atwhichoptimum

    value occurs

    Search between min and maxvalue of action, using PM1

    Figure 6: Calculation of optimal value

    function (PI*) and optimal action (a*) given

    a state atT-1where PM1 is the predictive

    model for states at stage T-1 to T.

    The optimization results in identification of

    the best action a* which would lead to the

    optimal value function PI* for each state at

    T-1. The next step is to calculate the Value

    Function, Q*(s, a*), which represents the

    current Wire Fitting approximation of PI*

    for states at T-1. This is achieved by

    supplying the optimal action (a*), together

    with the current state to the Wire Fittingapproximation function as input (Figure 7).

     Neural Network 

    Interpolation Function

    a*

    Q*(s, a*)

    s

    m

    i

    ii

    m

    i

    iiii

    Q sQ saa

    Q sQc saaQ

    a sQ*

    1

    1

    max

    1

    1

    max

    (s))()(*

    )(s))(()(*(s)

    *),(

    a3Q3(s)

    PM1

    a2Q2(s)

    PM1

    a1Q1(s)

    PM1

    s   ss

    Figure 7: Calculation of optimal value

    function approximation using Wire Fitting

    for state s and optimal action a* at T-1

    The calculation of the Value Functions Q 1(s),

    Q 2(s), and Q 3(s) in Figure 7 is further

    demonstrated in Figure 8. Let us take

    state:1 . The Neural Network will supply 3 actions

    (a1 a2, a3 shown in Figure 8).

    2 . The predictive model will supply 3 final

    states (ST1, ST2 ST3).

    3. Consequently, given the 3 final states, the

    3 value functions (Q 1(s), Q 2(s), and Q 3(s))

    can be calculated by calculating the

    performance indices.

    T

    S T  1

    T-1

    a3

    a2

    a1   S T1

    S T3

    S T2PM1

    PM1

    PM1

    Q3(s)

    Q2(s)

    Q1(s)

    Figure 8: Calculation of support points

    (Q 1(s), Q 2(s), and Q 3(s)) for use in Wire

    Fitting approximation of optimal value

    function for states at T-1

    A similar procedure is used to calculate the

    mean squarred Bellman error for all states

    at stages T-2 and T-3.

    2.3.4 Back Propagation of Mean Squared

    Bellman Error

    After the calculation of the mean squared

    Bellman errors, the weights and biases in

    the Neural Network are modified as follows:

    . /  (12)using the chain rule,

    . /  (13)Furthermore, the derivative of the Value

    Function with respect to the output of the

    Neural Newtwork, is obtained analyticaly

    by diferentation of Equation 11.

    Back propagation of mean squared Bellman

    error is performed initially for states at T-1

    only. Once training is achieved for this

    subset of state-action pairs, states at T-2 are

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    9/12

     Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    Mustafa Abbas Mustafa andTony Wilson

    29

    added in the training data set. This is then

    finally followed by inclusion of the initial

    state. During this stage-wise training

    procedure, the back propagation of mean

    squared Bellman error (EB) is continued until

    it is below a certain tolerance for all the

    states in the training data set. Furthermore,

    each time a new experimental run is

    available, the whole procedure is repeated.

    MATLAB [21] is used to implement the

    algorithm.

    2.3.5 On-line Application of RL Algorithm

    Following the learning of the optimal value

    function for the current training data set,

    the RL algorithm calculates the optimal

    actions for states at different time intervals

    and the corresponding optimal value

    function. Given a state at a certain stage,

    the algorithm will perform a forward sweep

    and select the path that will lead to the

    highest resultant value function as shown in

    Figures 9 to 11 (The black dots show

    examples of the optimal route for sample

    states at the different time intervals). Itshould be noted that the use of only the

    Neural Network and Predictive Models

    (PM1, PM2, and PM3) is required for on-line

    applications.

    Figure 9: Search for optimal path for states

    at T-1

    Figure 10: Search for optimal path for

    states at T-2

    Figure 11: Search for optimal path for

    states at T-3

    3 Case Study

    The case study considered by Martinez et al.

    [4], is a semi-batch reactor. The main

    product is formed according to an

    autocatalytic reaction scheme, with a

    kinetic mechanism for the main reaction asfollows:

      →   (14)Meanwhile, a slower irreversible

    degradation of the product simultaneously

    takes place as following:

    →   (15)The concentration of Product (B) can be

    measured fast enough and hence is used for

    the on-line optimization. As for the

    assessment of the impurity level, it is

    analyzed in the final productonly, because it

    is costlyand time-consuming. The quality of

    the product is on-specification, if less than

    2% of B is lost to impurity. Otherwise, it is a

    wasted batch and a penalty of -1 is applied.

    In addition, there is a preference for

    achieving the maximum possible conversionof A.

    In order to guarantee the quality of the

    batch, the feed flow rate is changed

    according to intermediate measurement

    samples of product B formation. Three

    samples corresponding to accumulated

    total liquid volumes of V=0.2 Vf , V=0.4 Vf ,

    and V=0.6 Vf   are taken to measure the

    concentration of B. The result of each

    sample is only available with a delay of 30

    T

    T

    T

    T-1

    T-2

    T

    T-1

    T-2

    T-3

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    10/12

    30 Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

    minutes. Other relevant data for the case

    study is provided in Table 1 The process

    goal is then defined as achieving the

    product within specifications in less than 5

    hours, with a conversion of higher than 90%

    for the reactant A fed. Once the goal is

    achieved, the Preference Index (PI) is

    defined to have 3 units for each additional

    percent conversion over 90% obtained and

    1 unit for each hour reduction with respect

    to the maximum reaction time. Hence, if an

    on-specification product is obtained in 3

    hours and with 92 % of reactant conversion,

    then the goal is achieved and PI=8.

    Table 1: Semi-batch reactor case study

    Initial conditions

    Vol.=0.5 m3; CA=1.92 kmol m

    -3 ;

    CB=0.550 kmol m-3

     

    Reactor feed

    CA=1.42 kmol m-3

    ; CB=0.75 kmol m-3

    .

    Kinetic parameters

    k1= k11 )/.exp( 1   T  E   [m6 h

    -1 kmol

    -2]

    K2= k22 )/.exp( 2   T  E    [h-1

    ]

    Nominal parameters

    k1=10 )/.000,1exp( T   [m6 h

    -1 kmol

    -2]

    K2=2.2 1015 )/.400,13exp( T    [h-1] 

    Operating constraints

    Feed rate 1.5 m3 h

    -1; Temp. 50

    0C;

    Max. Volume=5 m3 

    4. Results and Discussion

    Using the same initial training data set of

    six batch runs used by Martinez et al. [4],

    the RL algorithm was executed for an

    additional 26 batch runs. The results

    obtained from implementing the predictive

    models are shown in Figure 12, together

    with the results obtained by Martinez et al.

    [4]. The same final PI value is approximately

    achieved by both RL applications with

    similar trends followed though the RL

    algorithm developed here manages to

    converge more steadily.

    The initial training data set used by

    Martinez et al. [4] contained a very good

    batch run with a Performance Index equal

    to 15.09. Hence it could be argued that the

    RL algorithm was not actually incrementally

    improving (or learning), but was rather

    interpolating between previous good batch

    runs. So the best batch run in the training

    data set was replaced with another batch

    run, which leads to the initial training

    dataset having a maximum PerformanceIndex value equal to 9.15.

    The implementation of the RL algorithm

    was then repeated for the new training data

    which shows how the RL algorithm manages

    to achieve a much higher value of the

    Performance Index of 11.36 in comparison

    Figure 12: Validation of RL Methodology

    -2

    0

    2

    4

    6

    8

    10

    12

    14

    7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

    Number of batches

    PerformanceIndex

    Results produced by Martinez (1998a)

    Results obtained from

    implementing RL methodology

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    11/12

     Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    Mustafa Abbas Mustafa andTony Wilson

    31

    Figure 13: Performance Index as a function of number of additional batch runs (best batch run

    in initial training data set results in a PI equal to 11.36)

    to the previous best batch run in the

    training data set (Figure 13). This issignificant in that it demonstrates capability

    of the RL algorithm to improve beyond the

    level observed in previous batches.

    5. Conclusion

    Reinforcement Learning provides a new

    approach towards the full automation of

    stochastic discrete systems such as batch

    chemical processes. Those systems provide

    clear challenges due to their inherenttransient behavior, unavailability of online

    measurements in addition to delays in

    measurements which will degrade the

    performance of any process control system.

    This work presents an in-depth description

    of RL for direct application of the algorithm.

    Furthermore, MATLAB implementation of

    the RL approach was validated against a

    published case study [4]. The results haveshown good agreement with literature in

    addition to further assurance of

    improvement in performance beyond the

    best run already presented in the initial

    training data set.

    References

    1.  D. Bonvin, Optimal Operation of Batch

    Reactors: A Personal View, J. Process

    Control. 8 (1998) 355-368.

    2. 

    Terwiesch, P., M. Agarwal and D.W.T

    Rippin, Batch Unit Optimization withImperfect Modeling: a Survey, J. Process

    Control. 4 (1994) 238-258.

    3.  A.Y. Tsen, S.S. Jhang, D.S. Wong, and B.

    Joseph, Predictive Control of Quality in

    Batch Polymerization using ANN Models,

    AIChE Journal. 42 (1996) 455-465.

    4.  E.C. Martinez, J.A. Wilson and M.A.

    Mustafa, An Incremental Learning

    Approach to Batch Unit Optimisation,

    The 1998 IChemE Research Event,

    Newcastle, UK.

    5.  E.C. Martinez, R.A. Pulley and J.A.

    Wilson, Learning to Control the

    Performance of Batch Processes, Chem.

    Eng. Res. Des., 76 (1998) 711-722.

    6. 

    E.C. Martinez and J.A. Wilson, A Hybrid

    Neural Network First Principles Approach

    to Batch Unit Optimisation, Comput.

    Chem. Eng., Suppl. 22 (1998) S893-S896.

    7.  R.S. Sutton and A.G. Barto,

    Reinforcement Learning: An

    Introduction, The MIT Press, Cambridge,

    Massachusetts, London, England, 1998.

    8.  E.L. Thorndike, Animal Intelligence,

    Hafner, Darien, Conn., 1991.

    0

    2

    4

    6

    8

    10

    12

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

    Number of additional batches added to intial training data set

    PerformanceIndex

  • 8/18/2019 1. a Detailed Approach to Reinforcement Learning - A Semi-batch Reactor Case Study 2013

    12/12

    32 Sudan Engineering Society Journal, March 2013, Volume59; No.1 

    A DETAILED APPROACH TO REINFORCEMENT LEARNIN-A SEMI-BATCH REACTOR CASE STUDY

    9. 

    R. Bellman, Dynamic Programming,

    Princeton University, Press, Princeton,

    New Jersey, 1957.

    10.  R.E. Bellman, A Markov Decision Process,

    J. Math. Mech. 6 (1957) 679-684.

    11.  R. Howard, Dynamic Programming and

    Markov Processes, MIT Press,

    Cambridge, MA, USA, 1960.

    12.  D.P. Bertsekas, Dynamic Programming

    and Optimal Control, Athena Scientific,

    Belmont, MA, USA, 1995.

    13. 

    M.L Puterman, Markov Decision

    Problems, Wiley, New York, USA, 1994.

    14.  S. Ross, Introduction to Stochastic

    Dynamic Programming, Academic Press,

    New York, USA, 1983.

    15.  A.E. Bryson, Optimal control-1950 to

    1985, IEEE Control Systems Magazine. 16

    (1996) 26-33.

    16. 

    C.J.C.H Watkins, Learning from Delayed

    Rewards, PhD thesis, Cambridge

    University, UK, 1989.

    17. 

    P.J. Weberos, Building and

    Understanding Adaptive Systems: A

    Statistical/Numerical Approach to

    Factory Automation and Brain Research,

    IEEE Transactions on Systems, Man, and

    Cybernetics. 17 (1987) 7-20.

    18.  D.C. Psichogios and L.H. Ungar, A Hybrid

    Neural Network First Principles Approach

    to Process Modelling, AIChE Journal. 38

    (1992) 1499-1511.

    19. 

    M.L. Thompson and M.A. Kramer,

    Modeling Chemical Processes using Prior

    Knowledge and Neural Networks, AIChE

    Journal. 40 (1994) 1328-1340.20.  L.C. Baird and A.H. Klopf, Reinforcement

    Learning with High-dimensional

    Continuos Actions, Technical Report WL-

    TR-93-1147, Wright Laboratory, Wright

    Patterson Air Force Base, 1993.

    21.  MATLAB: Version 4 User's Guide. The

    Math Works Inc, Natick, Massachusetts,

    USA, 1995.