9
Research Article Intelligent Inventory Control via Ruminative Reinforcement Learning Tatpong Katanyukul 1 and Edwin K. P. Chong 2 1 Faculty of Engineering, Khon Kaen University, Computer Engineering Building, 123 Moo 16, Mitraparb Road, Muang, Khon Kaen 40002, ailand 2 Department of Electrical and Computer Engineering, Colorado State University, 1373 Campus Delivery, Fort Collins, CO 80523-1373, USA Correspondence should be addressed to Tatpong Katanyukul; [email protected] Received 30 December 2013; Accepted 29 May 2014; Published 7 July 2014 Academic Editor: Aderemi Oluyinka Adewumi Copyright © 2014 T. Katanyukul and E. K. P. Chong. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Inventory management is a sequential decision problem that can be solved with reinforcement learning (RL). Although RL in its conventional form does not require domain knowledge, exploiting such knowledge of problem structure, usually available in inventory management, can be beneficial to improving the learning quality and speed of RL. Ruminative reinforcement learning (RRL) has been introduced recently based on this approach. RRL is motivated by how humans contemplate the consequences of their actions in trying to learn how to make a better decision. is study further investigates the issues of RRL and proposes new RRL methods applied to inventory management. Our investigation provides insight into different RRL characteristics, and our experimental results show the viability of the new methods. 1. Introduction Inventory management is a crucial business activity and can be modeled as a sequential decision problem. Bertsimas and iele [1], among others, addressed the need for an efficient and flexible inventory solution that is also simple to implement in practice. is may be among the reasons for extensive studies of reinforcement learning (RL) application to inventory management. RL [2, 3] is an approach to solve sequential decision problems based on learning the underlying state value or state-action value. Relying on learning mechanism, RL in its typical form does not require knowledge of a structure of the problem. erefore, RL has been studied in wide range of sequential decision problems, for example, virtual machine configuration [4], robotics [5], helicopter control [6], ventila- tion, heating and air conditioning control [7], electricity trade [8], financial management [9], water resource management [10], and inventory management [11]. Acceptance of RL is credited to RL’s effectiveness, potential possibilities [12], link to mammal learning processes [13, 14], and its model-free property [15]. Despite fascination with RL’s model-free property, most inventory management problems can naturally be formulated into a well-structured part interacting with another part that is less understood. at is, replenishment cost, holding cost, and penalty cost can be determined precisely in advance. On the other hand, customer demand or, in some cases, delivery time or availability of supplies is usually less predictable. However, once a value of a less predictable variable is known, the period cost can be determined precisely. Specifically, a warehouse would know its period inventory cost aſter its replenishment has arrived and all demand orders in the period have been observed. Calculation of a period cost is a well-defined formula, while another part, for example, demand, is less predictable. Knowledge about the well- structured part can be exploited, while a learning mechanism can be used to handle the less understood part. Utilizing this knowledge, Kim et al. [16] proposed asyn- chronous action-reward learning, which used simulation to Hindawi Publishing Corporation Journal of Applied Mathematics Volume 2014, Article ID 238357, 8 pages http://dx.doi.org/10.1155/2014/238357

Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

Research ArticleIntelligent Inventory Control via RuminativeReinforcement Learning

Tatpong Katanyukul1 and Edwin K P Chong2

1 Faculty of Engineering Khon Kaen University Computer Engineering Building123 Moo 16 Mitraparb Road Muang Khon Kaen 40002 Thailand

2Department of Electrical and Computer Engineering Colorado State University1373 Campus Delivery Fort Collins CO 80523-1373 USA

Correspondence should be addressed to Tatpong Katanyukul tatponggmailcom

Received 30 December 2013 Accepted 29 May 2014 Published 7 July 2014

Academic Editor Aderemi Oluyinka Adewumi

Copyright copy 2014 T Katanyukul and E K P Chong This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

Inventory management is a sequential decision problem that can be solved with reinforcement learning (RL) Although RL inits conventional form does not require domain knowledge exploiting such knowledge of problem structure usually available ininventory management can be beneficial to improving the learning quality and speed of RL Ruminative reinforcement learning(RRL) has been introduced recently based on this approach RRL is motivated by how humans contemplate the consequences oftheir actions in trying to learn how to make a better decision This study further investigates the issues of RRL and proposes newRRL methods applied to inventory management Our investigation provides insight into different RRL characteristics and ourexperimental results show the viability of the new methods

1 Introduction

Inventory management is a crucial business activity andcan be modeled as a sequential decision problem Bertsimasand Thiele [1] among others addressed the need for anefficient and flexible inventory solution that is also simple toimplement in practice This may be among the reasons forextensive studies of reinforcement learning (RL) applicationto inventory management

RL [2 3] is an approach to solve sequential decisionproblems based on learning the underlying state value orstate-action value Relying on learning mechanism RL in itstypical form does not require knowledge of a structure of theproblem Therefore RL has been studied in wide range ofsequential decision problems for example virtual machineconfiguration [4] robotics [5] helicopter control [6] ventila-tion heating and air conditioning control [7] electricity trade[8] financial management [9] water resource management[10] and inventory management [11] Acceptance of RL iscredited to RLrsquos effectiveness potential possibilities [12] link

to mammal learning processes [13 14] and its model-freeproperty [15]

Despite fascination with RLrsquos model-free property mostinventorymanagement problems can naturally be formulatedinto a well-structured part interacting with another part thatis less understood That is replenishment cost holding costand penalty cost can be determined precisely in advance Onthe other hand customer demand or in some cases deliverytime or availability of supplies is usually less predictableHowever once a value of a less predictable variable is knownthe period cost can be determined precisely Specifically awarehouse would know its period inventory cost after itsreplenishment has arrived and all demand orders in theperiod have been observed Calculation of a period costis a well-defined formula while another part for exampledemand is less predictable Knowledge about the well-structured part can be exploited while a learningmechanismcan be used to handle the less understood part

Utilizing this knowledge Kim et al [16] proposed asyn-chronous action-reward learning which used simulation to

Hindawi Publishing CorporationJournal of Applied MathematicsVolume 2014 Article ID 238357 8 pageshttpdxdoiorg1011552014238357

2 Journal of Applied Mathematics

evaluate consequences of actions not taken in order to accel-erate the learning process in a stateless system Extending theidea to state-based system Katanyukul [17] developed rumi-native reinforcement learning (RRL) methods that is rumi-native SARSA (RSarsa) and policy-weighted RSarsa (PRS)The RRL approach is motivated by how humans contemplateconsequences of their actions to improve their learninghoping to make a better decision His study of RRL revealsgood potential of the approach However existing individualmethods show strengths in different scenarios RSarsa isshown to have fast learning but leads to inferior learningquality in a long-term run PRS is shown to lead to superiorlearning quality in a long-term run but with slower rate

Our proposed method here is developed to exploit thefast learning characteristic of RSarsa and good learningquality in a long-term run of PRS Our experimental resultsshow effectiveness of the proposed method and support ourassumption underlying development of RRL

2 Background

An objective of a sequential inventory management is tominimize a long-term cost 119862

0(1199040) = min

1198860 119886119879sum119879

119905=0120574119905119864[119888119905|

1199040 1198860 119886

119879] subject to 119886

119905isin 119860119904119905for 119905 = 0 119879 where119864[119888

119905|

1199040 1198860 119886

119879] is the expected period cost of period 119905 given

an initial state 1199040and actions 119886

0 119886

119879over periods 0 to 119879

respectively 120574 is a discount factor and119860119904119905is a feasible action

set at state 119904119905 Under certain assumptions the problem can

be posed as a Markov decision problem (MDP) (see [15] fordetails) In this case what we seek is an optimal policy whichmaps each state to an optimal action Given an arbitrarypolicy120587 the long-term state cost for that policy can bewrittenas

119862120587(119904) = 119903

120587(119904) + 120574sum

1199041015840

119901120587(1199041015840| 119904) 119862120587(1199041015840) (1)

where 119903120587(119904) is an expected period state cost 119901120587(1199041015840 | 119904)

is a transition probabilitymdashthe probability of the next statebeing 1199041015840when the current state is 119904The superscript 120587 notationindicates dependence on the policy 120587 In practice exactsolution to (1) is difficult to find Reinforcement learning (RL)[2] provides a framework to find an approximate solutionAn approximate long-term cost of state 119904 is obtained bysummation of the period cost and a long-term cost of the nextstate 119876(119904) = 119903(119904) + 120574119876(1199041015840)

The RL approach is based on temporal difference (TD)learning which uses temporal difference error 120595 (2) toestimate the long-term cost (3)

120595 = 119903 + 120574119876 (1199041015840 1198861015840) minus 119876 (119904 119886) (2)

119876(new)

(119904 119886) = 119876(old)

(119904 119886) + 120572 sdot 120595 (3)

where 119903 is the period cost which corresponds to taking action119886 in state 119904 120572 is a learning rate and 1199041015840 and 1198861015840 are the state andaction taken in the next period respectively

Once the values of 119876(119904 119886) are thoroughly learned theyare good approximations of long-term costs We often referto 119876(119904 119886) as the ldquoQ-valuerdquo Most RL methods determine the

Q

r(t)

s(t)

a(t)

s(tminus1)

a(tminus1)

Environment

s998400

a998400

r

a

sTD error

SARSA agent

Buffer

Buffer

120587

Figure 1 SARSA agent and interacting variables (this figure isadapted from Figure 615 of Sutton and Barto [2])

actions to take based on Q-values These methods includeSARSA [2] a widely used RL algorithm We use SARSA asa benchmark representing a conventional RL method tocompare with other methods under investigation In eachperiod given observed state 119904 action taken 119886 observedperiod cost 119903 observed next state 1199041015840 and anticipating nextaction taken 119886

1015840 the SARSA algorithm updates the Q-valuebased on TD learning (2) and (3)

Based on the Q-value we can define a policy 120587 todetermine an action to take at each stateThe policy is usuallystochastic defined by a probability 119901(119886 | 119904) to take anaction 119886 given a state 119904 The policy has to balance betweentaking the best action based on the currently learnedQ-valueand trying another alternative Trying another alternativegives the learning agent a chance to explore thoroughly theconsequences of its state-action space This helps to createa constructive cycle of improving the quality of learned Q-values which in turn will help the agent to choose betteractions and reduce the chance to get stuck in a local optimumThis is an issue of balancing between exploitation and explo-ration as discussed in Sutton and Barto [2] (Since the RLalgorithm is autonomous and interacts with its environmentwe sometimes use the term ldquolearning agentrdquo)

An 120598-greedy policy is a general RL policy which also iseasy to implement With probability 120598 the policy choosesan action randomly from 119886 isin 119860(119904) where 119860(119904) is a setof allowable actions given state 119904 Otherwise it takes anaction corresponding to the minimal current Q-value 119886lowast =argmin

119886119876(119904 119886)

3 Ruminative Reinforcement Learning

The conventional RL approach SARSA assumes that theagent knows only the current state 119904 the action 119886 it takesthe period cost 119903 the next state 1199041015840 and the action 119886

1015840 it willtake in the next state Each period the SARSA agent updatesthe Q-value based on the TD error calculated with thesefive variables Figure 1 illustrates the SARSA agent the fivevariables it needs to update the Q-value and its interactionwith its environment

Journal of Applied Mathematics 3

120585120585

kk

Buffer

Environment

s998400

r

as

s

r

s998400

a

Rumination

Figure 2 Environment knowledge of its structure and rumination

However in inventorymanagement problems we usuallyhave extra knowledge about the environment That is theproblem structure can naturally be formulated such that theperiod cost 119903 and next state 1199041015840 are determined by a function119896 119904 119886 120585 997891rarr 119903 119904

1015840 where 120585 is an extra information variableThis variable 120585 captures the stochastic aspect of the problemThe process generating 120585may be unknown but the value of 120585is fully observable after the period is over Given a value of 120585along with 119904 and 119886 the deterministic function 119896 can preciselydetermine 119903 and 1199041015840

Without this extra knowledge each period the SARSAagent updates only one value of 119876(119904 119886) corresponding tocurrent state 119904 and action taken 119886 However with the function119896 and an observed value of 120585 we can do ldquoruminationrdquoevaluating the consequences of other actions 119886 even thosethat were not taken Figure 2 illustrates rumination and itsassociated variables Given the rumination mechanism wecan provide information required by SARSArsquos TD calculationfor any underlying action Katanyukul [17] introduced thisrumination idea and incorporated it into the SARSA algo-rithm resulting in the ruminative SARSA (RSarsa) algorithmAlgorithm 1 shows the RSarsa algorithm It should be notedthat RSarsa is similar to SARSA but with inclusion ofrumination from line 8 to line 13

The experiments in [17] showed that RSarsa had per-formed significantly better than SARSA in early periods(indicating faster learning) but its performance was inferiorto SARSA in later periods (indicating poor convergence to theappropriate long-term state cost approximation) Katanyukul[17] attributed RSarsarsquos poor long-term learning quality to itslack of natural action visitation frequency

TD learning (2) and (3) update theQ-value as an approx-imation of the long-term state costThe transition probability119901120587(1199041015840| 119904) in (1) does not appear explicitly in the TD learning

calculation Conventional RL relies on sampling trajectoriesto reflect the natural frequency of visits to state-action pairscorresponding to the transition probability It updates onlythe state-action pairs as they are actually visited therefore itdoes not require explicit calculation of the transition proba-bility and still eventually converges to a good approximation

However because RSarsa does rumination for all actionsignoring their sampling frequency this is equivalent to

(L00) Initialize 119876(119904 119886)(L01) Observe 119904(L02) Determine 119886 by policy 120587(L03) For each period(L04) observe 119903 1199041015840 and 120585(L05) determine 1198861015840 by policy 120587(L06) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L07) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595(L08) for each 119886 isin 119860(119904)(L09) calculate 119903 1199041015840 with 119896(119904 119886 120585)(L10) determine 1198861015840(L11) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L12) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595

(L13) until ruminated all 119886 isin 119860(L14) set 119904 larr 119904

1015840 and 119886 larr 1198861015840

(L15) until termination

Algorithm 1 RSarsa algorithm

disregarding the transition probability which leads toRSarsarsquos poor long-term learning quality

To address this issue Katanyukul [17] proposed policy-weighted RSarsa (PRS) PRS explicitly calculates probabilitiesof actions to be ruminated and adjusts the weights of theirupdates PRS is similar to RSarsa but the rumination update(line 12 in Algorithm 1) is replaced by

119876 (119904 119886) larr997888 119876 (119904 119886) + 120573 sdot 120595 (4)

where 120573 = 120572 sdot 119901(119886) and 119901(119886) is the probability of takingaction 119886 in state 119904 with policy 120587 Given an 120598-greedy policywe have 119901(119886) = 120598|119860(119904)| for 119886 = 119886

lowast and 119901(119886) = 120598|119860(119904)| +

(1 minus 120598) otherwise where |119860(119904)| is a number of allowableactions PRS has been shown to performwell in early and laterperiods compared to SARSA However RSarsa is reported tosignificantly outperform PRS in early periods

4 New Methods

According to the results of [17] although RSarsa mayconverge to a wrong approximation RSarsa was shown toperform impressively in the very early periods This suggeststhat if we jump-start the learning agent with RSarsa and thenlater switch to PRS before the Q-values settle into bad spotswe may be able to achieve both faster learning and goodapproximation for a long-term run

PRSBeta We first introduce a straightforward idea calledPRSBeta where we will use a varying ruminative learn-ing rate as a mechanism to shift from full rumination(RSarsa) to policy-weighted rumination (PRS) Similar toPRS the rumination update is determined by (4) Howeverthe value of the rumination learning rate 120573 is determined by

120573 = 120572 sdot 1 minus (1 minus 119901 (119886)) sdot 119891 (5)

where 119891 is a function having a value between 0 and 1 When119891 rarr 0 120573 rarr 120572 and the algorithm will behave like RSarsa

4 Journal of Applied Mathematics

00

02

04

06

08

10

f

1e6 8e5 6e5 4e5 2e5 0|120595|

f exp(minus120591 |120595|)

120591 = 1

120591 = 1e minus 5

120591 = 05e minus 5

Figure 3 Function exp(minus120591|120595|) and effects of different 120591 values

When 119891 rarr 1 120573 rarr 120572 sdot 119901(119886) and the algorithm will behavelike PRS We want 119891 to start out close to 0 and grow to 1

at a proper rate By examining our preliminary experimentsthe TD error will get smaller as the learning converges Thisis actually a property of TD learning Given this propertywe can use the magnitude of the TD error |120595| to control theshifting such that

119891 (120595) = exp (minus120591 sdot 10038161003816100381610038161205951003816100381610038161003816) (6)

where 120591 is a scaling factor Figure 3 illustrates the effects ofdifferent values of 120591 Since the magnitude of 120591 should berelative to |120595| we set 120591 = |2(119903 + 119876(119904 119886))| so that themagnitude of 120591 will be in a proper scale relative to |120595| andautomatically adjusted

RSarsaTD Building on the PRSBeta method above we nextpropose another method called RSarsaTD The underlyingidea is that since SARSA performs well in a long-term run(see [2] for theoretical discussion of SARSArsquos optimalityand convergence properties) then after we speed up theearly learning process with rumination we can just switchback to SARSA This approach is to utilize the fast learningcharacteristic of full rumination in early periods and toavoid its poor long-term performance In addition as acomputational cost of rumination is proportional to the sizeof the ruminative action space |119860(119904)| this also helps to reducethe computational cost incurred by rumination It is alsointuitively appealing in the sense that we do rumination onlywhen we need it

The intuition to selectively do ruminationwas introducedin [17] in an attempt to reduce the extra computational costfrom ruminationThere the probability to do ruminationwasa function of the magnitude of the TD error

119901 (rumination) = 1 minus exp(minus10038161003816100381610038161003816100381610038161003816

2120595

119903 + 119876 (119904 119886)

10038161003816100381610038161003816100381610038161003816

) (7)

However Katanyukul [17] investigated this selective rumi-nation only with the policy-weighted method and called itPRSTD Although PRSTD was able to improve the compu-tational cost of the rumination approach the inventory man-agement performance of PRSTDwas reported to have mixedresults implying that incorporation of selective ruminationmay deteriorate performance of PRS

This performance deterioration may be due to using119901(rumination) with policy weighted correction Bothschemes use |120595| to control their effect of rumination there-fore they might have an effect equivalent to overcorrectingthe state-transition probability Unlike PRS RSarsa doesnot correct the state-transition probability Incorporatingselective rumination (7) will be the only scheme controllingrumination with |120595| Therefore we expect that this approachmay allow the advantage of RSarsarsquos fast learning whilemaintaining the long-term learning quality of SARSA

5 Experiments and Results

Our study uses computer simulations to conduct numeri-cal experiments on three inventory management problemsettings (P1 P2 and P3) All problems are periodic reviewsingle-echelon with nonzero setup cost P1 and P2 have one-period lead time P3 has two-period lead time The sameMarkov model is used to govern all problem environmentsbut with different settings For P1 and P2 the problem statespace is I times 0 I+ for on-hand and in-transit inventories 119909and 119887(1) respectively P3rsquos state space is I times 0 I+ times 0 I+ for119909 and in-transit inventories 119887(1) and 119887(2) The action space is0 I+ for replenishment order 119886

The state transition is specified by

119909119905+1

= 119909119905+ 119887(1)

119905minus 119889119905

119887(119894)

119905+1= 119887(119894+1)

119905 for 119894 = 1 119871 minus 1

119887(119871)

119905+1= 119886119905

(8)

where 119871 is a number of lead time periodsThe inventory period cost is calculated from the equation

119903119905= 119870 sdot 120575 (119886

119905) + 119866 sdot 119886

119905+ 119867 sdot 119909

119905+1sdot 120575 (119909119905+1)

minus 119861 sdot 119909119905+1

sdot 120575 (minus119909119905+1)

(9)

where 119870 119866 119867 and 119861 are setup unit holding and penaltycosts respectively and 120575(sdot) is a step function Five RL agentsare studied SARSA RSarsa PRS RSarsaTD and PRSBeta

Each experiment is repeated 10 times In each repeti-tion an agent is initialized with all zero Q-values Thenthe experiment is run consecutively for 119873

119864episodes Each

episode starts with initial state and action as follows for allproblems 119887(1)

1and 119886

1are initialized with values randomly

drawn between 0 and 100 In P1 1199091is initialized to 50 in

P21199091is initialized from randomly drawn values betweenminus50

and 400 in P3 1199091is initialized to 100 and randomly drawn

values of 119887(2) between 0 and 100 Each episode ends when119873119875periods are reached or an agent has visited a termination

Journal of Applied Mathematics 5

Table 1 Experimental results

Line MethodsSARSA RSarsa PRS RSarsaTD PRSBeta

Relative computation timeepoch1 P1 1 30 26 5 302 P2 1 20 21 3 193 P3 1 31 29 6 31

Average cost of early periods4 P1 8421 7619 (W) 8379 (p043) 7597 (W) 7450 (W)5 P2 4935 4606 (W) 4792 (p006) 4685 (W) 4411 (W)6 P3 10502 8694 (W) 9958 (p020) 9390 (p007) 8472 (W)

Average cost of later periods7 P1 7214 7355 (p068) 7051 (W) 7110 (p011) 7010 (W)8 P2 4308 4388 (p090) 4248 (p014) 4375 (p084) 4194 (W)9 P3 8613 8139 (p029) 8312 (p037) 8486 (p043) 7664 (p018)

state which is a state lying outside a valid range of Q-valueimplementation The maximum number of periods in eachepisode119873

119875 defines the length of the problem horizon while

the number of episodes 119873119864specifies a variety of problem

scenarios that is different initial states and actionsThree problem settings are used in our experiments

Problem 1 (P1) has 119873119864= 100 119873

119875= 60 119870 = 200 119866 = 100

119861 = 200 and 119867 = 20 Demand 119889119905is normally distributed

with mean 50 and standard deviation 10 denoted as 119889119905sim

N(50 102) The environment state [119909 119887(1)] is set as the RL

agent state 119904 = [119909 119887(1)] Problem 2 (P2) has 119873

119864= 500

119873119875= 60 119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 with

demand 119889119905sim N(50 10

2) The RL agent state is set as the

inventory level 119904 = 119909 + 119887(1) Therefore the RL agent state is

one-dimensional Problem 3 (P3) has 119873119864= 500 119873

119875= 60

119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 The demand 119889119905is

AR1GARCH(11) 119889119905= 1198860+ 1198861sdot 119889119905minus1

+ 120598119905 120598119905= 119890119905sdot 120590119905and

1205902

119905= ]0+ ]1sdot 1205982

119905minus1+ ]2sdot 1205902

119905minus1 where 119886

0and 1198861are AR1 model

parameters ]0 ]1 and ]

2are GARCH(11) parameters and 119890

119905

is white noise distributed according toN(0 1) The values ofAR1GARCH(11) in our experiments are 119886

0= 2 119886

1= 08

]0= 100 ]

1= 01 and ]

2= 08 with initial values 119889

1= 50

1205902

1= 100 and 120598

1= 2 The RL agent state in P3 is three-

dimensional 119904 = [119909 119887(1) 119887(2)] In all three problem settings

the RL agent period cost and action are the inventory periodcost and replenishment order respectively For RSarsa PRSRSarsaTD and PRSBeta the extra information required byrumination is the inventory demand variable 120585 = 119889

119905

The Q-value is implemented using grid tile coding [2]without hashing Tile coding is a function approximationmethod based on a linear combination of weights of activatedfeatures called ldquotilesrdquo The approximation function withargument z is given by

119891 (z) = 11990811206011(z) + 119908

21206012(z) + sdot sdot sdot + 119908

119872120601119872(z) (10)

where 1199081 1199082 119908

119872are tile weights and 120601

1(z) 1206012(z)

120601119872(z) are tile activation functions 120601

119894(z) = 1 only when z lies

inside the hypercube of the 119894th tile

The tile configuration that is 1206011(z) 120601

119872(z) is prede-

fined Each Q-value is stored using tile coding through theweights Given a value119876 to store at any entry of z the weightsare updated according to

119908119894= 119908(old)119894

+

(119876 minus 119876(old)

)

119873 (11)

where 119908(old)119894

and 119876(old) are the weight (of the 119894th tile) and

approximation before the new update Variable 119873 is for anumber of tiling layers

For P1 we use a tile codingwith 10 tiling layers Each layerhas 8 times 3 times 4 three-dimensional tiles covering multidimen-sional state-action space of [minus300 500] times [0 150] times [0 150]

corresponding to 119904 = [119909 119887(1)] and 119886 This means that this tile

coding allows only a state lying in [minus300 500]times[0 150] and avalue of action between 0 and 150 The dimensions along 119909119887(1) and 119886 are partitioned into 8 3 and 4 partitions creating96 three-dimensional hypercubes for each tiling layer Alllayers are overlapping to constitute an entire tile coding setLayer overlapping is arranged randomly For P2 we use atile coding with 5 tiling layers Each tiling has 11 times 5 two-dimensional tiles covering the space of [minus300 650]times [0 150]corresponding to 119904 = (119909 + 119887

(1)) and 119886 For P3 we use a tile

coding with 10 tiling layers Each tiling has 8 times 3 times 3 times 4

four-dimensional tiles covering the space of [minus400 1200] times[0 150]times[0 150]times[0 150] corresponding to 119904 = [119909 119887(1) 119887(2)]and 119886

All RL agents use the 120598-greedy policy with 120598 = 02 Thelearning update uses the learning rate 120572 = 07 and discountfactor 120574 = 08

Figures 4 5 and 6 showmoving averages (of degree 1000)of period costs in P1 P2 and P3 obtained with differentlearning agents as indicated in the legends (ldquoRTDrdquo is shortfor RSarsaTD) Figures 7 and 8 show box plots of averagecosts obtained with the different methods in early and laterperiods respectively

The results are summarized in Table 1 The computationcosts of the methods are measured by relative averagecomputation time per epoch shown in lines 1ndash3 Average

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

2 Journal of Applied Mathematics

evaluate consequences of actions not taken in order to accel-erate the learning process in a stateless system Extending theidea to state-based system Katanyukul [17] developed rumi-native reinforcement learning (RRL) methods that is rumi-native SARSA (RSarsa) and policy-weighted RSarsa (PRS)The RRL approach is motivated by how humans contemplateconsequences of their actions to improve their learninghoping to make a better decision His study of RRL revealsgood potential of the approach However existing individualmethods show strengths in different scenarios RSarsa isshown to have fast learning but leads to inferior learningquality in a long-term run PRS is shown to lead to superiorlearning quality in a long-term run but with slower rate

Our proposed method here is developed to exploit thefast learning characteristic of RSarsa and good learningquality in a long-term run of PRS Our experimental resultsshow effectiveness of the proposed method and support ourassumption underlying development of RRL

2 Background

An objective of a sequential inventory management is tominimize a long-term cost 119862

0(1199040) = min

1198860 119886119879sum119879

119905=0120574119905119864[119888119905|

1199040 1198860 119886

119879] subject to 119886

119905isin 119860119904119905for 119905 = 0 119879 where119864[119888

119905|

1199040 1198860 119886

119879] is the expected period cost of period 119905 given

an initial state 1199040and actions 119886

0 119886

119879over periods 0 to 119879

respectively 120574 is a discount factor and119860119904119905is a feasible action

set at state 119904119905 Under certain assumptions the problem can

be posed as a Markov decision problem (MDP) (see [15] fordetails) In this case what we seek is an optimal policy whichmaps each state to an optimal action Given an arbitrarypolicy120587 the long-term state cost for that policy can bewrittenas

119862120587(119904) = 119903

120587(119904) + 120574sum

1199041015840

119901120587(1199041015840| 119904) 119862120587(1199041015840) (1)

where 119903120587(119904) is an expected period state cost 119901120587(1199041015840 | 119904)

is a transition probabilitymdashthe probability of the next statebeing 1199041015840when the current state is 119904The superscript 120587 notationindicates dependence on the policy 120587 In practice exactsolution to (1) is difficult to find Reinforcement learning (RL)[2] provides a framework to find an approximate solutionAn approximate long-term cost of state 119904 is obtained bysummation of the period cost and a long-term cost of the nextstate 119876(119904) = 119903(119904) + 120574119876(1199041015840)

The RL approach is based on temporal difference (TD)learning which uses temporal difference error 120595 (2) toestimate the long-term cost (3)

120595 = 119903 + 120574119876 (1199041015840 1198861015840) minus 119876 (119904 119886) (2)

119876(new)

(119904 119886) = 119876(old)

(119904 119886) + 120572 sdot 120595 (3)

where 119903 is the period cost which corresponds to taking action119886 in state 119904 120572 is a learning rate and 1199041015840 and 1198861015840 are the state andaction taken in the next period respectively

Once the values of 119876(119904 119886) are thoroughly learned theyare good approximations of long-term costs We often referto 119876(119904 119886) as the ldquoQ-valuerdquo Most RL methods determine the

Q

r(t)

s(t)

a(t)

s(tminus1)

a(tminus1)

Environment

s998400

a998400

r

a

sTD error

SARSA agent

Buffer

Buffer

120587

Figure 1 SARSA agent and interacting variables (this figure isadapted from Figure 615 of Sutton and Barto [2])

actions to take based on Q-values These methods includeSARSA [2] a widely used RL algorithm We use SARSA asa benchmark representing a conventional RL method tocompare with other methods under investigation In eachperiod given observed state 119904 action taken 119886 observedperiod cost 119903 observed next state 1199041015840 and anticipating nextaction taken 119886

1015840 the SARSA algorithm updates the Q-valuebased on TD learning (2) and (3)

Based on the Q-value we can define a policy 120587 todetermine an action to take at each stateThe policy is usuallystochastic defined by a probability 119901(119886 | 119904) to take anaction 119886 given a state 119904 The policy has to balance betweentaking the best action based on the currently learnedQ-valueand trying another alternative Trying another alternativegives the learning agent a chance to explore thoroughly theconsequences of its state-action space This helps to createa constructive cycle of improving the quality of learned Q-values which in turn will help the agent to choose betteractions and reduce the chance to get stuck in a local optimumThis is an issue of balancing between exploitation and explo-ration as discussed in Sutton and Barto [2] (Since the RLalgorithm is autonomous and interacts with its environmentwe sometimes use the term ldquolearning agentrdquo)

An 120598-greedy policy is a general RL policy which also iseasy to implement With probability 120598 the policy choosesan action randomly from 119886 isin 119860(119904) where 119860(119904) is a setof allowable actions given state 119904 Otherwise it takes anaction corresponding to the minimal current Q-value 119886lowast =argmin

119886119876(119904 119886)

3 Ruminative Reinforcement Learning

The conventional RL approach SARSA assumes that theagent knows only the current state 119904 the action 119886 it takesthe period cost 119903 the next state 1199041015840 and the action 119886

1015840 it willtake in the next state Each period the SARSA agent updatesthe Q-value based on the TD error calculated with thesefive variables Figure 1 illustrates the SARSA agent the fivevariables it needs to update the Q-value and its interactionwith its environment

Journal of Applied Mathematics 3

120585120585

kk

Buffer

Environment

s998400

r

as

s

r

s998400

a

Rumination

Figure 2 Environment knowledge of its structure and rumination

However in inventorymanagement problems we usuallyhave extra knowledge about the environment That is theproblem structure can naturally be formulated such that theperiod cost 119903 and next state 1199041015840 are determined by a function119896 119904 119886 120585 997891rarr 119903 119904

1015840 where 120585 is an extra information variableThis variable 120585 captures the stochastic aspect of the problemThe process generating 120585may be unknown but the value of 120585is fully observable after the period is over Given a value of 120585along with 119904 and 119886 the deterministic function 119896 can preciselydetermine 119903 and 1199041015840

Without this extra knowledge each period the SARSAagent updates only one value of 119876(119904 119886) corresponding tocurrent state 119904 and action taken 119886 However with the function119896 and an observed value of 120585 we can do ldquoruminationrdquoevaluating the consequences of other actions 119886 even thosethat were not taken Figure 2 illustrates rumination and itsassociated variables Given the rumination mechanism wecan provide information required by SARSArsquos TD calculationfor any underlying action Katanyukul [17] introduced thisrumination idea and incorporated it into the SARSA algo-rithm resulting in the ruminative SARSA (RSarsa) algorithmAlgorithm 1 shows the RSarsa algorithm It should be notedthat RSarsa is similar to SARSA but with inclusion ofrumination from line 8 to line 13

The experiments in [17] showed that RSarsa had per-formed significantly better than SARSA in early periods(indicating faster learning) but its performance was inferiorto SARSA in later periods (indicating poor convergence to theappropriate long-term state cost approximation) Katanyukul[17] attributed RSarsarsquos poor long-term learning quality to itslack of natural action visitation frequency

TD learning (2) and (3) update theQ-value as an approx-imation of the long-term state costThe transition probability119901120587(1199041015840| 119904) in (1) does not appear explicitly in the TD learning

calculation Conventional RL relies on sampling trajectoriesto reflect the natural frequency of visits to state-action pairscorresponding to the transition probability It updates onlythe state-action pairs as they are actually visited therefore itdoes not require explicit calculation of the transition proba-bility and still eventually converges to a good approximation

However because RSarsa does rumination for all actionsignoring their sampling frequency this is equivalent to

(L00) Initialize 119876(119904 119886)(L01) Observe 119904(L02) Determine 119886 by policy 120587(L03) For each period(L04) observe 119903 1199041015840 and 120585(L05) determine 1198861015840 by policy 120587(L06) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L07) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595(L08) for each 119886 isin 119860(119904)(L09) calculate 119903 1199041015840 with 119896(119904 119886 120585)(L10) determine 1198861015840(L11) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L12) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595

(L13) until ruminated all 119886 isin 119860(L14) set 119904 larr 119904

1015840 and 119886 larr 1198861015840

(L15) until termination

Algorithm 1 RSarsa algorithm

disregarding the transition probability which leads toRSarsarsquos poor long-term learning quality

To address this issue Katanyukul [17] proposed policy-weighted RSarsa (PRS) PRS explicitly calculates probabilitiesof actions to be ruminated and adjusts the weights of theirupdates PRS is similar to RSarsa but the rumination update(line 12 in Algorithm 1) is replaced by

119876 (119904 119886) larr997888 119876 (119904 119886) + 120573 sdot 120595 (4)

where 120573 = 120572 sdot 119901(119886) and 119901(119886) is the probability of takingaction 119886 in state 119904 with policy 120587 Given an 120598-greedy policywe have 119901(119886) = 120598|119860(119904)| for 119886 = 119886

lowast and 119901(119886) = 120598|119860(119904)| +

(1 minus 120598) otherwise where |119860(119904)| is a number of allowableactions PRS has been shown to performwell in early and laterperiods compared to SARSA However RSarsa is reported tosignificantly outperform PRS in early periods

4 New Methods

According to the results of [17] although RSarsa mayconverge to a wrong approximation RSarsa was shown toperform impressively in the very early periods This suggeststhat if we jump-start the learning agent with RSarsa and thenlater switch to PRS before the Q-values settle into bad spotswe may be able to achieve both faster learning and goodapproximation for a long-term run

PRSBeta We first introduce a straightforward idea calledPRSBeta where we will use a varying ruminative learn-ing rate as a mechanism to shift from full rumination(RSarsa) to policy-weighted rumination (PRS) Similar toPRS the rumination update is determined by (4) Howeverthe value of the rumination learning rate 120573 is determined by

120573 = 120572 sdot 1 minus (1 minus 119901 (119886)) sdot 119891 (5)

where 119891 is a function having a value between 0 and 1 When119891 rarr 0 120573 rarr 120572 and the algorithm will behave like RSarsa

4 Journal of Applied Mathematics

00

02

04

06

08

10

f

1e6 8e5 6e5 4e5 2e5 0|120595|

f exp(minus120591 |120595|)

120591 = 1

120591 = 1e minus 5

120591 = 05e minus 5

Figure 3 Function exp(minus120591|120595|) and effects of different 120591 values

When 119891 rarr 1 120573 rarr 120572 sdot 119901(119886) and the algorithm will behavelike PRS We want 119891 to start out close to 0 and grow to 1

at a proper rate By examining our preliminary experimentsthe TD error will get smaller as the learning converges Thisis actually a property of TD learning Given this propertywe can use the magnitude of the TD error |120595| to control theshifting such that

119891 (120595) = exp (minus120591 sdot 10038161003816100381610038161205951003816100381610038161003816) (6)

where 120591 is a scaling factor Figure 3 illustrates the effects ofdifferent values of 120591 Since the magnitude of 120591 should berelative to |120595| we set 120591 = |2(119903 + 119876(119904 119886))| so that themagnitude of 120591 will be in a proper scale relative to |120595| andautomatically adjusted

RSarsaTD Building on the PRSBeta method above we nextpropose another method called RSarsaTD The underlyingidea is that since SARSA performs well in a long-term run(see [2] for theoretical discussion of SARSArsquos optimalityand convergence properties) then after we speed up theearly learning process with rumination we can just switchback to SARSA This approach is to utilize the fast learningcharacteristic of full rumination in early periods and toavoid its poor long-term performance In addition as acomputational cost of rumination is proportional to the sizeof the ruminative action space |119860(119904)| this also helps to reducethe computational cost incurred by rumination It is alsointuitively appealing in the sense that we do rumination onlywhen we need it

The intuition to selectively do ruminationwas introducedin [17] in an attempt to reduce the extra computational costfrom ruminationThere the probability to do ruminationwasa function of the magnitude of the TD error

119901 (rumination) = 1 minus exp(minus10038161003816100381610038161003816100381610038161003816

2120595

119903 + 119876 (119904 119886)

10038161003816100381610038161003816100381610038161003816

) (7)

However Katanyukul [17] investigated this selective rumi-nation only with the policy-weighted method and called itPRSTD Although PRSTD was able to improve the compu-tational cost of the rumination approach the inventory man-agement performance of PRSTDwas reported to have mixedresults implying that incorporation of selective ruminationmay deteriorate performance of PRS

This performance deterioration may be due to using119901(rumination) with policy weighted correction Bothschemes use |120595| to control their effect of rumination there-fore they might have an effect equivalent to overcorrectingthe state-transition probability Unlike PRS RSarsa doesnot correct the state-transition probability Incorporatingselective rumination (7) will be the only scheme controllingrumination with |120595| Therefore we expect that this approachmay allow the advantage of RSarsarsquos fast learning whilemaintaining the long-term learning quality of SARSA

5 Experiments and Results

Our study uses computer simulations to conduct numeri-cal experiments on three inventory management problemsettings (P1 P2 and P3) All problems are periodic reviewsingle-echelon with nonzero setup cost P1 and P2 have one-period lead time P3 has two-period lead time The sameMarkov model is used to govern all problem environmentsbut with different settings For P1 and P2 the problem statespace is I times 0 I+ for on-hand and in-transit inventories 119909and 119887(1) respectively P3rsquos state space is I times 0 I+ times 0 I+ for119909 and in-transit inventories 119887(1) and 119887(2) The action space is0 I+ for replenishment order 119886

The state transition is specified by

119909119905+1

= 119909119905+ 119887(1)

119905minus 119889119905

119887(119894)

119905+1= 119887(119894+1)

119905 for 119894 = 1 119871 minus 1

119887(119871)

119905+1= 119886119905

(8)

where 119871 is a number of lead time periodsThe inventory period cost is calculated from the equation

119903119905= 119870 sdot 120575 (119886

119905) + 119866 sdot 119886

119905+ 119867 sdot 119909

119905+1sdot 120575 (119909119905+1)

minus 119861 sdot 119909119905+1

sdot 120575 (minus119909119905+1)

(9)

where 119870 119866 119867 and 119861 are setup unit holding and penaltycosts respectively and 120575(sdot) is a step function Five RL agentsare studied SARSA RSarsa PRS RSarsaTD and PRSBeta

Each experiment is repeated 10 times In each repeti-tion an agent is initialized with all zero Q-values Thenthe experiment is run consecutively for 119873

119864episodes Each

episode starts with initial state and action as follows for allproblems 119887(1)

1and 119886

1are initialized with values randomly

drawn between 0 and 100 In P1 1199091is initialized to 50 in

P21199091is initialized from randomly drawn values betweenminus50

and 400 in P3 1199091is initialized to 100 and randomly drawn

values of 119887(2) between 0 and 100 Each episode ends when119873119875periods are reached or an agent has visited a termination

Journal of Applied Mathematics 5

Table 1 Experimental results

Line MethodsSARSA RSarsa PRS RSarsaTD PRSBeta

Relative computation timeepoch1 P1 1 30 26 5 302 P2 1 20 21 3 193 P3 1 31 29 6 31

Average cost of early periods4 P1 8421 7619 (W) 8379 (p043) 7597 (W) 7450 (W)5 P2 4935 4606 (W) 4792 (p006) 4685 (W) 4411 (W)6 P3 10502 8694 (W) 9958 (p020) 9390 (p007) 8472 (W)

Average cost of later periods7 P1 7214 7355 (p068) 7051 (W) 7110 (p011) 7010 (W)8 P2 4308 4388 (p090) 4248 (p014) 4375 (p084) 4194 (W)9 P3 8613 8139 (p029) 8312 (p037) 8486 (p043) 7664 (p018)

state which is a state lying outside a valid range of Q-valueimplementation The maximum number of periods in eachepisode119873

119875 defines the length of the problem horizon while

the number of episodes 119873119864specifies a variety of problem

scenarios that is different initial states and actionsThree problem settings are used in our experiments

Problem 1 (P1) has 119873119864= 100 119873

119875= 60 119870 = 200 119866 = 100

119861 = 200 and 119867 = 20 Demand 119889119905is normally distributed

with mean 50 and standard deviation 10 denoted as 119889119905sim

N(50 102) The environment state [119909 119887(1)] is set as the RL

agent state 119904 = [119909 119887(1)] Problem 2 (P2) has 119873

119864= 500

119873119875= 60 119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 with

demand 119889119905sim N(50 10

2) The RL agent state is set as the

inventory level 119904 = 119909 + 119887(1) Therefore the RL agent state is

one-dimensional Problem 3 (P3) has 119873119864= 500 119873

119875= 60

119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 The demand 119889119905is

AR1GARCH(11) 119889119905= 1198860+ 1198861sdot 119889119905minus1

+ 120598119905 120598119905= 119890119905sdot 120590119905and

1205902

119905= ]0+ ]1sdot 1205982

119905minus1+ ]2sdot 1205902

119905minus1 where 119886

0and 1198861are AR1 model

parameters ]0 ]1 and ]

2are GARCH(11) parameters and 119890

119905

is white noise distributed according toN(0 1) The values ofAR1GARCH(11) in our experiments are 119886

0= 2 119886

1= 08

]0= 100 ]

1= 01 and ]

2= 08 with initial values 119889

1= 50

1205902

1= 100 and 120598

1= 2 The RL agent state in P3 is three-

dimensional 119904 = [119909 119887(1) 119887(2)] In all three problem settings

the RL agent period cost and action are the inventory periodcost and replenishment order respectively For RSarsa PRSRSarsaTD and PRSBeta the extra information required byrumination is the inventory demand variable 120585 = 119889

119905

The Q-value is implemented using grid tile coding [2]without hashing Tile coding is a function approximationmethod based on a linear combination of weights of activatedfeatures called ldquotilesrdquo The approximation function withargument z is given by

119891 (z) = 11990811206011(z) + 119908

21206012(z) + sdot sdot sdot + 119908

119872120601119872(z) (10)

where 1199081 1199082 119908

119872are tile weights and 120601

1(z) 1206012(z)

120601119872(z) are tile activation functions 120601

119894(z) = 1 only when z lies

inside the hypercube of the 119894th tile

The tile configuration that is 1206011(z) 120601

119872(z) is prede-

fined Each Q-value is stored using tile coding through theweights Given a value119876 to store at any entry of z the weightsare updated according to

119908119894= 119908(old)119894

+

(119876 minus 119876(old)

)

119873 (11)

where 119908(old)119894

and 119876(old) are the weight (of the 119894th tile) and

approximation before the new update Variable 119873 is for anumber of tiling layers

For P1 we use a tile codingwith 10 tiling layers Each layerhas 8 times 3 times 4 three-dimensional tiles covering multidimen-sional state-action space of [minus300 500] times [0 150] times [0 150]

corresponding to 119904 = [119909 119887(1)] and 119886 This means that this tile

coding allows only a state lying in [minus300 500]times[0 150] and avalue of action between 0 and 150 The dimensions along 119909119887(1) and 119886 are partitioned into 8 3 and 4 partitions creating96 three-dimensional hypercubes for each tiling layer Alllayers are overlapping to constitute an entire tile coding setLayer overlapping is arranged randomly For P2 we use atile coding with 5 tiling layers Each tiling has 11 times 5 two-dimensional tiles covering the space of [minus300 650]times [0 150]corresponding to 119904 = (119909 + 119887

(1)) and 119886 For P3 we use a tile

coding with 10 tiling layers Each tiling has 8 times 3 times 3 times 4

four-dimensional tiles covering the space of [minus400 1200] times[0 150]times[0 150]times[0 150] corresponding to 119904 = [119909 119887(1) 119887(2)]and 119886

All RL agents use the 120598-greedy policy with 120598 = 02 Thelearning update uses the learning rate 120572 = 07 and discountfactor 120574 = 08

Figures 4 5 and 6 showmoving averages (of degree 1000)of period costs in P1 P2 and P3 obtained with differentlearning agents as indicated in the legends (ldquoRTDrdquo is shortfor RSarsaTD) Figures 7 and 8 show box plots of averagecosts obtained with the different methods in early and laterperiods respectively

The results are summarized in Table 1 The computationcosts of the methods are measured by relative averagecomputation time per epoch shown in lines 1ndash3 Average

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

Journal of Applied Mathematics 3

120585120585

kk

Buffer

Environment

s998400

r

as

s

r

s998400

a

Rumination

Figure 2 Environment knowledge of its structure and rumination

However in inventorymanagement problems we usuallyhave extra knowledge about the environment That is theproblem structure can naturally be formulated such that theperiod cost 119903 and next state 1199041015840 are determined by a function119896 119904 119886 120585 997891rarr 119903 119904

1015840 where 120585 is an extra information variableThis variable 120585 captures the stochastic aspect of the problemThe process generating 120585may be unknown but the value of 120585is fully observable after the period is over Given a value of 120585along with 119904 and 119886 the deterministic function 119896 can preciselydetermine 119903 and 1199041015840

Without this extra knowledge each period the SARSAagent updates only one value of 119876(119904 119886) corresponding tocurrent state 119904 and action taken 119886 However with the function119896 and an observed value of 120585 we can do ldquoruminationrdquoevaluating the consequences of other actions 119886 even thosethat were not taken Figure 2 illustrates rumination and itsassociated variables Given the rumination mechanism wecan provide information required by SARSArsquos TD calculationfor any underlying action Katanyukul [17] introduced thisrumination idea and incorporated it into the SARSA algo-rithm resulting in the ruminative SARSA (RSarsa) algorithmAlgorithm 1 shows the RSarsa algorithm It should be notedthat RSarsa is similar to SARSA but with inclusion ofrumination from line 8 to line 13

The experiments in [17] showed that RSarsa had per-formed significantly better than SARSA in early periods(indicating faster learning) but its performance was inferiorto SARSA in later periods (indicating poor convergence to theappropriate long-term state cost approximation) Katanyukul[17] attributed RSarsarsquos poor long-term learning quality to itslack of natural action visitation frequency

TD learning (2) and (3) update theQ-value as an approx-imation of the long-term state costThe transition probability119901120587(1199041015840| 119904) in (1) does not appear explicitly in the TD learning

calculation Conventional RL relies on sampling trajectoriesto reflect the natural frequency of visits to state-action pairscorresponding to the transition probability It updates onlythe state-action pairs as they are actually visited therefore itdoes not require explicit calculation of the transition proba-bility and still eventually converges to a good approximation

However because RSarsa does rumination for all actionsignoring their sampling frequency this is equivalent to

(L00) Initialize 119876(119904 119886)(L01) Observe 119904(L02) Determine 119886 by policy 120587(L03) For each period(L04) observe 119903 1199041015840 and 120585(L05) determine 1198861015840 by policy 120587(L06) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L07) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595(L08) for each 119886 isin 119860(119904)(L09) calculate 119903 1199041015840 with 119896(119904 119886 120585)(L10) determine 1198861015840(L11) calculate 120595 = 119903 + 120574119876(119904

1015840 1198861015840) minus 119876(119904 119886)

(L12) update 119876(119904 119886) larr 119876(119904 119886) + 120572 sdot 120595

(L13) until ruminated all 119886 isin 119860(L14) set 119904 larr 119904

1015840 and 119886 larr 1198861015840

(L15) until termination

Algorithm 1 RSarsa algorithm

disregarding the transition probability which leads toRSarsarsquos poor long-term learning quality

To address this issue Katanyukul [17] proposed policy-weighted RSarsa (PRS) PRS explicitly calculates probabilitiesof actions to be ruminated and adjusts the weights of theirupdates PRS is similar to RSarsa but the rumination update(line 12 in Algorithm 1) is replaced by

119876 (119904 119886) larr997888 119876 (119904 119886) + 120573 sdot 120595 (4)

where 120573 = 120572 sdot 119901(119886) and 119901(119886) is the probability of takingaction 119886 in state 119904 with policy 120587 Given an 120598-greedy policywe have 119901(119886) = 120598|119860(119904)| for 119886 = 119886

lowast and 119901(119886) = 120598|119860(119904)| +

(1 minus 120598) otherwise where |119860(119904)| is a number of allowableactions PRS has been shown to performwell in early and laterperiods compared to SARSA However RSarsa is reported tosignificantly outperform PRS in early periods

4 New Methods

According to the results of [17] although RSarsa mayconverge to a wrong approximation RSarsa was shown toperform impressively in the very early periods This suggeststhat if we jump-start the learning agent with RSarsa and thenlater switch to PRS before the Q-values settle into bad spotswe may be able to achieve both faster learning and goodapproximation for a long-term run

PRSBeta We first introduce a straightforward idea calledPRSBeta where we will use a varying ruminative learn-ing rate as a mechanism to shift from full rumination(RSarsa) to policy-weighted rumination (PRS) Similar toPRS the rumination update is determined by (4) Howeverthe value of the rumination learning rate 120573 is determined by

120573 = 120572 sdot 1 minus (1 minus 119901 (119886)) sdot 119891 (5)

where 119891 is a function having a value between 0 and 1 When119891 rarr 0 120573 rarr 120572 and the algorithm will behave like RSarsa

4 Journal of Applied Mathematics

00

02

04

06

08

10

f

1e6 8e5 6e5 4e5 2e5 0|120595|

f exp(minus120591 |120595|)

120591 = 1

120591 = 1e minus 5

120591 = 05e minus 5

Figure 3 Function exp(minus120591|120595|) and effects of different 120591 values

When 119891 rarr 1 120573 rarr 120572 sdot 119901(119886) and the algorithm will behavelike PRS We want 119891 to start out close to 0 and grow to 1

at a proper rate By examining our preliminary experimentsthe TD error will get smaller as the learning converges Thisis actually a property of TD learning Given this propertywe can use the magnitude of the TD error |120595| to control theshifting such that

119891 (120595) = exp (minus120591 sdot 10038161003816100381610038161205951003816100381610038161003816) (6)

where 120591 is a scaling factor Figure 3 illustrates the effects ofdifferent values of 120591 Since the magnitude of 120591 should berelative to |120595| we set 120591 = |2(119903 + 119876(119904 119886))| so that themagnitude of 120591 will be in a proper scale relative to |120595| andautomatically adjusted

RSarsaTD Building on the PRSBeta method above we nextpropose another method called RSarsaTD The underlyingidea is that since SARSA performs well in a long-term run(see [2] for theoretical discussion of SARSArsquos optimalityand convergence properties) then after we speed up theearly learning process with rumination we can just switchback to SARSA This approach is to utilize the fast learningcharacteristic of full rumination in early periods and toavoid its poor long-term performance In addition as acomputational cost of rumination is proportional to the sizeof the ruminative action space |119860(119904)| this also helps to reducethe computational cost incurred by rumination It is alsointuitively appealing in the sense that we do rumination onlywhen we need it

The intuition to selectively do ruminationwas introducedin [17] in an attempt to reduce the extra computational costfrom ruminationThere the probability to do ruminationwasa function of the magnitude of the TD error

119901 (rumination) = 1 minus exp(minus10038161003816100381610038161003816100381610038161003816

2120595

119903 + 119876 (119904 119886)

10038161003816100381610038161003816100381610038161003816

) (7)

However Katanyukul [17] investigated this selective rumi-nation only with the policy-weighted method and called itPRSTD Although PRSTD was able to improve the compu-tational cost of the rumination approach the inventory man-agement performance of PRSTDwas reported to have mixedresults implying that incorporation of selective ruminationmay deteriorate performance of PRS

This performance deterioration may be due to using119901(rumination) with policy weighted correction Bothschemes use |120595| to control their effect of rumination there-fore they might have an effect equivalent to overcorrectingthe state-transition probability Unlike PRS RSarsa doesnot correct the state-transition probability Incorporatingselective rumination (7) will be the only scheme controllingrumination with |120595| Therefore we expect that this approachmay allow the advantage of RSarsarsquos fast learning whilemaintaining the long-term learning quality of SARSA

5 Experiments and Results

Our study uses computer simulations to conduct numeri-cal experiments on three inventory management problemsettings (P1 P2 and P3) All problems are periodic reviewsingle-echelon with nonzero setup cost P1 and P2 have one-period lead time P3 has two-period lead time The sameMarkov model is used to govern all problem environmentsbut with different settings For P1 and P2 the problem statespace is I times 0 I+ for on-hand and in-transit inventories 119909and 119887(1) respectively P3rsquos state space is I times 0 I+ times 0 I+ for119909 and in-transit inventories 119887(1) and 119887(2) The action space is0 I+ for replenishment order 119886

The state transition is specified by

119909119905+1

= 119909119905+ 119887(1)

119905minus 119889119905

119887(119894)

119905+1= 119887(119894+1)

119905 for 119894 = 1 119871 minus 1

119887(119871)

119905+1= 119886119905

(8)

where 119871 is a number of lead time periodsThe inventory period cost is calculated from the equation

119903119905= 119870 sdot 120575 (119886

119905) + 119866 sdot 119886

119905+ 119867 sdot 119909

119905+1sdot 120575 (119909119905+1)

minus 119861 sdot 119909119905+1

sdot 120575 (minus119909119905+1)

(9)

where 119870 119866 119867 and 119861 are setup unit holding and penaltycosts respectively and 120575(sdot) is a step function Five RL agentsare studied SARSA RSarsa PRS RSarsaTD and PRSBeta

Each experiment is repeated 10 times In each repeti-tion an agent is initialized with all zero Q-values Thenthe experiment is run consecutively for 119873

119864episodes Each

episode starts with initial state and action as follows for allproblems 119887(1)

1and 119886

1are initialized with values randomly

drawn between 0 and 100 In P1 1199091is initialized to 50 in

P21199091is initialized from randomly drawn values betweenminus50

and 400 in P3 1199091is initialized to 100 and randomly drawn

values of 119887(2) between 0 and 100 Each episode ends when119873119875periods are reached or an agent has visited a termination

Journal of Applied Mathematics 5

Table 1 Experimental results

Line MethodsSARSA RSarsa PRS RSarsaTD PRSBeta

Relative computation timeepoch1 P1 1 30 26 5 302 P2 1 20 21 3 193 P3 1 31 29 6 31

Average cost of early periods4 P1 8421 7619 (W) 8379 (p043) 7597 (W) 7450 (W)5 P2 4935 4606 (W) 4792 (p006) 4685 (W) 4411 (W)6 P3 10502 8694 (W) 9958 (p020) 9390 (p007) 8472 (W)

Average cost of later periods7 P1 7214 7355 (p068) 7051 (W) 7110 (p011) 7010 (W)8 P2 4308 4388 (p090) 4248 (p014) 4375 (p084) 4194 (W)9 P3 8613 8139 (p029) 8312 (p037) 8486 (p043) 7664 (p018)

state which is a state lying outside a valid range of Q-valueimplementation The maximum number of periods in eachepisode119873

119875 defines the length of the problem horizon while

the number of episodes 119873119864specifies a variety of problem

scenarios that is different initial states and actionsThree problem settings are used in our experiments

Problem 1 (P1) has 119873119864= 100 119873

119875= 60 119870 = 200 119866 = 100

119861 = 200 and 119867 = 20 Demand 119889119905is normally distributed

with mean 50 and standard deviation 10 denoted as 119889119905sim

N(50 102) The environment state [119909 119887(1)] is set as the RL

agent state 119904 = [119909 119887(1)] Problem 2 (P2) has 119873

119864= 500

119873119875= 60 119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 with

demand 119889119905sim N(50 10

2) The RL agent state is set as the

inventory level 119904 = 119909 + 119887(1) Therefore the RL agent state is

one-dimensional Problem 3 (P3) has 119873119864= 500 119873

119875= 60

119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 The demand 119889119905is

AR1GARCH(11) 119889119905= 1198860+ 1198861sdot 119889119905minus1

+ 120598119905 120598119905= 119890119905sdot 120590119905and

1205902

119905= ]0+ ]1sdot 1205982

119905minus1+ ]2sdot 1205902

119905minus1 where 119886

0and 1198861are AR1 model

parameters ]0 ]1 and ]

2are GARCH(11) parameters and 119890

119905

is white noise distributed according toN(0 1) The values ofAR1GARCH(11) in our experiments are 119886

0= 2 119886

1= 08

]0= 100 ]

1= 01 and ]

2= 08 with initial values 119889

1= 50

1205902

1= 100 and 120598

1= 2 The RL agent state in P3 is three-

dimensional 119904 = [119909 119887(1) 119887(2)] In all three problem settings

the RL agent period cost and action are the inventory periodcost and replenishment order respectively For RSarsa PRSRSarsaTD and PRSBeta the extra information required byrumination is the inventory demand variable 120585 = 119889

119905

The Q-value is implemented using grid tile coding [2]without hashing Tile coding is a function approximationmethod based on a linear combination of weights of activatedfeatures called ldquotilesrdquo The approximation function withargument z is given by

119891 (z) = 11990811206011(z) + 119908

21206012(z) + sdot sdot sdot + 119908

119872120601119872(z) (10)

where 1199081 1199082 119908

119872are tile weights and 120601

1(z) 1206012(z)

120601119872(z) are tile activation functions 120601

119894(z) = 1 only when z lies

inside the hypercube of the 119894th tile

The tile configuration that is 1206011(z) 120601

119872(z) is prede-

fined Each Q-value is stored using tile coding through theweights Given a value119876 to store at any entry of z the weightsare updated according to

119908119894= 119908(old)119894

+

(119876 minus 119876(old)

)

119873 (11)

where 119908(old)119894

and 119876(old) are the weight (of the 119894th tile) and

approximation before the new update Variable 119873 is for anumber of tiling layers

For P1 we use a tile codingwith 10 tiling layers Each layerhas 8 times 3 times 4 three-dimensional tiles covering multidimen-sional state-action space of [minus300 500] times [0 150] times [0 150]

corresponding to 119904 = [119909 119887(1)] and 119886 This means that this tile

coding allows only a state lying in [minus300 500]times[0 150] and avalue of action between 0 and 150 The dimensions along 119909119887(1) and 119886 are partitioned into 8 3 and 4 partitions creating96 three-dimensional hypercubes for each tiling layer Alllayers are overlapping to constitute an entire tile coding setLayer overlapping is arranged randomly For P2 we use atile coding with 5 tiling layers Each tiling has 11 times 5 two-dimensional tiles covering the space of [minus300 650]times [0 150]corresponding to 119904 = (119909 + 119887

(1)) and 119886 For P3 we use a tile

coding with 10 tiling layers Each tiling has 8 times 3 times 3 times 4

four-dimensional tiles covering the space of [minus400 1200] times[0 150]times[0 150]times[0 150] corresponding to 119904 = [119909 119887(1) 119887(2)]and 119886

All RL agents use the 120598-greedy policy with 120598 = 02 Thelearning update uses the learning rate 120572 = 07 and discountfactor 120574 = 08

Figures 4 5 and 6 showmoving averages (of degree 1000)of period costs in P1 P2 and P3 obtained with differentlearning agents as indicated in the legends (ldquoRTDrdquo is shortfor RSarsaTD) Figures 7 and 8 show box plots of averagecosts obtained with the different methods in early and laterperiods respectively

The results are summarized in Table 1 The computationcosts of the methods are measured by relative averagecomputation time per epoch shown in lines 1ndash3 Average

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

4 Journal of Applied Mathematics

00

02

04

06

08

10

f

1e6 8e5 6e5 4e5 2e5 0|120595|

f exp(minus120591 |120595|)

120591 = 1

120591 = 1e minus 5

120591 = 05e minus 5

Figure 3 Function exp(minus120591|120595|) and effects of different 120591 values

When 119891 rarr 1 120573 rarr 120572 sdot 119901(119886) and the algorithm will behavelike PRS We want 119891 to start out close to 0 and grow to 1

at a proper rate By examining our preliminary experimentsthe TD error will get smaller as the learning converges Thisis actually a property of TD learning Given this propertywe can use the magnitude of the TD error |120595| to control theshifting such that

119891 (120595) = exp (minus120591 sdot 10038161003816100381610038161205951003816100381610038161003816) (6)

where 120591 is a scaling factor Figure 3 illustrates the effects ofdifferent values of 120591 Since the magnitude of 120591 should berelative to |120595| we set 120591 = |2(119903 + 119876(119904 119886))| so that themagnitude of 120591 will be in a proper scale relative to |120595| andautomatically adjusted

RSarsaTD Building on the PRSBeta method above we nextpropose another method called RSarsaTD The underlyingidea is that since SARSA performs well in a long-term run(see [2] for theoretical discussion of SARSArsquos optimalityand convergence properties) then after we speed up theearly learning process with rumination we can just switchback to SARSA This approach is to utilize the fast learningcharacteristic of full rumination in early periods and toavoid its poor long-term performance In addition as acomputational cost of rumination is proportional to the sizeof the ruminative action space |119860(119904)| this also helps to reducethe computational cost incurred by rumination It is alsointuitively appealing in the sense that we do rumination onlywhen we need it

The intuition to selectively do ruminationwas introducedin [17] in an attempt to reduce the extra computational costfrom ruminationThere the probability to do ruminationwasa function of the magnitude of the TD error

119901 (rumination) = 1 minus exp(minus10038161003816100381610038161003816100381610038161003816

2120595

119903 + 119876 (119904 119886)

10038161003816100381610038161003816100381610038161003816

) (7)

However Katanyukul [17] investigated this selective rumi-nation only with the policy-weighted method and called itPRSTD Although PRSTD was able to improve the compu-tational cost of the rumination approach the inventory man-agement performance of PRSTDwas reported to have mixedresults implying that incorporation of selective ruminationmay deteriorate performance of PRS

This performance deterioration may be due to using119901(rumination) with policy weighted correction Bothschemes use |120595| to control their effect of rumination there-fore they might have an effect equivalent to overcorrectingthe state-transition probability Unlike PRS RSarsa doesnot correct the state-transition probability Incorporatingselective rumination (7) will be the only scheme controllingrumination with |120595| Therefore we expect that this approachmay allow the advantage of RSarsarsquos fast learning whilemaintaining the long-term learning quality of SARSA

5 Experiments and Results

Our study uses computer simulations to conduct numeri-cal experiments on three inventory management problemsettings (P1 P2 and P3) All problems are periodic reviewsingle-echelon with nonzero setup cost P1 and P2 have one-period lead time P3 has two-period lead time The sameMarkov model is used to govern all problem environmentsbut with different settings For P1 and P2 the problem statespace is I times 0 I+ for on-hand and in-transit inventories 119909and 119887(1) respectively P3rsquos state space is I times 0 I+ times 0 I+ for119909 and in-transit inventories 119887(1) and 119887(2) The action space is0 I+ for replenishment order 119886

The state transition is specified by

119909119905+1

= 119909119905+ 119887(1)

119905minus 119889119905

119887(119894)

119905+1= 119887(119894+1)

119905 for 119894 = 1 119871 minus 1

119887(119871)

119905+1= 119886119905

(8)

where 119871 is a number of lead time periodsThe inventory period cost is calculated from the equation

119903119905= 119870 sdot 120575 (119886

119905) + 119866 sdot 119886

119905+ 119867 sdot 119909

119905+1sdot 120575 (119909119905+1)

minus 119861 sdot 119909119905+1

sdot 120575 (minus119909119905+1)

(9)

where 119870 119866 119867 and 119861 are setup unit holding and penaltycosts respectively and 120575(sdot) is a step function Five RL agentsare studied SARSA RSarsa PRS RSarsaTD and PRSBeta

Each experiment is repeated 10 times In each repeti-tion an agent is initialized with all zero Q-values Thenthe experiment is run consecutively for 119873

119864episodes Each

episode starts with initial state and action as follows for allproblems 119887(1)

1and 119886

1are initialized with values randomly

drawn between 0 and 100 In P1 1199091is initialized to 50 in

P21199091is initialized from randomly drawn values betweenminus50

and 400 in P3 1199091is initialized to 100 and randomly drawn

values of 119887(2) between 0 and 100 Each episode ends when119873119875periods are reached or an agent has visited a termination

Journal of Applied Mathematics 5

Table 1 Experimental results

Line MethodsSARSA RSarsa PRS RSarsaTD PRSBeta

Relative computation timeepoch1 P1 1 30 26 5 302 P2 1 20 21 3 193 P3 1 31 29 6 31

Average cost of early periods4 P1 8421 7619 (W) 8379 (p043) 7597 (W) 7450 (W)5 P2 4935 4606 (W) 4792 (p006) 4685 (W) 4411 (W)6 P3 10502 8694 (W) 9958 (p020) 9390 (p007) 8472 (W)

Average cost of later periods7 P1 7214 7355 (p068) 7051 (W) 7110 (p011) 7010 (W)8 P2 4308 4388 (p090) 4248 (p014) 4375 (p084) 4194 (W)9 P3 8613 8139 (p029) 8312 (p037) 8486 (p043) 7664 (p018)

state which is a state lying outside a valid range of Q-valueimplementation The maximum number of periods in eachepisode119873

119875 defines the length of the problem horizon while

the number of episodes 119873119864specifies a variety of problem

scenarios that is different initial states and actionsThree problem settings are used in our experiments

Problem 1 (P1) has 119873119864= 100 119873

119875= 60 119870 = 200 119866 = 100

119861 = 200 and 119867 = 20 Demand 119889119905is normally distributed

with mean 50 and standard deviation 10 denoted as 119889119905sim

N(50 102) The environment state [119909 119887(1)] is set as the RL

agent state 119904 = [119909 119887(1)] Problem 2 (P2) has 119873

119864= 500

119873119875= 60 119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 with

demand 119889119905sim N(50 10

2) The RL agent state is set as the

inventory level 119904 = 119909 + 119887(1) Therefore the RL agent state is

one-dimensional Problem 3 (P3) has 119873119864= 500 119873

119875= 60

119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 The demand 119889119905is

AR1GARCH(11) 119889119905= 1198860+ 1198861sdot 119889119905minus1

+ 120598119905 120598119905= 119890119905sdot 120590119905and

1205902

119905= ]0+ ]1sdot 1205982

119905minus1+ ]2sdot 1205902

119905minus1 where 119886

0and 1198861are AR1 model

parameters ]0 ]1 and ]

2are GARCH(11) parameters and 119890

119905

is white noise distributed according toN(0 1) The values ofAR1GARCH(11) in our experiments are 119886

0= 2 119886

1= 08

]0= 100 ]

1= 01 and ]

2= 08 with initial values 119889

1= 50

1205902

1= 100 and 120598

1= 2 The RL agent state in P3 is three-

dimensional 119904 = [119909 119887(1) 119887(2)] In all three problem settings

the RL agent period cost and action are the inventory periodcost and replenishment order respectively For RSarsa PRSRSarsaTD and PRSBeta the extra information required byrumination is the inventory demand variable 120585 = 119889

119905

The Q-value is implemented using grid tile coding [2]without hashing Tile coding is a function approximationmethod based on a linear combination of weights of activatedfeatures called ldquotilesrdquo The approximation function withargument z is given by

119891 (z) = 11990811206011(z) + 119908

21206012(z) + sdot sdot sdot + 119908

119872120601119872(z) (10)

where 1199081 1199082 119908

119872are tile weights and 120601

1(z) 1206012(z)

120601119872(z) are tile activation functions 120601

119894(z) = 1 only when z lies

inside the hypercube of the 119894th tile

The tile configuration that is 1206011(z) 120601

119872(z) is prede-

fined Each Q-value is stored using tile coding through theweights Given a value119876 to store at any entry of z the weightsare updated according to

119908119894= 119908(old)119894

+

(119876 minus 119876(old)

)

119873 (11)

where 119908(old)119894

and 119876(old) are the weight (of the 119894th tile) and

approximation before the new update Variable 119873 is for anumber of tiling layers

For P1 we use a tile codingwith 10 tiling layers Each layerhas 8 times 3 times 4 three-dimensional tiles covering multidimen-sional state-action space of [minus300 500] times [0 150] times [0 150]

corresponding to 119904 = [119909 119887(1)] and 119886 This means that this tile

coding allows only a state lying in [minus300 500]times[0 150] and avalue of action between 0 and 150 The dimensions along 119909119887(1) and 119886 are partitioned into 8 3 and 4 partitions creating96 three-dimensional hypercubes for each tiling layer Alllayers are overlapping to constitute an entire tile coding setLayer overlapping is arranged randomly For P2 we use atile coding with 5 tiling layers Each tiling has 11 times 5 two-dimensional tiles covering the space of [minus300 650]times [0 150]corresponding to 119904 = (119909 + 119887

(1)) and 119886 For P3 we use a tile

coding with 10 tiling layers Each tiling has 8 times 3 times 3 times 4

four-dimensional tiles covering the space of [minus400 1200] times[0 150]times[0 150]times[0 150] corresponding to 119904 = [119909 119887(1) 119887(2)]and 119886

All RL agents use the 120598-greedy policy with 120598 = 02 Thelearning update uses the learning rate 120572 = 07 and discountfactor 120574 = 08

Figures 4 5 and 6 showmoving averages (of degree 1000)of period costs in P1 P2 and P3 obtained with differentlearning agents as indicated in the legends (ldquoRTDrdquo is shortfor RSarsaTD) Figures 7 and 8 show box plots of averagecosts obtained with the different methods in early and laterperiods respectively

The results are summarized in Table 1 The computationcosts of the methods are measured by relative averagecomputation time per epoch shown in lines 1ndash3 Average

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

Journal of Applied Mathematics 5

Table 1 Experimental results

Line MethodsSARSA RSarsa PRS RSarsaTD PRSBeta

Relative computation timeepoch1 P1 1 30 26 5 302 P2 1 20 21 3 193 P3 1 31 29 6 31

Average cost of early periods4 P1 8421 7619 (W) 8379 (p043) 7597 (W) 7450 (W)5 P2 4935 4606 (W) 4792 (p006) 4685 (W) 4411 (W)6 P3 10502 8694 (W) 9958 (p020) 9390 (p007) 8472 (W)

Average cost of later periods7 P1 7214 7355 (p068) 7051 (W) 7110 (p011) 7010 (W)8 P2 4308 4388 (p090) 4248 (p014) 4375 (p084) 4194 (W)9 P3 8613 8139 (p029) 8312 (p037) 8486 (p043) 7664 (p018)

state which is a state lying outside a valid range of Q-valueimplementation The maximum number of periods in eachepisode119873

119875 defines the length of the problem horizon while

the number of episodes 119873119864specifies a variety of problem

scenarios that is different initial states and actionsThree problem settings are used in our experiments

Problem 1 (P1) has 119873119864= 100 119873

119875= 60 119870 = 200 119866 = 100

119861 = 200 and 119867 = 20 Demand 119889119905is normally distributed

with mean 50 and standard deviation 10 denoted as 119889119905sim

N(50 102) The environment state [119909 119887(1)] is set as the RL

agent state 119904 = [119909 119887(1)] Problem 2 (P2) has 119873

119864= 500

119873119875= 60 119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 with

demand 119889119905sim N(50 10

2) The RL agent state is set as the

inventory level 119904 = 119909 + 119887(1) Therefore the RL agent state is

one-dimensional Problem 3 (P3) has 119873119864= 500 119873

119875= 60

119870 = 200 119866 = 50 119861 = 200 and 119867 = 20 The demand 119889119905is

AR1GARCH(11) 119889119905= 1198860+ 1198861sdot 119889119905minus1

+ 120598119905 120598119905= 119890119905sdot 120590119905and

1205902

119905= ]0+ ]1sdot 1205982

119905minus1+ ]2sdot 1205902

119905minus1 where 119886

0and 1198861are AR1 model

parameters ]0 ]1 and ]

2are GARCH(11) parameters and 119890

119905

is white noise distributed according toN(0 1) The values ofAR1GARCH(11) in our experiments are 119886

0= 2 119886

1= 08

]0= 100 ]

1= 01 and ]

2= 08 with initial values 119889

1= 50

1205902

1= 100 and 120598

1= 2 The RL agent state in P3 is three-

dimensional 119904 = [119909 119887(1) 119887(2)] In all three problem settings

the RL agent period cost and action are the inventory periodcost and replenishment order respectively For RSarsa PRSRSarsaTD and PRSBeta the extra information required byrumination is the inventory demand variable 120585 = 119889

119905

The Q-value is implemented using grid tile coding [2]without hashing Tile coding is a function approximationmethod based on a linear combination of weights of activatedfeatures called ldquotilesrdquo The approximation function withargument z is given by

119891 (z) = 11990811206011(z) + 119908

21206012(z) + sdot sdot sdot + 119908

119872120601119872(z) (10)

where 1199081 1199082 119908

119872are tile weights and 120601

1(z) 1206012(z)

120601119872(z) are tile activation functions 120601

119894(z) = 1 only when z lies

inside the hypercube of the 119894th tile

The tile configuration that is 1206011(z) 120601

119872(z) is prede-

fined Each Q-value is stored using tile coding through theweights Given a value119876 to store at any entry of z the weightsare updated according to

119908119894= 119908(old)119894

+

(119876 minus 119876(old)

)

119873 (11)

where 119908(old)119894

and 119876(old) are the weight (of the 119894th tile) and

approximation before the new update Variable 119873 is for anumber of tiling layers

For P1 we use a tile codingwith 10 tiling layers Each layerhas 8 times 3 times 4 three-dimensional tiles covering multidimen-sional state-action space of [minus300 500] times [0 150] times [0 150]

corresponding to 119904 = [119909 119887(1)] and 119886 This means that this tile

coding allows only a state lying in [minus300 500]times[0 150] and avalue of action between 0 and 150 The dimensions along 119909119887(1) and 119886 are partitioned into 8 3 and 4 partitions creating96 three-dimensional hypercubes for each tiling layer Alllayers are overlapping to constitute an entire tile coding setLayer overlapping is arranged randomly For P2 we use atile coding with 5 tiling layers Each tiling has 11 times 5 two-dimensional tiles covering the space of [minus300 650]times [0 150]corresponding to 119904 = (119909 + 119887

(1)) and 119886 For P3 we use a tile

coding with 10 tiling layers Each tiling has 8 times 3 times 3 times 4

four-dimensional tiles covering the space of [minus400 1200] times[0 150]times[0 150]times[0 150] corresponding to 119904 = [119909 119887(1) 119887(2)]and 119886

All RL agents use the 120598-greedy policy with 120598 = 02 Thelearning update uses the learning rate 120572 = 07 and discountfactor 120574 = 08

Figures 4 5 and 6 showmoving averages (of degree 1000)of period costs in P1 P2 and P3 obtained with differentlearning agents as indicated in the legends (ldquoRTDrdquo is shortfor RSarsaTD) Figures 7 and 8 show box plots of averagecosts obtained with the different methods in early and laterperiods respectively

The results are summarized in Table 1 The computationcosts of the methods are measured by relative averagecomputation time per epoch shown in lines 1ndash3 Average

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

6 Journal of Applied Mathematics

0 1000 2000 3000 4000 5000 6000

8000

10000

12000

14000

ma (cost 1000)

Epoch

Cos

t

S

R

P

T

B

SR

PTB

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 4 Moving average of period costs P1

0 5000 10000 15000 20000 25000

4000

4500

5000

5500

6000

ma (cost 1000)

Cos

t

R T

SPB

eEpoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 5 Moving average of period costs P2

costs are used as the inventory management performanceand they are shown in lines 4ndash6 for early periods (periods1ndash2000 in P1 and P2 and periods 1ndash4000 in P3) and lines 7ndash9for later periods (periods after early periods) The numbersin each entry indicate average costs obtained from thecorrespondingmethods Parentheses reveal results from one-sideWilcoxonrsquos rank sum tests ldquoWrdquo indicates that the averagecost is significantly lower than an average cost obtainedfrom SARSA (119875 lt 005) otherwise the 119875 value is showninstead

The computation costs of RSarsa PRS and PRSBeta (fullrumination) are about 20ndash30 times of SARSA (RL without

0 5000 10000 15000 20000

8000

10000

12000

14000

ma (cost 1000)

Cos

t

BR

S

P

T

Epoch

S SARSAR RSarsaP PRS

T RTDB PRSBeta

Figure 6 Moving average of period costs P3

rumination) RSarsaTD (selective rumination) dramaticallyreduces the computation cost of rumination at scales of 5ndash7times An evaluation of the effectiveness of each method(compared to SARSA) shows that RSarsa and PRSBetasignificantly outperform SARSA in early periods for all 3problems Average costs obtained from RSarsaTD are lowerthan ones from SARSA but significance tests can confirmonly results in P1 and P2 It should be noted that PRSresults do not show significant improvement over SARSAThis agrees with results in a previous study [17] With respectto performance in later periods average costs of PRS andPRSBeta are lower than SARSArsquos in all 3 problems Howeversignificance tests can confirmonly few results (P1 for PRS andP1 and P2 for PRSBeta)

Table 2 shows a summary of results from significance testscomparing the previous studyrsquos RRL methods (RSarsa andPRS) to our proposed methods (RSarsaTD and PRSBeta)The entries with ldquoWrdquo indicate that our proposed methodon the corresponding column significantly outperforms aprevious method on the corresponding row (119875 lt 005)Otherwise the 119875 value is indicated

6 Conclusions and Discussion

Our results have shown that PRSBeta achieves our goalwhich is to address the slow learning rate of PRS as it sig-nificantly outperforms PRS in early periods in all 3 problemsand to address the long-term learning quality of RSarsa as itsignificantly outperforms RSarsa in later periods in P1 and P2and its average cost is lower than RSarsarsquos in P3 It should benoted that although the performance of RSarsaTD may notseem impressive when compared to PRSBetarsquos RSarsaTD

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

Journal of Applied Mathematics 7

7500

8000

8500

P1 epochs 1minus2000

4200

4600

5000

P2 epochs 1minus2000

8000

12000

P3 epochs 1minus4000

S R P T B

S R P T B

S R P T B

Figure 7 Average costs in early periods

requires less computational cost Therefore as RSarsaTDshows some improvement over SARSA this reveals thatselective rumination is still worth further study

It should be noted that PRSBeta employs TD error tocontrol its behavior (6) The notion to extend TD error todetermine learning factors is not limited only to ruminationIt may be beneficial to use the TD error signal to determineother learning factors such as the learning rate for anadaptive-learning-rate agent A high TD error indicates thatthe agent has a lot to learn that what it has learned is wrongor that things are changing For each of these cases the goalis to make the agent learn more quickly So a high TD errorshould be a clue to increase the learning rate increase thedegree of rumination or increase the chance to do moreexploration

To address issues inRLworth investigationmore efficientQ-value representations should be among the prioritiesRegardless of the action policy every RL policy relies on Q-values to determine the action to take Function approxima-tions suitable to represent Q-values should facilitate efficient

6800

7200

7600

8000

P1 epochs 2001+

4100

4300

4500

4700P2 epochs 2001+

6000

8000

11000

P3 epochs 4001+

S R P T B

S R P T B

S R P T B

Figure 8 Average costs in late periods

Table 2 Experimental results

Line RSarsaTD PRSBetaEarly periods

1 P1 RSarsa 049 0162 PRS W W3 P2 RSarsa 095 W4 PRS 010 W5 P3 RSarsa 080 0376 PRS 026 W

Later periods7 P1 RSarsa W W8 PRS 063 0149 P2 RSarsa 046 W10 PRS 097 01211 P3 RSarsa 066 02612 PRS 060 018

realization of an action policy For example 120598-greedy policyhas to search for an optimal action AQ-value representation

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

8 Journal of Applied Mathematics

suitable for an 120598-greedy policy should allow efficient searchfor an optimal action given a state Another general RL actionpolicy is the softmax policy [2] Given a state the softmaxpolicy has to evaluate the probabilities of candidate actionsbased on their associated Q-values A representation thatfacilitates efficientmapping fromQ-values to the probabilitieswould have great practical importance in this case Due tothe interaction between the Q-value representation and theaction policy there are considerable efforts to combine thesetwo concepts This is an active research direction under therubric of policy gradient RL [18]

There are many issues in RL needed to be explored the-oretically and for application Our findings reported in thispaper provide another step in understanding and applying RLto practical inventory management problems Even thoughwe only investigated inventory management problems hereourmethods can be applied beyond this specific domainThisearly step in the study of using TD error to control learningfactors along with investigation of other issues in RL wouldyield a more robust learning agent that is useful in a widerange of practical applications

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgment

Theauthorswould like to thank theColorado StateUniversityLibraries Open Access Research and Scholarship Fund forsupporting the publication of this paper

References

[1] D Bertsimas and AThiele ldquoA robust optimization approach toinventory theoryrdquo Operations Research vol 54 no 1 pp 150ndash168 2006

[2] R S Sutton andAG BartoReinforcement LearningMITPressBoston Mass USA 1998

[3] W B Powell Approximate Dynamic Programming Solvingthe Curses of Dimensionality Wiley Series in Probability andStatistics Wiley-Interscience Hoboken NJ USA 2007

[4] J Rao X Bu C Z Xu L Wang and G Yin ldquoVCONF areinforcement learning approach to virtual machines auto-configurationrdquo in Proceedings of the 6th International Confer-ence on Autonomic Computing (ICAC 09) pp 137ndash146 ACMBarcelona Spain June 2009

[5] S G Khan G Herrmann F L Lewis T Pipe and C MelhuishldquoReinforcement learning and optimal adaptive control anoverview and implementation examplesrdquo Annual Reviews inControl vol 36 no 1 pp 42ndash59 2012

[6] A Coates P Abbeel and A Y Ng ldquoApprenticeship learning forhelicopter controlrdquo Communications of the ACM vol 52 no 7pp 97ndash105 2009

[7] C W Anderson D Hittle M Kretchmar and P YoungldquoRobust reinforcement learning for heating ventilation and airconditioning control of buildingsrdquo inHandbook of Learning andApproximate Dynamic Programming J Si A Barto W Powell

and D Wunsch Eds pp 517ndash534 John Wiley amp Sons NewYork NY USA 2004

[8] R Lincoln S Galloway B Stephen and G Burt ldquoCompar-ing policy gradient and value function based reinforcementlearning methods in simulated electrical power traderdquo IEEETransactions on Power Systems vol 27 no 1 pp 373ndash380 2012

[9] Z Tan C Quek and P Y K Cheng ldquoStock trading with cyclesa financial application of ANFIS and reinforcement learningrdquoExpert Systems with Applications vol 38 no 5 pp 4741ndash47552011

[10] A Castelletti F Pianosi and M Restelli ldquoA multiobjectivereinforcement learning approach to water resources systemsoperation pareto frontier approximation in a single runrdquoWaterResources Research vol 49 no 6 pp 3476ndash3486 2013

[11] T Katanyukul E K P Chong and W S Duff ldquoIntelligentinventory control is bootstrapping worth implementingrdquo inIntelligent Information Processing VI vol 385 of IFIP Advancesin Information and Communication Technology pp 58ndash67Springer New York NY USA 2012

[12] T Akiyama H Hachiya and M Sugiyama ldquoEfficient explo-ration through active learning for value function approximationin reinforcement learningrdquo Neural Networks vol 23 no 5 pp639ndash648 2010

[13] K Doya ldquoMetalearning and neuromodulationrdquo Neural Net-works vol 15 no 4ndash6 pp 495ndash506 2002

[14] P Dayan and N D Daw ldquoDecision theory reinforcementlearning and the brainrdquo Cognitive Affective and BehavioralNeuroscience vol 8 no 4 pp 429ndash453 2008

[15] T Katanyukul W S Duff and E K P Chong ldquoApproximatedynamic programming for an inventory problem empiricalcomparisonrdquo Computers and Industrial Engineering vol 60 no4 pp 719ndash743 2011

[16] C O Kim I H Kwon and J G Baek ldquoAsynchronous action-reward learning for nonstationary serial supply chain inventorycontrolrdquo Applied Intelligence vol 28 no 1 pp 1ndash16 2008

[17] T Katanyukul ldquoRuminative reinforcement learning improveintelligent inventory control by ruminating on the pastrdquo inProceedings of the 5th International Conference on ComputerScience and Information Technology (ICCSIT 13) AcademicPublisher Paris France 2013

[18] N A Vien H Yu and T Chung ldquoHessian matrix distributionfor Bayesian policy gradient reinforcement learningrdquo Informa-tion Sciences vol 181 no 9 pp 1671ndash1685 2011

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Intelligent Inventory Control via ...downloads.hindawi.com/journals/jam/2014/238357.pdf · Research Article Intelligent Inventory Control via Ruminative Reinforcement

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of