Upload
nicolette-toney
View
229
Download
1
Tags:
Embed Size (px)
Citation preview
1
OR IIOR IIGSLM 52800GSLM 52800
2
3
Policy and ActionPolicy and Action
policy the rules to specify what to do for all states
action what to do at a state as dictated by the policy
examples policy: replacement only at state 3
do nothing at states 0, 1, and 2, replacing at state 3
policy: overhaul at state 2 and replacement at state 3 do nothing at state 0 and 1, overhaul at state 2, and replace at
state 3
4
Expected RewardExpected Reward
pij(k) = the probability of changing from state i to state j when action k is taken
qij(k) = expected cost at state i when action k is taken and the state changes to j
Cik = the expected cost at state i with action k
ii jjpij(k)
0( ) ( )
M
ik ij ijj
C q k p k
5
Definition of VariablesDefinition of Variables
policy R g(R) = the long-term average cost per unit time
of policy R objective: finding the policy that minimizes g .
.
vi(R) = the effect on the total expected cost when adopting policy R and starting at state i
( ) total cost of starting at state adopting policy
with periods to go
niv R i R
n
0
( )M
i iki
g R C
6
Relationship Between & Relationship Between & ( )niv R 1( )n
iv R
1
0( ) ( ) ( ), 0,1,...,
Mn ni ik ij j
jv R C p k v R i M
1( )i ikv R C
( ) ( ) ( )ni iv R ng R v R
0( ) ( ) ( ) ( 1) ( ) ( ) , 0,1,...,
M
i ik ij jj
ng R v R C p k n g R v R i M
0( ) ( ) ( ) ( ), 0,1,...,
M
i ik ij jj
g R v R C p k v R i M
Claim: The intuitive idea is exact
7
Key Result in Policy ImprovementKey Result in Policy Improvement
M+1 equations, M+2 unknowns g(R) = the long-term average cost of policy R
vi(R) = the effect on the total expected cost when adopting policy R and starting at state i
0( ) ( ) ( ) ( ), 0,1,...,
M
i ik ij jj
g R v R C p k v R i M
8
Idea of Policy ImprovementIdea of Policy Improvement
the collection of vi(R) does not change by adding a constant
vi(R) = vi+c
the set of equations can be solved by arbitrarily setting vM(R) = 0
0( ) ( ) ( )( ), 0,1,...,
M
i ik ij jj
g R v c C p k v c i M
0( ) ( ) , 0,1,...,
M
i ik ij jj
g R v C p k v i M
9
Idea of Policy ImprovementIdea of Policy Improvement
given policy R with action k, suppose that there exists policy Ro with action ko such that
then it can be shown that g(Ro) < g(R)
0( ) ( ) ( ) ( )
M
ik ij j ij
g R C p k v R v R
0 0( ) ( ) ( ) ( ) ( ) ( )
o
M M
ik ij o j i ik ij j ij j
C p k v R v R C p k v R v R
10
Policy ImprovementPolicy Improvement
1 Value Determination: Fix policy R. Set vM(R) to 0 and solve
0( ) ( ) ( ) ( ), for 0,1,...,
M
ik ij j ij
g R C p k v R v R i M
2 Policy Improvement: For each state i, find action k as argument minimum of
1,2,..., 0min ( ) ( ) ( )
M
ik ij j ik K j
C p k v R v R
3 Form a new policy from actions in 2. Stop if this policy is the same as R; else go to 1
11
Idea of Policy ImprovementIdea of Policy Improvement
it can be proven that g is non-increasing
R is minimum if there is no change in policy
the algorithm stops after finite number of iterations
12
ExampleExample
Policy: Replacement only at state 3
transition probability matrix
C11 = 0, C21 = 1000, C31 = 3000, C33 = 6000
7 1 18 16 163 1 14 8 8
1 12 2
0
0
0 0
1 0 0 0
13
ExampleExample
Iteration 1: Value Determination
7 1 18 16 163 1 14 8 8
1 12 2
0
0
0 0
1 0 0 0
7 11 2 08 16
3 11 2 14 8
12 22
0
( ) ( ) ( ) ( )
( ) 1000 + ( ) ( ) ( )
( ) 3000 ( ) ( )
( ) 6000 ( )
g R v R v R v R
g R v R v R v R
g R v R v R
g R v R
3
0
1
2
( ) 0
( ) 1923
( ) 4077
( ) 2615
( ) 2154
v R
g R
v R
v R
v R
14
ExampleExample
Iteration 1: Policy Improvement
nothing can be done at state 0 and machine must be replaced at state 3
possible decisions at state 1: decision 1 (do nothing, $1000)
decision 3 (replace, $6000) state 2: decision 1 (do nothing, $3000)
decision 2 (overhaul, $4000)
decision 3 (replace, $6000)
15
ExampleExample
Iteration 1: Policy Improvement : the general expressions
0 00 01 02
1 10 11 12
2 20 21 22
3 30 31
State 0: ( )(4077) ( )(2615) ( )(2154) 4077
State 1: ( )(4077) ( )(2615) ( )(2154) 2615
State 2: ( )(4077) ( )(2615) ( )(2154) 2154
State 3: ( )(4077) ( )
k
k
k
k
C p k p k p k
C p k p k p k
C p k p k p k
C p k p k
32(2615) ( )(2154)p k
3
0
1
2
( ) 0
( ) 1923
( ) 4077
( ) 2615
( ) 2154
v R
g R
v R
v R
v R
16
ExampleExample
Iteration 1: Policy Improvement
DecisionDecisionState 1State 1
CC11kk pp1010((kk)) pp1111((kk)) pp1212((kk)) pp1313((kk)) EE(value)(value)
11 10001000 00 3/43/4 1/81/8 1/81/8 19231923
33 60006000 11 00 00 00 45384538
DecisionDecisionState 2State 2
CC2k2k pp2020((kk)) pp2121((kk)) pp2222((kk)) pp2323((kk)) EE(value)(value)
11 30003000 00 00 1/21/2 1/21/2 19231923
22 40004000 00 11 00 00 -769-769
33 60006000 11 00 00 00 -231-231
new policy: do nothing at states 0 and 1, overhaul at state 2, and
replace at state 3
1 10 11 12State 1: ( )(4077) ( )(2615) ( )(2154) 2615kC p k p k p k
2 20 21 22State 2: ( )(4077) ( )(2615) ( )(2154) 2154kC p k p k p k
17
ExampleExample
Iteration 2: Value Determination
7 1 18 16 163 1 14 8 8
0
0
0 1 0 0
1 0 0 0
7 11 2 08 16
3 11 2 14 8
12 22
0
( ) ( ) ( ) ( )
( ) 1000 + ( ) ( ) ( )
( ) 4000 ( ) ( )
( ) 6000 ( )
g R v R v R v R
g R v R v R v R
g R v R v R
g R v R
3
0
1
2
( ) 0
( ) 1667
( ) 4333
( ) 3000
( ) 667
v R
g R
v R
v R
v R
It can be shown that there is no improvement in policy so that doing nothing at states 0 and 1, overhauling at state 2, and
replacing at state 3 is an optimum policy