View
1
Download
0
Category
Preview:
Citation preview
Stochastic Control and GamesThor Stead
29 April 2020
What is Stochastic Control?● Optimizing a set of decisions in a random process evolving in uncertainty
How is it studied?● Typically discrete-time, Markov or non-Markov decision process assuming a
reward at each step or at the end
Applications:● Games of chance, air traffic control, reinforcement learning
Markov Decision Process (MDP)● Define as agent G occupies some state s, and can take some set of actions A=
(a1,a2,...,an) at that state s.
○ P(s’ | sn,an) = P(s’ | sn, an, … , s1, a1) memoryless
M. L. Puterman,Markov Decision Processes, Discrete Stochastic Dynamic Programming, John
Wiley and Sons, Inc., 2005.
Policies and Feedback Control● To denote a strategy in our MDP, we introduce π, our policy
○ Set of actions we will take at each possible state s
● Feedback control:
○ Actions dependent on output of
current state
○ The value of a certain state is
dependent upon what states can be
reached from it, and their
respective rewards
Action State transition
“Value” of current state → converges
https://en.wikipedia.org/wiki/Control_theory#Open-loop_and_clos
ed-loop_(feedback)_control
Optimizing our Policy:● 𝜋 = policy = (s
t
, a
t
) ∀ t ∈ [0,n]
● Define final value W(s,𝜋) = sum of all discounted rewards in expectation
○ what we want to maximize
● Need to figure out set of optimal actions
Distinction between W(i,j) and r(i,j):
r(i,j) refers to reward func. at a single point
W(i,j) refers to the recursive formula:
W(i,j) = r(i,j) + ∑ [ P(i’,j’)*W(i’,j’) ]
and can thus be interpreted as a recursive sum
of rewards along the path from (i,j) outwards
t
the Bellman Equation● We can solve the previous equation for A if we obtain W(i,j) for all points
(vectors) within our sample space (here, we use R
2
for simplicity)
● This equation known as the Bellman equation after Richard Bellman (1957),
central to dynamic programming
Bellman also formulated a solution for W,
as we will see...
Solving for W
where 𝚽 refers to the max over all actions a. Bellman’s value iteration algorithm:
RHS of the Bellman Equation
A. T. Mehryar Mohri, Afshin
Rostamizadeh,Foundations of
Machine Learning, The MIT
Press, 2018
Why this works:Because the Bellman
equation is a contraction
mapping.
||𝚽(x)-𝚽(y)|| ≤ 𝛂||x-y||,
for some 0 ≤ 𝛂 < 1
● Banach fixed point theorem: on any contraction mapping, sequence x
n
= 𝚽(x
n-1
)
will eventually converge to fixed point 𝚽(x) = x, ∀ x
○ Heart of value iteration in vector space
An adapted proof.
https://people.eecs.berkeley.edu/~ananth/223Spr07/ee223_spr07_lec19.pdf
Phases of this Project:1: Card Game Proposal
Initially proposed
application to card game
of Bluff (BS)
In fact a non-Markovian
process so utilizes a
different framework, and
many hurdles with coding
the AI unrelated to math
2: One-player Optimization
Optimized a path for a
particle moving on a grid
evolving in uncertainty
● reward function
based on grid (x,y)
3: Competing AIs
Had two AIs play ‘tag’ on
a torus-shaped grid, with
varying degrees of
uncertainty
● Reward function based on
player-player distance
the Optimal Path1. Define a reward function based
on grid location
2. Start a particle at a random (x,y)
on the grid, adding in a
stochasticity
a. For example, P(goes intended
direction) = 0.7
3. Iteratively solve Bellman eq. To
get the value W of each location
4. Simulate
5. Repeat (3) until (x,y) = W(x*,y*)
Yellow =
high
value
Purple =
low
value
Playing Tag?1. Define reward function based
on other player’s location
2. Start 2 particles at random
(x,y) on the grid, adding in a
stochasticity
3. Solve Bellman eq. To get the
value W of each location
4. Simulate player 1 turn
5. Solve Bellman eq. using
player 1 location to get W
6. Simulate player 2 turn
7. In this example, #turns = 20 each
Going Forward● Computing & solving the Bellman equation for an MDP is a fundamental tool of
reinforcement machine learning and optimality under randomness
● Want to extend the principles here to include and solve:
○ other Markovian games
○ non-Markovian games
○ Games with varying win conditions– many ways to attain max. Reward
Thank you!Credit to Patrick Flynn for helping to educate, develop, and revise versions of this work throughout the semester.
Recommended