Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...

Stochastic Control and GamesThor Stead

29 April 2020

What is Stochastic Control?● Optimizing a set of decisions in a random process evolving in uncertainty

How is it studied?● Typically discrete-time, Markov or non-Markov decision process assuming a

reward at each step or at the end

Applications:● Games of chance, air traffic control, reinforcement learning

Markov Decision Process (MDP)● Define as agent G occupies some state s, and can take some set of actions A=

(a1,a2,...,an) at that state s.

○ P(s’ | sn,an) = P(s’ | sn, an, … , s1, a1) memoryless

M. L. Puterman,Markov Decision Processes, Discrete Stochastic Dynamic Programming, John

Wiley and Sons, Inc., 2005.

Policies and Feedback Control● To denote a strategy in our MDP, we introduce π, our policy

○ Set of actions we will take at each possible state s

● Feedback control:

○ Actions dependent on output of

current state

○ The value of a certain state is

dependent upon what states can be

reached from it, and their

respective rewards

Action State transition

“Value” of current state → converges

https://en.wikipedia.org/wiki/Control_theory#Open-loop_and_clos

ed-loop_(feedback)_control

Optimizing our Policy:● 𝜋 = policy = (s

) ∀ t ∈ [0,n]

● Define final value W(s,𝜋) = sum of all discounted rewards in expectation

○ what we want to maximize

● Need to figure out set of optimal actions

Distinction between W(i,j) and r(i,j):

r(i,j) refers to reward func. at a single point

W(i,j) refers to the recursive formula:

W(i,j) = r(i,j) + ∑ [ P(i’,j’)*W(i’,j’) ]

and can thus be interpreted as a recursive sum

of rewards along the path from (i,j) outwards

the Bellman Equation● We can solve the previous equation for A if we obtain W(i,j) for all points

(vectors) within our sample space (here, we use R

for simplicity)

● This equation known as the Bellman equation after Richard Bellman (1957),

central to dynamic programming

Bellman also formulated a solution for W,

as we will see...

Solving for W

where 𝚽 refers to the max over all actions a. Bellman’s value iteration algorithm:

RHS of the Bellman Equation

A. T. Mehryar Mohri, Afshin

Rostamizadeh,Foundations of

Machine Learning, The MIT

Press, 2018

Why this works:Because the Bellman

equation is a contraction

mapping.

||𝚽(x)-𝚽(y)|| ≤ 𝛂||x-y||,

for some 0 ≤ 𝛂 < 1

● Banach fixed point theorem: on any contraction mapping, sequence x

= 𝚽(x

will eventually converge to fixed point 𝚽(x) = x, ∀ x

○ Heart of value iteration in vector space

An adapted proof.

https://people.eecs.berkeley.edu/~ananth/223Spr07/ee223_spr07_lec19.pdf

Phases of this Project:1: Card Game Proposal

Initially proposed

application to card game

of Bluff (BS)

In fact a non-Markovian

process so utilizes a

different framework, and

many hurdles with coding

the AI unrelated to math

2: One-player Optimization

Optimized a path for a

particle moving on a grid

evolving in uncertainty

● reward function

based on grid (x,y)

3: Competing AIs

Had two AIs play ‘tag’ on

a torus-shaped grid, with

varying degrees of

uncertainty

● Reward function based on

player-player distance

the Optimal Path1. Define a reward function based

on grid location

2. Start a particle at a random (x,y)

on the grid, adding in a

stochasticity

a. For example, P(goes intended

direction) = 0.7

3. Iteratively solve Bellman eq. To

get the value W of each location

4. Simulate

5. Repeat (3) until (x,y) = W(x*,y*)

Yellow =

Purple =

Playing Tag?1. Define reward function based

on other player’s location

2. Start 2 particles at random

(x,y) on the grid, adding in a

stochasticity

3. Solve Bellman eq. To get the

value W of each location

4. Simulate player 1 turn

5. Solve Bellman eq. using

player 1 location to get W

6. Simulate player 2 turn

7. In this example, #turns = 20 each

Going Forward● Computing & solving the Bellman equation for an MDP is a fundamental tool of

reinforcement machine learning and optimality under randomness

● Want to extend the principles here to include and solve:

○ other Markovian games

○ non-Markovian games

○ Games with varying win conditions– many ways to attain max. Reward

Thank you!Credit to Patrick Flynn for helping to educate, develop, and revise versions of this work throughout the semester.

Stochastic Control and Games · 2020. 7. 29. · Typically discrete-time, Markov or non-Markov...

Documents

Chapter 9: Markov Chain Regular Markov Chains Section 9…momran/m118videos/notes/sec92.pdf · Chapter 9: Markov Chain Section 9.2: Regular Markov Chains • Irreducible Markov Chain:

The Hidden Markov Model (HMM) - Computer Scienceelgammal/classes/cs536/lectures/HMM.pdf · 2 Lecture Outline • Theory of Markov Models – discrete Markov processes – hidden Markov

Local Income Changes and Export Opportunities...risk of middle income trap) assuming a Markov process I Policy discussion (Klinger, 2009) I The case of Africa (Fortunato and Valensisi,

Hidden Markov Models - AUusers-cs.au.dk/cstorm/courses/PRiB_f12/slides/hidden-markov-model… · Hidden Markov Models Markov Model Hidden Markov Model If the latent variables are

Assuming Practice Assuming Problems Sociology and Nursing Education Draft

Nursing Workforce Planning Mariel S. Lavieri Sandra Regan Martin L. Puterman Pamela A. Ratner

Expresiones de Mariette Puterman

SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. · Title: SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. Subject: SCIENTIFIC ABSTRACT MARKOV, K.K. - MARKOV, K.K. Keywords: k. a r

Principles of Markov Automata · Markov Chain Labelled Transition Systems Discrete-Time Markov Chains Continuous-Time Markov Chains Probabilistic Decisions Non-Deterministic Decisions

Chapter 16 Markov Chains - osp.mans.edu.egosp.mans.edu.eg/elbeltagi/Infra 4-2 Markov Chains.pdf16 Markov Chains The preceding ... Formulating the Inventory Example as a Markov Chain

1 BABS 502 Lecture 1 February 23, 2009 (C) Martin L. Puterman

Markov Models and Hidden Markov Models (HMMs)

Chapter 9: Markov Chain Regular Markov Chains …momran/m118videos/notes/sec92.pdf · Chapter 9: Markov Chain Section 9.2: Regular Markov Chains • Irreducible Markov Chain: When

Markov Chains. Summary Markov Chains Discrete Time Markov Chains Homogeneous and non-homogeneous Markov chains Transient and steady state Markov chains

IEEE TRANSACTIONS ON MOBILE COMPUTING, VOL., NO., …naderi/papers/Full-duplex-MAC.pdfat the link layer by assuming a simple CSMA/CA channel access, and then deﬁning a Markov chain-based

Markov chains, Markov Processes, Queuing Theory and ...anthonybusson.fr/fileTeaching/Markov.pdf · Markov chains, Markov Processes, Queuing Theory and Application to Communication

Markov Chains and Stationary Distributionscommunity.wvu.edu/~krsubramani/courses/sp12/rand/lecnotes/markov… · Markov Chains and Stationary Distributions ... Suppose we have a Markov

FINITE HORIZON MARKOV DECISION PROBLEMS …steele/Publications/PDF/Dobrushin...FINITE HORIZON MARKOV DECISION PROBLEMS ... non-homogeneous Markov chain, central limit theorem, Markov

9 Markov chains and Hidden Markov Models - Freie … · 9 Markov chains and Hidden Markov Models We will discuss: Markov chains Hidden Markov Models (HMMs) Algorithms: Viterbi, forward,

Markov Chains Regular Markov Chains Absorbing Markov Chains