17
Lecture 5: Windy Frozen Lake Nondeterministic world! Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]>

Lecture 5: Windy Frozen Lake Nondeterministic world!hunkim.github.io/ml/RL/rl05.pdf · Lecture 5: Windy Frozen Lake Nondeterministic world! Reinforcement Learning with TensorFlow&OpenAI

  • Upload
    buiminh

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

Lecture 5: Windy Frozen LakeNondeterministic world!

Reinforcement Learning with TensorFlow&OpenAI GymSung Kim <[email protected]>

S

Windy Frozen Lake

Deterministic VS Stochastic (nondeterministic)

• In deterministic models the output of the model is fully determined by the parameter values and the initial conditions initial conditions

• Stochastic models possess some inherent randomness. - The same set of parameter values and initial conditions will lead to an

ensemble of different outputs.

Deterministic

Stochastic (non-deterministic)

Stochastic (non-deterministic) worlds

• Unfortunately, our Q-learning (for deterministic worlds) does not work anymore

• Why not?

Our previous Q-learning does not work

Score over time: 0.0165

Why does not work in stochastic (non-deterministic) worlds?

a

s

Stochastic (non-deterministic) world

• Solution?- Listen to Q (s`) (just a little bit)

- Update Q(s) little bit (learning rate)

• Like our life mentors- Don’t just listen and follow one mentor

- Need to listen from many mentors

http://m.kauppalehti.fi/uutiset/your-career-needs-many-mentors--not-just-one/gp3Q4rTp

Stochastic (non-deterministic) world

a

s

Learning incrementally

• Learning rate, -

Q(s, a) r + �max

a0Q(s0, a0)

↵ = 0.1↵

Q(s, a) Q(s, a) + [r + �max

a0Q(s0, a0)]

Learning with learning rate

Q(s, a) (1� ↵)Q(s, a) + ↵[r + �max

a0Q(s0, a0)]

Q(s, a) r + �max

a0Q(s0, a0)

Learning with learning rate

Q(s, a) (1� ↵)Q(s, a) + ↵[r + �max

a0Q(s0, a0)]

Q(s, a) Q(s, a) + ↵[r + �max

a0Q(s0, a0)�Q(s, a)]

Q(s, a) r + �max

a0Q(s0, a0)

Q-learning algorithm

Q(s, a) (1� ↵)Q(s, a) + ↵[r + �max

a0Q(s0, a0)]

Convergence

Machine Learning, Tom Mitchell, 1997

ˆQ(s, a) (1� ↵) ˆQ(s, a) + ↵[r + �max

a0ˆQ(s0, a0)]

Next

Lab: Stochastic worlds