NUS-ISS Learning Day 2017 - How About a Game of Chess?

Preview:

Citation preview

HOW ABOUT A NICE GAME OF CHESS?

Lee Chuk Munnchukmunnlee@nus.edu.sg

http://bit.ly/learningday2017

What is this talk about?

An introduction to key ideas in reinforcement learning

Conceptual, fairly high level

Intuition about the maths without the equations- Not quite. There are 4 equations

Demos

What is Machine Learning?

“Machine learning is the field of

study that gives computers the

ability to learn without being

explicitly programmed”

Arthur Samuel

Wrote the first

program that learn

how to play

checkers in 1959

Uses the minimax

algorithm

Arthur Samuel

Programmed vs Learnt

Traditional Program

MachineProgram

Data

Answer

Machine Learning

Machine

Ne

w D

ata

New AnswerMachine Learning

Program

Data

Answer

Training phase

How Do Machine Learn?

Slides adapted from https://www.slideshare.net/aaalm/2016-01-27-a-gentle-and-structured-introduction-to-machine-learning-v105-for-release-58329371

By generalising

Supervised

Learning

Give samples

Give answers to the

samples

Infer rules from the

samples and the

answers

By comparing

Unsupervised

Learning

Give samples

Do not give answers

Use some metrics to

infer similarity by

grouping them

By reward

Reinforcement

Learning

Do not give samples

Do not give answers

Infer the rules from the

positive or negative

feedback

Reinforcement Learning circa 1977

ActionObserveAgent

Environment

State Reward

Reinforcement Learning

State

A ‘snapshot’ of the environment at a point in time

What are the possible actions to take?

State

LEFT UP

State Transition

State

State State State

Action

State

LEFT UP

Which of these 2

states is the better

one to be in?

State

Features of this state

Straight ahead, no reduction

in speed

Features of this state

Proximity to power pill and

around the corner (obstacle) but

reduce speed

Utility

State

State State State

Action

The state has the

best features

Utility is how we calculate

the ‘goodness’ of a state

Using the utility function we

can express the agent’s

preference

u(state0) > u(state1)

agent prefer state0 over

state1

To Win

Maximize our utility

Utility, Value and Reward

Rewards - How do they differ?

What Actions to Take?

Image from http://ai.berkeley.edu/home.html

Rewards

Rewards

- Can be either positive and negative

- Given at the end

- Given at every step - living reward

Prefer now to later

- Discounting - earlier rewards will have higher utility than later rewards

Image from http://ai.berkeley.edu/home.html

How will the Agent Behave?

Scenario #1

Living reward is -1

End game reward is 100

Scenario #2

Living reward is 1

End game reward is 100

UP

Uncertainty

Controller Problem

80% move in the correct direction

10% go left

10% go right

What is the Value?

UP

80%

20%

Time to Changi Airport

Normal traffic = 30 mins

Probability = 60%

Heavy traffic = 45 mins

Probability = 40%

Time to Changi Airport

(30 ✕ .6) = 18 mins (45 ✕ .4) = 18 mins

≅36 mins

Bellman Equation

Richard Bellman

The Bellman Equation

state0

state1

(state0, action)

(state0, action, state1)

Q-state

Value

Policy Extraction or How to Win

Step 1 - start by being optimal

Step 2 - keep being optimal

Demo

1. Prefer the closer exit (1),

risking the cliff (-10)

1. Prefer the closer exit (1),

avoiding the cliff (-10)

1. Prefer the distant exit (10),

risking the cliff (-10)

1. Prefer the distant exit (10),

avoiding the cliff (-10)

Model Free Learning

Trial and Error

Don’t know the transitions

Don’t know the rewards

Model Free Learning

Learn by trial and error

Eventually will approximate Bellman updates

Explore vs Exploit

Demo

Where to Learn?

Reinforcement Learning - An Introductionby Richard S Sutton and Andrew G Barto

Berkeley AI Course - http://ai.berkeley.edu/home.html

David Silver - http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html

Thank YouHave a great day

Maximize your utility

Recommended