Mastering the game of go with deep neural networks and tree search

05/03/2023 1

Mastering the game of Go with deep neural networks and tree search

Speaker: San-Feng Chang

Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman,

S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D.

Nature, 529(7587):484–489, 2016.

05/03/2023 2

Outline

• AI in Game Playing• Previous Work of Go Research• Architecture of AlphaGo• AlphaGo’s methods• The playing strength of AlphaGo• Conclusion

05/03/2023 3

AI in Game Playing(1/3)

• Game-playing is a specific problem to measure the performance of an AI.

• One classification for outcomes of an AI test is:Optimal It is not possible to perform better

Strong super-human Performs better than all humans

Super-human Performs better than most humans

Sub-human Performs worse than most humans

05/03/2023 4


Game Players Branching Factor Depth Length Complexity

ChessDeep Blue vs

Kasparov (1997)

35 80 35^80 ≈ 10^123

Go AlphaGo vs Lee Sedol (2016) 250 150 250^150≈

10^360

Evolution of Gaming Tree Search:

Brute Force Minmax &Alpha-Beta MCTS AlphaGo’s

Method

05/03/2023 5


• Minmax & Alpha-Beta Pruning

The complexity is still too high.

https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/AB_pruning.svg/1280px-AB_pruning.svg.png?1458451165542

05/03/2023 6

Previous Work of Go Research (1/4)

• Monte Carlo rollouts search to maximum depth without branching at all, by sampling long sequences of actions for both players from a policy p.

• Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.

05/03/2023 7


• Monte Carlo Tree Search:

2/3

1/1 1/2

1/1 0/1

2/3

1/1 1/2

1/1 0/1

Selection(Randomly)

Expansion

0/0

Player 1

Player 2

Player 1

05/03/2023 8

• Monte Carlo Tree Search:


2/3

1/1 1/2

1/1 0/1

Simulation

0/0

......

3/4

1/1 2/3

2/2 0/1

Back-Propagation

1/1

Player 1

Player 2

Player 1

Player 2

05/03/2023 9


• The strongest current Go programs are based on MCTS, enhanced by policies that are trained to predict human expert moves.

• However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.

05/03/2023 10

Architecture of AlphaGo

Neural Network Training Pipeline

s: board positiona: legal moves

p(a|s): probability distributionv(s): scalar value

Two Brains

Human expert dataset: KGS server ~ 160,000 games

29.4 million positions

05/03/2023 11

Convolution Neural Network(1/2)

A regular 3-layer Neural Network A convolutional neural network

Input volume of size: W1 x H1 x D1

Requires four hyperparameters: 1. Number of filters K (depth) 2. Spatial extent F (kernel size) 3. The stride S 4. The amount of zero padding P

Output volume size: W2 x H2 x D2

W2 = (W1 – F + 2P)/S + 1 H2 = (H1 – F + 2P)/S + 1 D2 = k• Parameter sharing: total weights: (F * F * D1) * K

http://cs231n.github.io/convolutional-networks/

05/03/2023 12

Convolution Neural Network(2/2)


Number of filter K: 2Spatial extent F: 3 x 3Stride S: 2Zero padding P: 1

05/03/2023 13

AlphaGo’s methods – Trained by Human Expert (1/6)

• Rollout Policy : – Using 2μs to select an action but only 24.2% accuracy

to predict expert moves correctly – Using a linear softmax of small pattern features with

weights π

p

n1

n2

n3

n1,in

n2,in

n3,in

ininin

in

nnn

n

out eeeen

,3,2,1

,1

,1

https://qph.fs.quoracdn.net/main-qimg-9e2d012ef7cb8b29d2bed14d2975c986

05/03/2023 14

AlphaGo’s methods – Trained by Human Expert (2/6)

• SL policy :– Using 3ms to select an action and 57.0% accuracy

to predict expert moves correctly – Using 13 layers convolutional neural network with

weights σ

p

......

InputSize: 19*1948 planes

First layerConv + ReLU

Kernel size: 5 x 5

2nd~12th layers Conv + ReLU

Kernel size: 3 x 3

13th layers Kernel size: 1 x 1, 1 filter, softmax

05/03/2023 15

AlphaGo’s methods – Reinforcement Learning pρ (3/6)

SL policypσ

Initialize Weightsρ = ρ- = σ

RL policypρ

pρ- pρ

Opponent pool

Play ...... End

r

rewardPolicy Gradient

Method

Add pρ to opponent pool

05/03/2023 16

AlphaGo’s methods – Value Network vθ (4/6)

• Supervised Learning: – Used to estimate the positions’ winning rate at

current state– Using 15 layers CNN

......

InputSize: 19*1948 planes+1 unit(current color)

1st~13th layers The same as

RL Policy networks

15th layers Full-connected

1 tanh unit

14th layerFully-connected256 ReLU unit

05/03/2023 17

AlphaGo’s methods – Value Network vθ (5/6)

• Randomly sample an integer U in 1 ~ 450– t = 1 ~ U-1 – Played by SL policy network pσ

– t = U – Random action– t = U+1 ~ End – Played by RL policy network pρ

• Reward • Only a single training example (sU+1, zU+1) is

added to the data set from each game.

Tt srz

05/03/2023 18

AlphaGo’s methods – Searching (6/6)

• Q: Action Value Winning scores• u(P): Upper Confidence bound Exploration vs. Exploitation • P: Prior probability using pσ (SL performed better than RL)

More

05/03/2023 19

The playing strength of AlphaGo

05/03/2023 20

Conclusion

• Reaching a milestone is the beginning of the next milestone.

• Stay hungry, stay foolish!

05/03/2023 21

References(1/2)

• Nature: – Mastering the game of Go with deep neural

networks and tree search• Mark Chang:– http://

www.slideshare.net/ckmarkohchang/alphago-in-depth

• CNN:– http://cs231n.github.io/convolutional-networks/

http://www.slideshare.net/ckmarkohchang/alphago-in-depth






05/03/2023 22

References(2/2)

• 陳鍾誠– http://www.slideshare.net/ccckmit/30alphago

• Monte Carlo Tree Search– https://jeffbradberry.com/posts/2015/09/intro-t

o-monte-carlo-tree-search/

• How AlphaGo Works– http://

www.slideshare.net/ShaneSeungwhanMoon/how-alphago-works

http://www.slideshare.net/ccckmit/30alphago

http://www.slideshare.net/ccckmit/30alphago

https://jeffbradberry.com/posts/2015/09/intro-to-monte-carlo-tree-search/



http://www.slideshare.net/ShaneSeungwhanMoon/how-alphago-works



05/03/2023 23

EndThank You

05/03/2023 24

Formula(1/2)

• Policy Network: classification

• Policy Network: reinforcement learning

• Value Network: regression

m

k

kk sap

m 1

log

itit

n

i

i

t

it

it svzsap

n

1 1

log

km

kkk svsvz

m 1

05/03/2023 25

Formula(2/2)• Searching:

asuasQa tta

t ,,maxarg

asNasPasu,1,,

n

i

iaslasN1

,,,

n

iLisViasl

asNasQ

1

,,,1,

l(s,a,i) indicates whether an edge (s,a) ith simulation

siL is the leaf node from ith simulation

LLL zsvsV 1

Back

asNbsN

asPcasu b rpuct ,1

,,,

05/03/2023 26

How AlphaGo selected its move

05/03/2023 27

The playing strength of AlphaGo(Bonus 1)

05/03/2023 28

The playing strength of AlphaGo(Bonus 2)

Technology

Mastering the game of go with deep neural networks and tree search