Randomized Strategies and Temporal Difference Learning in Poker

Randomized Strategies and Temporal Difference Learning in Poker

Michael Oder

April 4, 2002

Advisor: Dr. David Mutchler

Overview• Perfect vs. Imperfect Information Games

• Poker as Imperfect Information Game

• Randomization

• Neural Nets and Temporal Difference

• Experiments

• Conclusions

• Ideas for Further Study

Perfect vs. Imperfect Information• World-class AI agents exist for many popular

games– Checkers– Chess– Othello

• These are games of perfect information• All relevant information is available to each

player• Good understanding of imperfect information

games would be a breakthrough

Poker as an Imperfect Information Game

• Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information.

• Non-deterministic aspects as well

Enter Loki

• One of the most successful computer poker players created

• Produced at University of Alberta by Jonathan Schaeffer et al

• Employs randomized strategy– Makes player less predictable– Allows for bluffing

Probability Triples• At any point in a poker game, player has 3

choices– Bet/Raise– Check/Call– Fold

• Assign a probability to each possible move

• Single move is now a probability triple

• Problem: Associate payoff with hand, betting history, and triple (move selected)

Neural Nets• One promising way to learn such functions

is with a neural network

• Neural Networks consist of connected neurons

• Each connection has a weight

• Input game state, output a prediction of payoff

• Train by modifying weights

• Weights are modified by an amount proportional to learning rate

Neural Net Example

hand

history

triple

P(2)

P(1)

P(-1)

P(-2)

Temporal Difference• Most common way to train multiple layer

neural net is with backpropagation• Relies on simple input-output pairs.• Problem: need to know correct answer

right away in order to train nets• Solution: Temporal Difference (TD)

learning.• TD(λ) algorithm developed by Richard

Sutton

Temporal Difference (cont’d)• Trains responses over the course of a game

over many time steps

• Tries to make each prediction closer to the prediction in the next time step

P1 P2 P3 P4 P5

University of Mauritius Group

• TD Poker program produced by group supervised by Dr. Mutchler

• Provides environment for playing poker variants and testing agents

Simple Poker Game• Experiments were conducted on extremely

simple variant of Poker

• Deck consists of 2, 3, and 4 of Hearts

• Each player gets one card

• One round of betting

• Player with highest card wins the pot

• Goal: Get the net to produce accurate payoff values as outputs

Early Results• Started by pitting a neural net player against

a random one• Results were inconsistant• Problem: Innappropriate value for learning

rate• Too low: Outputs never approach true

payoffs• Too high: Outputs fluctuate between too

high and too low

Experiment Set I• Conjecture: Learning should occur with very small

learning rate over many games• Learning Rate = 0.01• Train for 50,000 games• Only set to train when card is a 4• First player always bets, second player tested• Two Choices

– call 80%, fold 20% -> avg. payoff = 1.4

– call 20%, fold 80% -> avg. payoff = -0.4

• Want payoffs to settle in on average values

Results

• 3 out of 10 trials came within 0.1 of the correct result for the highest payoff

• 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff

• None of the trials came within 0.1 of the correct result for both

• The results were in the correct order in only half of the trials

More Distributions• Repeated experiment with six choices instead of

two– call 100% -> avg. payoff = 2.0– call 80%, fold 20% -> avg. payoff = 1.4– call 60%, fold 40% -> avg. payoff = 0.8– call 40%, fold 60% -> avg. payoff = 0.2– call 20% fold 80% -> avg. payoff = -0.4– fold 100% -> avg. payoff = -1.0

• Using more distributions did help the program learn to order value of the distributions correctly

• All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)

Output Encoding• Distributions are ranked correctly, but many

output values are still inaccurate.

• Seems to be largely caused by the encoding of outputs

• Network has four outputs, each representing probability of a specific payoff

• This encoding is not expandable, and four outputs must all be correct for good payoff prediction.

Relative Payoff Encoding

• Replace four outputs with single number• The number represents the payoff

relative to highest payoff possible

P = 0.5 + (winnings/total possible)

• Total possible winnings determined at beginning of game (sum of other players’ holdings)

• Repeated previous experiments using this encoding

Results (Experiment Set 2)

• Payoff predictions were generally more accurate using this encoding

• 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available

• Most trials had very close value for payoff associated with one of the distributions

• However, no trial was significantly close on multiple probability distributions

Observations/Conclusions

• Neural Net player can learn strategies based on probability

• Payoff is successfully learned as a function of betting action

• Consistency is still a problem

• Trouble learning correct payoffs for more than one distribution

Further Study

• Issues of expandability– Coding for multiple-round history– Can previous learning be extended?

• Variable learning rate

• Study distribution choices

• Sample some bad distribution choices

• Test against a variety of other players

Questions?

Documents

Randomized Strategies and Temporal Difference Learning in Poker