Upload
deacon-wong
View
25
Download
0
Embed Size (px)
DESCRIPTION
Randomized Strategies and Temporal Difference Learning in Poker. Michael Oder April 4, 2002 Advisor: Dr. David Mutchler. Overview. Perfect vs. Imperfect Information Games Poker as Imperfect Information Game Randomization Neural Nets and Temporal Difference Experiments Conclusions - PowerPoint PPT Presentation
Citation preview
Randomized Strategies and Temporal Difference Learning in Poker
Michael Oder
April 4, 2002
Advisor: Dr. David Mutchler
Overview• Perfect vs. Imperfect Information Games
• Poker as Imperfect Information Game
• Randomization
• Neural Nets and Temporal Difference
• Experiments
• Conclusions
• Ideas for Further Study
Perfect vs. Imperfect Information• World-class AI agents exist for many popular
games– Checkers– Chess– Othello
• These are games of perfect information• All relevant information is available to each
player• Good understanding of imperfect information
games would be a breakthrough
Poker as an Imperfect Information Game
• Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information.
• Non-deterministic aspects as well
Enter Loki
• One of the most successful computer poker players created
• Produced at University of Alberta by Jonathan Schaeffer et al
• Employs randomized strategy– Makes player less predictable– Allows for bluffing
Probability Triples• At any point in a poker game, player has 3
choices– Bet/Raise– Check/Call– Fold
• Assign a probability to each possible move
• Single move is now a probability triple
• Problem: Associate payoff with hand, betting history, and triple (move selected)
Neural Nets• One promising way to learn such functions
is with a neural network
• Neural Networks consist of connected neurons
• Each connection has a weight
• Input game state, output a prediction of payoff
• Train by modifying weights
• Weights are modified by an amount proportional to learning rate
Neural Net Example
hand
history
triple
P(2)
P(1)
P(-1)
P(-2)
Temporal Difference• Most common way to train multiple layer
neural net is with backpropagation• Relies on simple input-output pairs.• Problem: need to know correct answer
right away in order to train nets• Solution: Temporal Difference (TD)
learning.• TD(λ) algorithm developed by Richard
Sutton
Temporal Difference (cont’d)• Trains responses over the course of a game
over many time steps
• Tries to make each prediction closer to the prediction in the next time step
P1 P2 P3 P4 P5
University of Mauritius Group
• TD Poker program produced by group supervised by Dr. Mutchler
• Provides environment for playing poker variants and testing agents
Simple Poker Game• Experiments were conducted on extremely
simple variant of Poker
• Deck consists of 2, 3, and 4 of Hearts
• Each player gets one card
• One round of betting
• Player with highest card wins the pot
• Goal: Get the net to produce accurate payoff values as outputs
Early Results• Started by pitting a neural net player against
a random one• Results were inconsistant• Problem: Innappropriate value for learning
rate• Too low: Outputs never approach true
payoffs• Too high: Outputs fluctuate between too
high and too low
Experiment Set I• Conjecture: Learning should occur with very small
learning rate over many games• Learning Rate = 0.01• Train for 50,000 games• Only set to train when card is a 4• First player always bets, second player tested• Two Choices
– call 80%, fold 20% -> avg. payoff = 1.4
– call 20%, fold 80% -> avg. payoff = -0.4
• Want payoffs to settle in on average values
Results
• 3 out of 10 trials came within 0.1 of the correct result for the highest payoff
• 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff
• None of the trials came within 0.1 of the correct result for both
• The results were in the correct order in only half of the trials
More Distributions• Repeated experiment with six choices instead of
two– call 100% -> avg. payoff = 2.0– call 80%, fold 20% -> avg. payoff = 1.4– call 60%, fold 40% -> avg. payoff = 0.8– call 40%, fold 60% -> avg. payoff = 0.2– call 20% fold 80% -> avg. payoff = -0.4– fold 100% -> avg. payoff = -1.0
• Using more distributions did help the program learn to order value of the distributions correctly
• All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)
Output Encoding• Distributions are ranked correctly, but many
output values are still inaccurate.
• Seems to be largely caused by the encoding of outputs
• Network has four outputs, each representing probability of a specific payoff
• This encoding is not expandable, and four outputs must all be correct for good payoff prediction.
Relative Payoff Encoding
• Replace four outputs with single number• The number represents the payoff
relative to highest payoff possible
P = 0.5 + (winnings/total possible)
• Total possible winnings determined at beginning of game (sum of other players’ holdings)
• Repeated previous experiments using this encoding
Results (Experiment Set 2)
• Payoff predictions were generally more accurate using this encoding
• 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available
• Most trials had very close value for payoff associated with one of the distributions
• However, no trial was significantly close on multiple probability distributions
Observations/Conclusions
• Neural Net player can learn strategies based on probability
• Payoff is successfully learned as a function of betting action
• Consistency is still a problem
• Trouble learning correct payoffs for more than one distribution
Further Study
• Issues of expandability– Coding for multiple-round history– Can previous learning be extended?
• Variable learning rate
• Study distribution choices
• Sample some bad distribution choices
• Test against a variety of other players
Questions?