Upload
lucius-mcfarland
View
20
Download
0
Embed Size (px)
DESCRIPTION
Learning to Play Blackjack. Thomas Boyett Presentation for CAP 4630 Teacher: Dr. Eggen. Objective. Specifying the Task Environment. Can be considered the problem that a rational agent is the solution too. Designing a good solution always includes gaining an in depth - PowerPoint PPT Presentation
Citation preview
Learning to Play Learning to Play BlackjackBlackjack
Thomas BoyettThomas Boyett
Presentation for CAP 4630Presentation for CAP 4630
Teacher: Dr. EggenTeacher: Dr. Eggen
ObjectiveObjective
Design and implement an agent that learns to play Design and implement an agent that learns to play Blackjack. Blackjack.
The rational agent approach to AI described in The rational agent approach to AI described in Chapter 2 of our textbook was used as a guide.Chapter 2 of our textbook was used as a guide.
Accomplish this without supplying to the agent Accomplish this without supplying to the agent theories or rules about how to play the game theories or rules about how to play the game optimally. optimally.
Specifying the Task EnvironmentSpecifying the Task Environment
Can be considered the problem that a rational agent is theCan be considered the problem that a rational agent is thesolution too.solution too.
Designing a good solution always includes gaining an in depthDesigning a good solution always includes gaining an in depthknowledge of the problem.knowledge of the problem.
PEASPEASPerformance – Objective measure of work quality.Performance – Objective measure of work quality.Environment – The things the agent will interact with.Environment – The things the agent will interact with.ActuatorsActuatorsSensorsSensors
Other Properties of the Task Other Properties of the Task EnvironmentEnvironment
Fully observable vs. partially observableFully observable vs. partially observable
Deterministic vs. stochasticDeterministic vs. stochastic
Episodic vs. sequentialEpisodic vs. sequential
Static vs. dynamicStatic vs. dynamic
Discrete vs. continuousDiscrete vs. continuous
Single agent vs. multi-agentSingle agent vs. multi-agent
The RulesThe Rules
Unlike most cards games you play only against the dealer.Unlike most cards games you play only against the dealer.Whoever has the highest valued hand without exceeding 21 isWhoever has the highest valued hand without exceeding 21 isthe winner. Going above 21 is called Busting and is an immediate loss.the winner. Going above 21 is called Busting and is an immediate loss.
Aces are worth 11 or 1, your choice.Aces are worth 11 or 1, your choice.Kings, Queens, Jacks and Tens are worth 10.Kings, Queens, Jacks and Tens are worth 10.All other cards are worth their face value.All other cards are worth their face value.The suit of the card is ignored.The suit of the card is ignored.
The dealer gives you two cards face up and deals himself two cards, The dealer gives you two cards face up and deals himself two cards, one of them face up. This is one of the features of Blackjack that causeone of them face up. This is one of the features of Blackjack that causeit to be a partially observable task environment.it to be a partially observable task environment.
The RulesThe Rules
If you want another card you can hit. If you areIf you want another card you can hit. If you aresatisfied with your hand you can stand. Whoeversatisfied with your hand you can stand. Whoeverhas the best score wins the game. If the scoreshas the best score wins the game. If the scoresare equal then the game is a draw.are equal then the game is a draw.
If either player on the initial deal receives anIf either player on the initial deal receives anace and any card worth 10 then they have aace and any card worth 10 then they have aBlackjack. A Blackjack is immediate victory unlessBlackjack. A Blackjack is immediate victory unlessboth players have a Blackjack, this results in aboth players have a Blackjack, this results in adraw.draw.
Types of Rational AgentsTypes of Rational Agents
Simple reflex agentsSimple reflex agentsModel-based reflex agentsModel-based reflex agentsGoal-based agentsGoal-based agentsUtility-based agentsUtility-based agents
All types can be extended to be learning agents.All types can be extended to be learning agents.
The Blackjack agent will be designed as a learningThe Blackjack agent will be designed as a learningmodel-based reflex agent.model-based reflex agent.
The Components of a Learning The Components of a Learning AgentAgent
Performance elementPerformance element
CriticCritic
Learning elementLearning element
Problem GeneratorProblem Generator
The Performance ElementThe Performance Element
The part of the agent that chooses what toThe part of the agent that chooses what to
do.do.
The Blackjack agent in this design will beThe Blackjack agent in this design will be
limited to Hitting and Standing. Optimallimited to Hitting and Standing. Optimal
winning and betting are separate andwinning and betting are separate and
Complex problems.Complex problems.
The Performance ElementThe Performance Element
Chooses to hit or stand based on dealer’sChooses to hit or stand based on dealer’s
Value and its own value.Value and its own value.
Actions stored in a reference table.Actions stored in a reference table.
Columns represent dealer’s value and rowsColumns represent dealer’s value and rows
represent the agent’s value.represent the agent’s value.
The Performance ElementThe Performance Element
22 33 44 55 66 77 88 99 1010 1111
22 HH HH HH HH HH HH HH HH HH HH
33 HH HH HH HH HH HH HH HH HH HH
44 HH HH HH HH HH HH HH HH HH HH
…… …… …… …… …… …… …… …… …… …… ……
1616 SS SS SS SS SS HH HH HH HH HH
1717 SS SS SS SS SS SS SS SS HH HH
1818 SS SS SS SS SS SS SS SS SS SS
1919 SS SS SS SS SS SS SS SS SS SS
2020 SS SS SS SS SS SS SS SS SS SS
The CriticThe Critic
The critic tells the learning element if theThe critic tells the learning element if the
results of an action were good or bad.results of an action were good or bad.
The critical part of a learning agent must beThe critical part of a learning agent must be
objective and independent of the learningobjective and independent of the learning
element. element.
The CriticThe Critic
If the agent chooses to hit:If the agent chooses to hit:The outcome is good if the agent did not bust.The outcome is good if the agent did not bust.The outcome is bad if the agent did.The outcome is bad if the agent did.
If the agent chose to stand:If the agent chose to stand:The outcome is good if the agent won the game.The outcome is good if the agent won the game.The outcome is bad if the agent lost.The outcome is bad if the agent lost.The outcome is ignored if the game ends in a The outcome is ignored if the game ends in a draw. Neither dealer or player benefit from a draw. Neither dealer or player benefit from a draw or are penalized by it.draw or are penalized by it.
The Learning ElementThe Learning Element
Makes improvements to the performanceMakes improvements to the performance
element.element.
Works in direct response to feedback Works in direct response to feedback
provided by the critic.provided by the critic.
The Learning ElementThe Learning Element
The actual structure of aThe actual structure of a
lookup table entrylookup table entry
are four values thatare four values that
represent the agent’srepresent the agent’s
previous experience with aprevious experience with a
SpecificSpecific
dealer/player valuedealer/player value
combination. The learningcombination. The learning
element maintains theseelement maintains these
values.values.
# Hits resulting in bad # Hits resulting in bad feedbackfeedback
# Hits resulting in good # Hits resulting in good feedbackfeedback
# Stands resulting in # Stands resulting in bad feedback.bad feedback.
# Stands resulting in # Stands resulting in good feedback.good feedback.
The Learning ElementThe Learning Element
The good/bad ratio of the hitting andThe good/bad ratio of the hitting and
standing results are computed and whichever ratiostanding results are computed and whichever ratio
is largest decides the perceived optimal action.is largest decides the perceived optimal action.
This approach allows the agent to improveThis approach allows the agent to improve
based on previous results. Thousands ofbased on previous results. Thousands of
games must be played to generate a reliablegames must be played to generate a reliable
lookup table.lookup table.
The Learning ElementThe Learning Element
An example of tableAn example of table
computation. A hypotheticalcomputation. A hypothetical
Table entry for dealer withTable entry for dealer with
Value 8 and player withValue 8 and player with
value 12.value 12.
Since GH/BH is greater thanSince GH/BH is greater than
GS/BS this data evaluatesGS/BS this data evaluates
to MUST HIT on (8,12).to MUST HIT on (8,12).
GHGH 4545
BHBH 55
GSGS 2020
BSBS 3030
GS/BH = 45/5 = 9GS/BH = 45/5 = 9
GS/BS = 15/35 = 0.429GS/BS = 15/35 = 0.429
Problem GeneratorProblem Generator
The problem generator’s job is toThe problem generator’s job is to
occasionally tell the learning agent to try aoccasionally tell the learning agent to try a
non optimal action for a given situation.non optimal action for a given situation.
At the cost of sometimes behaving lessAt the cost of sometimes behaving less
optimally the agent is given the opportunityoptimally the agent is given the opportunity
to find less obvious ways to perform better.to find less obvious ways to perform better.
The Problem GeneratorThe Problem Generator
Force the agent to play a set of games either onlyForce the agent to play a set of games either onlyhitting or standing.hitting or standing.
Naïve policy if you are playing to win, but it allowsNaïve policy if you are playing to win, but it allowsthe agent to learn about the quality of both choicesthe agent to learn about the quality of both choicesin all circumstances.in all circumstances.
Problem generation may seem counter productiveProblem generation may seem counter productivebut it allows the agent to learn information thatbut it allows the agent to learn information thatotherwise would have been left undiscovered.otherwise would have been left undiscovered.
ResultsResults
Before Being allowed to learn:Before Being allowed to learn:Average Win%: 17%Average Win%: 17%Average Lose%: 80%Average Lose%: 80%
After being allowed to learn without guidance from a problemAfter being allowed to learn without guidance from a problemGenerator (50,000 games):Generator (50,000 games):Average Win%: 32%Average Win%: 32%Average Loss%: 60%Average Loss%: 60%
After being allowed to learn with a problem generator (50,000 games):After being allowed to learn with a problem generator (50,000 games):Average Win%: 45%Average Win%: 45%Average Loss%: 49%Average Loss%: 49%
Results in PerspectiveResults in Perspective
A player that always hits:A player that always hits:
Average Win%: 15%Average Win%: 15%
Average Loss%: 80%Average Loss%: 80%
A player that flips a coin to decide hit/stand:A player that flips a coin to decide hit/stand:
Average Win%: 24%Average Win%: 24%
Average Loss%: 70%Average Loss%: 70%
Results in PerspectiveResults in Perspective
A player that always stands:A player that always stands:Average Win%: 40%Average Win%: 40%Average Loss%: 55%Average Loss%: 55%
A professional Blackjack player who usesA professional Blackjack player who usesBasic (Optimal) Strategy:Basic (Optimal) Strategy:Average Win%: 45%Average Win%: 45%Average Loss%: 49%Average Loss%: 49%
ReferencesReferences
Artificial Intelligence: A Modern Approach,Artificial Intelligence: A Modern Approach,
Second Edition. Stuart Russel, Peter Norvig.Second Edition. Stuart Russel, Peter Norvig.
Prentice Hall, 2003.Prentice Hall, 2003.