Learning to Play Blackjack

Learning to Play Learning to Play BlackjackBlackjack

Thomas BoyettThomas Boyett

Presentation for CAP 4630Presentation for CAP 4630

Teacher: Dr. EggenTeacher: Dr. Eggen

ObjectiveObjective

Design and implement an agent that learns to play Design and implement an agent that learns to play Blackjack. Blackjack.

The rational agent approach to AI described in The rational agent approach to AI described in Chapter 2 of our textbook was used as a guide.Chapter 2 of our textbook was used as a guide.

Accomplish this without supplying to the agent Accomplish this without supplying to the agent theories or rules about how to play the game theories or rules about how to play the game optimally. optimally.

Specifying the Task EnvironmentSpecifying the Task Environment

Can be considered the problem that a rational agent is theCan be considered the problem that a rational agent is thesolution too.solution too.

Designing a good solution always includes gaining an in depthDesigning a good solution always includes gaining an in depthknowledge of the problem.knowledge of the problem.

PEASPEASPerformance – Objective measure of work quality.Performance – Objective measure of work quality.Environment – The things the agent will interact with.Environment – The things the agent will interact with.ActuatorsActuatorsSensorsSensors

Other Properties of the Task Other Properties of the Task EnvironmentEnvironment

Fully observable vs. partially observableFully observable vs. partially observable

Deterministic vs. stochasticDeterministic vs. stochastic

Episodic vs. sequentialEpisodic vs. sequential

Static vs. dynamicStatic vs. dynamic

Discrete vs. continuousDiscrete vs. continuous

Single agent vs. multi-agentSingle agent vs. multi-agent

The RulesThe Rules

Unlike most cards games you play only against the dealer.Unlike most cards games you play only against the dealer.Whoever has the highest valued hand without exceeding 21 isWhoever has the highest valued hand without exceeding 21 isthe winner. Going above 21 is called Busting and is an immediate loss.the winner. Going above 21 is called Busting and is an immediate loss.

Aces are worth 11 or 1, your choice.Aces are worth 11 or 1, your choice.Kings, Queens, Jacks and Tens are worth 10.Kings, Queens, Jacks and Tens are worth 10.All other cards are worth their face value.All other cards are worth their face value.The suit of the card is ignored.The suit of the card is ignored.

The dealer gives you two cards face up and deals himself two cards, The dealer gives you two cards face up and deals himself two cards, one of them face up. This is one of the features of Blackjack that causeone of them face up. This is one of the features of Blackjack that causeit to be a partially observable task environment.it to be a partially observable task environment.

The RulesThe Rules

If you want another card you can hit. If you areIf you want another card you can hit. If you aresatisfied with your hand you can stand. Whoeversatisfied with your hand you can stand. Whoeverhas the best score wins the game. If the scoreshas the best score wins the game. If the scoresare equal then the game is a draw.are equal then the game is a draw.

If either player on the initial deal receives anIf either player on the initial deal receives anace and any card worth 10 then they have aace and any card worth 10 then they have aBlackjack. A Blackjack is immediate victory unlessBlackjack. A Blackjack is immediate victory unlessboth players have a Blackjack, this results in aboth players have a Blackjack, this results in adraw.draw.

Types of Rational AgentsTypes of Rational Agents

Simple reflex agentsSimple reflex agentsModel-based reflex agentsModel-based reflex agentsGoal-based agentsGoal-based agentsUtility-based agentsUtility-based agents

All types can be extended to be learning agents.All types can be extended to be learning agents.

The Blackjack agent will be designed as a learningThe Blackjack agent will be designed as a learningmodel-based reflex agent.model-based reflex agent.

The Components of a Learning The Components of a Learning AgentAgent

Performance elementPerformance element

CriticCritic

Learning elementLearning element

Problem GeneratorProblem Generator

The Performance ElementThe Performance Element

The part of the agent that chooses what toThe part of the agent that chooses what to

do.do.

The Blackjack agent in this design will beThe Blackjack agent in this design will be

limited to Hitting and Standing. Optimallimited to Hitting and Standing. Optimal

winning and betting are separate andwinning and betting are separate and

Complex problems.Complex problems.


Chooses to hit or stand based on dealer’sChooses to hit or stand based on dealer’s

Value and its own value.Value and its own value.

Actions stored in a reference table.Actions stored in a reference table.

Columns represent dealer’s value and rowsColumns represent dealer’s value and rows

represent the agent’s value.represent the agent’s value.


22 33 44 55 66 77 88 99 1010 1111

22 HH HH HH HH HH HH HH HH HH HH



…… …… …… …… …… …… …… …… …… …… ……

1616 SS SS SS SS SS HH HH HH HH HH

1717 SS SS SS SS SS SS SS SS HH HH

1818 SS SS SS SS SS SS SS SS SS SS



The CriticThe Critic

The critic tells the learning element if theThe critic tells the learning element if the

results of an action were good or bad.results of an action were good or bad.

The critical part of a learning agent must beThe critical part of a learning agent must be

objective and independent of the learningobjective and independent of the learning

element. element.

The CriticThe Critic

If the agent chooses to hit:If the agent chooses to hit:The outcome is good if the agent did not bust.The outcome is good if the agent did not bust.The outcome is bad if the agent did.The outcome is bad if the agent did.

If the agent chose to stand:If the agent chose to stand:The outcome is good if the agent won the game.The outcome is good if the agent won the game.The outcome is bad if the agent lost.The outcome is bad if the agent lost.The outcome is ignored if the game ends in a The outcome is ignored if the game ends in a draw. Neither dealer or player benefit from a draw. Neither dealer or player benefit from a draw or are penalized by it.draw or are penalized by it.

The Learning ElementThe Learning Element

Makes improvements to the performanceMakes improvements to the performance

element.element.

Works in direct response to feedback Works in direct response to feedback

provided by the critic.provided by the critic.


The actual structure of aThe actual structure of a

lookup table entrylookup table entry

are four values thatare four values that

represent the agent’srepresent the agent’s

previous experience with aprevious experience with a

SpecificSpecific

dealer/player valuedealer/player value

combination. The learningcombination. The learning

element maintains theseelement maintains these

values.values.

# Hits resulting in bad # Hits resulting in bad feedbackfeedback

# Hits resulting in good # Hits resulting in good feedbackfeedback

# Stands resulting in # Stands resulting in bad feedback.bad feedback.

# Stands resulting in # Stands resulting in good feedback.good feedback.


The good/bad ratio of the hitting andThe good/bad ratio of the hitting and

standing results are computed and whichever ratiostanding results are computed and whichever ratio

is largest decides the perceived optimal action.is largest decides the perceived optimal action.

This approach allows the agent to improveThis approach allows the agent to improve

based on previous results. Thousands ofbased on previous results. Thousands of

games must be played to generate a reliablegames must be played to generate a reliable

lookup table.lookup table.


An example of tableAn example of table

computation. A hypotheticalcomputation. A hypothetical

Table entry for dealer withTable entry for dealer with

Value 8 and player withValue 8 and player with

value 12.value 12.

Since GH/BH is greater thanSince GH/BH is greater than

GS/BS this data evaluatesGS/BS this data evaluates

to MUST HIT on (8,12).to MUST HIT on (8,12).

GHGH 4545

BHBH 55

GSGS 2020

BSBS 3030

GS/BH = 45/5 = 9GS/BH = 45/5 = 9

GS/BS = 15/35 = 0.429GS/BS = 15/35 = 0.429

Problem GeneratorProblem Generator

The problem generator’s job is toThe problem generator’s job is to

occasionally tell the learning agent to try aoccasionally tell the learning agent to try a

non optimal action for a given situation.non optimal action for a given situation.

At the cost of sometimes behaving lessAt the cost of sometimes behaving less

optimally the agent is given the opportunityoptimally the agent is given the opportunity

to find less obvious ways to perform better.to find less obvious ways to perform better.

The Problem GeneratorThe Problem Generator

Force the agent to play a set of games either onlyForce the agent to play a set of games either onlyhitting or standing.hitting or standing.

Naïve policy if you are playing to win, but it allowsNaïve policy if you are playing to win, but it allowsthe agent to learn about the quality of both choicesthe agent to learn about the quality of both choicesin all circumstances.in all circumstances.

Problem generation may seem counter productiveProblem generation may seem counter productivebut it allows the agent to learn information thatbut it allows the agent to learn information thatotherwise would have been left undiscovered.otherwise would have been left undiscovered.

ResultsResults

Before Being allowed to learn:Before Being allowed to learn:Average Win%: 17%Average Win%: 17%Average Lose%: 80%Average Lose%: 80%

After being allowed to learn without guidance from a problemAfter being allowed to learn without guidance from a problemGenerator (50,000 games):Generator (50,000 games):Average Win%: 32%Average Win%: 32%Average Loss%: 60%Average Loss%: 60%

After being allowed to learn with a problem generator (50,000 games):After being allowed to learn with a problem generator (50,000 games):Average Win%: 45%Average Win%: 45%Average Loss%: 49%Average Loss%: 49%

Results in PerspectiveResults in Perspective

A player that always hits:A player that always hits:

Average Win%: 15%Average Win%: 15%

Average Loss%: 80%Average Loss%: 80%

A player that flips a coin to decide hit/stand:A player that flips a coin to decide hit/stand:

Average Win%: 24%Average Win%: 24%

Average Loss%: 70%Average Loss%: 70%

Results in PerspectiveResults in Perspective

A player that always stands:A player that always stands:Average Win%: 40%Average Win%: 40%Average Loss%: 55%Average Loss%: 55%

A professional Blackjack player who usesA professional Blackjack player who usesBasic (Optimal) Strategy:Basic (Optimal) Strategy:Average Win%: 45%Average Win%: 45%Average Loss%: 49%Average Loss%: 49%

ReferencesReferences

Artificial Intelligence: A Modern Approach,Artificial Intelligence: A Modern Approach,

Second Edition. Stuart Russel, Peter Norvig.Second Edition. Stuart Russel, Peter Norvig.

Prentice Hall, 2003.Prentice Hall, 2003.

Documents

Learning to Play Blackjack