28
Individual and Social Learning Nobuyuki Hanaki July 2, 2004 Abstract We use adaptive models to understand the dynamics that lead to efficient and fair outcomes in a repeated Battle of the Sexes game. Human subjects appear to easily recognize the possibility of a coordinated alternation of actions as a means to generate an efficient and fair outcome. Yet such typical learning models as Fictitious Play and Reinforcement Learning have found it hard to replicate this particular result. We de- velop a model that not only uses individual learning but also the “social learning” that operates through evolutionary selection. We find that the efficient and fair outcome emerges relatively quickly in our model. JEL Classification: B52, D83 Keywords: Reinforcement Learning, Evolutionary Dynamics * We are grateful to an anonymous referee for comments and suggestions, and to Jayant Ray for editorial supports. The Earth Institute, Columbia University. Address: Hogan Hall B-19, 2910 Broadway, New York, NY, 10025, U.S.A. Tel: +1-212-663-2853. Fax: +1-212-854-6309. E-mail: [email protected]. 1

Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

Individual and Social Learning∗

Nobuyuki Hanaki†

July 2, 2004

Abstract

We use adaptive models to understand the dynamics that lead to efficient and fair

outcomes in a repeated Battle of the Sexes game. Human subjects appear to easily

recognize the possibility of a coordinated alternation of actions as a means to generate

an efficient and fair outcome. Yet such typical learning models as Fictitious Play and

Reinforcement Learning have found it hard to replicate this particular result. We de-

velop a model that not only uses individual learning but also the “social learning” that

operates through evolutionary selection. We find that the efficient and fair outcome

emerges relatively quickly in our model.

JEL Classification: B52, D83

Keywords: Reinforcement Learning, Evolutionary Dynamics

∗We are grateful to an anonymous referee for comments and suggestions, and to Jayant Ray for editorialsupports.

†The Earth Institute, Columbia University. Address: Hogan Hall B-19, 2910 Broadway, New York, NY,10025, U.S.A. Tel: +1-212-663-2853. Fax: +1-212-854-6309. E-mail: [email protected].

1

Page 2: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

1 Introduction

Experimental studies have been accumulating evidence that the human behavior exhib-

ited in simple games does not coincide with the equilibrium outcomes of standard game

theory. Instead of following the course of action predicted by theory, experimental subjects

often try to explore possible actions and decide what to do next based on the resulting out-

comes. Several distinct models have been proposed to account for such learning processes.

A reinforcement learning model successfully replicates human behavior for games with an

unique mixed strategy Nash equilibrium [Erev and Roth, 1998], while belief-based learning

models such as the Fictitious Play learning model can account for the behavior observed in

coordination games [Cheung and Friedman, 1997, Mookherjee and Sopher, 1997]. A hybrid

of these two models – applicable to games where reinforcement or belief-based learning mod-

els can replicate behavior of experimental subjects – is the Experience Weighted Attraction

(EWA) learning model proposed by Camerer and Ho [1999]. This can be used to test for

various learning models by estimating the parameters of the model. What these models

share in common is the assumption that players are learning about stage game action.

While successful in some games, these models fail in such well known games as Prisoner’s

Dilemma, Battle of the Sexes, and Chicken as illustrated by Arifovic et al. [2002]. These

are the games where experimental subjects behave as if they are motivated by fairness and

efficiency considerations: subjects often find and coordinate actions that can maximize the

aggregate payoff (efficiency) and divide such payoffs equally amongst themselves (fairness).

Observations that people behave as if they are concerned about fairness and efficiency have

led some researchers to develop models with a richer class of preferences in which players

care not only about their own material payoffs but the payoffs to others [Rabin, 1993, Bolton

and Ockenfels, 2000, Fehr and Schmidt, 1999, Charness and Rabin, 2002].

Instead of assuming that fairness and efficiency considerations are a primitive of their

model, Hanaki et al. [2003] have demonstrated that a simple learning model applied to a

limited set of finitely-complex repeated game strategies can generate the characteristic out-

2

Page 3: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

comes of laboratory experiments, i.e., efficient and fair outcomes. The idea that players may

be learning about strategies instead of actions is not new. Various authors have noted that

stage game actions may not be the most natural things for players to learn about. [Erev and

Roth, 1998, Camerer and Ho, 1999, McKelvey and Palfrey, 2001] Instead, the contribution

of Hanaki et al. [2003] lies in the way they restricted the set of strategies that players learn

about from the otherwise infinitely many possibilities: complexity constraints on the strate-

gies players learn about and two phases of learning. The complexity of strategies is captured

by the number of states required in their finite automaton representations. Introducing two

phases of learning allows players to explore the set of restricted strategies beyond the confines

of short laboratory experiments by introducing a long horizon pre-experiment phase where

players interact with many others and gradually learn which strategies to use.

What has not been explored by Hanaki et al. [2003], although it is of great interest, is the

underlying dynamics that lead to an efficient and fair outcome. This is especially true of the

Battle of the Sexes game where their model makes the greatest difference from learning based

on stage game action in terms of replicating human behavior. The primary objective of this

paper is to take a step forward to investigate such dynamics. This is done by exploiting the

evolutionary selection concept, utilized by Miller [1996], and combining the individual and

the social levels of learning.

The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is the

large number of strategies each player considers. Even if the set of strategies is limited to very

simple ones, each player in the model is choosing from 26 distinct strategies – too many to be

a reasonable reflection of the human cognitive process. A realistic model should account for

the fact that people consider a much smaller number of strategies from which they learn and

make decisions; and that the strategies people consider are often preconditioned by factors

such as “culture” that have evolved over the generations. While the first – individual choice

is made from a small set of strategies – can be captured by a model of individual learning, it is

to the latter – endogenously restricting the sets of strategies – that evolutionary modeling is

3

Page 4: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

applicable. The introduction of social learning modeled as an evolutionary process makes our

model different from that of Hanaki et al. [2003], and it makes a clearer distinction between

learning which may be taking place outside of the laboratory (the so-called “pre-experiment

phase”) and that which takes place in the course of laboratory experiments.

In particular, we consider a model where each individual is learning about and choosing

from a set of strategies to play a Battle of the Sexes game with a randomly chosen partner.

The set of strategies players consider is a small subset of the possible strategy space and

may vary from one player to another. Occasionally players are given a chance to look for

a different partner if they are unhappy with the status quo. Evolutionary pressures make

players more likely to adopt the strategy sets of players with higher payoff. Over time,

the majority of the population will be learning about and choosing from a similar subset

of possible strategies. Starting with an intuitively plausible set of assumptions, we first

demonstrate that the model generates the efficient and fair outcome of the game. We then

proceed to relax one assumption at a time so as to detect the minimal set of assumptions

that can generate such an outcome, and under this minimum set of assumptions we closely

investigate the dynamics behind the convergence to the efficient and fair outcome. The rest

of the paper is organized as follows: Section 2 explains the model in detail, followed by a

discussion of the results in Section 3. Section 4 illustrates the dynamics of the model with

an analytical example, and section 5 concludes.

2 Model

Consider a population of agents N = {1, 2, 3, ...., 2N}. The population is partitioned

into row and column players. For convenience, let odd-numbered agents be row players and

even-numbered agents column players.

Players drawn from the population are matched pairwise to play several repetitions of a

4

Page 5: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

specified two-player stage game, in each period, as shown below.

Row / Column 0 1

0 18, 6 3, 3

1 3, 3 6, 18

At the beginning of each period each player selects a repeated game strategy and adheres

to it throughout the period. The number of stages or repetitions in each period is determined

stochastically by a parameter ρ ∈ (0, 1), where ρ is the probability that the period will end

after any given stage. Each period therefore consists of 1/ρ stages on an average.

The set of strategies available to player i, si, comprises of m, not necessarily distinct,

strategies. These are randomly chosen from the set of all possible strategies, S, if i is a

member of the initial generation. Alternatively, these strategies are inherited from player

i’s parents, in a way described in section 2.4. In this paper the set of all possible strategies,

S, is the set of all the strategies that can be implemented by one- and two-state automata.

Section 2.1 provides a brief discussion of an automaton representation of a repeated game

strategy.

Players have propensities or ‘attractions’ associated with each of their strategies and these

attractions determine the probabilities with which a given strategy is chosen. Learning takes

place through the evolution of attractions: at the end of each period, players evaluate the

performance of their chosen strategy and update their attractions accordingly (the precise

manner in which this occurs is described in section 2.2 below).

At the end of each period, there is a probability γ that players decide to (possibly)

be rematched with a different partner if they are not satisfied with their current partner.

This decision is unilateral in the sense that if one of the players decides to change partners,

both have to draw a new partner from the pool of players wishing to change partners. If,

however, two partners cannot be matched with others because no one else wants to change

partners, they then must stay together. The break-up decision of each player is determined by

5

Page 6: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

comparing that player’s average payoff (per interaction and hitherto referred to as “average

payoff”) to the population’s mean average payoff for the same type (column or row) of

player. For any given player, a low average payoff relative to the population mean, increases

the probability of that agent changing partners.

Randomly, a tournament is announced.1 Each existing pair is required to participate in a

contest against another pair. The contest takes the following form. Two pairs are randomly

matched up and their average payoffs are compared. The pair with the higher average payoffs

is declared the winner, and is allowed to produce two offspring (one column and one row

player) in a manner that will be described later.2 These offspring eventually form the next

generation. Losing pairs are not allowed to reproduce at the end of the contest they have

just lost. The winning and the losing pair of each contest are reverted back to the pool of

contestants, so that each pair may be picked again, at random, for future contests with other

pairs. The tournament continues in this fashion until the number of offspring is equal to the

number of parents. At this point the parent generation dies off. The offspring, now the new

generation, form pairs and play the repeated Battle of Sexes game just as their parents did.

2.1 Representing Repeated Game Strategies with Automata

The strategies that we consider in this paper are repeated game strategies with finite

complexity – strategies that can be implemented by an automaton with a finite number of

states. The more complex a strategy becomes, the larger the number of states an automaton

requires to implement that strategy. An automaton consists of four components: a set of

states, the initial state, an output function which indicates which actions to be taken in each

state, and a transition function that indicates which state will be reached in the next period

1One restriction applies: rematching among players and tournaments cannot take place at the same time.2Normally, a tournament takes place at the level of individuals. This is based on the idea of asexual

reproduction. In economics, models based on asexual reproduction are preferred to sexual reproductionowing to how evolutionary models are interpreted: A standard interpretation of evolution is a process oflearning through imitating others who seem to be performing better. I have chosen sexual reproductionbecause of the setup of the stage game in which each player takes just one role. The dependence of the resulton this assumption has not been explored and left to future research.

6

Page 7: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

given the current state and the current actions of the opponent. The simplest repeated

game strategy is to choose the same action regardless of any history. This strategy can be

represented by an automaton with one state. Two possible one-state automata for the game

considered in this paper are “always choose action 0” and “always choose action 1” which

generate the same action regardless of the games history.

A slightly more complex repeated game strategy requires a player to choose his or her ac-

tion based on the opponent’s last action. Two-state automata can implement strategies with

this level of complexity. For example, two-state automata can implement the reciprocation

strategy – first choose action 0, and alternate between action 0 and action 1 as long as the

opponent is choosing the same action as the player herself. The following figure illustrates

this automaton:

-

1

?

0

0-

1�

0

?

1

where the left box (with an arrow pointing to it) is the initial state and i ∈ {0, 1} represents

the state i. In this paper, the output function is generic: the state the automaton occupies

and the action taken in that state are the same. Therefore, action 0 (1) will be chosen in state

0 (1). The transition function is indicated by arrows from one state to the another. Each

arrow is associated with opponent’s actions. The figure above shows that this automaton

starts from state 0 and moves to state 1 if the opponent also chooses action 0. Otherwise,

it stays at state 0. Once the automaton is in the state 1, it returns back to state 0 only

if the opponent’s action is 1. This automaton will alternates forever between two states

conditional on the opponent’s actions in this fashion.

In this paper we consider a total of 26 possible automata. For a repeated game consisting

of a 2 player, 2 action stage game, like the game we consider in this paper, these 26 automata

7

Page 8: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

are all the possible one- and two-state automata.3 See Appendix A for a complete listing of

the automata we consider.

2.2 Individual Learning and the Strategy Choice

When players utilize repeated game strategies, they face the “inference problem” [McK-

elvey and Palfrey, 2001] when anticipating the opponent’s strategy from the observed history

of action. Thus, belief-based learning models, in which players are trying to learn about an

opponent’s next action (and play the best response against it) are not readily applicable to

such an environment. On the other hand, as argued by Hanaki et al. [2003], reinforcement

learning models do not have such a problem as they use only the realized payoff. In this paper

we assume that players learn based on the realized payoff i.e., we follow the reinforcement

learning rule.

Let aij(τ) be the level of attraction of strategy j ∈ si for player i on period τ . The initial

level of attraction, aij(0), can be set freely. We would like to set an equal value for all the

strategies because our aim is to see what emerges out of the model without presupposing any

differences between strategies. We would also like to set a rule, for choosing the value that

makes the model applicable to various games. Based on these considerations we set aij(0)

equal to the expected payoff from the situation in which both players are randomly choosing

their actions, for all j′s and i′s. This can be interpreted as the average level of attraction of

the game itself for the players. For our game, aij(0) = 1

4(18 + 3 + 3 + 6) = 7.5 for both types

of players.

As a result of the interactions in period τ , the attraction level evolves as a weighted

3There are two possible states for an automaton and two possible actions by the opponent. The transitionfunction thus maps four different possibilities of (own state, the opponent’s action) pairs into the set of twostates to be taken in the next period. This generates 24 = 16 cases in total. As each of these can have oneof the two initial states, we have a total 32 possible automata. Among these automata, however, 4 of themimplement the same strategy “always play zero”and another 4 implements “always play one.” Eliminatingthese non-unique automata yields a total of 26 automata – 2 one-state and 24 two-state automata.

8

Page 9: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

average of its previous value and the current reinforcement value:

aij(τ + 1) =

(1 − 1ni

j(τ+1))ai

j(τ) + 1ni

j(τ+1)Ri

j(τ) if strategy j is chosen on period τ

aij(τ) otherwise

(1)

where Rij(τ) is the reinforcement value. This is the average payoff the player obtained (by

utilizing strategy j) in period τ . Each period consists of 1/ρ interactions on an average. nij(τ)

is the number of times the strategy j has been chosen up to period τ plus the initial value,

nij(0). Hence the weight, 1

nij(τ+1)

, captures the idea known as the “power law of practice”

[Erev and Roth, 1998]: the more experience one obtains by using a particular strategy, the

less effect additional information has on changes in that strategy’s attraction level. The

initial value nij(0) ≥ 1 is a parameter of the model. This determines the strength of the early

outcomes of the game relative to the later ones. The larger the value of nij(0), the smaller

the influence of the difference (between the effect of the earlier and the later outcomes) on

changes of attraction level. In this paper, we fix it to 1 for all i and j.

Once the levels of attraction have been updated, players simultaneously choose a new

strategy according to the following probability distribution:

pij(τ + 1) =

eλaij(τ+1)

k∈si eλaik(τ+1)

(2)

This logistic transformation is undertaken so as to ensure that the probability is positive

for all the strategies at any point in time. λ ≥ 0 is another parameter in this model;

it determines the probability distribution of the strategy choice, given how each player is

attracted to each of the strategies. In other words, λ represents the sensitivity of the strategy

choice to the level of attraction. If λ is zero, then all strategies are equally likely to be chosen

regardless of their attractiveness. Thus this situation is equivalent to having no learning

because all the learning takes place through evolution of the level of attraction. If on the

9

Page 10: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

other hand, λ is large, the strategy with the highest attraction level will be chosen.4

2.3 Rematching

After all the interactions for a period have taken place, there is a small probability γ,

that a rematching opportunity will become available to everyone. If this is the case, each

player compares his or her average payoff with the mean average payoff of the population of

the same type. The probability that player i of type k will decide to change partners is

bik = Max[0, (πk − πi)/(πk − πk)] (3)

where πk and πk are the population mean and the minimum possible average payoff, respec-

tively, of the type k, and πi is the average payoff that i has obtained by interacting with

his or her current partner. Equation 3 states that if the player is obtaining more than the

population mean average payoff of her type, then she does not choose to break up. If on

the other hand, the player is getting the minimum possible payoff for her type, then, with

probability 1, she breaks up and tries to find a new partner. The breaking-up decision is

unilateral, so if one player from a pair decides to break up, the other player also has to

look for a new partner. Players who break up with their current partners are pooled and

randomly rematched with another player of the different type in the pool.5

4This transformation is popular in the learning literature. In belief-based learning models, such asSmoothed Fictitious play, Ai

j is defined as the expected payoff when the player i is using strategy j while λis interpreted as that player’s error rate. If λ is zero the player ignores the expected payoff when choosingthe best response whereas if λ is infinity, it means that the player does not make any mistakes and alwayschooses the strategy having the highest expected payoff.

5It is possible for a player to be rematched with the same partner as before. However, the probability ofthis occurring becomes very small as the number of players in the rematching pool increases.

10

Page 11: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

2.4 Generation Change

There is a very small probability, ν ≪ γ, that a tournament will be announced at the end

of any given period.6 A tournament (a collection of contests) produces a new set of players,

whom we call a new “generation” of players. In each contest, two pairs are randomly matched

up and the total average payoffs of the matched pairs is compared. The pair with the higher

total average payoffs is declared the winner of the contest.7 The Winning pair, but not the

losing pair, gets to produce two offspring in the following fashion.

Let us suppose the tournament is announced in period x. Let s1 = {σ11, σ

12, ..., σ

1m} and

s2 = {σ21, σ

22, ..., σ

2m} be the strategy sets of the two players in the winning pair, and let

A1 = {a1σ1

1

(x), a1σ1

2

(x), ..., a1σ1

m(x)} and A2 = {a2

σ2

1

(x), a2σ2

2

(x), ..., a2σ2

m(x)} be the corresponding

sets of their attraction levels to each of their strategies. We construct a source strategy set

sP0 by joining s1 and s2. So, sP

0 = {σ11, σ

12, ..., σ

1m, σ2

1, σ22, ..., σ

2m}. To simplify the notation,

let us re-label the element in the source set, so that sP0 = {σP

1 , σP2 , ..., σP

2m}. Similarly,

the set of attractions AP0 over the source set is created by joining A1 and A2. So, AP

0 =

{aP1 , aP

2 , ...., aP2m}, after re-labeling. An offspring is created by choosing m strategies, one at

a time, from sP0 without replacement. Thus we see that the source strategy set, as well as

the corresponding set of weights, change as we select strategies.

Let sPq−1 and AP

q−1 respectively be the source strategy set and corresponding set of at-

tractions after q − 1 < m strategies have been selected. Now the probability of choosing the

strategy σPl , in the sP

q−1 at the q-th round of strategy choice, is

pq,l =eλaP

l

k∈APq−1

eλaPk

, (4)

where the λ takes the same value as in the case of individual strategy choice. This determines

how well what the parents have learned is transferred to the offspring. If λ = 0, the attraction

6As noted before, there is one restriction: rematching and a tournament cannot take place at the sametime.

7In the event of a tie, one pair is randomly chosen as the winner.

11

Page 12: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

level of the parents does not play any role in transferring strategies to an offspring. The

offspring will inherit only randomly chosen strategies from their parents’ strategy-sets. By

contrast, when λ is very large, it is the m strategies with the highest weights in the sP0 that

will be inherited by the offspring. And if σPl is selected at the q-th strategy selection for the

offspring, then the source strategy set becomes sPq+1 = sP

q \{σPl } and the corresponding set

of weights APq+1 = AP

q \{aPl }.

Each of the strategies in the offspring’s strategy set can be replaced, with a small proba-

bility µ, with a strategy randomly chosen from the set of all the possible strategies S.8 The

parents’ attraction for each of the strategies will not be inherited by the offspring. In the

beginning, the offspring are equally attracted to each of the strategies at hand.9 An example

of such an offspring creation process is shown in Fig. 1.

After each contest, both the winning and the losing pair are reverted back to the pool of

contestants so that each of them may be picked again, at random, for future contests with

other pairs. The tournament continues in this fashion until the number of offspring is equal

to the number of parents, 2N . At this point, the parent generation dies. The offspring,

now the new generation, form pairs randomly and play the specified game repeatedly. Each

player of the new generation will learn, over time, which strategy to use just as their parents

did.

3 Results

We first report the result from the computational experiments in which N = 100, m = 3,

and λ = 2.0. That is a society consisting of 200 players or 100 pairs; each player considers

three strategies not necessarily distinct and individual strategy choice and offspring creation

8These processes are analogous to “cross-over”(creating an offspring from two parents) and to “muta-tion”(a random shock being delivered to the offspring’s strategy set) in the evolutionary algorithm. Seefor example Miller [1996] for an implementation of a genetic algorithm as part of an inquiry into complexrepeated game strategies in the Repeated Prisoner’s Dilemma game.

9It is possible, however, for an offspring to inherit fewer than m distinct strategies from its parents. Itis then effectively the case that there are higher levels of attraction for the strategies that enter more thanonce into her strategy set than there are for those that enter only once.

12

Page 13: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

Winning pair’s strategy set and corresponding attractions:s1 = {3, 5, 6} and A1 = {10, 4, 4} s2 = {6, 10, 15} and A2 = {8, 4, 8}

First strategy choice

Source strategy setsP0 = {3, 5, 6, 6, 10, 15}

Corresponding attractionsAP

0 = {10, 4, 4, 8, 4, 8}Corresponding probability weights (⇒)Suppose strategy 3 is chosen

3 5 6 6 10 15σ

0.2

0.4

0.6

0.8

1.0P

Second strategy choice

Source strategy setsP1 = {5, 6, 6, 10, 15}

Corresponding attractionsAP

1 = {4, 4, 8, 4, 8}Corresponding probability weights (⇒)Suppose strategy 15 is chosen

5 6 6 10 15σ

0.2

0.4

0.6

0.8

1.0P

Third strategy choice

Source strategy setsP2 = {5, 6, 6, 10}

Corresponding attractionsAP

2 = {4, 4, 8, 4}Corresponding probability weights (⇒)Suppose strategy 10 is chosen

5 6 6 10σ

0.2

0.4

0.6

0.8

1.0P

Strategy set for an offspring si = {3, 15, 10}

Each element in the set can be replaced by randomly chosen strategies, σ ∈ S, with aprobability µ. Suppose first element were randomly changed to strategy 1, then the finalstrategy set and corresponding attractions for an offspring i are si = {1, 15, 10} and Ai ={7.5, 7.5, 7.5}

Figure 1: An example of the offspring-creation process. Assume players learn about 3strategies.

13

Page 14: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

are sufficiently sensitive to the level of attraction. Other parameters are set as ρ = 0.1, γ =

0.01, and ν = 0.001. Thus, each period consists of 10 interactions on average. Rematching

opportunities become available once in 100 periods and the tournament is announced once

in 1000 periods. Therefore each generation of players interacts 10,000 times before the

generation dies off and there are 20,000 periods in each simulation – i.e., each simulation

consists of 20 generations on average. Finally, the probability of random mutation in offspring

creation, µ, is set as being equal to 0.01.

After an initial set of simulations, we will relax some of the assumptions by changing

the parameter values. Specifically, we will (1) drop the possibility of rematching by setting

γ = 0; (2) change the number of strategies in the individual strategy set, m; and (3) make

strategy choice and offspring creation insensitive to the level of attraction, λ = 0. We find

that the result is robust to the first change. It is also robust to the second change as long

as there are more than two strategies in the individual strategy set; the result is somewhat

weaker for m = 2. We do not, however, obtain the same result once we have make the third

change.

3.1 Convergence to the efficient and fair outcome

Figure 2 shows the movement of the average payoff over generations. Efficiency requires

that the population mean of the average payoff be 12 because at the Pareto frontier the

aggregate average payoff for a pair is 24. Fairness can be represented by a small standard

deviation of the average payoff. Recall that the more equally payoffs are divided among

players, the smaller the standard deviation of the payoff distribution becomes. The figure

shows quick convergence to the efficient and fair outcome.

Let’s take a closer look at the dynamics. Figure 3 traces the evolution of the average

payoff distribution for all the pairs in society from a single realization. Each dot in the figure

represents an average-payoff combination for a pair. The average payoffs of the row players

are given on the horizontal axis and those of the column players are on the vertical axis.

14

Page 15: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

� �� �� �� ������������������������ ���� ����������

Figure 2: The dynamics of the average payoff distribution over generations. The solid linein the middle represents the population mean average payoff. The two dotted lines eachrepresent one standard deviation from the mean. Each line is averaged over 20 differentrealizations.

The triangle in the figure represents the locus of possible pairs of average payoffs, given the

payoff matrix. The side of the triangle farthest from the origin is the Pareto frontier and

the 45-degree line represents the equal division of payoffs between two players. We call this

the “fairness line.” Note that it is possible for more than one dot to be in the same or a

similar location. Higher population (payoff) densities are represented by darker shades of

gray. Each graph in the figure represents an end-of-generation average payoff distribution.

Starting from an almost random distribution at the end of the first generation, only efficient

and fair payoff pairs are left by the end of generation 8.

A close look at the dynamics of payoff distribution tells us that the unfair payoff pairs

are being eliminated from the beginning. By generation 3, points have begun to lie along

the fairness line. Then inefficient points – i.e., points in the interior of the triangle – are

eliminated. The efficiency dynamics are due to an evolutionary pressure that selects pairs

having higher average payoffs – i.e, those closer to the Pareto frontier. The fairness dynamics

require a closer examination.

15

Page 16: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

� �� ��������������� �

� �� ��������������� �

� �� ��������������� ��

� �� ��������������� ��� �� �������

�������� �� �� �������

�������� �� �� �������

�������� �� �� �������

�������� �� �� ��������������� �

� �� ��������������� �

� �� ��������������� �

� �� ��������������� �

Figure 3: Generational change in the average payoff distribution, from a representative run.Each point represents the end-of-generation average payoff for a pair. Notice that it ispossible to have several points on top of each other. Higher population (payoff) densities arerepresented by darker shades of gray. The x-axis represents the payoffs for Row players andthe y-axis for the Column players. The triangle in the figure is the locus of possible averagepayoff pairs given the game and the 45-degree line represents equal payoff division betweenthe two players. The parameter value for this realization was N = 100, m = 3, ρ = 0.1,γ = 0.01, ν = 0.001, and µ = 0.01

16

Page 17: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

� �� �� �� ������������������������ ������� �� ��������

Figure 4: Dynamics of the average payoff, over generations for the simulations withoutwithin-one-generation rematching. The solid line in the middle represents the populationmean average payoff. The two dotted lines each represent one standard deviation from thepopulation mean. Each line is averaged over 20 realizations.

3.2 Without rematching

Given how the breaking-up decision is modeled, one naturally suspects that it is the

rematching assumption that serves as the driving force leading to the convergence of fair

and efficient outcomes seen above.10 To see the importance of within-one-generation re-

matching,11 we run a simulation by setting the rematching probability, γ, to zero. Other

parameters of the model are kept unchanged.12 The result illustrated in Fig. 4 shows that

the rematching assumption is not causing the system to converge to the efficient and fair

outcome. The only difference from the case with the positive rematching probability is that

now it takes slightly longer for convergence to occur.

10Consider the situation where one-third of the population is realizing an efficient outcome that favorscolumn players and one-third the efficient outcome that favors the row players, and the last third are gettingthe efficient and fair outcome. As a result of rematching, on average, half the players who are getting anefficient but unfair payoff favoring one type of the player will be rematched with the players who are gettingthe efficient and unfair outcome that favors the opposite type of player. By contrast, none of the playersobtaining the efficient and fair outcome break up. Since it is possible for the newly matched pairs to obtaina less than the efficient payoff, they have a higher probability of losing the contests and of failing to produceoffspring.

11We emphasize within-one-generation rematching simply because there is another type of rematching atwork within the model; the type that takes place when a new generation of players is created. Recall that,at the start of a newly created generation, players are randomly matched against other players of a typedifferent from themselves.

12That is, with N = 100, m = 3, ρ = 0.1, ν = 0.001, µ = 0.01 and 20, 000 periods per simulation.

17

Page 18: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

gen. 1 gen. 4 gen. 8 gen. 12 gen. 16 gen. 20m=1 7.46 7.86 9.22 10.52 11.63 11.87

(4.22) (4.41) (4.98) (5.13) (5.08) (5.02)

m=2 9.78 9.67 11.46 11.99 11.99 12.00(3.86) (3.32) (1.66) (1.02) (1.00) (1.01)

m=3 10.30 10.50 11.66 11.94 11.99 12.00(3.64) (2.80) (0.93) (0.45) (0.36) (0.36)

m=5 10.58 11.04 11.81 11.98 11.99 12.00(3.49) (2.38) (0.81) (0.34) (0.33) (0.32)

m=10 10.79 11.20 11.70 11.96 11.98 11.98(3.24) (2.13) (0.72) (0.34) (0.32) (0.33)

m=26 10.61 11.19 11.78 11.89 11.96 11.95(3.20) (2.14) (0.81) (0.38) (0.33) (0.43)

Table 1: Dynamics of mean and standard deviation of end-of-generation average payofffor various m values. Result from simulations with zero rematching probability. Standarddeviations are in parentheses. Each datum is averaged over 20 realizations.

3.3 Varying the parameter m

We have seen that individual learning and evolutionary pressure, together produce the

efficient and fair outcome when the number of strategies that each individual considers is

three. We have also seen that this result is robust if we eliminate rematching. One might

ask another question. What happen when we vary m? Do we still obtain the same result

as before? Next we address this question and the results are summarized in Table 1. The

table shows that regardless of the value of m, a similar result can be obtained. The only

exception to this occurs when m = 1, i.e., when there is no strategy selection by individual

players. Under this condition, the model fails to produce the efficient and fair outcome even

with within-one-generation rematching (see table 2).13

13Doubling the length of the simulation does not change the result.

18

Page 19: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

gen. 1 gen. 4 gen. 8 gen. 12 gen. 16 gen. 20m=1 9.32 9.98 11.29 11.73 11.94 11.93

(4.29) (4.40) (3.92) (3.60) (3.46) (3.47)

m=2 9.62 10.65 11.46 11.93 11.99 12.00(3.94) (2.93) (0.49) (0.36) (0.37) (0.34)

Table 2: Dynamics of mean and standard deviations of end-of-generation average payoff form = 1 and 2, with positive rematching probability (γ = 0.01). Standard deviations are inparentheses. Each datum is averaged over 20 realizations.

3.4 Varying λ

We have shown that when individuals are deprived of the active strategy choice, the model

fails to attain the efficient and fair outcome. This suggests that both individual strategy

selection and “social learning” are needed to obtain such an outcome. When players are

not choosing their own strategies individual learning is not effective in our model. Does this

suggest that both individual and social learning are required, to obtain the efficient and fair

outcome?

One way to investigate the importance of having both individual and social learning in

our framework is to set λ = 0. Notice that when λ = 0, strategy selection and offspring

creation are no longer sensitive to the attraction levels of strategies. Since all individual

learning takes place through the evolution of levels of attraction, we can separately study

the importance of individual learning while retaining the possibility of strategy selection.

The result depicted in Fig. 5 shows that although the efficient outcome can emerge if

the simulation is run long enough,14 we do not necessarily obtain the fair outcome. This

suggests that evolutionary selection alone is not sufficient to generate the efficient and fair

outcome.

14We ran simulations for 40,000 instead of 20,000 periods, – hence on an average there are 40 generationsin each simulation.

19

Page 20: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

�� �� �� �� ���������������� ��� ���� ���� ���� �����������

Figure 5: Dynamics of the average payoff over generations. The solid line in the middlerepresents the mean average payoff in the population. The two dotted lines each representone standard deviation from the mean. Each line is averaged over 20 realizations.

3.5 Dynamics of Strategy

To understand the driving force behind convergence to the efficient and fair outcome

through both individual learning and evolutionary dynamics, we must also examine the dy-

namics of strategy choice in detail. The essential feature is that individual learning causes

a subset of all the possible strategies to become more likely to be chosen by players. These

strategies have a higher probability of being passed on to the next generation than do the

other ones. Then, the next generation of players learns to choose some of these inherited

strategies more often. And again, those are the very strategies that have a higher probability

of being inherited by the next generation. The repetition of this process over several gener-

ations leads to the survival of strategies which, if used by players to play among themselves

generates the efficient and fair outcome.

Figure 6 depicts the dynamics of the end-of-generation probability distribution of strategy

choice revealing the average probability for a strategy to be chosen by a player. At the end

of the first generation, some strategies already are beginning to drop out of the race. The

four strategies that have the highest probability of being chosen at the end of generation

2, are those that can be implemented by automata 1, 2, 7, and 9.15 See Appendix A, for

15We call the strategy that can be implemented by automaton i, strategy i.

20

Page 21: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

��� �� � ������������������������������������������ �� ������ ��� ��� ��� ���������������������������������������� �� ������ ���� �� � ������������������������������������������ �� ������ ��� ��� ��� ���������������������������������������� �� ������ �

Figure 6: Dynamics of the end-of-generation probability distribution of strategy choice, froma run with N = 100, m = 26, γ = 0.0 (without within-one-generation rematching).

illustrations of each of the strategies discussed here.

Of the four strategies just cited, strategies 1 and 2 begin to fall behind by the end of

generation 3. By the end of generation 4, the four strategies that are most likely to be

chosen are 7, 9, 11, and 13. If these four strategies were to play against each other, they

would achieve a perfectly coordinated alternation of action between {0,0} and {1,1}. As the

generation-change proceeds further, these will become the only strategies used by players in

the model.

It must be noted that there are realizations in which the four surviving strategies are 21,

23, 24, and 26, rather than the aforementioned 7, 9, 11, and 13. If, however, 21, 23, 24, and

26 play against each other, they too achieve the perfectly coordinated alternation of actions

between {1,1} and {0,0}. The initial dynamics of these realizations are similar to those seen

above. Strategies 1 and 2 are among those most likely to be chosen in the beginning, but

they drop out as further generation-change takes place. The two types of realization are

equally likely to occur.

21

Page 22: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

4 Analytical exploration

The strategy dynamics we have just been exploring at the end of the previous section

illustrate the tendency for individual learning to, in the beginning, select out “always play

zero”, “always play one”, and “alternate between zero and one.” We also saw how, as the

generation proceeds, the first two drop out. The challenge confronting us now is to see

whether a simple analytical setting can help us to understand why this is the case. In this

section, we take an initial step toward meeting that challenge.

Let us suppose that there are only three strategies being used by the players: “always play

zero”, “always play one”, and “mechanically alternate between zero and one.” Automata

1 and 2 correspond to the first two strategies, and either automaton 11 or automaton 24

can implement the third. Let’s use automaton 11 to implement this strategy. Let p1, p2,

and p11 such that p1 + p2 + p11 = 1 be the average probabilities for strategies 1, 2 and 11,

respectively, to be chosen by a player in the population at a given point in time.16

A standard way to model the evolutionary pressure is to use symmetric replicator dynamics:

·

pi = pi(πi − π) (5)

where·

pi is the rate of change of the average probability of strategy i being chosen; πi is

the mean average payoff for those utilizing strategy i; and π is the mean population average

payoff. Equation 5 states that strategies that generate higher (lower) average payoff than

the population mean will be used more (less) frequently.17 Notice that this is the global

selection process, unlike individual learning which takes place locally.

16One could also interpret these probabilities as the share of the population who utilize the given strategyat a given point in time.

17See for example Ch. 3 of Fudenberg and Levine [1998], and Ch. 3 of Weibull [1995], for two good, in-depthdiscussions of replicator dynamics.

22

Page 23: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

Consider an infinitely repeated game that consists of the following symmetric stage game:

Row / Column 0 1

0 a, b c, c

1 c, c b, a

where a > c > 0, b > c > 0, and a 6= b. In such a game, the efficient and fair outcome will

be achieved only through an alternation between {0,0} and {1,1}.

Under the “limit of means criterion,”18 the replicator dynamics becomes:19

·

p1 = −1

4(a + b − 2c)p1(1 + p1(2p1 − 3) + p2(2p2 − 1))

·

p2 = −1

4(a + b − 2c)p2(1 + p2(2p2 − 3) + p1(2p1 − 1)) (6)

This two-dimensional dynamical system has three stable and three unstable equilibria.

The stable equilibria are {p1, p2, p11} = {1, 0, 0}, {0, 1, 0}, and {0, 0, 1}. In these equilib-

ria, only one of the three strategies is used by the players. The unstable equilibria are

{p1, p2, p11} = {0.5, 0.5, 0}, {0.5, 0, 0.5}, and {0, 0.5, 0.5}. Two of the three strategies are

18There are several ways to model the payoff from the repeated games, among them are “discountingcriterion”and “limit of means criterion.” With the “discounting criterion”, the payoff from the near futurewill be weighted more. With the “limit of means criterion” the payoff over the long run is what matters. Wechoose the “limit of means criterion”because there is one less parameter (i.e., we don’t have to worry aboutdiscounting rate); also under this criterion, however, the choice of strategy is not sensitive to change in thepayoff from a single period. See Ch. 8 of Osborne and Rubinstein [1994]

19The equations can be derived by substituting the following equations into.

p1 and.

p2 and manipulatingthem:

π1 = p1(a+b2

) + p2c + (1 − p1 − p2)(a+b+2c

4).

π2 = p2(a+b2

) + p1c + (1 − p1 − p2)(a+b+2c

4)

π11 = (p1 + p2)(a+b+2c

4) + (1 − p1 − p2)(

a+b2

)

π = p1π1 + p2π2 + (1 − p1 − p2)π11

We will give a brief explanation of the first equation. A player utilizing strategy 1 will meet another playerusing strategy 1 with probability p1. In this case, her expected payoff is (a + b)/2, because she can be eithera column or a row player with equal probability. She meets a player using strategy 2 with probability p2,and obtains payoff c. Finally, with probability 1 − p1 − p2, she could meet a player utilizing strategy 11. Inthat case she would get (a + b + 2c)/4.

The average payoff for players using either strategy 2 or strategy 11 are derived in a similar fashion.

23

Page 24: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

P =111

P =12

p =11

P =011P =02

P =01

Figure 7: Phase diagram in a unit simplex. The filled black circles correspond to the equi-libria of the system. Among them, those found in the vertices of the triangle are stable. Thedynamics of the system in the simplex are represented by the arrowed lines.

equally likely to be used in these equilibria. As shown in Fig. 7, however, since these are

unstable, even a slight deviation in the probability will take the system away from these

equilibria, and eventually the system will converge on one of the stable equilibria. Among

the stable equilibria the one with the largest basin of attraction is {p1, p2, p11} = {0, 0, 1},

where we observe the perfect alternation of actions.

An important lesson to be drawn from this analysis is that if the system starts with two

groups of people, about the same size, utilizing either strategies 1 or 2 and all the others

using strategy 11, then the evolutionary pressure alone will lead to the stable equilibrium in

which only strategy 11 is used. This is where the efficient and fair outcome of the game can

be observed.

The previously discussed strategy dynamics in the simulation shows us that individual

learning tends to bring the system initially, into the position where strategies 1 and 2 are

equally likely to be chosen by the players. Hence the model makes it clear that individual

learning, when it is combined with the evolutionary pressure or global selection process, can

lead to the efficient and fair outcome.

24

Page 25: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

5 Conclusion

We have shown that when taken in tandem, the individual learning and “social learn-

ing” that arises from within simple, finitely complex repeated game strategies, generate the

efficient and fair outcome in the repeated Battle of the Sexes game. In our framework how-

ever, neither individual learning nor evolutionary selection, can generate such an outcome

by itself.

This notion – that the actual human learning process is a combination of two levels of

learning, individual and social – has a certain intuitive appeal. It is certainly of interest

to see that the model generates an outcome coinciding with the characteristic outcome of

laboratory experiments only when both dimensions of learning are effective.

This paper represents no more than just an initial step taken toward understanding the

full implications of the two modes of learning. While it has identified a minimal setup that

generates such an outcome, to what extent its results can be generalized is not known. In

particular, we have not explored the validity of the result in other games. Future research

should analyze a model having an arbitrary 2 by 2 payoff matrix, so as to identify a set of

conditions that will always lead to the efficient and fair outcome.

25

Page 26: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

A The Complete set of one- and two-state automata

-

0, 1

?

0 -

0, 1

?

1

Automaton 1 Automaton 2

-

0

?

01

-

0, 1�

1 -

0

?

01

-

0�

1

?

1 - 10, 1

-

0, 1

?

0 -

1

?

10

-

0, 1

?

0

Automaton 3 Automaton 4 Automaton 15 Automaton 16

-

0

?

01

-

1�

0

?

1 -

0

?

01

-

0, 1

?

1 -

0

?

11

-

0, 1

?

0 - 10, 1

-

1�

0

?

0

Automaton 5 Automaton 6 Automaton 17 Automaton 18

-

1

?

00

-

0, 1�

1 -

1

?

00

-

0�

1

?

1 -

1

?

10

-

1�

0

?

0 -

0

?

11

-

1�

0

?

0

Automaton 7 Automaton 8 Automaton 19 Automaton 20

-

1

?

00

-

1�

0

?

1 -

1

?

00

-

0, 1

?

1 - 10, 1

-

0�

1

?

0 -

1

?

10

-

0�

1

?

0

Automaton 9 Automaton 10 Automaton 21 Automaton 22

- 00, 1

-

0, 1�

1 - 00, 1

-

0�

1

?

1 -

0

?

11

-

0�

1

?

0 - 10, 1

-

0, 1�

0

Automaton 11 Automaton 12 Automaton 23 Automaton 24

- 00, 1

-

1�

0

?

1 - 00, 1

-

0, 1

?

1 -

1

?

10

-

0, 1�

0 -

0

?

11

-

0, 1�

0

Automaton 13 Automaton 14 Automaton 25 Automaton 26

26

Page 27: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

References

Jasmina Arifovic, Richard D. McKelvey, and Svetlana Pevnitskaya. An initial implementa-

tion of the turing tournament to learning in two person games. Mimeo, California Institute

of Technology, 2002.

Gary Bolton and Axel Ockenfels. Erc: A theory of equity, reciprocity, and competition.

American Economic Review, 90(1):166–193, 2000.

Colin Camerer and Teck-Hua Ho. Experience-weighted attraction learning in normal form

games. Econometrica, 7(4):827–874, 1999.

Gary Charness and Matthew Rabin. Understanding social preferences with simple tests.

Quarterly Journal Of Economics, 117(3):817–869, 2002.

Yin-Wong Cheung and Daniel Friedman. Individual learning in normal form games: Some

laboratory results. Games and Economic Behavior, 19:46–76, 1997.

Ido Erev and Alvin E. Roth. Predicting how people play games: Reinforcement learning in

experimental games with unique, mixed strategy equilibria. American Economic Review,

88(4):848–881, 1998.

Ernst Fehr and Klaus M. Schmidt. A theory of fairness, competition, and cooperation. The

Quarterly Journal of Economics, 114(3):817–868, 1999.

Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press,

Cambridge, MA, 1998.

Nobuyuki Hanaki, Rajiv Sethi, Ido Erev, and Alexander Peterhansl. Learning strategy.

Journal of Economic Behavior and Organization, 2003. Forthcoming.

Richard D. McKelvey and Thomas R. Palfrey. Playing in the dark: Information, learning,

and coordination in repeated games. Mimeo, California Institute of Technology, 2001.

27

Page 28: Individual and Social Learning - vcharite.univ-mrs.fr€¦ · the social levels of learning. The most unrealistic aspect of the strategy learning model of Hanaki et al. [2003] is

John H. Miller. The coevolution of automata in the repeated prisoner’s dilemma. Journal

of Economics Behavior and Organization, 29(1):87–112, 1996.

Dilip Mookherjee and Barry Sopher. Learning and decision costs in experimental constant

sum games. Games and Economic Behavior, 19:97–132, 1997.

Martin J. Osborne and Ariel Rubinstein. A Course in Game Theory. The MIT Press,

Cambridge, MA, 1994.

Matthew Rabin. Incorporating fairness into game theory and economics. American Economic

Review, 83(5):1281–1302, 1993.

Jorgen W. Weibull. Evolutionary Game Theory. MIT Press, Cambridge, MA, 1995.

28