General Self-Motivation and Strategy Identification Case Studies Based

Embed Size (px)

Citation preview

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    1/17

    IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014 1

    General Self-Motivation and Strategy Identi cation:Case Studies Based on Sokoban and Pac-Man

    Tom Anthony, Daniel Polani, and Chrystopher L. Nehaniv

    Abstract In this paper, we use empowerment , a recently in-troduced biologically inspired measure, to allow a n AI player toassign utility values to potential future states within a previouslyunencountered game without requiring explicit speci cation of goal states. We further introduce strategi c af nity , a methodof grouping action sequences together to form strategies,by examining the overlap in the sets of potential future statesfollowing each such action sequence. We a lso demonstrate aninformation-theoretic method of predicting future utility. Com-bining these methods, we extend empowerment to soft-horizonempowerment which enables the player t o select a repertoire of action sequences that aim to maintain anticipated utility. We

    show how this method provides a proto-heuristic for nonterminalstates prior to specifying concrete game goals, and propose it asa principled candidate model for intuitive strategy selection, inline with other recent work on self-motivated agent behavior.We demonstrate that the techn ique, despite being generically de- ned independently of scenario, performs quite well in relativelydisparate scenarios, such as a Sokoban -inspired box-pushingscenario and in a Pac-Man -inspired predator game, suggestingnovel and principle-based candidate routes toward more generalgame-playing algorithms.

    Index Terms Arti cial intelligence (AI), games, informationtheory.

    I. I NTRODUCTIONAct always so as to increase the number of

    choices Heinz von Foerster

    A. Motivation

    I N MANY GAMES, including some still largely inacces-sible to computer techniques, there exists, for many statesof that game, a subset of actions that can be considered prefer-able by default. Sometimes, it is easy to identify these actions, but for many more complex games, it can be extremely dif -cult. While in games such as Chess algorithmic descriptions of the quality of a situation have led to powerful computer strate-gies, the task of capturing the intuitive concept of the beauty of a position, often believed to guide human master players, remains

    Manuscript received July 15, 2012; revised February 12, 2013; acceptedSeptember 02, 2013. Date of publication December 18, 2013; date of currentversion March 13, 2014.

    The authors are with the Adaptive Systems Research Group, School of Computer Science, University of Hertfordshire, Hat eld, Herts AL10 9AB,U.K. (e-mail: [email protected]; [email protected]; [email protected]).

    Color versions of one or more of the gures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identi er 10.1109/TCIAIG.2013.2295372

    elusive [1]. One is unable to provide precise rules for a beautyheuristic, which would need to tally with the ability of master Chess players to appreciate the structural aspects of a game po-sition, and from this identify important states and moves.

    While there exist exceedingly successful algorithmic solu-tions forsome games, much of their success derives from a com- bination of computing power with human explicitly designedheuristics. In this unsatisfactory situation, the core challenge for AI remains: Can we produce algorithms able to identify relevantstructural patterns in a more general way which would apply to

    a broader collection of games and puzzles? Can we create anAI player motivated to identify these structures itself? Gametree search algorithms were proposed to identify good actionsor moves for a given state [2][4]. However, it has since beenfelt that tree search algorithms, with all their practical successes,make a limited contribution in moving us toward intelligencethat could be interpreted as plausible from the point of view of humanlike cognition; by using brute-force computation, the al-gorithms sidestep the necessity of identifying how beauty andrelated structural cues would be detected (or constructed) byan arti cial agent. John McCarthy predicted this shortcomingcould be overcome by brute-force for Chess but not yet Go1

    and criticized the fact that with Chess the solutions were simplysubstituting large amounts of computation for understanding[5]. Recently, games such as Arimaa were created to challengethese shortcomings [6] and provoke research to nd alternativemethods. Arimaa is a game played with Chess pieces, on a Chess board, and with simple rules, but normally a player with only afew games worth of experience can beat the best computer AIs.

    At a conceptual level, tree search algorithms generally relyon searching exhaustively to a certain depth. While with var-ious optimizations the search will not, in reality, be an exhaus-tive search, the approach is unlikely to be mimicking a humanapproach. Furthermore, at leaf nodes of such a search, the stateis usually evaluated with heuristics handcrafted by the AI de-signer for the speci c game or problem.

    These approaches do not indicate how higher level conceptsmight be extracted from the simple rules of the game, or howstructured strategies might be identi ed by a human. For ex-ample, given a Chess position, a human might consider twostrategies at a given moment (e.g., attack opponent queen or defend my king) before considering which moves, in partic-ular, to use to enact the chosen strategy. Tree search approachesdo not operate on a level which either presupposes or providesconceptual gamestructures (thehuman-made AI heuristics may,

    1Recent progress in Go -playing AI may render McCarthys pessimistic pre-diction concerning performance moot, but, the qualitative criticism stands.

    1943-068X 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    2/17

    2 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    of course, incorporate them, but only as an explicit proviso bythe human AI designer).

    More recently, Monte Carlo tree search (MCTS) algorithms[7] have been developed, which overcome a number of the lim-itations of the more traditional tree search approaches.

    MCTS algorithms represent an important breakthroughin themselves, and lead us to a better understanding of treesearching. However, while MCTS has signi cantly extendedthe potential ability of tree search algorithms, it remains aslimited by similar conceptual constraints as previous tree searchmethods.

    In this paper, we propose a model that we suggest is morecognitively plausible, and yet also provides rst steps towardnovel methods, which could help address the weaknesses of tree search (and may be used alongside them). The methods we present have arisen from a different line of thought than MCTSand tree search in general. As this is, to our knowledge, the rstapplication of this train of thought to games, at this early stage,it is not intended to outcompete state-of-the-art approaches interms of performance, but rather to develop qualitatively dif-ferent, alternative approaches which, with additional research,may help to improve our understanding and approach to game playing AIs.

    The model we propose stems from cognitive and biologicalconsiderations, and, for this purpose, we adopt the perspectiveof intelligence arising from situatedness and embodiment [8]and view the AI player as an agent that is embodied withinan environment [9], [10]. The agents actuator options will cor-respond to the legal moves within the game, and its sensors re- ect the state of the game (those parts available to that player according to the relevant rules).

    Furthermore, we create an incentive toward structured deci-sions by imposing a cost on the search/decision process; this isclosely related to the concept of bounded rationality [11], [12]which deals with decision making when working with limitedinformation, cognitive capacity, and time, and is used as a modelof human decision making in economics [13]. As natural costfunctionals for decision processes, we use information-theoreticquantities; there is a signi cant body of evidence that such quan-tities have not only a prominent role in learning theory [14],[15], but also that various aspects of biological cognition can be successfully described and understood by assuming informa-tional processing costs being imposed on organisms [16][22].

    Thus, our adoption of an information-theoretic framework inthe context of decisions in games is plausible not only from alearning- and decision-theoretic point of view, but also fromthe perspective of a biologically oriented, high-level view of cognition where payoffs conferred by a decision must be tradedoff against the informational effort of achieving them.

    Here, more speci cally, we combine this thinking ininformational constraints with empowerment [23], [24]: an-other information-theoretic concept generalizing the notionof mobility [25] or options available to an agent in itsenvironment. Empowerment can be intuitively thought of as ameasure of how many observable changes an embodied agent,starting from its current state, can make to its environment viaits subsequent actions. Essentially, it is a measure of mobilitythat is generalized, as it can directly incorporate randomness

    as well as incomplete information without any changes to theformalism. If noise causes actions to produce less controllableresults, this is detected via a lower empowerment value.

    This allows one to treat stochastic systems, systems with in-complete information, dynamical systems, games of completeinformation, and other systems in essentially the same coherentway [26].

    In game terms, the above technique could be thought of asa type of proto-heuristic that transcends speci c game dy-namics and works as a default strategy to be applied, before thegame-speci c mechanics are re ned. This could prove usefuleither independently or as heuristics from genesis, which could be used to guide an AI players behavior in a new game whilegame-speci c heuristics were developed during play. In this paper, we do not go as far as exploring the idea of buildinggame-speci c heuristics on top of the proto-heuristics, but focuson deploying the method to generate useful behavior primitives.We demonstrate the operation of proto-heuristics in two gamescenarios and show that intuitively sensible behaviors are se-lected.

    B. Information Theory

    To develop the method, we require Shannons theory of infor-mation for which we give a very basic introduction. To begin,we introduce entropy , which is a measure of uncertainty; the en-tropy of a variable is de ned as

    (1)

    where is the probability that is in the state . The loga-

    rithm can be taken to any chosen base; in this paper, we alwaysuse 2, and the entropy is thus measured in bits . If is another random variable jointly distributed with , the conditional en-tropy is

    (2)

    This measures the remaining uncertainty about the value of , if we know the value of . This also allows us to measure

    the mutual information between two random variables

    (3)

    Mutual information can be thought of as the reduction in un-certainty about one random variable, given that we know thevalue of the other. In this paper, we will also examine the mu-tual information between a particular value of a random variablewith another random variable

    (4)

    This can be thought of as the contribution by a speci caction to the total mutual information, and will be useful for selecting a subset of that maximizes mutual information.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    3/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 3

    Fig. 1. Bayesian network representation of the perceptionaction loop.

    Finally, we introduce the information-theoretic concept of the

    channel capacity [27]. It is de ned as

    (5)

    Channel capacity is measured as the maximum mutual infor-mation taken over all possible input (action) distributions ,and depends only on , which is xed for a given startingstate . This corresponds to the potential maximum amount of information about its prior actions that an agent can later ob-serve. One algorithm that can be used to nd this maximum isthe iterative BlahutArimoto algorithm [28].

    C. Empowerment

    Empowerment, based on the information-theoretic percep-tionaction loop formalism introduced in [24] and [29], is aquantity characterizing the sensorimotor adaptedness of anagent in its environment. It quanti es the ability of situatedagents to in uence their environments via their actions.

    For the purposes of puzzle solving and gameplay, we translatethe setting as follows: Consider the player carrying out a moveas a sender sending a message, and observing the subsequentstate on the board as receiving the response to this message. Interms of Shannon information, when the agent performs an ac-

    tion, it injects information into the environment, and, subse-quently, the agent reacquires part of this information from theenvironment via its sensors. Note that, in this paper, we discussonly puzzles and games with perfect information, but the for-malism carries over directly to the case of imperfect informa-tion games.

    For the scenarios relevant in this paper, we will employ aslightly simpli ed version of the empowerment formalism. The player (agent) is represented by a Bayesian network, shown inFig. 1, with the random variable the state of the game (as per the players sensors), and a random variable denoting theaction at time .

    As mentioned above, we consider the communicationchannel formed by the actionstate pair and computethe channel capacity, i.e., the maximum possible Shannoninformation that one action can inject or store into thesubsequent state . We de ne empowerment as this mo-tori-sensor channel capacity

    (6)

    If we consider the game as homogenous in time, we can, for simplicity, ignore the time index, and empowerment only de- pends on the actual state .

    Instead of a single action, it often makes sense to consider an action sequence of length and its effect on the state.In this case, which we will use throughout most of the paper,

    we will speak about -step empowerment. Formally, we rstconstruct a compound random variable of the next actuations

    . We now maximize the mu-tual information between this variable and the state at time ,represented by . -step empowerment is the channel ca- pacity between these

    (7)

    It should be noted that depends on (the current state of the world) but, to keep the notation unburdened, we will alwaysimplicitly assume conditioning on the current state and notexplicitly write it.

    In this paper, we will present two extensions to the empow-erment formalism which are of particular relevance for puzzlesand games. The rst, discussed in Section IV, is impoverishedempowerment; it sets constraints on the number of action se-quences an agent can retain, and was originally introduced in[30]. The second, presented in Section VI, introduces the con-cept of a soft horizon forempowerment which allows an agent touse a hazy prediction of the future to inform action selection.Combined, these present a model of resource limitation on theactions that can be retained in memory by the player and corre-sponds to formulating a bounded rationality constraint on em- powerment fully inside the framework of information theory.

    Before this paper, and [30], empowerment as a measure hassolely been used as a utility applied to states , but, in this paper,we introduce the notion of how empowered an action can be.In this case, empowered corresponds to how much a particular action or action sequence contributes toward the empowerment

    of a state. Note that we use the Bayesian network formalism in its causal

    interpretation [31], as the action nodes have a well-de ned inter-ventional interpretationthe player can select its action freely.The model is a simpler version of a more generic Bayesian net-work model of the perceptionaction loop where the state of thegame is not directly accessible, and only partially observablevia sensors. The empowerment formalism generalizes naturallyto this more general case of partial observability, and can beconsidered both in the case where the starting state is externallyconsidered (objective empowerment landscape) or where itcan only be internally observed (via context; i.e., distinguishing

    states by observing sensor sequences for which empowermentvalues will differ; see [32] and [26]). In fact, the empowermentformalism could be applied without change to the more generic predictive state representation formalism (PSR; see [33]). 2

    Here, however, to develop the formalism for self-motivationand strategy identi cation, we do not burden ourselves with is-sues of context or state reconstruction, and we, therefore, con-centrate on the case where the state is fully observable. Further-more, we are not concerned here with learning the dynamicsmodel itself, but assume that the model is given (which is, inthe game case, typically true for one-player games, and for two- player games one can either use a given opponent model or use,

    2 Note that th is relies on the actions in the en tries of the system-dynamics ma-trix as being interpreted interventionally (i.e., as freely choosable by the agent)in PSR.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    4/17

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    5/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 5

    3) The action policy generated by empowerment should iden-tify that different states have different utilities. Naive mo- bility-like empowerment does not account for the fact that being able to reach some states can be more advantageousthan being able to reach others.

    In Section IV, we will address the rst issue. As for issues 2)and 3), it turns out that they are very related to one another; theywill be discussed further in Sections V and VI.

    Finally, in Section VII, we will bring together all considera-tions and apply it to a selection of game scenarios.

    IV. I MPOVERISHED EMPOWERMENT

    In the spirit of bounded rationality outlined above, we modi- ed the -step empowerment algorithm to introduce a constrainton the bandwidth of action sequences that an agent could retain.We call this modi ed concept impoverished empowerment[30]. This allows us to identify possible favorable tradeoffs,where a large reduction in the bandwidth of action sequenceshas little impact on empowerment.

    While, in the original empowerment de nition, all possibleaction sequences leading to various states are considered, inimpoverished empowerment, one considers only a strongly re-stricted set of action sequences. Therefore, we need to identifythose action sequences which are most empowering, i.e., thosecontributing most to the agents empowerment. How one actionsequence can be more empowering than another is a function of the action sequences stochasticity (does it usually get where itwanted to go?), and whether other action sequences lead to thesame state (are there other ways to get there?).

    A. ScenarioTo investigate the impoverished empowerment concept, we

    revisited the scenario from [24]; a player is situated within a2-D in nite gridworld and can select one of four actions (north,south, east, and west) in any single time step. Each action movesthe agent by one space into the corresponding cell, provided itis not occupied by a wall. The state of the world is completelydetermined by the position of the agent.

    B. Impoverished Empowerment Algorithm

    This bandwidth reduction works by clustering the available

    action sequences together into a number of groups, from whicha single representative action sequence is then selected. The se-lected action sequences then form a reduced set of action se-quences, for which we can calculate the empowerment.

    Stage 1: Compute the empowerment in the conventionalway, obtaining an empowerment-maximizing probability dis-tribution for all -step action sequences (typically with

    ).Having calculated the empowerment, we have two distribu-

    tions: is the capacity achieving distribution of action se-quences, and is the channel that represents the re-sults of an agents interactions with the environment. For con-ciseness, we will write to represent action sequences .

    Stage 2: In traditional empowerment computation, isretained for all -step sequences . Here, however, we assume

    Fig. 2. Visualization of action sequence grouping using impoverished empow-erment in an empty gridworld with two steps; each two character combination(e.g., NW) indicates a two-step action sequence that leads to that cell from thecenter cell. Lighter lines represent the grid cells, and darker lines represent thegroupings.

    a bandwidth limitation on how many such action sequences can be retained. Instead of remembering for all action se-quences , we impoverish , i.e., we are going to thindown the action sequences to the desired bandwidth limit.

    To stay entirely in the information-theoretic framework, weemploy the so-called information-bottleneck method [48], [49].Here, one assumes that the probability is given,meaning you need a model of what will be possible outcomesfor a given action by a player in a given state. In single-player games, this is easily determined, whereas in multiplayer games,we need a model of the other players (we discuss this more inSection IX-A).

    We start by setting our designed bandwidth limit by selectinga cardinality for a variable where ; we now wishto nd a distribution , where is a group of action se-

    quences with .The information-bottleneck algorithm (see Appendix B) can

    be used to produce this mapping, using the original channel as aninput. It acts to minimize while keepingconstant; it can be thought of as squeezing the informationshares with through the new variable to maximize theinformation shares with while discarding the irrele-vant aspects. By setting a cardinality for and then running theinformation-bottleneck algorithm, we obtain a conditional dis-tribution , which acts as a mapping of actions to groups.

    The result of this is action sequences that usually lead to iden-tical states being clustered together into groups. However, if the

    number of groups is less than the number of observed states,then, beyond the identical state action sequences, the groupingis arbitrary, as seen in Fig. 2. This is because there is nothingto imply any relation between states, be it spatial or otherwise;states are only consistently grouped with others that lead to thesame state.

    Contrary to what might be expected, introducing noise intothe environment actually improves the clustering of actions tothose that are more similar (in this case, spatially). This isdue to the possibility of being blown off course, meaning theagent sometimes ends up not in the expected outcome state butin a nearby one, which results in a slight overlap of outcomestates between similar action sequences. However, it is clear that relying on noise for such a result is not ideal, and a better solution to this problem is introduced in Section VI.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    6/17

    6 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    Fig. 3. Typical behaviors where four action sequences were selected from

    possibilities. The agents starting location is shown in green, and its various

    nallocations are shown in pink.

    Stage 3: Because our aim is to select a subset of our originalaction sequences to form the new action policy for the agent, wemust use an algorithm to decompose this conditional distribu-tion into a new distribution of action sequences, whichhas an entropy within the speci ed bandwidth limit.

    We wish to maximize empowerment, so, for each , we selectthe action sequence which provides the most toward our em- powerment [i .e., the highest value of ]. However,when selecting a representative action sequence for a given ,

    we must consider (i.e., does this action sequence trulyrepresent this group?), so we weight on that; however, in mostcases, the mapping between and is a hard partitioning, sothis is not normally important. This results in collapsing groupsto their dominant action sequence.

    C. Impoverishment Results

    Fig. 3 shows three typical outcomes of this algorithm; in thisexample, we have a bandwidth constraint of two bits corre-sponding to four action sequences, operating on sequences with

    a length of six actions; this is a reduction of actionsequences down to four. The walls are represented by black, thestarting position of the agent is the green center square, and theselected trajectories are represented by the thin arrowed lineswith a pink cell marking the end location of the sequence. Theresult that emerges consistently for different starting states is aset of skeleton action sequences extending into the state spacearound the agent. In particular, note that stepping through thedoorway, which intuitively constitutes an environmental featureof particular salient interest, is very often found among the four action sequences.

    Inspection reveals that a characteristic feature of the se-

    quences surviving the impoverishment is that the end points of each sequence usually all have a single unique sequence (of theavailable ) that reaches them.

    This can be understood by the following considerations. Inorder to maintain empowerment while reducing bandwidth, themost effective way is to eliminate equivalent actions rst, sincethese are waste action bandwidth without providing a richer set of end states. States reachable by only one action sequenceare, therefore, advantageous to retain during impoverishment; inFig. 3, the last action sequences the agent will retain are thoseleading to states that have only a single unique sequence thatreaches them. This is a consequence of selecting the action fromeach group by , and may or may not be desirable;however, in the soft-horizon empowerment model to follow, wewill see that this result disappears.

    V. T HE HORIZON

    Identifying the correct value for , for -step empowerment(i.e., the empowerment horizon depth), is critical for being ableto make good use of empowerment in an unknown scenario.A value of , which is too small, can mean that states are notcorrectly differentiated from one another, as some options lie

    beyond the agents horizon. By contrast, a value of , which istoo large (given the size of the world), can allow the agent to believe all states are equally empowered [30].

    Furthermore, it is unlikely that a static value of would besuitable in many nontrivial scenarios (where different parts of the scenario require different search horizons), and having ahard horizon compounds this.

    A. Softening the Horizon

    We understand intuitively that, when planning ahead in agame, a human player does not employ a hard horizon, but in-stead probably examines some moves ahead precisely, and be-yond that has a somewhat hazy prediction of likely outcomes.

    In the soft-horizon empowerment model, we use a similar softening of the horizon, and demonstrate how it also helpsidentify relationships between action sequences which allowsthe previously presented clustering process to operate more ef-fectively. It allows us to group sets of action sequences together intoalike sequences (determined by the overlap in their poten-tial future states), with the resulting groups of action sequencesrepresenting strategies. This will be shown later to be usefulfor making complex puzzles easier to manage for agents. Fur-thermore, we will show that this horizon softening can helpto estimate any ongoing utility we may have in future states,having followed an action sequence. We acknowledge that some

    future states may be more empowered than others (i.e., leadon to states with more empowerment).

    VI. SOFT -HORIZON EMPOWERMENT

    Soft-horizon empowerment is an extension of the impover-ished empowerment model and provides two signi cant im- provements: the clustering of action sequences into groups isenhanced such that the clusters formed represent strategies ,and it allows an agent to roughly forecast future empowermentfollowing an action sequence. We will show that these featuresalso suggest a solution to having to predetermine the appropriatehorizon value for a given scenario.

    A. Split the Horizon

    Again, we designate a set of actions , from which an agentcan select in any given time step. For convenience, we label thenumber of possible actions in any time step as .

    We form all possible action sequences of length , repre-senting all possible action trajectories a player could take in

    time steps, such that we have trajectories.From here, we can imagine a set of possible states that the

    agent arrived in corresponding to the trajectory the agent took,. It is likely that there are less than unique states, be-

    cause some trajectories are likely commutative, the game worldis Markovian, and also because the world is possibly stochastic, but for now, we proceed on the assumption that trajectorieslead to states (we show how to optimize this in section [30]).

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    7/17

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    8/17

    8 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    Fig. 6. End states of the four action sequences selected to represent the strate-gies seen in Fig. 4; each is distant from the walls to improve their individualongoing empowerment.

    channel which provides a capacity achieving distribution of ac-tion sequences . We now break this channel up intoseparate channels based on all those where is identical, i.e.,one channel for each case in which the rst steps are identical.We now have a set of channels, corresponding to each set of

    -step sequences that stem from common ancestor sequences.For each of these subchannels, we sum the mutual informationfor each sequence in this subchannel with , usingthe capacity achieving distribution of action sequences calcu-lated above. More formally,

    where . This gives us an approximation of the-step empowerment for each sequence of -steps; we

    can now select, from within each strategy group, those that aremost empowered.

    As before, we must weight this by their likelihood to mapto that (i.e., the highest value of for the given ),although once again usually these mappings are deterministic,so this is unnecessary.

    For the mapping shown in Fig. 4, this leads to the selectionof action sequences with the end states shown in Fig. 6.

    It can be seen that this results in collapsing strategies to the ac-tion sequences which areforecast to lead to the most empoweredstates. Without any explicit goal or reward, we are able to iden-tify action sequences which represent different strategies, andthat are forecastto have future utility.The complete soft-horizonempowerment algorithm is presented in Appendix A.

    D. Single-Step Iterations to Approximate Full Empowerment

    One major problem of the traditional empowerment compu-tation is the necessity to calculate the channel capacity in viewof a whole set of -step actions. Their number grows exponen-

    tially with , and this computation thus becomes infeasible for large .In [30], we, therefore, introduced a model whereby we itera-

    tively extend the empowerment horizon by one step followed by an impoverishment phase that restricts the number of re-tained action sequences to be equal to the number of observedstates. Colloquially, this is summed up as only remember oneaction sequence to reach each state. This is usually suf cient toensure the player retains full empowerment while signi cantlyimproving the computational complexity. With this optimiza-tion, the computational bounds on empowerment grow with thenumber of states (usually linear) instead of with the number of action sequences (usually exponential). Space restrictions pre-vent us from including the algorithm here, but we have used itfor the -phase of empowerment calculations.

    Fig. 7. Sokoban -inspired gridworld scenario. Green represents the player, and purple represents the pushable box, which is blocking the door.

    E. Alternative Second Horizon Method

    An alternative approach, not explored in this paper, to usingthe second horizon to predict the future empowerment is to usean equi-distribution of actions over the -steps forming thesecond horizon, as opposed to calculating the channel capacity(step in appendix sec:algoappendix). This approximation is al-gorithmically cheaper, but usually at the expense of a less accu-rate forecast of future empowerment, as well as a less consistentidenti cation of strategies. However, it may prove to be one pathto optimizing the performance of soft-horizon empowerment.

    VII. G AME SCENARIOS AND R ESULTS

    A. Sokoban

    Many puzzle games concern themselves with arranging ob- jects in a small space to clear a path, toward a good con g-uration. Strategy games often are concerned with route ndingand similar such algorithms, and heuristics for these often haveto be crafted carefully for a particular games dynamics.

    As an example of such games, we examine a simpli ed box- pushing scenario inspired by Sokoban . In the original incarna-tion, each level has a variety of boxes which need to be pushed

    (never pulled) by the player into some designated con guration;when this is completed, the player completes the level and pro-gresses to the next. Sokoban has received some attention for the planning problems it introduces [50], [51], and most pertinentapproaches to it are explicitly search based and tuned towardthe particular problem.

    We are changing the original Sokoban problem insofar as thatin our scenario there are no target positions for the boxes, and, infact, there is no goal or target at all. As stated, we postulate that acritical element for general game playing is self-motivation that precedes the concretization of tasks, thus we adapted the gameaccordingly.

    Fig. 7 shows the basic scenario, with a player and a pushable box. Similarly to the earlier gridworlds, the player can movenorth, south, east, and west in any timestep (there is no standstill option). Should the player move into a neighboring celloccupied by the box, then the box is pushed in the same direc-tion into the next cell; should the destination cell for the box be blocked, then neither the player nor the box moves and the timestep passes without any change in the world.

    Our intuition would be that most human players, if presentedwith this scenario and given four time steps to perform a se-quence of actions on the understanding that an unknown task will follow in the subsequent time steps, would rst consider moves that move the box away from blocking the doorway.Humans, we believe, would understand instinctively from ob-serving the setup of the environment that the box blocks us from

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    9/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 9

    Fig. 8. Distribution of nal game states following the selected four-step action sequences selected by an AI player constrained to two action sequences, aggregatedfrom 1000 runs. The pink cells indicate the players nal position: (a) 48.1%; (b) 47.2%; (c) 2.0%; (d) 2.7%; (e) 92.7%; (f) 4.4%; (g) 1.4%; and (h) 1.5%.

    Fig. 9. Second Sokoban scenario, with multiple boxes in purple and the players starting position in green. Solutions shown represent the fullest possible recoveryfor the three boxes, along with example action sequences for recovery. Being one step short of these solutions is considered partial recovery (not shown). (a)Scenario. (b) Box 1. (c) Box 2. (d) Box 3.

    the other room and thus moving it gives us more options in termsof places (states) we can reach.

    However, it is not obvious how to enable AI players, without

    explicit goals and no hand-coded knowledge of their environ-ment, to perform such basic tasks as making this identi cation.Here, we approach this task with the fully generic soft-horizonempowerment concept. We use the following parameters:and , and a bandwidth constraint limiting the player toselecting two action sequences from among the possible

    . This constraint was originally selected to see if theAI player would identify both choices for clearing a pathway tothe door, and to allow for a possible contrast in strategies.

    In Fig. 8, we can see the distribution of nal states that werereached. We represent the nal states rather than the action se-quences that were selected as there are various paths the player

    can take to reach the same state.In Fig. 8(a) and (b), we can see the cases that result the ma- jority of the time (95.3%); in one case, the box is pushed to theleft or right of the door and, in the other case, the box is pushedinto the doorway. The variants in Fig. 8(c) and (d) are almostidentical, just the player has moved away from the box one cell.

    We can see that two clear strategies emerged; one to clear the path through the door for the player, and a second to pushthe box through the door (blocking the player from using thedoorway). The two options for clearing a path to the doorway(box pushed left or right of the door) are clustered as being partof the same strategy.

    However, it is clear that the types of strategy that can arise,and the ways that action sequences are clustered together, aredependent upon the horizon. If we revisit the same scenario but

    now adjust the second horizon to be longer, setting , thenwe can see the results change.

    The second row of Fig. 8 shows the altered results, and in

    Fig. 8(e), we can see that there is a single set of result statesthat now form the majority of the results (92.7%). We can seethat clearing the box to the left or right of the door no longer is part of the strategy; now that the player has a horizon of vesteps it prefers to retain the option of being able to move the box to either the left or the right of the door, rather than com-mitting to one side from the outset. Inspection reveals that oc-cupying the cell below the (untouched) box provides the player with an empowerment of 5.25 bits rather than

    4.95 bits that would be achieved by clearing thedoor (in either direction) immediately. The choices that clear thedoor continue to be clustered together, but the scenario is now

    represented by the higher empowerment option that emergeswith the increased horizon.Fig. 9 shows a more complex scenario, with multiple boxes

    in the world, all trapped in a simple puzzle. Again, thereis no explicit goal; for each of the three boxes, there exists asingle unique trajectory that will recover the box fromthe puzzlewithout leaving it permanently trapped. Note that trapping any box immediately reduces the players ability to control its en-vironment and costs it some degrees of freedom (in the statespace) afforded by being able to move the box.

    The scenario is designed to present multiple intuitive goals,which are attainable only via a very sparse set of action se-quences. With a horizon of 14 steps, there are 268 million pos-sible action sequences (leading to 229 states), of which 13 fullyretrieve a box. Note that box 2 (top right) cannot be fully re-

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    10/17

    10 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    trieved (pass through the doorway) within 14 steps, and that box1 (top left) is the only box that could be returned to its starting position by the player.

    Table I shows the results when allowing the AI player to se-lect four action sequences of 14 steps with an extended horizonof 5 steps, averaged over 100 runs. An explicit count of the different paths to the doorway for each boxs puzzle roomreveals that there are only six action sequences that fully retrievethe box to the main room for each of the top two boxes, and onlyone still for the bottom box.

    The results indicate the ability of soft-horizon empowermentto discover actions that lead to improved future empowerment.Furthermore, in every run, all three boxes were moved in someway, with 35% of cases resulting in all three boxes being re-trieved, and in 58%, at least two boxes were retrieved, leading toindications that the division of the action sequences into strate-gies is a helpful mechanism toward intuitive goal identi cation.

    B. Pac-Man-Inspired PredatorPrey Maze Scenario

    Pac-Man and its variants have been studied previously, in-cluding using a tree search approach [52], but the aim of thecurrent paper is not to attempt to achieve the performance of these methods but rather to demonstrate that, notwithstandingtheir genericness, self-motivation concepts such as soft-horizonempowerment are capable of identifying sensible goals of oper-ation in these scenarios on their own and use these to performthe scenario tasks to a good level.

    Thus, the nal scenario we present is a simpli ed predatorprey game based on a simpli ed Pac-Man model;rather than having pills to collect, and any score, the game issimpli ed to having a set of ghosts that hunt the player and kill

    him should they catch him. The score, which we will mea-sure, is given simply by the time steps that the player survives before he is killed; however, it is important to note that our algorithm will not be given an explicit goal, rather the implicitaim is simply survival. If humans play the game a few times, itis plausible to assume (and we would also claim some anecdotalevidence for that) that they will quickly decide that survivalis the goal of the game without being told. Choosing survivalas your strategy is a perfectly natural decision; assuming nofurther knowledge/constraints beyond the game dynamics, anda single-player game, anything that may or may not happenlater has your survival in the present as its precondition.

    In the original Pac-Man game, each ghost used a uniquestrategy (to add variation and improve the gameplay), and theywere not designed to be ruthlessly ef cient; the ghosts in our scenario are far more ef cient and all use the same algorithm.Here, in each time step, the player makes a move (there isno do nothing action, but he can indirectly achieve it bymoving toward a neighboring wall), and then the ghosts, inturn, calculate the shortest path to his new location and move.Should multiple routes have the same distance, then the ghostsrandomly decide between them. They penalize a route whichhas another ghost already on it by adding extra steps to thatroute; setting results in the ghosts dumbly following oneanother in a chain, which is easy for the player. Increasing thevalue makes the ghosts swarm the player more ef ciently. For the present results, we use , which is a good compromise

    TABLE IPERCENTAGE OF R UNS IN WHICH EACH OF THE BOXES WAS R ECOVERED . W E

    CAN SEE THE IMPORTANCE OF A LONG E NOUGH HORIZON ; BOX 2 (WHICHCANNOT BE R ETRIEVED COMPLETELY FROM THE R OOM ) IS

    R ECOVERED LESS OFTEN THAN THE OTHER BOXES

    Fig. 10. Pac-Man -inspired scenario, showing the three possible starting posi-tions of the player (yellow) in the center, and of the starting positions of each of the three ghosts (red).

    Fig. 11. Players score for different parameter sets, averaged over three dif-ferent starting positions for the player, and 100 games per data point. It high-lights that with no second horizon performance does not improve asthe rst horizon increases.

    between ghost ef ciency and giving the player suf cient chanceto survive long enough to allow different values for and todifferentiate.

    The maze setup we used is shown in Fig. 10, and the locationof the three ghosts can be seen. Having only three ghosts isanother compromise for the same reasons as above; using four ghosts usually resulted in the player not surviving long enough

    to get meaningful variance in the results generated withdifferent parameter sets.The player has a model of the ghosts algorithm and, thus,

    can predict their paths with some accuracy, and is being allowedfour samples of their possible future positions (which are sto-chastic, given the possibility that for one or more ghosts the pathlengths coincide) for a given player move. Once no equal routesare present, then one sample represents perfect information, butonce one or more ghosts have one or more equal length paths,then the sampling becomes less accurate and may lose informa-tion about the possible future ghost moves.

    The game begins with the players rst move, and continuesuntil he is caught by any of the ghosts; at this point, the player is dead and is no longer allowed to move. However, there isno special death state in our model; once caught, the player is

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    11/17

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    12/17

    12 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    Fig. 13. Gridworld scenario. The green cell represents the starting position of the player; the player starts atop a wall.

    TABLE IIDISTRIBUTION OF A CTION SEQUENCES SELECTED BY EACH METHOD

    (E ACH O VER 1000 R UNS ). ACTION SEQUENCES LEADING TO THESAME STATE HAVE BEEN GROUPED . NO NOISE

    tiple actions with equally maximal expected mobility, pick a subset randomly).

    This results in a set of -step action sequences that have anexpected mobility over -steps in a similar fashion to soft-horizon empowerment, and, like empowerment, can operate instochastic scenarios.

    A. Gridworld Scenario

    This gridworld scenario we present here, shown in Fig. 13,operates identically to the gridworld scenarios earlier, but in thisinstance, there is no box for the agent to manipulate. The player starts on a wall along which it cannot move; the player can moveeast or west and fall into the room on either side, but it cannot

    then return to the wall. While on the wall, the player can effec-tively stay still by trying to move north or south.

    We ran both soft-horizon empowerment and the greedy mo- bility algorithm with , , and a target action se-quence cardinality of 1, such that the output of each algorithmwould be a two-step action sequence selected to maximize em- powerment or mobility. For each method, 1000 runs were per-formed, and the results are shown in Table II.

    All of the actions selected by the greedy mobility algorithmhave an expected mobility of nine moves, and all those movesalsoleadto a state where 3.17bits. However, due tothe wayin which the soft-horizon empowerment algorithm forecasts fu-

    ture empowerment (in the second horizon), it favors movingaway from the wall.We now introduce some noise into the environment, making

    it so in the eastern room there is a 50% chance that, for any run,all directions are rotated (N S, S E, E W, W N).Again, 1000 runs were performed, and the results are shown inTable III.

    The change can be seen clearly; empowerment immediatelyadapts and switches to always favoring the western room whereit has more control, whereas greedy mobility does not signi -cantly change the distribution of actions it selects. This behavior is a critical advantage of empowerment over traditional mo- bility; the rotation of actions does nothing to decrease mobility,as each action within a speci c run is deterministic. When be-ginning a new run, it is impossible to predict whether the actions

    TABLE IIIDISTRIBUTION OF A CTION SEQUENCES SELECTED BY EACH METHOD

    (E ACH OVER 1000 R UNS ). W ITH NOISE IN THE EASTERN R OOM

    would be inverted, so while there is no decrease in mobility,there is a decrease in predictability , which negatively impactsempowerment.

    While empowerment can be intuitively thought of as a sto-chastic generalization of mobility, it is actually not exactly thecase in many instances. It is possible to encounter stochasticitywith no reduction in mobility, but stochasticity is re ected in

    empowerment due to its reducing of a players control (over their own future).

    IX. D ISCUSSION

    The presented soft-horizon empowerment method exhibitstwo powerful features, both of which require no hand-codedheuristic:

    the ability to assign sensible anticipated utility values tostates where no task or goal has been explicitly speci ed;

    accounting for the strategic af nity between potential ac-tion sequences, as implicitly measured by the overlap inthe distribution of their potential future states (naively thiscan be thought of as how many states reachable from Aare also reachable from B within the same horizon). Thisallows a player to select a set of action sequences that twithin a speci c bandwidth limit while ensuring that theyrepresent a diversity of strategies.

    This clustering of action sequences allows strategic prob-lems to be approached with a coarser grain; by grouping setsof actions together into a common strategy, different strategiescan be explored without requiring that every possible action se-quence is explored. We believe that such a grouping moves to-ward a cognitively more plausible perspective, which groupsstrategies a priori according to classes of strategic relevancerather than blindly evaluating an extremely large number of possible moves. Furthermore, by modifying the bandwidth, theconcept of having strategies with differing granularities (i.e.,attack versus attack from the north and attack from thesouth, etc.) emerges; it has previously been shown that thereis a strong urge to compress environments and tasks in such away [53].

    Before we go into a more detailed discussion of the approachin the context of games, however, some comments are requiredas to why a heuristic, which is based only on the structure of agame and does not take the ultimate game goal into account,can work at all. This is not obvious and seems, on rst sight,to contradict the rich body of work on reward-based actionselection (grounded in utility theory/reinforcement learning,etc.).

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    13/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 13

    To resolve this apparent paradox, one should note that, for many games, the structure of the game rules already implicitlyencodes partial aspects of the ultimate tasks to a signi cant de-gree (similarly to other tasks [54]). For instance, Pac-Man byits very nature is a survival game. Empowerment immediatelyre ects survival, as a dead player loses all empowerment.

    A. Application to Games

    In the context of tree search, the ability to cluster actionsequences into strategies introduces the opportunity to imbuegame states and corresponding actions with a relatedness whichderives from the intrinsic structure of the game and is notexternally imposed by human analysis and introspection of thegame.

    The game tree could now be looked at from a higher level,where the branches represent strategies, and the nodes representgroups of similar states. It is possible to foresee pruning a treeat the level of thus determined strategies rather than individual

    actions, incurring massive ef

    ciency gains. More importantly,however, these strategies emerge purely from the structure of the game rather than from externally imposed or assumed se-mantics.

    In many games, it is reasonable to assume having perfectknowledge of transitions in the game state given a move. How-ever, note that the above model is fully robust to the introduc-tion of probabilistic transitions, be it through noise, incompleteinformation, or simultaneous selection of moves by the oppo-nent. The only precondition is the assumption that one can builda probabilistic model of the dynamics of the system. Such op- ponent or environment models can be learned adaptively [32],[35]. The quality of the model will determine the quality of thegenerated dynamics, however, we do not investigate this further here.

    We illustrated the ef cacy of this approach using two sce-narios. Importantly, the algorithm was not speci cally craftedto suit the particular scenario, but is generic and transfers di-rectly to other examples.

    In the Pac-Man -inspired scenario, we demonstrated thatacting to maintain anticipated future empowerment is suf- cient to provide a strong generic strategy for the player.More precisely, the player, without being set an explicit goal,made the natural decision to ee the ghosts. This behavior derives from the fact that empowerment is, by its very nature,a survival-type measure, with death being a zero-empow-erment state. With the second horizons forecast of the future,the player was able to use the essential basic empowerment principle to successfully evade capture for extended periods of time.

    We presented several Sokoban -inspired scenarios; the rst,smaller, scenario presented a doorway that was blocked by a box, with human intuition identifying clearing the doorway asa sensible idea. We saw that soft-horizon empowerment iden-ti ed clearing the doorway as a good approach for maximizingfuture utility, and alsoselected an alternative strategy of pushingthe box through the doorway. It was interesting to see that soft-horizon empowerment identi ed clearing the door with the boxto the left or to the right as part of the same strategy, as opposed

    to pushing the box through the door. This scenario also high-lighted how the algorithm differentiates strategies based on itshorizon limitations.

    The second scenario presented three trapped boxes, each re-quiring a 14-step action sequence to retrieve from the trap.A human considering the problem could deduce the desirabletarget states for each box (freeing them so they could be movedinto the main room). With a total of 268 10 possible actionsequences to choose from, and lacking the a priori knowledgedetermining which states should be target states, the algorithmreliably selects a set of action sequences which includes an ac-tion sequence for retrieving each of the boxes. Not only are thetarget states identi ed as being important, but also the possibleaction sequences to recover each of the different boxes are iden-ti ed as belonging to a different strategy.

    The importance of this result lies in the fact that, while, again,the approach used is fully generic, it nevertheless gives riseto distinct strategies which would be pref erred, also based onhuman inspection. This result is similarly important for the prac-tical relevance of the approach. The above relevant solutionsare found consistently, notwiths tanding the quite considerablenumber and depth of possible action sequences. We suggest thatthis may shed additional light on how to construct cognitively plausible mechanisms, which would allow AI agents to pres-elect candidates for viable medium-term strategies without re-quiring full exploration of the space.

    The nal Sokoban example introduced noise into the environ-ment and compared empowerment to a simple mobility algo-rithm. It highlighted a distinct advantage of empowerment over mobility in that emp owerment identi es a reduction in controland is able to respond appropriately.

    B. Relation to Monte Carlo Tree Search

    The presented formalism could be thought of, in certain cir-cumstances, as just dividing a transition table into two halvesand using forecasts of probabilities of encountering states in thesecond half to differentiate those states in the rst half and as-sign them some estimated utility. The information-theoretic ap- proach allows this to be quanti able and easily accessible toanalysis. However, we believe the technique presented wouldwork using other methodologies and could be combined withother techniques in the medium term. One important exampleof where it could be applied would be in conjunction with anMCTS approach, and we would like to discuss below how theformalism presented in this paper may provide pathways to ad-dress some weaknesses with MCTS.

    MCTS has been seen to struggle with being a global search in problems with a lot of local structure [55]. An example for thisis a weakness seen in the Go progam F UEGO , which is identi edas having territorially weak play [56] because of this problem.Some method, which clusters action sequences into strategies,where the strategic af nity distance between the subsequentstates is low, might allow for the tree search to partially operateat the level of the strategies instead of single actions, and thiscould help in addressing the problem.

    The second aspect of MCTS has led to some criticism in thatit relies on evaluating states to the depth of the terminal nodes

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    14/17

    14 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    they lead to in order to evaluate a state. It is possible that thefolding back model of empowerment presented in this paper could be used as a method to evaluate states in an MCTS, whichmay operate within a de ned horizon when no terminal statesappear within that horizon. In this way, the search could be donewithout this terminal state requirement, and this might allow a better balance between the depth of the search versus its spar-sity. This, of course, would make use of the implicit assumptionof structural predictability underlying our formalism.

    C. Computational Complexity

    We are interested in demonstrating the abilities of the soft-horizon method, but in this paper, we did not aim for an opti-mization of the algorithm. As such, the unre ned soft-horizonalgorithm is still very time consuming.

    Previously, the time complexity of empowerment was expo-nential with respect to the horizon , until the impoverish-and-iterate approach was introduced in [30] (which is linear with re-spect to the number of states encountered).

    This paper introduces soft-horizon empowerment and asecond horizon , and currently the time complexity of thesoft-horizon algorithm is exponential with respect to . Wehave not yet attempted to apply the iterative impoverishmentapproach to soft-horizon empowerment, but we expect that thisor a similar approach would provide signi cant improvements.

    In continuous scenarios, where early empowerment studiesused to be extremely time consuming, recently developed ap- proximation methods for empowerment allowed the reductionof computation time by several orders of magnitude [57].

    X. C ONCLUSION

    We have proposed soft-horizon empowerment as a candidatefor solving implicit problems that are de ned by the environ-ments dynamics, without imposing an externally de ned re-ward. We argued that these cases are of a type that is intuitivelynoticed by human players when rst exploring a new game, butwhich computers struggle to identify.

    Importantly, it is seen that this clustering of action sequencesinto strategies determined by their strategic af nity, combinedwith aiming to maintain a high level of empowerment (or naivemobility in simpler scenarios) brings about a form of self-moti-vation. It seems that setting out to maximize the agents futurecontrol over a scenario produces action policies which are intu-

    itively preferred by humans. In addition, the grouping of actionsequences into strategies ensures that the solutions producedare diverse in nature, offering a wider selection of options in-stead of all converging to microsolutions in the same part of the problem at the expense of other parts. The philosophy of theapproach is akin to best preparing the scenario for the agent tomaximize its in uence so as to react most effectively to a yet toemerge goal.

    In the context of general game playing, to create an AI thatcan play new or previously unencountered games, it is criticalto shed its reliance on externally created heuristics (e.g., by hu-mans) and enable it to discover its own. In order to do this, we propose that it will need a level of self-motivation and a generalmethod for assigning preference to states as well as for identi-fying which actions should be grouped into similar strategies.

    Soft-horizon empowerment provides a starting point into howwe may begin going about this.

    APPENDIX ASOFT -HORIZON EMPOWERMENT COMPLETE A LGORITHM

    The soft-horizon empowerment algorithm consists of two

    main phases. This appendix presents the complete algorithm.In order to make it somewhat independent and concise, itintroduces some notation not used in the main paper.

    Phase 1 is not strictly necessary, but acts as a powerful op-timization by vastly reducing the number of action sequencesthat need to be analyzed. The main contributions presented inthis paper are within phase 2.

    A. Setup

    1) De ne a set as the list of all possible single-step actionsavailable to the player.

    2) Set . Begin with an empty set containing a list of action sequences .

    B. Phase 1

    Phase 1 serves to create a set of action sequences that,most likely, will reach all possible states within -steps, butwill have very few (0 in a deterministic scenario) redundant se-quences. In stochastic scenarios that have heterogeneous noisein the environment it may be that those areas are avoided in pref-erence to staying within more stable states, and in these cases,you will nd there may be some redundancy in terms of mul-tiple action sequences to the same state.

    In a deterministic scenario, phase 1 can be entirely skipped;the same optimization can be achieved by selecting at randoma single action sequence for each state reachable within -steps[i.e., for each state , select any single action sequence where

    ].3) Produce an extended list of action sequences by forming

    each possible extension for every action sequence inusing every action in ; the number of resultant actionsequences should equal . Replace with thisnew list and increment by 1.

    a) Using the channel/transition table, , notethe number of unique states (labeled ) reachableusing the action sequences in , always startingfrom the current state.

    4) Produce from , assuming an equidistribution on. Using this, combined with , as inputs to

    the information-bottleneck algorithm (we recommend theimplementation in [49, p. 30]; see Appendix B). For ,our groups of actions (labeled in [49]), we set to be equal to , such that the number of groups matchesthe number of observed states. This will produce a map- ping , which will typically be a hard mapping ingame scenarios. Select a random value of from each(choosing in cases where it is not a hardmapping). Form a set from these selected values, and usethis set to replace .

    5) Loop over steps 3) and 4) until reaches the desired length.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    15/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 15

    C. Phase 2

    Phase 2 extends these base -step sequences to extended se-quences of -steps, before collapsing them again such thatwe retain only a set of -step sequences which can forecast their own futures in the following -steps available to them.

    6) Produce a list of action sequences by forming every

    possible -step sequence of actions from the actions in .7) Produce an extended list of action sequences by

    forming each possible extension for every action sequencein the nal value of using every action sequence in .

    8) Create a channel (where) by sampling from the environmental dynamics

    for our current state (using the games transition table).For environments with other players, one can use anyapproximation of their behavior available and sample over multiple runs or, lacking that, model them with greedyempowerment maximization based on a small horizon.

    9) Calculate the channel capacity for this channel using the

    BlahutArimoto algorithm, which provides the capacityachieving distribution of action sequences .

    10) Now collapse to bymarginalizing over the equally distributed extension of theaction sequences

    where

    11) Apply the information bottleneck (as in [30, p. 30]; seeAppendix B) to reduce this to a mapping of action se-quences to strategies where is our group of ac-tion sequences grouped into strategies. Cardinality of sets how many strategy groups you wish to select.

    12) We now need to select a representative action fromeach group that maximizes approximated future empow-erment (and weight this on how well the action representsthe strategy , which is relevant if is not de-terministic)

    where , and using the capacity achievingdistribution of action sequences , calculated instep 9). Note that the mutual information there requires thefull channel, but sums over those parts of the channel withthe identical -steps, so algorithmically it is advisable tocalculate and store these mutual information values whiledoing the channel collapse above.

    We can now form a distribution of -step action sequencesfrom the set of values of from each action group; theserepresent a variety of strategies while aiming to maximize futureempowerment within those strategies.

    APPENDIX BI NFORMATION -BOTTLENECK A LGORITHM

    The information-bottleneck algorithm is a variation of ratedistortion, in which the compressed representation is guided not by a standard rate-distortion function but rather through the con-cept of relevancy through another variable. We wish to form a

    compressed representation of the discrete random variable ,denoted by , but we acknowledge that different choices of dis-tortion would result in different representations; it is likely thatwe have some understanding of what aspects of we would liketo retain and which could be discarded. The information-bottle-neck method seeks to address this by introducing a third variable

    , which is used to specify the relevant aspects of .More formally, we wish to compress while main-

    taining . We are looking to retain the aspects of whichare relevant to , while discarding what else we can. We intro-duce a Lagrange multiplier , which controls the tradeoff be-tween these two aspects, meaning we seek a mappingwhich minimizes

    The iterative algorithm we present here is from [49].

    Input:1) Joint distribution .2) Tradeoff parameter .3) Cardinality parameter and a convergence parameter .

    Output: A mapping , where . For the scenariosin this paper, this is typically a hard mapping, but it is notnecessarily so.

    Setup: Randomly initialize , then calculate andusing the corresponding equations below.

    Loop:1)

    2)

    3)

    Until:

    JS

    where JS is the JensenShannon divergence, based on , theKullbackLeibler divergence

    JS

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    16/17

    16 IEEE TRANSACTIONS ON COMPUTATIONAL INTELLIGENCE AND AI IN GAMES, VOL. 6, NO. 1, MARCH 2014

    A. Parameter Values

    In this paper, for all scenarios, and .The algorithm is based on a random initialization of ,

    so it is usually bene cial to use multiple runs of the algorithmand select those with the best result. Throughout this paper, weused 200 runs of the information-bottleneck algorithm in each

    instance it was used.

    R EFERENCES

    [1] S. Margulies, Principles of beauty, Psychol. Rep. , vol. 41, pp. 311,1977.

    [2] J. v. Neumann, Zur theorie der gesellschaftsspiele, Mathematische Annalen , vol. 100, no. 1, pp. 295320, 192 8.

    [3] C. E. Shannon, Programming a computer for playing chess, Philosoph. Mag. , vol. 41, no. 314, pp. 256275, 1950.

    [4] A. Samuel, Some studies in machine lear ning using the game of checkers, IBM J. , vol. 11, pp. 601617, 1967.

    [5] J. McCarthy, What is arti cial intelligence? 2007 [Online]. Avail-able: http://www-formal.stanford.edu/jmc/whatisai/

    [6] O. Syed and A. Syed, ArimaaA new game designed to be dif cult

    for computers, Int. Comput. Games Assoc. J. , vol. 26, pp. 138139,2003.[7] G. Chaslot, S. Bakkes, I. Szita, and P. Spronck, Monte-Carlo tree

    search: A new framework for game AI, in Proc. 4th Artif. Intell. In-teractive Digit. Entertain. Conf. , 2008, pp. 216217.

    [8] R. Pfeifer and J. C. Bongard , How the Body Shapes the Way WeThink: A New View of Intelligence . Cambridge, MA, USA: MITPress, 2006.

    [9] F. J. Varela, E. T. Thompson, and E. Rosch , The Embodied Mind: Cog-nitive Science and Human Experience . Cambridge, MA, USA: MITPress, 1992.

    [10] T. Quick, K. Dautenhahn, C. L. Nehaniv, and G. Roberts, On bots and bacteria: Onto logy independent embodiment, in Proc. 5th Eur. Conf. Artif. Life , 1999, pp. 339343.

    [11] H. A. Simon , Models of Man: Social and Rational , ser. MathematicalEssays on Rational Human Behavior in a Social Setting. New York, NY, USA: Wiley, 1957.

    [12] O. E. Williamson, The economics of organization: The transactioncost approach, Amer. J. Sociol. , vol. 87, no. 3, pp. 548577, 1981.

    [13] C. A. Tisdell , Bounded Rationality and Economic Evolution: A Con-tribution to Decision Making, Economics, and Management . Chel-tenham, U.K.: Edward Elgar, 1996.

    [14] D. A. McAlle ster, Pac-Bayesian model averaging, in Proc. 12th Annu. Conf. Comput. Learn. Theory , 1999, pp. 164170.

    [15] N.Tishbyand D. Polani,Information theory of decisions and actions,in Perception-Reason-Action Cycle: Models, Algorithms and Systems ,V. Cutsuridis, A. Hussain, and J. Taylor, Eds. New York, NY, USA:Springer-Verlag, 2010.

    [16] F. Attneave, Some informational aspects of visual perception, Psy-chol. Rev. , vol. 61, no. 3, pp. 183193, 1954.

    [17] H. B. Barlow, Possible principles underlying the transformations of sens ory messages, in Sensory Communication , W. A. Rosenblith,Ed. Cambridge, MA, USA: MIT Press, 1959, pp. 217234.

    [18] H. B. Barlow, Redundancy reduction revisited, Network, Comput. Neural Syst. , vol. 12, no. 3, pp. 241253, 2001.

    [19] J. J. Atick, Could information theory provide an ecological theory of sensory processing, Network, Comput. Neural Syst. , vol. 3, no. 2, pp.213251, May 1992.

    [20] M. Prokopenko,V. Gerasimov,and I. Tanev,Evolving spatiotemporalcoordination in a modularroboticsystem, in From Animals to Animats9, ser. Lecture Notes in Computer Science, S. Nol , G. Baldassarre,R. Calabretta, J. Hallam, D. Marocco, J. Meyer, and D. Parisi, Eds.Berlin, Germany: Springer-Verlag, 2006, vol. 4095, pp. 558569.

    [21] W. Bialek, I. Nemenman, and N. Tishby, Predictability, complexity,and learning, Neural Comput. , vol. 13, no. 11, pp. 24092463,2001.

    [22] N. Ay, N. Bertschinger, R. Der, F. Guettler, and E. Olbrich, Predictiveinformation and explorative behavior of autonomous robots, Eur. Phys. J. B, Condensed Matter Complex Syst. , vol. 63, no. 3, pp.329339, 2008.

    [23] A. S. Klyubin, D. Polani, and C. L. Nehaniv, Empowerment: Auniversal agent-centric measure of control, Proc. IEEE Congr. Evol.Comput. , vol. 1, pp. 128135, 2005.

    [24] A. S. Klyubin, D. Polani, and C. L. Nehaniv, All else being equal be empowered, in Advances in Arti cial Life , ser. Lecture Notes inComputer Science, M. S. Capcarrre, A. A. Freitas, P. J. Bentley, C.G. Johnson, and J. Timmis, Eds. Berlin, Germany: Springer-Verlag,Sep. 2005, vol. 3630, pp. 744753.

    [25] E. Slater, Statistics for the chess computer and the factor of mobility, IRE Professional Group Inf. Theory , vol. 1, no. 1, pp. 150152, Feb.1953.

    [26] A. S. Klyubin, D. Polani, and C. L. Nehaniv, Keep your options open:An information-based driving principle for sensorimotor systems, PLoS ONE , vol. 3, no. 12, 2008, DOI: 10.1371/journal.pone.0004018.

    [27] C. E. Shannon, A mathematical theory of communication, Bell Syst.Tech. J. , vol. 27, pp. 379423, Jul. 1948.

    [28] R. Blahut, Computation of channel capacity and rate distortion func-tions, IEEE Trans. Inf. Theory , vol. IT-18, no. 4, pp. 460473, Jul.1972.

    [29] A. S. Klyubin,D. Polani,and C. L. Nehaniv,Organization of theinfor-mation ow in the perception-action loop of evo lved agents, in Proc. NASA/DoD Conf. Evolvable Hardware , R. S. Zebulum, D. Gwaltney,G. Hornby, D. Keymeulen, J. Lohn, and A. Stoica, Eds., 2004, pp.177180.

    [30] T. Anthony, D. Polani, and C. L. Nehaniv, Impoverished empow-erment: Meaningful action sequence generation through bandwidthlimitation, in Advances in Arti cial Life , ser. Lecture Notes in Com- puter Science. Berlin, Germany: Springer-Verlag, 2009, vol. 5778,

    pp. 294301.[31] J. Pearl , Causality: Models, Re asoning and Inference . Cambridge,U.K.: Cambridge Univ. Press, 2000.

    [32] P. Capdepuy, D. Polani, and C. L. Nehaniv, Constructing the basicumwelt of arti cial agents: An information-theoretic approach, in Ad-vances in Arti cial Life . Berlin, Germany: Springer-Verlag, 2007,vol. 4648, pp. 375383.

    [33] S. Singh, M. R. James, and M . R. Rudary, Predictive state represen-tations: A new theory for modeling dynamical systems, in Proc. 20thConf. Uncertainty Artif. Intell. , 2004, pp. 512519.

    [34] T. Anthony, D. Polani, and C. L. Nehaniv, On preferred states of agents: how global structure is re ected in local structure, in Arti cial Life XI , S.Bullock, J. Noble, R. Watson, and M. A.Bedau, Eds. Cam- bridge, MA, USA: MIT Press, 2008, pp. 2532.

    [35] T. Jung, D. Polani, and P. Stone, Empowerment for continuousagent-environment systems, Adaptive Behav., Animals AnimatsSoftw. Agents R obots Adaptive Syst. , vol. 19, pp. 1639, Feb. 2011.

    [36] J. Schmidhuber, A possibility for implementing curiosity and boredom in model-building neural controllers, in From Animals to Animats , J. A. Meyer and S. W. Wilson, Eds. Cambridge, MA, USA:MIT Press/Bradford Books, 1991, pp. 222227.

    [37] L. Steels,The autotelic principle,in Embodied Arti cial Intelligence ,ser. Lect ure Notes in Computer Science, F. Iida, R. Pfeifer, L. Steels,and Y. Kuniyoshi, Eds. Berlin,Germany: Springer-Verlag, 2004, vol.3139, pp. 231242.

    [38] P. Oudeyer, F. Kaplan, and V. V. Hafner,Intrinsic motivation systemsfor autonomous mentaldevelopment, IEEE Trans. Evol. Comput. , vol.11, no. 2, pp. 265286, Apr. 2007.

    [39] J. Schmidhuber, Formal theory of creativity, fun, and intrinsic moti-vation (19902010), IEEE Trans. Autonom. Mental Develop. , vol. 2,no. 3, pp. 230247, Sep. 2010.

    [40] J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver, A Monte-Carlo AIXI approximation, J. Artif. Intell. Res. , vol. 40, pp. 95142,2011.

    [41] S. Singh, A. G. Barto, and N. Chentanez, Intrinsically motivated re-inforcement learning, in Proc. 18th Annu. Conf. Neural Inf. Process.Syst., Vancouver, BC, Canada, Dec. 2005, pp. 12811288.

    [42] M. Vergassola, E. Villermaux, and B. I. Shraiman, Infotaxis as astrategy for searching without gradients, Nature , vol. 445, no. 7126, pp. 406409, 2007.

    [43] P.-Y. Oudeyer and F. Kaplan, How can we de ne intrinsic motiva-tion?, in Proc. 8 th Int. Conf. Epigenetic Robot., Model. Cogn. De-velop. Robot. Syst. , M. Schlesinger, L. Berthouze, and C. Balkenius,Eds., 2008, pp. 93101.

    [44] T. Berger, Living information theory, IEEE Inf. Theory Soc. Newslett. , vol. 53, no. 1, p. 1, 2003.

    [45] J. Pearl , Heuristics: Intelligent Search Strategies for Computer Problem Solving . Boston, MA, USA: Addison-Wesley, 1984.

    [46] M. Genesereth and N. Love, General game playing: Overview of theAAAI competition, AI Mag. , vol. 26, pp. 6272, 2005.

    [47] S. B. Laughlin, R. R. de Ruyter van Steveninck, and J. C. Anderson,The metabolic cost of neural information, Nature Neurosci. , vol. 1,no. 1, pp. 3641, 1998.

  • 8/10/2019 General Self-Motivation and Strategy Identification Case Studies Based

    17/17

    ANTHONY et al. : GENERAL SELF-MOTIVATION AND STRATEGY IDENTIFICATION: CASE STUDIES BASED ON Sokoba n AND Pac-Man 17

    [48] N. Tishby, F. Pereira, and W. Bialek, The information bottleneck method, in Proc. 37th Annu. Allerton Conf. Commun. Control Comput. , 1999, pp. 368377.

    [49] N. Slonim, The information bottleneck: Theory and applications,Ph.D. dissertation, Schl. Comput. Sci. Eng., The Hebrew Univ.,Jerusalem, Israel, 2003.

    [50] A. Junghanns and J. Schaeffer, Sokoban: Enhancing general single-agent search methodsusing domain knowledge, Artif. Intell. , vol. 129,no. 1, pp. 219251, 2001.

    [51] D. Dor and U. Zwick, SOKOBAN and other motion planning prob-lems, Comput. Geometry Theory Appl. , vol. 13, no. 4, pp. 215228,1999.

    [52] D. Robles and S. M. Lucas, A simple tree search method for playingMs. Pac-Man, in Proc. 5th Int. Conf. Comput. Intell. Games , Piscat-away, NJ, USA, 2009, pp. 249255.

    [53] J. Schmidhuber, Driven by compression progress: A simple principleexplains essential aspects of subjective beauty, novelty, surprise, inter-estingness, attention, curiosity, creativit y, art, science, music, jokes,2008 [Online]. Available: http://arxiv.org/abs/0812.4360

    [54] J. Lehman and K. Stanley, Abandoning objectives: Evolution throughthesearch fornovelty alone, Evol. Comput .,vol.19,no.2,pp.189223,2011.

    [55] M. Mller, Challenges in Monte-Carlo tree search, 2010 [On-line]. Available: http://www.aigamesnetwork.org/media/main:events:london2010-mcts-challenges.pdf

    [56] M. Mller, Fuego-GB prototype at the Human Machine Competitionin Barcelona 2010: A tournament re port and analysis Univ. Alberta,Edmonton, AB, Canada, Tech. Rep. TR1008, 2010.

    [57] C. Salge, C. Glackin, and D. Polani, Approximation of empowermentin the continuous domain, Adv. Complex Syst. , vol. 16, no. 01n02,2013, DOI: 10.1142/S0219525912500798.

    Tom Anthony received the B.Sc. degree in computer science (with honors) from the University of Hert-fordshire, Hat eld, Herts, U.K., in 2004, where heis currently working toward the Ph.D. degree in theAdaptive Systems Research Group.

    His research is centered around the perceptionac-tion loop formalism, and using information theoretictools to drive intelligent behavior, with particular

    focus on what type of structure must exist in a worldto encourage adaptivity and learning in an embodiedagent, and how that structure may be detected and

    exploited by such an agent.

    Daniel Polani received the Doctor of Natural Sci-ences degree from the Faculty of Mathematics andComputerScience, University of Mainz, Mainz, Ger-many, in 1996.

    From 1996 to 2000, he was a Research Assistantwith the Faculty of Mathematics and Computer Sci-ence, University of Mainz. In 1997, he was a Vis-iting Researcher with the Neural Networks ResearchGroup, University of Texas at Austin, Austin, TX,USA. From 2000 to 2002, he was a Research Fellowwith theInstitutefor Neuro-and Bioinformatics,Uni-

    versity of Lbeck, Lbeck, Germany. From 2002 to 2008, he was a PrincipalLecturer with the Adaptive Systems and Algorithms Research Groups, Schoolof Computer Science, University of Hertfordshire (UH), Hat eld, U.K. Since2008, he has been a Reader in Arti cial Life and Member of the AdaptiveSystems and Algorithms Research Groups, School of Computer Science, UH.His research interests include the foundations of intelligent behavior in biolog-ical as well as arti cial agents, especially in terms of informational optimality principles.

    Chrystopher L. Nehaniv received the B.Sc. degree(with honors) from the University of Michigan, AnnArbor, MI, USA, in 1987, where he studied math-ematics, biology, linguistics, and cognitive science,and the Ph.D. degree in mathematics from the Uni-versityof California at Berkeley, Berkeley, CA,USA,in 1992.

    He is a Ukrainian-American mathematicianand computer scientist. He has held academic andresearch positions at the University of California atBerkeley, in Japan (professorship at University of

    Aizu, 19931998), and Hungary. Since 2001, he has been Research Professor of Mathematical and Evolutionary Computer Science at the University of Hertfordshire, Hat eld, U.K., where he plays leading roles in the Algorithms,BioComputation, and Adaptive Systems Research Groups. He is the author of over 200 publications and 30 edited volumes in computer science, mathematics,interactive systems, arti cial intelligence, and systems biology. He is alsothe coauthor of a monograph on algebraic theory of automata networks. Hiscurrent research interests focus on interaction, development, and enaction

    in biological and arti cial systems, the notion of rst- and second-personexperience (including phenomenological and temporally extended experiencein ontogeny) in humanoids and living things, as well as on mathematical andcomputer algebraic foundations and methods for complex adaptive systems.