Improving meIRL-based prediction models in video games using

Improving meIRL-basedprediction models in video games

using general behaviourclassification

Inge Becht6093906

Bachelor thesisCredits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie

University of AmsterdamFaculty of ScienceScience Park 904

1098 XH Amsterdam

Supervisordr.ing. S.C.J. Bakkes

Intelligent Systems Lab AmsterdamFaculty of Science

University of AmsterdamScience Park 904

1098 XH Amsterdam

June 28th, 2013

1

Abstract

In this research we claim that using behaviour classification models can im-prove the position prediction of opponents in video game AI. This claim is inves-tigated by proposing a system called meIRL BC, which uses maximum entropyInverse Reinforcement Learning for the creation of its position prediction models,and predicts the position of opponents based on the behaviour it most likely per-forms throughout a game of Capture The Flag. To test the performance of meIRLBC, it was integrated into a pre-existing AI and tested against meIRL, a previousimplementation that uses only a single model regardless of witnessed behaviour.Our implementation of meIRL BC showed that in case of correct classification ofthe behaviours in games, the Euclidean distance of the position prediction errorin some of the behaviour models was smaller than that of meIRL. A big problemin the performance of meIRL BC was the big variance in the performance of thebehaviour classifier throughout the matches. It did seem, however, that in the caseof matches where the classification ware correct more than fourty percent of thetime, the chance of meIRL BC had a bigger win rate, but there is still uncertaintyin its actual potential.

2

Contents1 Acknowledgement 4

2 Introduction 5

3 Related Work 73.1 Opponent position prediction . . . . . . . . . . . . . . . . . . . . . . 73.2 IRL in video games . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Theoretical Foundation: Maximum Entropy IRL 9

5 Approach 105.1 Game Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . 105.2 Behaviours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125.3 Position Prediction Models . . . . . . . . . . . . . . . . . . . . . . . 125.4 Behaviour classification . . . . . . . . . . . . . . . . . . . . . . . . . 145.5 Position Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 Experiments 176.1 Intergrating meIRL BC to the Terminator AI . . . . . . . . . . . . . . 176.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 186.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

6.3.1 Propagation errors . . . . . . . . . . . . . . . . . . . . . . . 196.3.2 Classification errors . . . . . . . . . . . . . . . . . . . . . . 216.3.3 Match performance of meIRL BC . . . . . . . . . . . . . . . 23

7 Discussion 257.1 Discussion of propagation errors . . . . . . . . . . . . . . . . . . . . 257.2 Discussion of classification errors . . . . . . . . . . . . . . . . . . . 257.3 Discussion of Match performance of meIRL BC . . . . . . . . . . . . 26

8 Conclusion 278.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3

1 AcknowledgementI would like to thank Bulent Tastan, co-author of [7], for providing me with his im-plementation of Maximum Entropy Inverse Reinforcement Learning, the author of theTerminator AI, for providing me with an AI to improve upon, and Sander Bakkes, forhis guidance along the way.

4

2 IntroductionIn recent years, most AI in video games have undergone a change to increase the en-tertainment value experienced by the human players. This change lies in the amount ofthe information the AI is equipped with. Traditionally AI were equipped with full in-formation about the environment and its opponents to increase their difficulty withoutmuch effort [2]. This practice often did not offer a fun game experience to the player[9], as the difficulty of the AI was not based on skillful tactics, but rather on cheatingtactics: by having complete knowledge of the gaming map and the whereabouts of itsopponents at all time, an unfair advantage is created towards the human player, whoonly uses the information of the game-agents he controls. This led to the developmentof AI that were equipped with only partial information about the game world, ideallythe same amount of information as the human player is equipped with.

This shift towards AI which uses only partial knowledge of the gaming world,however, intensified the problem of enabling the AI to correctly predict the opponent’sposition. This is a crucial aspect for the construction of well performing game AIin for example combat games, where knowing where the opponent is can give a greatadvantage towards determining what courses of actions to take to win the game. Havinga way to predict the position of an opponent can for instance give the AI the possibilityto determine if a route through the environment will be safe or not. This issue ofpredicting the opponent’s position still has not been completely resolved1, especiallyin the context of providing a challenging game experience.

The aim of this research is to successfully predict opponent positions in compet-itive multiplayer games, while dealing with limited knowledge about the opponent’swhereabouts. Although various approaches have already been investigated towardscreating opponent position prediction models for video games, not always a distinctionis made between different behaviours that can occur in a video game. Behaviour andposition are, however, often closely related in video games. For example, in a teamdeath match, players could patrol around the map on the look out for enemies, or moveto a save position to ambush unsuspecting opponents. One could conclude that whensuch different behaviours are not taken into account when making position predictions,the AI limited in successfully determining the opponent’s position [2][7].

In this paper we aim to improve on a technique for position prediction, called max-imum entropy Inverse Reinforcement Learning (meIRL), by creating different positionpredictions models depending on each of the exhibited behaviours in a game. meIRLhas formerly been investigated by [7], where predictive models were constructed foreach individual opponent, instead of making a model for each possible behaviours. Byadding position prediction models for different possible behaviours, it would no longerbe necessary to create a single model for each opponent, as more general behavioursare used for prediction. Furthermore we believe that creating these predictive mod-els using meIRL for multiple behaviours will increase the performance of an AI thatapplies them in video games.

In summary, this research aims to find the answer to the following research ques-tion:

To what extent can general behaviour classification improve meIRLbased prediction models in video games?

1Visit http://aigamedev.com/open/editorial/open-challenges-fps/ for a list ofcurrent relevant challenges

5

http://aigamedev.com/open/editorial/open-challenges-fps/

This thesis is concerned with answering the research question as proposed aboveand is organised as follows: First an outline of related works is given, to provide thereader with a greater understanding of position prediction models as a research topic forvideo games. Secondly, a short theoretic background is given on the topic of IRL andthe addition of the maximum entropy solution. Thirdly, the implementation of meIRLand behaviour classification is discussed for a game of Capture The Flag. Fourthly,the experiment section follows, in which a comparison is made between the formerapplication of meIRL prediction models and the addition of behaviour classificationby adding their functionality to a pre-existing game AI. Finally, the results are inter-preted in the discussion section, and a conclusion is made regarding the added value ofbehaviour classification to meIRL.

6

3 Related WorkIn this section related work regarding both opponent position prediction and the use ofIRL in games is discussed, to establish the background on the research topic.

3.1 Opponent position predictionIn the work of Hladky and Bulitko [2], Hidden Semi Markov Models and Particle Fil-ters, two different models for opponent position prediction, are compared with regardsto how accurate as well as human-like their predictions are. The comparison of howhuman-like their predictions are was carried out by comparing the performance to thatof various human subject in a game setting of Capture the Flag. The research concludesthat the Hidden Semi Markov Models were the most accurate and made the most hu-man like errors. It is clear that this research focuses mostly on the measurement ofhuman-like prediction, but does not integrate such a prediction system in a game par-ticipating AI system.Weber et al.[9] also use a Particle Filter, but rather than researching the likeness to hu-man prediction, it attempts to discover if adding such a filter for position prediction inan AI system can enhance its performance. The EISBot for the game Star Craft wasequipped with a Particle Filter and the reasoning capabilities for different game statesthat actively use the predictions made by the filter. Results showed that, while the AIhad a 10 percent increase in win percentage, increasing the number of available gamestates that employ the bot’s reasoning ability does not always improve its performance.

Instead of focusing on predicting the opponent’s position, Laird [3] focuses on cre-ating a reasoner that anticipates the behaviour of an opponent in Quake. This was doneby arming the Soar Quakebot with anticipation strategies, to be used only when the bothas a high chance of successfully predicting what the opponent is about to do. If suchan anticipating strategy is triggered, the Soar bot predicts the behaviour of the oppo-nent by reasoning what he himself would do in its opponent’s position. Although it isstated that these additions of anticipating strategies are beneficial for the performanceof the AI, no results are mentioned that confirm this. The implementation of theseanticipation strategies might not be solely for position prediction, but it does highlightsome interesting points regarding the use of ones own behaviours when anticipatingwhat an opponent is going to do, something which could arguably adds a human-likedimension to the bot.

3.2 IRL in video gamesIn Tastan et al. [7] predictive models were made using meIRL to try and interceptopponents in the game Quake. The predictive models were applied using a ParticleFilter that integrates both meIRL position prediction models, as discussed in Ziebart etal. [10], and Brownian prediction models, from which the latter serves as a baselineperformance test (by randomly spreading particles around the map). The predictivemodels using meIRL had a smaller tracking error of the opponent and were more likelyto successfully intercept the behaviour.

Lee and Popovic [4] use IRL for the learning of player behaviour style, making thebehaviours applicable in a different game environment than it was witnessed in. It findsthe optimal policy belonging to the demonstrated player behaviour in case the appliedMarkov Model state space is deterministic and discrete. A similar approach is taken

7

by Tastan and Sukthankar [8] where IRL is used to teach game bots to attack, exploreand target opponents in a human like manner, by creating policies of human expertplaying demonstrations. The results were promising, as the bots acted more human-like (this was decided by human test subjects, who were submitted to a Turing-like testthat asked which bot was the human one). The problem with this research, however,was that there are multiple policies that can be learned from a single demonstration byusing IRL, making the selected policy not a straightforward choice.

Our research is closely related to the works of Tastan et al. [7], as we learn the po-sition prediction models in a similar way; by applying meIRL. We differ from their ap-proach by creating a single prediction model per distinguished behaviour in the game,instead of making a single position prediction model for each player, as we believe thiswill give a better representation of where a player might be throughout the game. Inthe upcoming section our approach will be discussed.

8

4 Theoretical Foundation: Maximum Entropy IRLIn this section a short introduction is given on the topic of IRL and how to constrain itto find the maximum entropy solution. For a more in depth explanation, on which thisintroduction is based, see [10] and [7].

Inverse Reinforcement Learning is a method for finding a policy, a course of actionfor every state in the world, by learning from a demonstration of an expert agent [6].To use this method for creating prediction models, some example trajectories are re-quired that are a good representation of the model that needs to be learned, expressedin movement action (a) and state positions (s). When learning the policy for these tra-jectories, a value map is created that for each position contains the probability that itwill be visited by the expert.

The creation of this policy by IRL can be written as the following Bellman equation[7]

V (s) = maxas

{R(s, as) +∑s′

P (s′|s, as)V (s′)} (1)

Where V(s) is the reward of each state when applying the optimal policy, P (s′|s, as)is the probability of ending up in state s′ when applying action a in state s. Videogame worlds are often deterministic in nature and thus only have value 1 or 0 for thesummation over s′ in case a ∈ A(s) where A(s) is the set of actions that can be taken atstate s. R(s, as) is the reward function, which is unknown in case of IRL. The rewardfunction is the product of a weight vector and a vector of features in which each stateis expressed, to create a position invariant expression of positions on the map, and theirimportance are learned by adjusting the weights accordingly. The weights converge byapplying the forward-backward pass algorithm, a process can be read about in [10] and[7].

Because in practice IRL is an under-constrained problem there are multiple possibleoptimal rewards that fit the demonstrated expert behaviour [8]. This can be changed byconsidering the maximum entropy solution:

V (s) =∑as

P (as|s){R(s, as) +∑s′

P (s′|s, as)V (s′)} (2)

The main difference between equation 1 and 2 is that in the second case the solutiondoes not only consider the single action in state s that maximises V, but the distributionof actions that are possible in each state, thus creating a reward value for each statethat considers all possible actions in each state. For the creation of a motion model thismeans that the result considers more than just the exact behaviour that is exhibited: itconsiders all possible actions, and thus minimizes the information loss.

9

5 ApproachTo fully answer the question as proposed in the Introduction

To what extent can general behaviour classification improve meIRLbased motion modelling in video games?

first the underlying problems need to be understood. To make these problems moreclear, we propose the following four sub-questions that should be investigated:

1. What different behaviours should be used for classification?

2. How can meIRL-based motion models be constructed for a specific behaviour?

3. How can a behaviour exhibited by an opponent be classified?

4. How are these position models actively used?

Figure 1 gives a schematic visualisation of all the components that should be de-veloped to fully answer the above questions. Throughout this research we will refer tothis whole system as meIRL BC. The visualisation depicts two offline components thatare activated before the start of a game: the position prediction modeller that createsa position prediction model for each behaviour, and the construction of the behaviourclassifier. The classifier is applied each game loop for each opponent in case it is wit-nessed by one of the meIRL BC bots. In case the opponent is not witnessed, a finalcomponent is activated; the position propagator.

The above questions are related to each of the components, as indicated by thenumber, except for the choice of behaviours, as this is more of an abstract decisioninstead of an implementation itself. The following sections explain these componentsin more details, while also answering the above sub-questions. Section 5.2 answersquestion 1, 5.3 answer question 2, 5.4 answers question 3 and 5.5 explains question 4.But first the game environment, for which meIRL BC will be developed, will be shortlyestablished.

5.1 Game EnvironmentAlthough our approach towards predictive modelling in theory is not limited to a singlevideo game, its implementation is very domain dependent; the different behavioursthat need to be classified and the features used for creating the motion model can varyfor every video game. In this thesis the game Capture The Flag (CTF) is used forimplementing meIRL BC. CTF is a typical strategic game often found in First PersonShooter games, where two teams try to capture each others flag, while trying to preventthe other team from succeeding by shooting at the opponent team. When deliveringthe flag at a certain position on the map a point is scored, and the team with the mostpoints wins the game.

The implementation in this section will be tested by integrating it into an alreadyexisting AI which is specifically designed for CTF games, called Terminator2. The Ter-minator is created in the AISandbox 3, a toolbox specifically designed for the creationof new multi-agent AI systems that consists of maps, example bots and an API to writeyour AI with.

2To be found at: http://github.com/eiisolver/Terminator3For more information visit http://www.aisandbox.com

10

http://github.com/eiisolver/Terminator

http://www.aisandbox.com

Learn Position PredictionModels Using using meIRL (2)

Example trajec-tories expressed

in features

Construct RF Classifier(3) Trainings Set

Start Game

Next gameloop

Iteration?Terminate

Select Next Opponent

Is enemyseen?

Propagate position usingactive prediction Model (4)

Classify behaviour, activateright prediction Model(3)

Moreenemies?

no

yes

yes

no

no

yes

Figure 1: A schematic visualisation of all components of meIRL BC. The ellipsesindicate input data, the rectangles actions and the diamond if-statements. The darkblue elements are specifically constructed for meIRL BC, and the numbers indicate therelated sub-question

11

5.2 BehavioursTo create a motion model for each behaviour expected to be exhibited by the opponent,first needs to be decided what kind of behaviours are important to distinct between in agame, as there are several ways to look at this problem. A possibility is to distinct be-tween different play-styles, for example aggressive and timid behaviours which couldend up using different positions around the map to achieve their goals [5]. For a gamewith such clear-defined game objectives as CTF, we decided to use only these objec-tives to base the motion models upon. Some example of these objectives are attackingthe opponent’s flag, defending your own flag, or delivering the flag to your flag scorelocation. The rest of the behaviours that are used in this research are inspired by thebehaviours which are used by the Terminator AI. Although the opponent’s exhibitedbehaviour might not fit the exact same behaviour models as which are used by the Ter-minator AI, it is a common practice in many video games to project an AI’s workingson that of an opponent [3].

The complete list of implemented behaviours, as performed by the Terminator, are:

• Attacking flag: Moving to an enemy flag to capture it from its spawning location.Often tends to be a single trajectory from the bot spawn location to the flag spawnlocation.

• Defending flag: Moving to your own flag for defending purposes, killing ene-mies that come near, is often very centered around a few defending points.

• Deliver flag: When the flag is captured, a bot moves it to its score position. Thisbehaviour often tries to choose a safe path from its current position to the flagscore position, without standing still too much along the way.

• Assist flag deliverer: Helping the flag bearer to safely deliver the flag home.Often takes a path similar to that of the flag deliverer.

• Ambushing: Ambushes are positioned on places that contains points of interest,for example patrols close to the flag score location of the enemy.

• Stalk: When the agent knows the position of an opponent while that opponentis not aware of the agent, the agent follows him unknowingly until in shootingrange.

5.3 Position Prediction ModelsTo successfully create prediction models for the different behaviours, meIRL needssome example trajectories in which these behaviours are exhibited (as indicated byfigure 1). These example trajectories are extracted from the Terminator AI in a singlegame of capture the flag, by logging every bot position in each iteration of the gameloop and the behaviour they exhibited at that moment in game. These positions arethen discretized, so that only a finite amount of states are possible, as is expected bymeIRL, and so that each consecutive position can be reached by moving one tile up,down left or right.

The features that are used for the training of the reward function in meIRL, arebased on previous research by Tastan et al. [7] in which the distance to specific pointson the map were used as features, which were deemed important for that specific gameenvironment. We used the same approach, but tailored to the CTF environment. Thefeatures used are:

12

(a) Finished position prediction model of the Deliver Flag behaviour (b) Finished position prediction model of the flag defend behaviour

Figure 2: Two finished position prediction models, showing the input trajectories inred, the obstacles in blue and the probability of each tile on the grid (white having ahigh probability of being visited and black having a low probability of being visited)

1. Shortest distance to enemy bot spawn

2. Shortest distance to own bot spawn

3. Shortest distance to enemy score position

4. Shortest distance to my score position

5. Shortest distance to my flag spawn location

6. Shortest distance to enemy flag spawn location

7. Visibility; depending on how many surrounding tiles are obstruction

As stated before, the process of collecting the example trajectories is done by mak-ing the Terminator AI play a single game of capture the flag, while logging the abovefeature and the behaviour that is performed. this gives enough representative materialto create the position prediction models for each behaviour to use as input into equation2. To learn the motion model, the right value of weights belonging to each feature needsto be decided by the forward-backward pass algorithm, which takes at most about 500iterations. Due to the size of the map we trained with, the weight would sometimes notsuccessfully converge. In those cases multiple position predictions models were trainedfor a single behaviour by using chunks of the complete trajectories and combined bytaking the mean of all these combined models.

Figure 2 shows two position prediction models constructed by meIRL on a mapof the AISandbox for the deliver flag and defending flag behaviour. The trajectories

13

used for learning are superimposed in red and the numbers indicate the positions of thefeatures (as enumerated above). The blue blocks indicate obstructed terrain. The blackto white gradient indicates the model learned by meIRL and shows the probability ofeach position being visited when a behaviour is performed, with the lightest blockshaving the highest probability of being visited by the expert AI and the darkest blockshaving the lowest probability.

5.4 Behaviour classification

1 for every iteration of the game loop do2 for all enemy agents do3 if agent is seen then4 classify behaviour;5 activate relevant position prediction model for this agent6 else7 if agent is alive then8 Execute PropagateStep using the active position prediction

motion model9 else

10 Clear propagation steps11 end12 end13 end14 Use prediction data15 end

Algorithm 1: Prediction algorithm

In this research we argue that distinguishing between various behaviours in a gamecan improve predicting the whereabouts of an opponent. To use behaviour based pre-diction models, instead of just using a single model throughout a game, somewhere adecision needs to be made which behaviour model best fits what the opponent is do-ing. In our implementation of meIRL BC, this decision is made each time an opponentis witnessed by the AI using meIRL BC (see algorithm 1). For the bot to be able toclassify behaviours correctly, a training set needs to be constructed that is considered agood representation of each behaviour. By playing a match with the Terminator AI inwhich all the six behaviours are exhibited, one can extract the behaviour of every botat every time step and additional feature information. creating around 6000 instances.The following features4 are deemed important in the classification process, based onour expert experience with the game:

• The agent’s position expressed in POI distances (the same way as for the motionmodel)

4A possible addition to the feature set that is used in this research is a feature that makes use of the sequen-tial nature of behaviour. For example, if a second ago an opponent was exhibiting some kind of behaviour, itseems quite likely it is still exhibiting the same behaviour a second later. Previous behaviour, however, cannot easily be added as a feature because the interval between opponent sightings varies greatly in a matchand therefore adds little extra value. One could take into account how long a behaviour typically lasts, anduse this to see if previous sightings are still relevant, but this is not used in our current implementation.

14

• The orientation of the agent

• The game state:

– Both flags are at their spawn point

– Both flags are not at spawn point

– My flag is not at spawn point

– Enemy flag is not at spawn point

• distance to closest opponent

• a Boolean value indicating if the bot sees an opponent

• Tile visibility

The classification method of choice in this research is Random Forest, a classifierthat creates multiple decision trees which all cast a vote when an instance need classi-fication. This results in a probability of each candidate class being the right one, givingmore insight as to how well a distinction can be made between different classes.

5.5 Position PropagationFor the prediction of a future positions of an opponent that is not witnessed by the AI,a propagation method is required, that each game loop spreads the candidate positionsin the available movement direction. In previous work, Particle Filters [9] [7] [2] wereoften used, but due to the discrete nature of our behaviour based position predictionmodel, it is more natural to apply the propagation steps this way as well. The propaga-tion algorithm for a single agent is described in algorithm 2 and 3.

Basically, each time an opponent is not seen, the (at most) twenty most likely tra-jectories are propagated a single step, by looking at the last candidate position (seeline 3), in the four possible directions (up, down, left and right), or less in case of ob-structions (see line 4). At the first ever propagation step, only a single trajectory withprobability one is propagated, resulting in at most four new trajectories, for which thenew probability is the product of the previous probability and the one for entering thisnew state, depending on the active motion model (see line 12). Because the probabilityof each state in the model is not relative to the state that the opponent is in right now, itis made so by dividing the probability of the state by the summation of the probabilitiesof of all possible subsequent states (see line 8).

If there are more than twenty trajectories after propagation, only the twenty mostprobable ones are kept (see line 21), as it is very unlikely these will end up beingsuccessful trajectories, but do take up valuable computation time. A slight preferenceis given to the tiles to which the opponent was facing as to not always just head themost probable way. This probability gets smaller every propagation step.

15

Data: T ← Trajectory set, containing probability;O ← Preferred Directions by agents;s← Propagation Step;MM ← Active motion model: probability mapResult: Tnew ← new Trajectories

1 for t in T do2 probability ← getProbability(t);3 position← getLastPosition(t);4 possiblePositions← possiblePositions(position);5 total← 0 ;6 for possiblePos in possiblePositions do7 total← total +MM [possiblePos]8 end9 for possiblePos in possiblePositions do

10 p← calcProb( position, total, MM );11 if p ∈ O then12 probability ← probability × (p+ ( 1s ∗ p))13 else14 probabilty ← probabilty × p15 end16 tnew ← Add possiblePos to t;17 Add tnew to Tnew ;18 Sort Tnew on highest probability;19 end20 end21 Discard Tnew[i] if i > 20

Algorithm 2: The PropagateStep Algorithm

Data: position← Position for which to find probability;total← Total probability of surrounding tiles;MM ← Active motion modelResult: prob← tile traverse probability

1 prob← MM [position]total ;

Algorithm 3: The CalcProb algorithm

16

6 ExperimentsIn this section we establish the experimental set up for our research, as well as ourresults. It is organised as follows: first, we will discuss how to add meIRL BC to theTerminator AI, in section 6.1. Next, we discuss what we would ideally evaluate ourapproach on in section 6.2. Lastly, the results of the experiments follow in section 6.3.

6.1 Intergrating meIRL BC to the Terminator AIIn this section we discuss how to add our approach of meIRL BC to the Terminator AI,but first we explain how the Terminator itself with regard to prediction the opponent’sposition. The Terminator in its original form reasons about the whereabouts of itsopponents by using a histogram-like approach (see algorithm5). Each time an opponentis seen by one of the Terminator bots, it increments the value of the tile which wasvisited by the opponent, as well as all tiles that make up the shortest distance betweenthis position and the position of the previous sighting. Using this information togetherwith the amount of attacks on each tile, safe routes and points for ambush are calculatedeach step of the way. The idea is straight forward, but in case opponents aren’t seen ina long time, the histogram approach becomes a very rough estimate as to where safepaths can be found, as the shortest route becomes very unlikely.

Data: nrV isited←Matrix consisting of every tile on board1 for every iteration of the game loop do2 for all enemy agents do3 if agent is alive and just seen then4 (x, y)← position of enemy;5 (prevx, prevy)← position of enemy at previous sighting;6 path← shortestPath(x,y, prevx, prevy);7 for (xpos, ypos) in path do8 nrV isited[xpos][ypos] + +9 end

10 else11 continue;12 end13 end14 Use new iterated values for determining the best course of actions.15 end

Algorithm 4: Terminator visited tiles algorithm

By straightforwardly changing the update from calculating the shortest path be-tween the least two points at which the bot was witnessed, to incrementing the visitedsquares to where each bot that is alive is most probably is, the new motion model isalready in use by the Terminator, and evaluation can take place. Algorirthm 5 showshow this is implemented. The only disadvantage here is that the expected positions arepassively used in the histogram-like fashion, without directly using the information toplan strategies, which does not test meIRL BC’s full potential.

17

Data: nrV isited←Matrix consisting of every tile on board1 for every iteration of the game loop do2 for all enemy agents do3 if agent is then4 path← most likely trajectory for the agent;5 (posx, posy)← last position of path;6 nrV isited[xpos][ypos] + +;7 else8 continue;9 end

10 end11 Use new iterated values for determining the best course of actions.12 end

Algorithm 5: Terminator meIRL BC visited tiles algorithm

6.2 Performance EvaluationThere are quite a lot of aspects to the implementation of meIRL BC, and all elementsneed to be tested accordingly. To evaluate the performance of meIRL BC we determinehow well the position predictions made are, how well the classifier works, and if usingmultiple behaviour models creates a better performance than only using a single model,even if the opponent not necessarily uses the exact same behaviours as are applied bymeIRL BC.

meIRL BC is tested on all these elements by the use of a small competition, inwhich the Terminator AI with meIRL BC integration plays 40 matches against the Ter-minator with Original meIRL, which only uses a single model for all opponents, and40 against the original Terminator, and 10 matches against a standard AISandbox AI,to see how well meIRL BC performs in case of an AI that does not use the exact samebehaviours as is trained upon by creating the prediction models. Throughout this com-petition information is gathered each iteration of the game loop on the position andbehaviour of each bot, and the predicted class and position of the opponent bot (thisclearly only is done both in case of meIRL BC Terminator, as only this implementa-tion makes predictions on where the opponents are and what behaviour an opponent isexecuting). After these matches the data can be used to check how often classes werepredicted correctly (only in case of the Terminator based AIs) and how well positionpredictions made were.

All matches take place on the same map on which the prediction models wereconstructed. Although meIRL policy learning is map invariant, this is not of interestwhile testing the performance of meIRL BC.

6.3 ResultsIn this section an overview is given of the results. Section 6.3.1 shows the propa-gation error made by the different behaviour models of meIRL BC in context of theperformance of original meIRL. Section 6.3.2 gives insight in the classification errorthroughout the matches played, and section 6.3.3 shows some statistics of the win andloss rate in the competition.

18

6.3.1 Propagation errors

Figure 4 shows the prediction error, in Euclidean distance, made by meIRL and meIRLBC for trajectories in 40 different games for each step in the propagation of the tra-jectory. In the case of the meIRL BC position prediction models, only the trajectorieswere used in case the right classification was applied, so that the error is only causedby the performance of the model, and not the classification as well. Note that not allbehaviour models of meIRL BC are shown, as the rest of the classes were not correctlyclassified often enough to give a good representation of their performances (as can beseen in the figure of the assist flag deliverer propagation error).

19

(a) Euclidian distance prediction error for Original meIRL (b) Euclidian distance prediction error for meIRL BC: AttackFlag

(c) Euclidian distance prediction error for meIRL BC: Flag Defending (d) Euclidian distance prediction error for meIRL BC: Ambushing

(e) Euclidian prediction error for meIRL BC: Assist Flag Deliverer

Figure 3: Euclidian distance prediction error for meIRL’s prediction and some of theclasses of meIRL BC

20

6.3.2 Classification errors

When evaluating the trained Random Forest classifier in Weka, using 10 fold crossvalidation, the amount of correctly classified instances was around 72%, but whenrunning the 40 matches against meIRL the mean of corrected classified instances wasmuch lower than that; around 27.5%. Table 1 shows the combined confusion matrixbelonging to all the matches of meIRL BC against meIRL.

Figure 4 shows the wins, losses and draws of every match of meIRL BC againstmeIRL, plotted against how much instances were correctly classified in that match. Italso shows the bot kills of both teams and the amount of flag captures plotted againstthe same data.

Table 1: Confusion Matrix for all classification of 40 games of meIRL against meIRLBC. The behaviour row shows the behaviour which was predicted, the behaviour col-umn the behaviour that really belonged to the instance

Attack Flag Defend Flag Ambush Assist Deliver StalkAttack Flag 1781 689 1776 888 508 22Defend Flag 8617 7528 6208 6750 4001 120

Ambush 1905 216 2069 576 231 8Assist 302 432 276 472 386 3

Deliver 495 329 568 378 1507 40Stalk 112 102 72 56 51 1

21

(a) Wins/losses/draws of meIRL BC plotted against percentage of instancescorrectly classified in a match

(b) Bot kills plotted against percentage of instances correctly classified in amatch

(c) Flag scores against percentage of instances correctly classified in a match

Figure 4: Flag scores, game wins and bot kills all plotted against the classification error

22

6.3.3 Match performance of meIRL BC

Table 2: Match performance of the meIRL BC AI

Terminator meIRL BC vs. Won Draw Loss Total matches Total Captures Total Flag LossBalanced 10 0 0 10 58 1

Terminator 3 10 27 40 95 153Terminator meIRL 16 12 12 40 113 98

Table 3: Match performance of Terminator AI

Terminator vs. Won Draw Loss Total matches Total Captures Total Flag LossBalanced 10 0 0 10 97 2

Terminator meIRL 26 7 7 40 185 113Terminator meIRL BC 27 10 3 40 153 95

Table 4: Performance of Terminator meIRL

Terminator meIRL vs. Won Draw Loss Total matches Total Captures Total Flag LossBalanced 9 1 0 10 46 3

Terminator 7 7 26 40 113 185Terminator meIRL BC 12 12 16 40 98 113

Tables 2, 3 and 4 show the performance of meIRL BC, the original Terminator, andTerminator with meIRL respectively , by showing the win and loss rate throughout thecompetition, as well as the total amount of flags captured and flags lost. Figure 5 showsfor each of the AIs that were matched up against the Balanced AI their average scoretime and point scores.

23

Figure 5: Flag scores against percentage of instances correctly classified in a match

24

7 DiscussionIn this section the results are discussed, while looking at the performance in propaga-tion error, classification error and overall match performance.

7.1 Discussion of propagation errorsFigure 4 gives the impression that the error propagation of the behaviour-based po-sition prediction models can be more accurate than the single model approach of theoriginal meIRL implementation in which only the exact same model is applied for ev-ery possible behaviour. Both the behaviour of attacking flag and defending flag have apropagation error that grows less quick that in comparison with meIRL, most notablearound the 10th propagation step, where the propagation error starts to converge. Thereis, however, a great variation of propagation for each trajectory, with quite some tra-jectories ending up with an euclidean distance error of 20 in as little as 14 steps, whichshows that in a lot of cases the step propagator does not perform well at all.

Another thing to note is that the error propagation for both the ambushing and assistflag deliverer behaviours do not seem to work as well. In the case of the assist flagdeliverer behaviour this seems a case of too little data, as mentioned earlier, as to whichnot a good prediction can be made of the average performance of the propagation. This,however, is clearly not the case of the ambushing behaviour; there is about as much dataas in the case of the attacking flag behaviour. The problem here is that ambushing isa behaviour in which not much movement occurs. A lot of the trajectories are basedon a single point at which the agent waits for opponents to ambush, Because positionpropagation only propagates in the left, right, up and down position, the position atwhich the opponent is already situated is not considered for the following position.This was initially done so that not all trajectories would end up in a local maximum,but this gives problems for such behaviours where a lot of waiting on a single positionoccurs.

7.2 Discussion of classification errorsWhen considering the average number of correctly classified instances for each match,27.5% and the accompanying confusion matrix (table 1) it seems that the classifieris completely incompatible with the classification problem we try to solve, but this,however, is not completely the case when viewing the classification error of each in-dividual match, as in the graphs in figure 4. The performance of the classifier varies alot for each match, varying between 10− 70% of the instances correctly classified, soat most the same amount correct as declared by Weka’s cross validation. Why exactlythe performance of the classifier varies as much as it does is not completely clear, butcould be improved upon by experimenting more extensively with different features andclassification techniques.

Although the performance of the classifier is not very satisfactory, it does givesome interesting insight in how well the addition of behaviour classification can helpin making position prediction models in video games. Figure 4 shows that in casethe amount of correctly classified instances are around 40% or higher the most winsare made against meIRL, as well as a higher bot kill by meIRL BC. In the case itis lower than 40% meIRL seems to perform better in most cases. This suggests thatwhen meIRL BC is correctly applied, it indeed performs better than the original meIRL,having a higher kill rate, a higher flag capture rate and an overall higher win rate. There

25

are some cases still in which the meIRL wins although the classifier works correctly, butthis might just be some noisy data, or have to do with the great variation in performanceof the step propagator as shown before. To determine the exact cause, some morematches could be run between meIRL and meIRL BC.

7.3 Discussion of Match performance of meIRL BCThe performance in win-loss rate of meIRL BC against meIRL has been explainedalready in the previous section, where it is suggested that if a classifier makes accurateclassifications it is paired with a higher win-rate for meIRL. In table 2 it seems that theperformance of meIRL BC is negligible, with only 4 more wins against meIRL thanlosses and draws, but this can be explained by the varying performance of the classifier.The varying performance, however, does not explain the high losing rate against theoriginal Terminator, which both meIRL and meIRL BC were no match against. Thisis something that we cannot yet explain, but it might possibly have to do with thehistogram like approach of the original Terminator. Not a lot of research was doneto find out how exactly this information influences the reasoner, and it could be thatour position prediction model does not work well as a substitution. A better approachmight have been to use an AI that expects specific positions on the mag of the enemyagents for each iteration loop instead of the Terminator’s grid-based histogram mapwhere each position consists of how often it is visited by an opponent.

A last point of evaluation was to see if meIRL BC could successfully play matchesagainst an AI that not necessarily performed the behaviours that meIRL BC tried toclassify with. The final score as found in table 2 shows that meIRLs performance doesnot differ much from the performance of Terminator and meIRL. Figure 5 shows thatmeIRL BC mostly is a bit faster in capturing flags, and more often scores highter thanmeIRL. These results indicate that playing against an AI that might not have the exactsame behaviour as is expected does not have to be a problem, so the behaviours seemgeneral enough to be applied against other AIs that were not previously trained against.

26

8 ConclusionIn this research we argued that making a distinction between various behaviours ingames is crucial for correctly predicting the position of an opponent when an AI cannotwitness him. This theory was tested by creating a system called meIRL BC. meIRL BCcreates position prediction models for different behaviours using Maximum EntropyInverse Reinforcement Learning, which has formerly only been used for the creationof player based prediction models. meIRL BC then uses these models to propagate theposition prediction of opponents that are not seen in a game loop iteration, based on thebehaviour they are most likely executing. By setting up a competition between meIRLBC and a version that does not distinguish between these behaviours, but uses a singleprediction model for each opponent at all times (which we call meIRL), we showed thatwhen these behaviour-based position prediction models were correctly used by meIRL,the propagation error grew less quickly in the most executed behaviours than that ofmeIRL implementation. A big problem throughout the experiments, however, was theperformance of the classifier. Due to its great variation of correctly classified instancesin each match, the win/loss rate against both meIRL and the original Terminator werenot a great representation of meIRL BC’s full potential. It did show, however, that inthe case of a high successful classification rate the number of wins were higher than thelosses. Furthermore, the performance of meIRL BC could successfully perform againstan AI that not exactly fitted the prediction models as used by meIRL BC, giving reachto the idea that such behaviours are general enough to apply to different opponents.

Due to these results, we conclude that the addition of behaviour classification tomeIRL has the possibility of improving its performance, but our implementation isnot consistently enough at the moment to be reliable to implement in video games orto show its full potential as of yet. Some more research towards classification andgood step propagation algorithms could possibly add to the performance a lot, withoutchanging the general idea of meIRL BC.

8.1 Future workThis section discusses some possible additions to meIRL BC that could be investigatedin the future.

Classification The classification method throughout this research has not been veryreliable for the classification of behaviours in games, and so a better methodshould be developed for the reliable classification of behaviours.

Dynamic features Right now, all the behaviour specific prediction models are con-structed offline, outside of the game, limiting the possible features that could beused for the learning of the position prediction models. For instance, a goodfeature to train the prediction models with, would be the position of the flagthroughout the game, so that the attack flag motion model changes when the flagof the enemy team is not at base. Right now the applied model stays the same,which makes the prediction model not such a good representation as it could befor such situations. We did try to add the position of the flag as a feature, creatingnew motion models for the attack and defend flag behaviour each time the flagwas dropped somewhere else at base, but the computations took too long to beof any use. Adding such continuously changing features could possibly createbetter prediction models than what has been the result of this research.

27

Adaptability Although it is not yet widely applied, adaptable AI is currently a hottopic in video game development [1]. In the current state of this research, adapt-ability to opponent’s play style is not possible at all, as the assumption is madethat only some general behaviours are enough to classify multiple different agents.A way to classify different play-styles could cater more to the specific needs ofeach opponent, and could possibly be combined with the specific game objec-tives that was used as the basic of this research.

28

References[1] Sander Bakkes, Pieter Spronck, and Jaap van den Herik. Rapid and reliable adap-

tation of video game ai. Computational Intelligence and AI in Games, IEEETransactions on, 1(2):93–104, 2009.

[2] Stephen Hladky and Vadim Bulitko. An evaluation of models for predicting op-ponent positions in first-person shooter video games, 2008.

[3] John E. Laird. It knows what you’re going to do: adding anticipation to a quake-bot. In Proceedings of the fifth international conference on Autonomous agents,AGENTS ’01, pages 385–392, New York, NY, USA, 2001. ACM.

[4] Seong Jae Lee and Zoran Popovic. Learning behavior styles with inverse rein-forcement learning. ACM Transactions on Graphics (TOG), 29(4):122, 2010.

[5] Yoshitaka Matsumoto and Ruck Thawonmas. Mmog player classification usinghidden markov models. In Entertainment Computing–ICEC 2004, pages 429–434. Springer, 2004.

[6] Andrew Y Ng and Stuart Russell. Algorithms for inverse reinforcement learning.In Proceedings of the seventeenth international conference on machine learning,pages 663–670, 2000.

[7] B. Tastan, Yuan Chang, and G. Sukthankar. Learning to intercept opponents infirst person shooter games. In Computational Intelligence and Games (CIG),2012 IEEE Conference on, pages 100–107, 2012.

[8] Bulent Tastan and Gita Sukthankar. Learning policies for first person shootergames using inverse reinforcement learning. Artificial Intelligence and InteractiveDigital Entertainment (AIIDE), pages 85–90, 2011.

[9] Ben G. Weber, Michael Mateas, and Arnav Jhala. A particle model for stateestimation in real-time strategy games. In Proceedings of AIIDE, page 103–108,Stanford, Palo Alto, California, 2011. AAAI Press, AAAI Press.

[10] Brian D. Ziebart, Andrew Maas, J. Andrew (Drew) Bagnell, and Anind Dey. Max-imum entropy inverse reinforcement learning. In Proceeding of AAAI 2008, July2008.

29

Documents

Improving meIRL-based prediction models in video games using