A Study of Reinforcement Learning in Multi-Agent Systems

IN DEGREE PROJECT TECHNOLOGY,FIRST CYCLE, 15 CREDITS

, STOCKHOLM SWEDEN 2018

A Study of Reinforcement Learning in Multi-Agent Systems

MARTA JABLECKA

HANNA KEREK

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ENGINEERING SCIENCES

www.kth.se

INOM EXAMENSARBETE TEKNIK,GRUNDNIVÅ, 15 HP

, STOCKHOLM SVERIGE 2018

En studie av reinforcement learning i multiagentsystem

MARTA JABLECKA

HANNA KEREK

KTHSKOLAN FÖR TEKNIKVETENSKAP

www.kth.se

Abstract

Reinforcement learning has recently gained popularity due to its many successfulapplications in various fields. In this project reinforcement learning is imple-mented in a simple warehouse situation where robots have to learn to interactwith each other while performing specific tasks. The aim is to study whetherreinforcement learning can be used to train multiple agents. Two different meth-ods have been used to achieve this aim, Q-learning and deep Q-learning. Due topractical constraints, this paper cannot provide a comprehensive review of reallife robot interactions. Both methods are tested on single-agent and multi-agentmodels in Python computer simulations.

The results show that the deep Q-learning model performed better in the multi-agent simulations than the Q-learning model and it was proven that agents canlearn to perform their tasks to some degree. Although, the outcome of thisproject cannot yet be considered sufficient for moving the simulation into real-life, it was concluded that reinforcement learning and deep learning methodscan be seen as suitable for modelling warehouse robots and their interactions.

Sammanfattning

Reinforcement learning har nyligen okat i popularitet till foljd av mangaframgangsrika tillampningar inom olika omraden. I detta projekt implementerasreinforcement learning i ett forenklat simulerat varulager dar robotar lar sig attinteragera med varandra samtidigt som de utfor specifika uppgifter. Syftet aratt undersoka om reinforcement learning kan anvandas till att trana flera agen-ter samtidigt. Tva olika metoder anvandes for att uppna detta, Q-learning ochdeep Q-learning. Pa grund av praktiska begransningar kan inte denna studiege en heltackande redogorelse over robotinteraktioner i verkligheten. De badametoderna testas pa modeller med en och flera agenter genom datorsimuleringari Python.

Resultaten visar att deep Q-learning modellen presterar battre i simuleringarmed flera agenter an Q-learning modellen och att agenterna kan lara sig attutfora sina uppgifter till viss grad. Trots att resultaten inte anses tillrackligafor att ta simuleringarna vidare till verkligheten sa ar slutsatsen att metodermed reinforcement learning och djupinlarning kan anses vara lampliga for attmodellera robotar i varulager och dess interaktioner.

C4: DISTRIBUTED OPTIMIZATION THROUGH DEEP REINFORCEMENT LEARNING

A Study of Reinforcement Learning in Multi-AgentSystems

Marta Jablecka and Hanna Kerek

Abstract—Reinforcement learning has recently gained popu-larity due to its many successful applications in various fields. Inthis project reinforcement learning is implemented in a simplewarehouse situation where robots have to learn to interact witheach other while performing specific tasks. The aim is to studywhether reinforcement learning can be used to train multipleagents. Two different methods have been used to achieve this aim,Q-learning and deep Q-learning. Due to practical constraints, thispaper cannot provide a comprehensive review of real life robotinteractions. Both methods are tested on single-agent and multi-agent models in Python computer simulations.

The results show that the deep Q-learning model performedbetter in the multi-agent simulations than the Q-learning modeland it was proven that agents can learn to perform their tasksto some degree. Although, the outcome of this project cannot yetbe considered sufficient for moving the simulation into real-life,it was concluded that reinforcement learning and deep learningmethods can be seen as suitable for modelling warehouse robotsand their interactions.

I. INTRODUCTION

Machine learning is an area within computer science wheresystems are given the ability to learn without being explicitlyprogrammed [1]. The main areas of machine learning aresupervised learning, unsupervised learning and reinforcementlearning [2]. This project focuses on reinforcement learning(RL). RL is based on the idea that an agent without anysupervising and in an unknown environment should find itsway to do a given task. An example where RL has beensuccessfully implemented is Alpha Go, a program developedby Google DeepMind which beat the best Go player in theworld.

In this paper the word agent is used to describe a systemwhich is an independent part of a larger program. The termprogram applies here to the entire system which has to containat least one agent, an environment is a module where thesimulations are done. In this project, the agent’s task is tomove to a designated location without any collisions. Whenthe agent moves through the environment it obtains rewardsand by maximizing the rewards the agent learns to perform itstask optimally [3].

This project’s aim is to enlarge the typical model of a single-agent to a more complex models containing two respectivelyfour agents working simultaneously in the same environmentand through that examine if RL can be used to train multipleagents. One major issue with moving to multi-agent simulationis the fact that the environment is no longer static sincecollisions can occur between two or more agents. A staticenvironment ensures that both negative and positive rewardsare always going to be obtained in precisely the same place

in each simulation. In contrast to this, a non-static environ-ment can return an unexpected negative reward which canslow down the learning process or make it impossible. Inan attempt to overcome this issue, Q-learning and deep Q-learning algorithms are implemented and modified as shownin the further parts of this report.

II. THEORY

A. Markov Decision Process

The environment in this project is based on Markov De-cision Process (MDP). MDP is widely used as a frameworkand has become a cornerstone of RL [4], [5]. It is importantto know that a RL algorithm does not need to have anyknowledge of the MDP [6]. Furthermore, using MDP as aframework for single-agent algorithms has been proven to beone of most efficient solutions currently available. [7].

Fig. 1: Agent environment interaction RL. [7].

MDP can be explained by Figure 1. At each time t the agentreceives its current state st as an input from the environmentand take an action at which becomes the output. Afterwardsthe agent receives the reward rt+1 and the next state st+1

from the environment [7]. A policy function π is used todecide which action to choose in each state [8]. The goal is tomaximize the cumulative reward

∑rt over a finite or infinite

time horizon [7]. The expected reward for a given state s = stis called the state value function V π(s) and can be describedby the Bellman equation where j represents all possible nextstates st+1:

V π(s) = maxa∈As

(r(s, a) + γ∑j

p(j|s, a)V π(j)) (1)

In equation (1) γ represents the discount factor, which is avalue between 0 and 1 that determines the agent’s interestin short-time versus long-time reward. Moreover, p(j|s, a)represents the probability of moving from state s to j = st+1

with action a. The equation has a unique solution for eachpolicy [8]. The optimal policy π? can be described by:


π?(s) ∈ argmaxa∈As

(r(s, a) + γ∑j

p(j|s, a)V ?(j)) (2)

One simple example of a grid-world with arrows indicatingthe optimal policy in each step is shown in Figure 2. Asexplained earlier, the policy function π dictates which movean agent takes. Finding the optimal policy is the main aim ofusing MDP.

Fig. 2: Simple grid environment for a single-agent, where thecoloured squares represent obstacles and the the oval-shapedobject represents the goal position. The agent’s task is to moveto the goal using the optimal policy while avoiding obstacles.

B. Q-Learning

One well-established and widely used algorithm in RL is Q-learning [7]. The Q-learning algorithm works by maintaininga table of Q-values Q(S,A). The table is referred to as Q-table in the following parts of this paper. Here, S is theset of possible states and A is the set of possible actions.Further, Q(s, a) represents the current estimate of the Q-value,Q(st, at). After assigning values to the expected rewards, theQ-function selects the state-action pair with the highest Q-value. The Q-value is estimated by selecting an action at fromthe current state st and is denoted by:

Q(st, at)← (1− α)Q(st, at) + α[r(st, at) + γ ·maxa

Q(st+1, a)] (3)

where r represents the reward, γ is the discount factor andα is the learning rate. The learning rate regulates to whatextent new information overrides old information [7], [9]. Thealgorithm based on this theory is written in pseudo-code andpresented in Algorithm 1 [9].

Although the Q-learning table seems to operate well in asimple single-agent situation, it is not an adequate method formore complex tasks. For multi-agent systems, where the stateand action space are large, the convergence of learning maybe extremely inefficient and time consuming. Furthermore, inthe multi-agent system, the environment is no longer staticwhich leads to gaining penalties from collisions. Knowing thatthe positions of agents are updated based on the Q-table, it isexpected that both learning time and the path chosen by agentsto complete their task may not be optimal, especially if thegoal of the program is to easily apply it to a more complicated,real-live environment with multiple agents.

C. Deep Q-Learning

Because of the limitations of Q-learning, our approach isto combine Q-learning with a neural network, that is, use so

Algorithm 1 Q-Learning

Initialize Q-tableInitialize the state s and add s to Q-tablefor each episode do

for each step in episode doChoose action based on the largest reward (obtainedfrom the Q-table)Move to the next stateif next state not in Q-table then

Add next state to Q-tableend ifUpdate Q-table using Q-function

end forend for

Fig. 3: Simple neural network with 3 nodes in the input layer,4 nodes in each hidden layer and one node in the output layer.

called deep Q-learning [10]. Neural networks are algorithmsmodelled on the human brain, that is, a highly-structuredgraph, organized into hidden layers and are used to mimichuman learning. They consist of an input layer, an output layerand one or several hidden layers with each layer containingone or several nodes [11]. A common variant of deep neuralnetworks that are used in this project are fully connected neuralnetworks. In fully connected neural networks, the nodes in onelayer take inputs from every node in the previous layer. Forexample, Figure 3 shows a fully connected neural networkconsisting of an input layer, two hidden layers and an outputlayer, were the nodes in the first hidden layer takes inputs fromevery node of the input layer.

Deep Q-learning is a compromise of exploration and ex-ploitation. Exploration is when the agent tries new choices ofactions and exploitation is the maximizing of reward accordingto known information of the agent [12].

Deep Q-learning is obtained when neural networks are usedto approximate the Q-function Q(s, a; θi) where θi are theparameters (weights) in the neural network at iteration i. Whentraining the agent, a method called experience replay is used.This means that the agent’s experiences et = (st, at, rt, st+1)are stored in a replay memory after each time-step t into a dataset Dt = {e1, ..., et} for several episodes, an episode includesevery step until a terminal state is reached. A random samplefrom the memory, a minibatch, is then updated simultaneouswith Q-learning using the following loss function:


Li(θi) = E(s,a,r,s′)

[((r + γ ·max

a′Q(s, a′; θ−i ))−Q(s, a, θi))

2]

(4)

where θ−i are the neural network parameters that are used forcomputing the target r + γ · max

a′Q(s, a′; θ−i ) at iteration i

and Q(s, a, θi) is a predicted value. The goal is to minimizethis function, that is, to predict a more accurate value. Theadvantages of this method over standard Q-learning are thatevery step of experience can be used in many weight updateswhich improves efficiency and that the randomized samplesreduce the variance of the updates since strong correlationsbetween consecutive samples are minimized [13]. Algorithm2 shows the algorithm following the theory above and basedon the algorithm introduced by Google DeepMind [10].

Algorithm 2 Deep Q-Learning

Initialize replay memory to a chosen capacity NInitialize action-value function Q with random weights θInitialize target action-value function Q with weights θ− =θfor episode from 1 to M do

Initialize sequence s1 = {x1}for t = 1 to T do

With probability 1− ε select a random action at;Otherwise select at = argmaxaQ(st, a; θ);Execute action at and observe reward rt and new statest+1;Store experience (st, at, rt, st+1) in the memory;Sample random minibatch of experiences(sj , aj , rj , sj+1) from memory;Set yj = rj if episode terminates at step j + 1,otherwise yj = rj + γmax

a′Q(sj+1, a

′; θ−);

Perform a gradient descent step on (yj−Q(sj , aj ; θ))2

with respect to the network parameter θ;end for

end for

III. METHOD

In this project two types of RL methods were used, Q-learning and deep Q-learning. Five models were developed.One single-agent and one two-agent model with Q-tables aswell as one single-agent, one two-agent and one four-agentmodel with deep Q-learning. All models were implemented inPython and the constants and rewards used are shown in TableI. An episode in this project is defined as the time betweenthe start of an agent’s life in its starting position and whenit reaches a death point. The number of steps is the numberof times an agent makes a move during one episode and thereward is the total reward obtained during an episode.

In the Q-learning models with Q-table the agent starts overin its start position every time it collides with an obstacle oranother agent. In the two-agent case the agents wait for eachother until they both have reached their goals or had a collisionwith an obstacle and then start over in their start positions atthe same time. In the deep Q-learning models however the

agent moves uninterrupted until it reaches its goal and thenstarts over from its start position. In the case of two and fouragents, they wait in their goal position until all agents havereached their goal positions and the agents then start over intheir respective start positions.

TABLE I: Constants used in the simulations.

Discount factor, γ 0.9Learning rate, α 0.01Reward for agent going into obstacles -2Reward for agents colliding -2Reward for agent reaching goal 4Reward for all other states -0.1

A. Single-Agent Static Model with Q-Table.

As mentioned earlier, a major advantage of the Q-learningtable is that it is easy to implement [8]. In order to understandhow to construct a simulation with Q-learning, a single-agentmodel was built.

Fig. 4: Grid-world environment for a single-agent model.Triangle shape and circle respectively denote the start locationof the agent and its goal in the simulation. The colouredsquares indicate obstacles with a negative reward.

The Q-learning algorithm was applied to a simple grid-world environment, shown in Figure 4, which allows straight-forward investigating of the results. In this simulation, an agentstarts in the upper-left corner of the grid and has to reachthe goal located in the bottom-right corner, indicated with thecircle. Since the grid has the size {10× 10} the model consistsof 100 states. The agent can move only one square at a timeand has 5 possible actions in each state; up, down, left, rightand stop. As seen in Figure 4, some of the squares are markedwith colour and indicates obstacles, which the agent shouldavoid.

Following the idea of RL, the agent was given no priorknowledge of its task or environment and was trained byobtaining rewards for each action. In this model, getting tothe goal gives a positive reward (4), entering an obstacle givesa negative reward (-2) and moving along white, neutral cellsgives a negative reward of -0.1, as described in Table I. Thealgorithm used to construct the single-agent model can be seenin Algorithm 1 and follows the theory introduced earlier [9].


B. Two-Agent Model with Q-Table.

The two-agent model with Q-table is based on the single-agent model. The grid-world is now adapted to the two-agentsituation, where each agent starts in a different corner of thegrid-world and their goals are located at the start locations ofthe other agent, see Figure 5.

Fig. 5: Grid environment for a two-agents model.

Once the simulation starts, the agents, which again have notbeen given any prior information on the environment, receivea penalty (-2) every time they collide with each other and thesame rewards as the single-agent model for entering obstacles,reaching the goal and moving from one white cell to another.The same algorithm as in the single-agent model was used andthe agents’ primary goal is to learn to avoid collisions witheach other and reach their goals.

In this model, the Q-table was enlarged to 1002 = 10000states since we are working on a {10× 10} grid-world andeach grid is assumed to represent one state where an agent canbe found. Following these calculations, it is easy to see that fora {10× 10} grid-world, the size of the Q-table will increasesignificantly with an increasing number of agents, obtaining100x number of states, where x represents the number ofagents in a simulation. In addition to this, the number ofactions were doubled, so that each agent has its own set ofactions in the Q-table. Even though the work presented inthis paper focuses on a small, multi-agent system, the mostsubstantial problem with Q-tables i.e. the lack of scalability,is noticeable.

C. Single-Agent Static Model with Deep Q-Learning.

The single-agent model with deep Q-learning is based onthe same grid-world as in the single-agent model with Q-table.

As mentioned earlier, Algorithm 2 shows the algorithm usedfor deep Q-learning in this project. A neural network withan input layer, one hidden layer consisting of 64 nodes andan output layer is used. The neural network has 5 outputvalues which corresponds to the 5 possible actions that theagent can take in each step. Each state gives 14 input valuesin this model, the distance from the agent to the goal andthe distance from the agent to the obstacles in x- and y-direction. The neural network was implemented by using themachine learning library TensorFlow in Python. In neuralnetworks, activation functions are used to decide which nodes

are activated and what information is relevant. The ReLUactivation function is used in this project; which means thatthe output equals the net input if the total input is greater thanzero but if the total input is less than or equal to zero theoutput will be zero. This can be explained by the function

y = max(0, wTx+ b) (5)

where x is the input, y is the output, w is the weight vectorand b is the bias [14]. The neural network consists of atarget network and an evaluation network which correspondto the target and the prediction explained in the theory of thisreport. Their structures are the same, but they have differentweights and biases since they are updated at different times.The weights in the target network are fixed for the first 200steps and are then replaced by the weights in the evaluationnetwork every 200 step. The memory size is 2000 and theminibatch size is 32. Training is not conducted for the first 200steps and then for every fifth step and only on the evaluationnetwork.

In deep Q-learning a parameter ε is also used which is theexploration rate and explains the trade-off between explorationand exploitation. In this case ε starts with the value 0 whichgives exploration a high priority at the beginning of thesimulation. The value is then increased with 0.001 for everytraining until the maximum value of 0.9 is reached. Morespecifically, each time an agent chooses its action, an arbitraryvalue 0 < i < 1 is generated, if i > ε a random actionis chosen, i.e. the agent determines a random action withprobability 1− ε. For ε = εmax the probability of selecting arandom action is equal to 10%.

D. Two-Agent Model with Deep Q-Learning.

In the two-agent model with deep Q-learning, the samegrid-world as in the two-agent model with Q-table is used.The algorithm used is the same as in the single-agent modelwith deep Q-learning. Each agent has its own neural networkwith an input layer, one hidden layer consisting of 64 nodesand an output layer. Each state gives 16 input values in thismodel; the distance between the agent and the other agent,the distance between the agent and its goal and the distancebetween the agent and the obstacles in x- and y-direction.Each neural network has 5 output values which correspond tothe 5 possible actions that the agent can take in each step.The neural network is implemented in the same way as forthe single-agent model with the same activation function andstructure.

E. Four-Agent Model with Deep Q-Learning.

The four-agent model contains 1004 possible states. The aimof presenting this model is to show that the deep Q-learningalgorithm does not lose its properties when the number ofstates is drastically increased. As seen in Figure 6, the four-agent model is based on the previous models used in this paper.Each agent has its own network with the same structure as theneural networks used in the single-agent and the two-agentmodel. Each neural network gets 20 input values in each state


Fig. 6: Grid environment for a four-agents model.

which are the distance to the agent’s goal, the distance to the3 other agents and the distance to the obstacles in x- and y-direction.

IV. RESULTS

Table II illustrates the percentage of episodes with colli-sions for every agent in each model for a selected numberof episodes and divided into three categories. As explainedbefore, for the deep Q-learning models the agent is free tomove everywhere on the grid until it reaches the goal point,while for Q-learning models, the agent "dies" if it entersan obstacle, collides with another agent or reaches its goalposition. It is therefore important to bear in mind the possiblebias in the analyse of the output of simulations. The categoriesObstacle and Agent provide the percentages of episodes wherecollisions with one or more obstacles or another agent occurrespectively. The category Total contains the percentages ofall episodes with collisions. The interval of examined episodeswere chosen so that it starts at the point where it was clearthat the reward function for the given model is convergent.The Figures 7, 8, 9 and 10 present the reward as a functionof episode for each model and respective agent.

TABLE II: Percentage of episodes with collisions.

Model Obstacle [%] Agent [%] Total [%] Episodes [103]

A 0 - 0 4-10

B, agent 1 0.24 0.84 - 85-95

B, agent 2 47.71 0.84 - 85-95

C 12.75 - 12.75 4-10

D, agent 1 14.30 2.98 16.98 4-10

D, agent 2 5.71 2.48 8.10 4-10

E, agent 1 12.06 13.86 21.45 4-10

E, agent 2 6.78 24.34 27.74 4-10

E, agent 3 31.64 14.25 32.34 4-10

E, agent 4 16.06 11.05 21.66 4-10

Turning now to the Table III, it shows the average number ofactions needed to be taken by an agent to reach its goal in thesame interval of episodes examined in Table II. In all models,the agent’s optimal policy consists of 18 steps. It is easy tosee that only in the case with one-agent and Q-learning, theoptimal policy is reached.

TABLE III: Average steps for reaching goal.

Model Steps Episodes [103]

A 18.00 4-10

B, agent 1 18.13 85-95

B, agent 2 22.11 85-95

C 19.95 4-10

D, agent 1 20.02 4-10

D, agent 2 19.97 4-10

E, agent 1 19.97 4-10

E, agent 2 19.87 4-10

E, agent 3 19.99 4-10

E, agent 4 20.02 4-10

As can be seen in Table I, the discount factor, learning rateand rewards were chosen to be equal in all models. It is worthnoting that changing those factors could significantly changethe results of the simulations. The given numbers were chosento compromise the best possible results in each model.

Fig. 7: Reward plotted against episodes for a single-agent withboth Q-learning and deep Q-learning.

Fig. 8: Reward plotted against episodes for the two-agentsmodel with both Q-learning and deep Q-learning for agent 1.


Fig. 9: Reward plotted against episodes for the two-agentsmodel with both Q-learning and deep Q-learning for agent 2.

Fig. 10: Rewards plotted against episodes for the four-agentsmodel with deep Q-learning.

V. DISCUSSION

The first question this study sought to answer was whetherRL can be used in a multi-agent situation. Specifically, whethermultiple agents can work simultaneously in the same environ-ment while avoiding collisions with each other and obstacles.In Table II the percentages of episodes with collisions are pre-sented. This study did not find that it is possible to completelyeliminate collisions between agents but the relatively lowpercentage of collisions brings hope for further improvementand could be worth inspecting.

The most obvious finding from the results is that both Q-learning and deep Q-learning algorithms are leading the agentsto seek the highest reward, as seen in Figure 7, 8, 9 and 10,which means that after the initial phase of exploration theagents strive after reaching their goals using the optimal policy.Although, the average number of steps taken by the agents isslightly higher than 18 (optimal policy) in the two-agent andfour-agent models it can be argued that all agents reachingthe optimal policy in a multi-agent dynamic environment maycause a higher number of unwanted collisions. The fact that theagents find their optimal paths only in the static environment,see Table III, can be used to confirm the association between

the policy and collisions, that is, the decreasing number ofcollisions is not simply just a result of finding the shortestpath by the agent.

It is important to remember that the agents in the Q-learningalgorithm with Q-table and the deep Q-learning algorithm havedifferent death points. This is because this set-up has beenfound to give the best results for each respective case. Whenboth algorithms have the same death points one of them doesnot converge or will converge much slower because of thedifferences in how the algorithms work. The deep Q-learningalgorithm only teaches agents to find their goals if the agentsonly restart when they reach their goals. The algorithm withQ-table will take much more time to train if the agents do notrestart when they collide with an obstacle or another agent.There is also a difference in the penalties agents get whenthey collide with each other. In the Q-table, both agents in thecollision get a penalty but in the deep Q-learning case onlythe agent that goes into another agent gets a penalty.

A. Single-agent model

When comparing the results between Q-learning and deepQ-learning in the single-agent model, it is evident from Figure7 that the deep Q-learning algorithm converges faster thanthe Q-learning algorithm but is more unstable and nevercompletely converges as well as the Q-learning algorithm. Thiscan also be seen in Table II and III where it is clear that theQ-learning algorithm reaches the optimal policy at 18 stepsand 0% collisions with obstacles while the deep Q-learningalgorithm has a higher value of average steps and collisions.

B. Two-agent model

In the two-agent situation, the deep Q-learning algorithm atfirst presents a better outcome than Q-learning. Even thoughthe reward function is still very unstable and never reallyconverges to one value, the learning process is noticeableshorter compared to the Q-learning algorithm. This can be seenin Figure 8 and 9. However the Q-learning algorithm performsbetter when looking at a longer period of time, when it comesto percentages of collisions between agents. The most strikingresult to emerge from the data is that agent 1 and 2 frommodel B present a very different outcome when it comes to thenumber of collisions with obstacles because agent 1 convergesmuch faster than agent 2.

The rewards used are decided to fit the deep Q-learningmodels and are therefore not optimal for Q-table. When agentsreceive a higher reward for reaching the goal and largerpenalty for collisions they will converge faster in the Q-learning models. The deep Q-learning models however do notconverge for those rewards. On average, agents in both caseswere shown to seek an optimal path to reach their goal and asmentioned above it is expected that the Q-learning algorithmwould present an even better output if the simulation extended95 · 103 episodes.

C. Four-agent model

When the number of agents increases to four, more colli-sions between agents occur which was to be expected. What


is surprising is that the outcome of the four-agent modelsimulation is still comparable to both the single-agent modeland the two-agent model with deep Q-learning. It can beobserved that increasing the number of agents in a simulationis not reducing the algorithm’s ability to learn quickly. Themost interesting finding was that regardless of the number ofagents, all of them still managed to cooperate and focus onfinding the best path given the situation.

A note of caution is due here since collisions are stilloccurring on an unsatisfactory level and may be furtherincreased if the number of agents were increased. It meansthat the method is not strictly reliable in the four-agent case.However, it seems possible that improving the outcome of thealgorithm in a single agent case would carry into any largemulti-agent model. Future studies on the topic of improvingdeep Q-learning algorithm are therefore recommended.

D. Future work

In the two-agent and four-agent cases in this project theaim of each agent is two maximize its own reward. For futurework on this topic it would be interesting to study if the agentswould be more devoted to collaborate with each other if theywere to have a collective reward instead of individual rewards.

Further research might also explore ways of making thedeep Q-learning algorithm more stable. A reasonable approachto tackle this issue could for example be implementing doubledeep Q-learning [15]. The double deep Q-learning algorithmis an improved version of the deep Q-learning algorithmintroduced by Google DeepMind in order to minimize over-estimating.

It would also be interesting to use a larger environment withan increased number of agents to make the simulation moresimilar to a real warehouse situation. To mimic a warehousesituation even more, the agents tasks would have to be morecomplicated for example by having different goals and startpositions for every episode or give the agents a long sequenceof task consisting of several smaller tasks.

VI. CONCLUSION

The results of this study give an insight to how a possiblemulti-agent system could be built. The main goal of the projectwas to determine if RL algorithms can be used to modelmulti-agent systems. The most obvious finding to emerge fromthis study is that agents could learn to work simultaneouslyand seek their goals, if the stability of the deep Q-learningmethod were increased. It is certain, that deep Q-learningmethods are more suitable for that task than standard Q-learning implemented with Q-table, especially if the systemis large.

Whilst this study did not prove that it is definitely possibleto use Q-learning or deep Q-learning as a core system formulti-agent models in a real life situation, many opportunitiesexist for furthering research focusing on finding the perfectmethod for that task. Promising results of this study encouragea further research in the field and a natural progression ofthis work is to analyse new deep learning methods as well asimproving the old ones.

Furthermore, the idea of multi-agent systems is still rarelyfocused on in machine learning and the authors hope thatthis paper will contribute to increasing the interest in thoseparticular systems.

APPENDIXCODE

The Python code used for this project can be found onGitHub, https://github.com/hannakerek/KEX.

ACKNOWLEDGMENT

The authors would like to thank our supervisors: Prof.Alexandre Proutiere, Mina Ferizbegovic, Alexandros Nikouand Takuya Iwaki for their inputs, feedback and guidance inthis project.

A dept of gratitude is also owed to Morvan Zhou from theUniversity of Technology Sydney for his complete and detailedtutorials on the subject of reinforcement learning [16] and hisopen source code for both Q-learning and deep Q-learningwhich was used as a kernel for this project.

REFERENCES

[1] W. L. Hosch. (2016, September) Machine learning. EncyclopediaBritannica, inc.,. [Online]. Available: https://www.britannica.com/technology/machine-learning

[2] M. Gori, Machine Learning. Cambridge, MA: M. Kaufmann, 2018,ch. 1, pp. 2–58.

[3] V. F. Farias, C. C. Moallemi, B. Van Roy, and T. Weissman, “Universalreinforcement learning,” Information Theory, IEEE Transactions on,vol. 56, no. 5, pp. 2441–2454, May 2010.

[4] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning,”Artificial Intelligence, vol. 112, pp. 181–211, Dec. 1998.

[5] K. Nitta. (2014, January) Decision making. EncyclopediaBritannica, inc.,. [Online]. Available: https://www.britannica.com/topic/decision-making

[6] E. Levin, R. Pieraccini, and W. Eckert, “Learning dialogue strategieswithin the markov decision process framework,” in 1997 IEEE Workshopon Automatic Speech Recognition and Understanding Proceedings, Dec1997, pp. 72–79.

[7] H. M. Schwartz, “Single-agent reinforcement learning,” in Multi-agentmachine learning : a reinforcement approach. Hoboken, New Jersey:John Wiley & Sons, 2014, pp. 12–37.

[8] G. Neto, “From single-agent to multi-agent reinforcement learning:Foundational concepts and methods,” Learning theory course.

[9] C. Watkins and P. Dayan, “Q -learning,” Machine Learning, vol. 8, no. 3,pp. 279–292, May 1992.

[10] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-stra, and M. Riedmiller, “Playing atari with deep reinforcement learn-ing,” arXiv preprint arXiv:1312.5602, December 2013.

[11] G. TensorFlow. (2018, March) Premade estimators for ml beginners.Google Brain Team, California. [Online]. Available: https://www.tensorflow.org/get_started/get_started_for_beginners

[12] F. Tan, P. Yan, and X. Guan, “Deep reinforcement learning: From q-learning to deep q-learning,” in Neural Information Processing, D. Liu,S. Xie, Y. Li, D. Zhao, and E.-S. M. El-Alfy, Eds. Cham: SpringerInternational Publishing, 2017, pp. 475–483.

[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski,S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran,D. Wierstra, S. Legg, and D. Hassabis, “Human-level control throughdeep reinforcement learning,” Nature, vol. 518, no. 7540, February 2015.

[14] S. Pattanayak, “Introduction to deep-learning concepts and tensorflow,”in Pro Deep Learning with TensorFlow. Berkeley, CA: Apress, 2017,pp. 106 –107.

[15] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learningwith double q-learning.” in AAAI, vol. 16, 2016, pp. 2094–2100.

[16] M. Zhou. (2016) Github repository. GitHub. [Online]. Available:https://github.com/MorvanZhou

https://github.com/hannakerek/KEX

https://www.britannica.com/technology/machine-learning

https://www.britannica.com/technology/machine-learning

https://www.britannica.com/topic/decision-making

https://www.britannica.com/topic/decision-making

https://www.tensorflow.org/get_started/get_started_for_beginners

https://www.tensorflow.org/get_started/get_started_for_beginners

https://github.com/MorvanZhou

Documents

A Study of Reinforcement Learning in Multi-Agent Systems