Stigmergic Independent Reinforcement Learning for Multi-Agent … · 2020. 10. 13. · Pierre-Paul Grass`e in 1950s when studying the behaviors of social insects [22], [23]. Recently,

Stigmergic Independent Reinforcement Learning forMulti-Agent Collaboration

Xing Xu1, Rongpeng Li1*, Zhifeng Zhao2, Honggang Zhang1

1Zhejiang University, 2Zhejiang Lab{hsuxing, lirongpeng, honggangzhang}@zju.edu.cn, [email protected]

Abstract—With the rapid evolution of wireless mobile devices,there emerges an increased need to design effective collaborationmechanisms between intelligent agents, so as to gradually ap-proach the final collective objective through continuously learningfrom the environment based on their individual observations. Inthis regard, independent reinforcement learning (IRL) is oftendeployed in multi-agent collaboration to alleviate the problem of anon-stationary learning environment. However, behavioral strate-gies of intelligent agents in IRL can only be formulated upontheir local individual observations of the global environment, andappropriate communication mechanisms must be introduced toreduce their behavioral localities. In this paper, we address theproblem of communication between intelligent agents in IRL byjointly adopting mechanisms with two different scales. For thelarge scale, we introduce the stigmergy mechanism as an indirectcommunication bridge between independent learning agents, andcarefully design a mathematical method to indicate the impactof digital pheromone. For the small scale, we propose a conflict-avoidance mechanism between adjacent agents by implementingan additionally embedded neural network to provide moreopportunities for participants with higher action priorities. Inaddition, we present a federal training method to effectivelyoptimize the neural network of each agent in a decentralizedmanner. Finally, we establish a simulation scenario in which anumber of mobile agents in a certain area move automatically toform a specified target shape. Extensive simulations demonstratethe effectiveness of our proposed method.

Index Terms—Collective Intelligence, Reinforcement Learning,Multi-Agent Collaboration, Stigmergy, Artificial Intelligence

I. INTRODUCTION

With the rapid development of mobile wireless communi-cation and Internet of Things technologies, many scenarioshave arisen in which collaboration between intelligent agentsis required, such as in the deployment of unmanned aerialvehicles (UAVs) [1]–[3], distributed control in the field ofindustry automation [4]–[6], and mobile crowd sensing andcomputing (MCSC) [7], [8]. In these scenarios, traditionalcentralized control methods are usually impractical due tolimited computational resources and the demand for ultra-lowlatency and ultra-high reliability. As an alternative, multi-agentcollaboration technologies can be used in these scenarios torelieve the pressure on the centralized controller.

Guiding autonomous agents to act optimally through trial-and-error interaction with the corresponding environment isa primary goal in the field of artificial intelligence, and isregarded as one of the most important objectives of rein-forcement learning (RL) [9]–[11]. Recently, deep RL (DRL),which combines RL and deep neural networks, has improved

the ability to obtain information from high-dimensional in-put, such as high-resolution images, and has demonstratedextraordinary learning ability across a wide array of tasks[12]. There are a number of advanced DRL algorithms thatcan direct a single agent to improve its behavioral policythrough continuously learning from the environment [13],[14]. However, the extension of single-agent DRL to multi-agent DRL is not straightforward, and many challengingproblems remain to be solved in the application of multi-agent RL (MARL) [15], [16]. In particular, in a completelydistributed multi-agent system (MAS), each agent is usuallylimited to partially observe the global environment, and itslearning process following this local observation can thusbe non-stationary, as other agents’ behavioral policies maychange temporally. In addition, the assignment of an individualreward is another challenging problem, as there is only oneglobal reward for feedback in most cases. As an alternative,independent RL (IRL) has been proposed to alleviate theproblem of a non-stationary learning environment, where eachagent undergoes an independent learning process with onlyself-related sensations [17].

In IRL, most behavioral policies learned by intelligentagents are self-centered, aiming to maximize their own in-terests. Thus, the target of mutual communication is to in-tegrate these individual behaviors effectively for the sametask. Several studies have attempted to solve the problem ofmutual communication in IRL. Foerster et al. [18] proposeddifferentiable inter-agent learning (DIAL), in which an ad-ditional communication action is added to the action set ofeach agent. In addition to the selection of the current action,a piece of inter-agent message is also generated and sentto other agents through a specified communication channel.Thereafter, a back-propagated error from the receiver of thecommunication messages is sent back to the sender to regulatethe communication action. However, this type of communica-tion channel exists between all pairs of independent learningagents, making DIAL very complex as the number of agentsincreases. Tan [19] tested to share different types of messagesto coordinate intelligent agents on the basis of independentQ-learning (e.g., sensations, episodes, and learned policies).Despite its improvement in the final performance, the methodhas several over-optimistic assumptions that limit its potentialapplication. For example, given the limited resources in a harshenvironment, it is usually impractical for a single agent totransmit large messages in mobile wireless scenarios. As an

arX

iv:1

911.

1250

4v3

[cs

.AI]

30

Jan

2021

improvement, Mao et al. [20] proposed to utilize a coordinatornetwork to aggregate compressed local messages and thenshare them among all agents. Because the shared messagescontain joint information from all agents, the expected resultof this design is to make each agent act optimally consideringthe other agents’ behaviors. However, it is difficult to obtaina well-trained coordinator and local networks. In summary,appropriate communication mechanisms must be introducedbetween independent learning agents to reduce their behaviorallocalities [21].

As another approach, the concept of stigmergy was firstintroduced by French entomologist Pierre-Paul Grasse in the1950s when studying the behavior of social insects [22], [23].Recently, stigmergy has experienced rapid diffusion across alarge number of application domains together with the pop-ularization of distributed computing, collective intelligence,and broadband Internet [24]–[27]. In particular, stigmergyhas demonstrated advantages in various scenarios requiringdistributed control where the generation of messages is closelyrelated to the environment, space, and time, such as themanagement of traffic lights [28], [29]. A key component ofstigmergy is called the medium, which acts as the informationaggregator in multi-agent collaboration. Benefiting from theexistence of a medium, effective stigmergic interaction canbe established between agents and their surrounding envi-ronment, and distributed agents can also interact with eachother indirectly to reduce their behavioral localities. Therefore,stigmergy is a potential technique to solve the problem ofmutual communication in IRL.

In addition to reap the large-scale merits of stigmergy,we also propose a conflict-avoidance mechanism executed atsmaller scales between adjacent agents to further reduce theirbehavioral localities. With this mechanism, we evaluate andassign a corresponding action priority to each agent, and alarger number of decision-making opportunities are providedfor agents with higher action priorities. In particular, theaction priority value is efficiently calculated by an additionallyembedded neural network within each agent. Furthermore, tosynchronously optimize the neural network of each agent,we apply a federal training method along with average opti-mization by improving the asynchronous advantage actor-critic(A3C) algorithm [13]. We summarize all the aforementionedtechniques and propose the stigmergic IRL (SIRL) algorithm.Based on the simulation scenario in [30], in which a number ofmobile agents move automatically to form a specified targetshape, we examine the effectiveness of the proposed SIRLalgorithm through an in-depth performance comparison withother available methods.

Focusing on the SIRL algorithm, the contributions of thispaper can be summarized as follows:

• First, we introduce the stigmergy mechanism into MARL,and provide an effective cooperative algorithm. We alsodemonstrate that the stigmergy mechanism can decom-pose a global objective into small tasks that can be moreefficiently perceived by individual agents.

• Second, we propose a conflict-avoidance mechanism tofurther reduce the behavioral localities of agents, whose

foundation is an additionally embedded neural networkin each agent.

• Third, we provide a federal training method by enhancingthe A3C algorithm to synchronously optimize the neuralnetwork of each independent learning agent in MAS.

The remainder of this paper is organized as follows. InSection II, we discuss the related work from the perspective ofcombining the stigmergy mechanism and MARL, and clarifythe novelty of our work. In Section III, we present the systemframework followed by a description of the stigmergy mecha-nism and the proposed conflict-avoidance mechanism. Finally,we introduce the federal training method. In Section IV, wemathematically analyze the details of the proposed SIRLalgorithm. In Section V, we describe the simulation scenario,compare the numerical simulation results, and present keyinsights into these results. Finally, in Section VI, we concludethe paper.

II. RELATED WORK

A prototype of the stigmergy mechanism can be widely ob-served in natural colonies. Pagan et al. [31] described a colonyof social insects as a super-organism with brain-like cognitiveabilities. This super-organism consists of a large numberof small insect brains coupled with appropriate cooperativemechanisms. Despite its limited size, the small brain of eachinsect is capable of performing an adaptive learning process,which is similar to RL [11], [32]. As a classical mechanismexplaining the cooperative behavior of social insects [26],stigmergy also includes a large number of small-scale learningprocesses [33], [34] and records their effect on the surroundingenvironment with the involved medium.

The application of stigmergy has been studied in the fieldof MAS [24]–[29]. Stigmergy is generally used to coordinatethe behavior of multiple agents to accomplish the targettask more efficiently. In particular, the coordination processin most applications focuses on the maintenance of digitalpheromone, which is an important part of stigmergy, while theinvolved agent itself lacks the ability to learn the behavioralpolicy. For example, the coordination process in the classicalant colony optimization (ACO) algorithm [35] leads to anincreased concentration of the correct pheromone; however,the behavioral policy of the involved agent is predetermined,that is, choosing among several concentrations in a proba-bilistic manner. In practice, this is effective in a bottom-to-up designed MAS [36] in which the behavioral policy of theinvolved agent can generally be predetermined and it is easierto predict the system’s final performance. However, in manypractical scenarios, the behavioral policies of the involvedagents cannot be predetermined, and the agents must adjusttheir own policies while maintaining coordination.

In MARL, each agent can learn its behavioral policythrough interaction with the surrounding environment. How-ever, MARL also faces several challenges, such as the partialobservation problem and credit assignment problem [15], [16].Several studies have attempted to combine stigmergy withMARL. Aras et al. [37] conceptually described how certainaspects of stigmergy can be imported into MARL, and defined

Fig. 1. Framework of stigmergic independent reinforcement learning (SIRL).

an inter-agent communication framework. Verbeeck et al. [38]investigated the impact of an ant algorithm (AA) on theextension of RL to an MAS, and explained that stigmergyis essential in systems in which agents do not fully cooperate.However, the above studies all lack available algorithms.For an agent to perform well in a partially observable en-vironment, it is usually necessary to require its actions todepend on the history of observations [39]. Peshkin et al.[40] explored a stigmergic approach in which an externalmemory is created and included as part of the input to theinvolved agent. In addition, to optimize packet schedulingin routers, Bournenane et al. [41] presented a pheromone-Q learning approach that combines the Q-learning algorithm[42] with a synthetic pheromone that acts as a communicationmedium. Specifically, the synthetic pheromone is introducedinto the Q-learning updating equation through a belief factor.However, the effectiveness of these methods should be verifiedin a more complex MAS. Verbeeck et al. [43] first proposedusing learning automaton to design and test stigmergic com-munication. Furthermore, Masoumi et al. [44] proposed to usethe concept of stigmergy to calculate the reward received byeach learning automaton (i.e., agent) to accelerate the learningprocess in Markov games. Similarly, Xu et al. [30] proposedto use the digital pheromone to coordinate the behavior ofmultiple agents attempting to form a target shape. In particular,the distribution of digital pheromones helps provide guidancefor the movement of agents. However, none of the above-mentioned works incorporate the advantageous capability ofRL into the involved agents’ decision-making processes. Incontrast to the above-mentioned studies, our work focuses onintroducing DRL into the decision-making process of eachinvolved agent.

Stochastic gradient descent based distributed machine learn-

ing algorithms have been studied in the literature in termsof both theoretical convergence analysis [45] and real-worldexperiments [46]. Moreover, in addition to the traditionaldistributed algorithms, Rui et al. [47] proposed to use thefederated learning algorithm to reduce the communication costto train one neural network. Federated learning distributes thetraining task (e.g., face recognition) among several devicesand obtains a common neural network through the integrationprocess. Similarly, Sartoretti et al. [48] proposed a distributedRL algorithm for decentralized multi-agent collection. Thisalgorithm allows multiple agents to learn a homogeneous,distributed policy, where various agents work together towarda common target without explicit interaction. Sartoretti et al.[48] also demonstrated that the aggregation of experience fromall agents can be leveraged to quickly obtain a collaborativebehavioral policy that naturally scales to smaller and largerswarms. However, the above-mentioned distributed learningalgorithms usually prohibit interactions among agents to stabi-lize the learning process, which may be harmful, especially incases where agents do not fully cooperate. Therefore, in thisstudy, the stigmergy mechanism is specifically incorporatedinto the proposed distributed learning algorithm to achieveeffective collaboration between agents while allowing eachagent to retain efficient RL ability.

III. SYSTEM FRAMEWORK

We present the framework of the SIRL mechanism inFig. 1. In particular, each agent is designed to learn inde-pendently during the training phase, and is required to actautomatically during the decentralized execution phase. Notethat each agent can only observe the environment partially

and locally. Therefore, as illustrated at the bottom of Fig.1, we deploy stigmergy as an indirect communication bridgebetween independent learning agents, which represents anexplicit feedback loop between agents and the medium. Fur-thermore, as illustrated in the center of Fig. 1, a conflict-avoidance mechanism is deployed among adjacent agents tofurther reduce their behavioral localities. At the top of Fig.1, we introduce the federal training method by appendinga virtual agent to effectively optimize the neural networkof each agent. To more easily clarify the framework andprocess of the SIRL mechanism, we primarily consider theflight formation of UAVs to monitor a specific target area as atypical example. We assume that multiple UAVs (i.e., agents)are flying and collaborating to form a specific team shape tohover above the final target area, and that each UAV mustdetermine its flying policy independently based on its limitedlocal environmental information, such as its relative positionto other UAVs. In this scenario, the proposed framework canbe applied to improve the learning effectiveness in terms of thefinal similarity between the target team shape and the shapeformed by the UAVs or the aggregated cost of the UAVs.

A. Stigmergy Mechanism

In general, stigmergy consists of four main components(i.e., medium, trace, condition, and action), which togetherform a feedback loop between agents and their surroundingenvironment [22], as illustrated in Fig. 1. Note that the mediumcan also be regarded as part of the entire environment. Here,the environment and medium are represented separately bydifferent parts in Fig. 1 to distinguish the traditional learn-ing environment in RL from that utilized in the stigmergymechanism. In addition, a trace (i.e., digital pheromone) isnormally left by an agent in the medium as an indicator ofthe environmental change resulting from its action. Severaltraces left by different agents in the medium can diffuse andfurther mix in a spontaneous manner [26]. Then, the variationpattern of these digital pheromone traces is returned as theinter-influence to other agents for their subsequent actions,while the amplitude of this inter-influence is largely related tothe inter-distance between agents [30].

In addition, stigmergy can also serve as a potential solutionto the decomposition of a global objective. For instance, theconstruction of a termite nest requires the cooperation of theentire colony, which usually passes through several genera-tions. However, a single termite in this colony is unawareof the global objective (i.e., building a termite nest), dueto the limited size of its brain. The cooperative mechanismutilized by this colony must be able to decompose the globalobjective into several small tasks that can be perceived by asingle termite. Therefore, in addition to realize the indirectcommunication, stigmergy can also achieve decomposition ofthe global objective implicitly to help obtain individual rewardin multi-agent collaboration.

B. Conflict-Avoidance Mechanism

Conflicts between the actions of different agents may arisefrom the competition for limited task resources during multi-

Fig. 2. Intuitive schematic of the conflict-avoidance mechanism.

agent collaboration. To reduce the number of conflicts andminimize the amount of data that agents need to transmitduring the collaboration process at the same time, we proposea conflict-avoidance mechanism by calculating action prioritiesfor different agents. As illustrated in Fig. 2, Agent 1 (or2) represents an independent learning agent in Fig. 1. Inparticular, there are two different internal neural networkmodules in each agent: the Evaluation Module and BehaviorModule. The Evaluation Module is used to efficiently calculatethe action priority of an agent at the current local state, whichis further used to compete for the action opportunity. TheBehavior Module is used to select appropriate actions foran agent according to the local input state when obtainingthe action opportunity. Note that the action policy from theBehavior Module may be self-centered, signifying that eachagent may be trapped in its local optimality while ignoringthe global objective. Thus, the conflict-avoidance mechanismenabled by the Evaluation Module can help eliminate self-centered action policies to facilitate collaboration.

With the conflict-avoidance mechanism, each agent passesthrough two steps to determine the action at the current state.As illustrated in Fig. 2, in the first step, the input related tothe current state of an agent is first sent to the EvaluationModule to calculate the action priority. The action priorityvalue is then compared with that of nearby agents through acoordination channel, and a priority list within a small rangecan be obtained. In the second step, the same input can besent to the Behavior Module for selecting appropriate actionsonly when an agent has the highest action priority in the firststep. Otherwise, the agent remains silent.

C. Federal Training Method

Despite its success in the training of deep Q-learning net-works [12], [49], experience replay may not be as effective inan MAS. Since a single agent in an MAS faces different tasksor situations, samples stored in the experience pool may notadapt to these changes. Recently, an asynchronous method forthe advantage actor-critic (A2C) algorithm has been proposedand its advantages have been verified through training human-level players for Atari games [13], [50], [51]. Here, wefurther propose a federal training method by improving the

asynchronous A2C (i.e., A3C) algorithm to synchronouslyoptimize the neural network of each agent through the averageoptimization of neural network gradients.

In this federal training method, each agent attempts tooptimize its neural network not only through self-relatedexperience, but also through neural network gradients fromother collaborative teammates. Suppose that the number ofactive agents participating in the collaboration at time stept is Nt, and Nt ≤ N , where N denotes the total numberof agents. Moreover, Nt can also represent the number ofagents that have obtained action opportunities through theconflict-avoidance mechanism. Gradients from these partici-pating agents naturally form a mini-batch whose functionalityis similar to that in the experience replay, and may be evenmore uncorrelated because they are sampled from differentsituations. Therefore, as illustrated in the right part of Fig. 1,a virtual agent is designed and added into SIRL, aiming tocollect various local gradients of the involved agents for theaverage optimization. This virtual agent has the same neuralnetwork structure as other agents, but takes no action.

IV. SIRL ALGORITHMS FOR MULTI-AGENTCOLLABORATION

In this section, we provide additional details and math-ematical formulations regarding the three above-mentionedmechanisms. We assume that N agents are located in anarea space and are dedicated to collaboratively fulfilling aspecific task (e.g., UAVs to form a particular position shape).As illustrated in Figs. 1 and 2, at each time step t, eachagent i receives a local state s(i)t from the environment’s localstate space S. Depending on the local state, an action a

(i)t

is selected from the individual action set A by each of Ntagents. After the selected action is performed, an individualreward r(i)t is returned to each participating agent to calculatethe neural network gradients according to the loss function,Loss(θ

(i)t ), so as to adjust its neural network parameters θ(i)t .

Specifically, the local state received by each agent is impactedby some particular environmental attractors, which are selectedaccording to the digital pheromone, as discussed in Section IV-A. Furthermore, because the selected action of each agent isinfluenced by its priority at the current local state, we discussthe related calculation method of this action priority in SectionIV-B. Finally, we present the entire training process of thetwo neural network modules of each agent using the federaltraining method. The main notations used in this paper arelisted in TABLE I.

A. Attractor Selection within the Stigmergic State Space

The effectiveness of stigmergy can be improved by utilizingthe digital pheromone [2]. Unlike chemical pheromones leftby ants in natural colonies, digital pheromones generatedby intelligent agents are virtual and can be represented byseveral informative records in memory with attributes suchas value, time, and location [52]. Furthermore, during swarmforaging, most ants are attracted to these chemical signals,

TABLE IMAIN NOTATIONS USED IN THIS PAPER.

Notation Definition

N Number of agentsNt Number of active agents at time tS Local state spaceA Individual action sets Local statea Selected actionπ Action policyr Individual rewardR Accumulated individual rewardr Deterministic individual rewardR Accumulated deterministic individual rewardl Learning ratee State value network within Evaluation Modulee Target state value network within Evaluation Modulep Policy network within Behavior Moduleb State value network within Behavior Moduleb Target state value network within Behavior ModuleV State valueθ Neural network parametersd Distance between agent and attractorD Distance-dependent functionε Amount of digital pheromoneξ Attractor setrm Individual reward from mediumv Virtual agentSI Similarity

whose distribution naturally forms a pheromone map betweenthe food destination and nest. Similarly, a digital pheromonemap that contains the distribution of digital pheromones forproviding relevant information of the state space in the entireactivity area is also deployed in SIRL. The entire digitalpheromone map can be stored in a centralized manner in thevirtual agent, or can be split into several parts and stored in adecentralized manner in several specified agents [2]. Moreover,the digital pheromone map is continuously updated by mutualcommunication between its maintainer (e.g., virtual agent) andother agents in the activity area.

In SIRL, the digital pheromone is regarded as the trace leftby an agent in the medium, while the digital pheromone mapis regarded as the corresponding medium. As indicated bythe dynamic features of the medium in Fig. 1, the digitalpheromone experiences different evolution processes in themedium, which should be carefully designed to make thereturned conditions more effective for each agent. Inspired byphenomena in natural colonies, in which chemical pheromonesleft by different ants can be superposed to increase the totalinfluence, we model the accumulation of digital pheromoneswith different sources as a linear superposition. Moreover, in-stead of being restricted to a single area, the digital pheromonewith larger amount will diffuse into surrounding areas. Fur-thermore, the amount of digital pheromone will decay overtime. Therefore, the maintainer of the digital pheromone mapshould follow the three key principles: 1) linearly superposingdigital pheromones with different sources in the same area,2) diffusing digital pheromones into surrounding areas at asmall scale with a fixed diffusion rate after a new digitalpheromone has been left, and 3) decreasing the amount ofdigital pheromone at positions already occupied by agents with

a fixed decay rate. Note that the decay and diffusion rate areboth constants between 0 and 1.

With a digital pheromone map for the stigmergic statespace, each agent can sense the amount of digital pheromonewithin a certain range. Without loss of generality, we takethe aforementioned UAV application scenario as an example.Here, we regard any block (i.e., unit area) filled with the digitalpheromone as an attractor in the local environment, which hasan attractive effect on any nearby mobile agent for efficientlyobserving the local state space. Similar to the classical ACOalgorithm, within the local state space, each intelligent agentcan independently perform its action (i.e., approaching anattractor) by selecting a suitable attractor within its sensingrange from several potential attractor candidates, which canbe expressed by

Ci,j(t) =D(di,j(t)) · εj(t)∑

j∈ξi(t)D(di,j(t)) · εj(t), (1)

where Ci,j(t) is the probability of agent i selecting attractorj; εj(t) is the total amount of digital pheromone in attractor jat time t; ξi(t) is the set of attractors within the sensing rangeof agent i; di,j(t) is the Euclidean distance between agenti and attractor j; and D(·) is a monotone function used toreduce the effect of the digital pheromone as the inter-distancedi,j(t) increases [30], which is intuitively illustrated at thebottom of Fig. 1. This attractor selection method is inspiredby the classical ACO algorithm [35]. However, we use functionD(·) to replace the heuristic factor, which is commonly usedto consider the constrains of an optimization problem, ofthe original ACO algorithm to improve the selection process.Function D(·) can cause an agent to pay more attention tonearby influential attractors while avoiding the so-called ping-pong effect in the local environment. In addition, selectingattractors in a stochastic manner can cause agents to performactions (i.e., approaching target positions) with a smalleramount of digital pheromone, and can prevent a large numberof agents from crowding in a small local environment.

The location of the selected attractor serves as an informa-tive part of the input local state for each agent. Dependingon the two neural network modules of each agent, an actionis selected according to the input local state. Furthermore,following the stigmergic principle, any agent, which hasperformed the selected action accordingly leaves an additionaldigital pheromone in the medium, to provide new conditioninformation for the subsequent selection of attractors. Thisprocess can be expressed by

εj(t+ 1) =

{εj(t) + a1, if εj is in the labeled area;εj(t) · b1, otherwise, (2)

where a1 represents the fixed amount of digital pheromone leftby an agent at a time, and b1 is a discount constant between0 and 1 that helps gradually remove useless attractors. Thelabeled area indicates that the agent has partially fulfilled asmall task.

B. Action Priority Determination

In this subsection, we discuss the methods to calculatethe action priority for the conflict-avoidance mechanism.Corresponding to the two neural network modules in Fig.2, we exploit two algorithms to optimize their parameters,respectively. First, for the Evaluation Module, we define aninternal state value network whose output is the expectedaccumulated deterministic individual reward at s(i)t :

Ve(s(i)t ; θ(i)e ) = E[R

(i)t |s

(i)t = s, a

(i)t = (a; θ(i)p )], (3)

a(i)t = arg max

a∈Aπ(s

(i)t , a; θ(i)p ),

where s(i)t represents the local state observed by agent i at

time t. Subscript e represents the state value network in theEvaluation Module, and θ

(i)e denotes the related parameters.

Moreover, Ve is the state value of s(i)t , which is regarded as theaction priority of agent i at the current local state. a(i)t denotesthe selected action toward the chosen attractor of agent i attime t. Furthermore, subscript p represents the policy networkin the Behavior Module, while π represents its action policy.Note that the evaluation of the action priority at s(i)t is basedon the deterministically executed action at the same local state,and the returned individual reward during the training processof the state value network in the Evaluation Module is alsodeterministic. Therefore, we define R

(i)t as the accumulated

deterministic individual reward of agent i at time t, which iscalculated by

R(i)t =

{r(i)t + γ2 · Ve(s(i)t+1; θ

(i)e ), if s(i)t+1 is non-terminal;

r(i)t , otherwise,

(4)where γ2 is a discount factor, and r(i)t is the returned determin-istic individual reward. Subscript e represents the target statevalue network of agent i, whose parameters and output arerepresented by θ(i)e and Ve, respectively. The target state valuenetwork is used to calculate the state value of the new inputlocal state s

(i)t+1 and further help calculate the accumulated

deterministic individual reward, as illustrated in the first lineof (4). Moreover, the target state value network is almost thesame as the state value network in the Evaluation Moduleexcept that its parameters are periodically copied from theoriginal state value network [12]. Finally, the loss functionof the state value network in the Evaluation Module can beexpressed as

Loss(θ(i)e ) = 0.5 · [R(i)t − Ve(s

(i)t ; θ(i)e )]2. (5)

For the Behavior Module, we use the A2C algorithm tooptimize its parameters [50]. In particular, there is a policy anda state value network in the Behavior Module that share thesame input local state from the local environment, includingthe stigmergic state space with attractor indexing. Their lossfunctions, which are used to calculate the gradients of theneural network parameters, are respectively expressed as

Loss(θ(i)p ) = −logπ(a(i)t |s

(i)t ; θ(i)p )(R

(i)t −Vb(s

(i)t ; θ

(i)b )), (6)

Loss(θ(i)b ) = 0.5 · [R(i)

t − Vb(s(i)t ; θ

(i)b )]2, (7)

where subscripts p and b represent the policy and state valuenetwork in the Behavior Module, respectively. The localstate s

(i)t is sent to two different neural networks (i.e., the

policy network and the state value network) to calculate thecorresponding output. For the policy network, the output isthe action policy π. We select each action in a probabilisticmanner during the training phase of the policy network; thus,the returned individual reward is stochastic. For the state valuenetwork, the output is the state value of s(i)t (i.e., Vb), whichis used to calculate the advantage (i.e., R(i)

t − Vb(s(i)t ; θ

(i)b ))

to speed up the convergence of the action policy in theparallel policy network. Accordingly, R(i)

t is the accumulatedindividual reward of agent i, which can be expressed as

R(i)t =

{r(i)t + γ1 · Vb(s

(i)t+1; θ

(i)

b), if s(i)t+1 is non-terminal;

r(i)t , otherwise,

(8)where r(i)t is the individual reward received by agent i at timet, and γ1 is a discount factor that can normally be set toa constant between 0 and 1 (e.g., 0.9). Similarly, subscriptb represents another target state value network of agent i.Note that under the conflict-avoidance mechanism, any agentwith the highest action priority will have the opportunity toperform the selected action. Because the execution order ofdifferent actions is arranged by their accumulated deterministicindividual rewards, this estimation method of action priorityis expected to obtain a large global reward.

C. Training Neural Network Modules

Under the conflict-avoidance mechanism, the Evaluationand Behavior Modules work together to obtain an individualreward, and forge a cooperative relationship. Furthermore,according to (3), the estimation of the action priority is alsobased on the deterministic action policy of the BehaviorModule. Therefore, there are two successive sessions in eachtraining round (i.e., training epoch). In the left part of Fig.3, we freeze the parameters of the Behavior Module in thetraining session of the Evaluation Module. Similarly, we freezethe parameters of the Evaluation Module in the training sessionof the Behavior Module, which is indicated in the right part ofFig. 3. The federal training method is applied in both sessions,where a virtual agent is also employed. Furthermore, we seta terminal condition for both training sessions. Each sessionis stopped as the number of internal time steps reaches itsmaximum, or the global reward is positive. The performance ofmulti-agent collaboration is represented by the global reward,whose improvement can implicitly indicate the convergenceof the current training of the neural network modules. In Fig.3, t denotes the index of time steps while tmax represents itsmaximum. t is set to 0 at the beginning of a training session.

Fig. 3. Two successive sessions in each training round.

During each training round, a sample is sent to both sessionsto optimize different neural network modules.

In MARL, the global reward can typically be used to indi-cate the performance of multi-agent collaboration. However,it usually cannot be directly used as the individual rewardreceived by an agent, as the global reward is determinedby a series of actions from different agents. In general, theindividual reward can be obtained through a reasonable de-composition of the global objective. In SIRL, after the selectedaction is performed, each agent receives an individual rewardfrom the medium. Because the objective of each agent is toapproach its selected attractor in the stigmergic state space, wedefine the returned individual reward as the Euclidean measure(i.e., inter-distance change) between the position of each agentand its selected attractor, which can be expressed as

r(i)m (t) = ρ1 ·max ([di,j(t− 1)− di,j(t)], 0) , (9)

where the subscript m represents the medium in SIRL, andr(i)m (t) represents the individual reward received by agenti from the medium at time t. In addition, ρ1 is a scalarfactor, while di,j(t) represents the inter-distance between agenti and its selected attractor j, where j ∈ ξi(t − 1). Notethat the reward r(i)m (t) is obtained due to the implementationof the digital pheromone, which indicates the decompositionprocess of the stigmergy mechanism for the global objective.In particular, during each training round, r(i)m (t) is set to r(i)t in(4) during the training session of the Evaluation Module, andr(i)t in (8) during the training session of the Behavior Module.

Furthermore, as illustrated in Figs. 1 and 3, the federaltraining method is used to optimize the parameters of differentneural network modules through the average optimization:

θ(v)t+1 = θ

(v)t + v

(v)t , (10)

v(v)t = ρ · v(v)t−1 − lt ·

1

Nt

Nt∑i=1

∂Loss(θ(i)t )

∂θ(i)t

, (11)

where the superscript v represents the virtual agent. At timestep t = 0, v(v)0 is set to 0. ρ is a momentum factor,while lt denotes the learning rate of parameters. The federaltraining method is inspired by the A3C algorithm; however,it applies a synchronous updating method, which has beendemonstrated to have a lower error bound [53]. Moreover, we

(a) (b)

Fig. 4. Moving mode and digital pheromone sensing range of a single mobileagent in the activity area.

add a momentum factor in the updating process to acceleratethe convergence. Because a virtual agent has the same neuralnetwork structure as other agents, it can be used to optimizethe parameters of different neural network modules. For sim-plicity, we use θ(i)t in (11) to represent the parameters of eitherthe Evaluation or Behavior Module for agent i, and use θ(v)t torepresent the parameters of the same neural network moduleof the virtual agent. In particular, for the current training ofthe neural network module, gradients of the involved neuralnetwork parameters are first calculated in these Nt agentsaccording to the corresponding loss function, Loss(θ(i)t ), andare then sent to the virtual agent for the average optimization.Finally, the newly calculated parameters θ(v)t+1 are sent back toall agents to update their neural network modules. The entireSIRL algorithm can be divided into a training and testing part,which are detailed in Algorithms 1 and 2, respectively.

V. EXPERIMENTAL SIMULATION SETTINGS ANDNUMERICAL RESULTS

In this section, we create a simulation scenario in whicha certain number of mobile agents (e.g., UAVs) in an ac-tivity area automatically form a specified team shape. Inthis simulation scenario, the target team shape is transferredthrough the binary normalization of an image that is takenfrom the standard Modified National Institute of Standardsand Technology (MNIST) dataset [54]. The MNIST datasetis a large set of handwritten digits that is commonly usedfor training various image processing systems. In addition, itis widely used for training and testing in the field of machinelearning. Specifically, each image in this dataset has a standardsize (i.e., 28 × 28). As illustrated in Fig. 4, an agent isrepresented by a non-zero block (i.e., a pixel, as in digitizedbinary image). The white block (i.e., unit area) represents anagent, while black blocks represent the potential positions themobile agent can visit. We set the distance between any twoadjacent blocks to 1. At the beginning, all mobile agents aredistributed randomly across the entire activity area (i.e., theentire image). The experimental objective in this scenario canbe mapped to many real-world multi-agent collaboration tasks,such as the flight formation of UAVs monitoring a certaintarget area.

First, the simulation settings for the involved agents and theactivity area are described as follows:

• Each agent is supposed to move a block at each time steptoward one of four directions: 1) Up, 2) Down, 3) Left,and 4) Right.

• Each agent can leave the digital pheromone in the currentposition and sense the amount of digital pheromones inthe surrounding blocks within a certain range.

• Each agent can identify whether the occupied position islabeled or unlabeled.

• Each agent can sense the existence of neighbors in theUp, Down, Left, and Right directions to avoid collision.

• Each block in the activity area can only be occupied byone agent at a time.

• The activity area can be classified as labeled and unla-beled, where the former corresponds to the target teamshape to be formed.

The moving mode of a single agent is illustrated in Fig.4(a). In Fig. 4(b), the white dotted line represents the bound-ary of the digital pheromone sensing range of that agent.Brown blocks indicate the presence of digital pheromones,and colors with different shades represent distinct amountsof the pheromone. In addition, the coordination channel forthe conflict-avoidance exists between any central agent and itseight possible neighbors, which are also called Moore neigh-bors in mobile cellular automation [55]. In the experimentalsimulation, we use a Gaussian function to play the role ofD(·) in (1), which can be expressed as

D(di,j(t)) = a2 · exp(− (di,j(t)− b2)2

2c21), (12)

where a2 represents the peak value and is set to 1; b2 is themean value and is set to 0; and c1 represents the standarddeviation and is normally set to 0.25.

The similarity, SI , is calculated by the ratio of the numberof agents that end up in the labeled area to the total number ofagents in the activity area, which is further determined by thenumber of non-zero pixels required to form the target shape.In the training part, we define the increase in similarity aftereach time step (i.e., ∆SI) as the global reward, which can bepositive or negative. In addition, similar to the settings in [30],we use the position distribution of all agents after a certainnumber of iterations as the swarm’s initial state, which canalso be regarded as a sample. Approximately 7,500 samplesare extracted from different iterations in this simulation duringthe formation of the target shape “4.” In particular, the newposition distribution of all agents is returned after each timestep to calculate both the global and individual rewards so asto further optimize the neural network modules. Note that theneural network modules trained by this sample set can also beused in the testing process to form another shape (e.g., “1,”“2,” “0,” “6,” and “8”).

Furthermore, the Evaluation and Behavior Modules sharethe same input local state, which is denoted by a vector withseven elements. The first four elements represented by bitnumbers are used to confirm whether there are neighbors in thefollowing four adjacent positions: Up, Right, Down, and Left.The fifth and sixth elements are used to describe the relativeposition of the selected attractor in a two-dimensional plane,

Algorithm 1 The training part of SIRL.Input: Agents with the neural network modules, the sample set, the target shape, number of training rounds;Output: Agents with the well-trained neural network modules;1: Initialize the whole activity area, the digital pheromone map, the labeled area, the neural network modules within each agent, diffusion rate, decay rate,tmax, time step t, the range of sensing digital pheromone, the range of coordination channel;

2: for each training round do3: //* Training session of Evaluation Module *//4: Select a sample randomly from the sample set and initialize the location of agents according to the selected sample;5: while t ≤ tmax do6: for each agent do7: Select an attractor according to (1) and form the input local state;8: Send the local state to Evaluation Module;9: Send the same local state to Behavior Module and select an action with the largest probability;

10: Perform the action;11: Modify the digital pheromone at current position according to (2);12: if the extra digital pheromone is left then13: Diffuse the digital pheromone to the surrounding areas with the fixed diffusion rate;14: Superpose the amount of digital pheromone at the same position linearly;15: end if16: Calculate the individual reward according to (9);17: Select a new attractor according to (1) and form the new input local state;18: end for19: Decay the amount of digital pheromone at positions already occupied by agents with the fixed decay rate;20: if the calculated global reward > 0 then21: Break;22: else23: for each agent do24: Calculate the gradients ∂Loss(θ

(i)e )

∂θ(i)e

of the state value network within Evaluation Module according to (4) - (5) with the new input local state;

25: Send the calculated gradients to the virtual agent;26: end for27: The virtual agent receives the gradients from agents and optimizes the internal state value network within Evaluation Module according to (10)

- (11);28: The virtual agent sends back the calculated parameters θ(v)t+1 to all agents;

29: Each agent updates the state value network within Evaluation Module with θ(v)t+1;30: end if31: end while32: //* Training session of Behavior Module *//33: Initialize the location of agents according to the selected sample;34: while t ≤ tmax do35: for each agent do36: Select an attractor according to (1) and form the input local state;37: Send the local state to Evaluation Module and calculate the action priority;38: Send out the action priority through the coordination channel and receive the returned priority list;39: if the own action priority is the largest then40: Send the same local state to Behavior Module and select an action in a probabilistic manner;41: Perform the action;42: Modify the digital pheromone at current position according to (2);43: if the extra digital pheromone is left then44: Diffuse the digital pheromone to the surrounding areas with the fixed diffusion rate;45: Superpose the amount of digital pheromone at the same position linearly;46: end if47: Calculate the individual reward according to (9);48: Select a new attractor according to (1) and form the new input local state;49: end if50: end for51: Decay the amount of digital pheromone at positions already occupied by agents with the fixed decay rate;52: if the calculated global reward > 0 then53: Break;54: else55: for each agent getting the action opportunity do

56: Calculate the gradients ∂Loss(θ(i)p )

∂θ(i)p

and∂Loss(θ

(i)b

)

∂θ(i)b

of the policy and state value networks within Behavior Module according to (6) - (8)

with the new input local state;57: Send the calculated gradients to the virtual agent;58: end for59: The virtual agent receives the gradients from agents and optimizes the internal policy and state value networks within Behavior Module according

to (10) - (11);60: The virtual agent sends back the calculated parameters θ(v)t+1 to all agents;

61: Each agent updates the policy and state value networks within Behavior Module with θ(v)t+1;62: end if63: end while64: end for65: Return agents with the well-trained neural network modules;

Algorithm 2 The testing part of SIRL.Input: Agents with the well-trained neural network modules, the location of each agent, the target shape, number of iterations;Output: The final similarity;1: Initialize the whole activity area, the digital pheromone map, the labeled area, diffusion rate, decay rate, time step t, the range of sensing digital pheromone,

the range of coordination channel;2: for each iteration do3: for each agent do4: Select an attractor according to (1) and form the input local state;5: Send the local state to Evaluation Module and calculate the action priority;6: Send out the action priority through the coordination channel and receive the returned priority list;7: if the own action priority is the largest then8: Send the same local state to Behavior Module and select an action with the largest probability;9: Perform the action;

10: Modify the digital pheromone at current position according to (2);11: if the extra digital pheromone is left then12: Diffuse the digital pheromone to the surrounding areas with the fixed diffusion rate;13: Superpose the amount of digital pheromone at the same position linearly;14: end if15: end if16: end for17: Decay the amount of digital pheromone at positions already occupied by agents with the fixed decay rate;18: end for19: Calculate the final similarity according to the target shape;20: Return the final similarity;

0 500 1000 1500 2000Training round

0

0.2

0.4

0.6

0.8

1

Sim

ilari

ty

Similarity

JLIRLSIRLJL - OIRL - O

(a)

0 50 100 150 200Number of iteration

0

0.2

0.4

0.6

0.8

1S

imila

rity

Similarity

JLIRLSIRLJL - OIRL - O

(b)

Fig. 5. Training and testing performance of different methods in the task to form the shape “4.”

and the seventh element is used to confirm whether the currentoccupied position is labeled or unlabeled. The individualaction set A contains five different actions: Up, Right, Down,Left, and Stop. However, since the local state received byeach agent contains the relative positions of adjacent agents,the recorded new input local state s

(i)t+1 may be inaccurate

at time step t + 1 due to the possible movement of adjacentagents at time step t, thus leading to inaccurate estimationof Ve(s

(i)t+1; θ

(i)e ) in (4). In the following simulation, we use

different values of γ2 to test this phenomenon.We first present the convergence of SIRL and compare it

with that of other typical methods. The first method, jointlearning (JL), attempts to learn the optimal behavior from thejoint information using only the Behavior Module, whose inputis a cascaded vector containing all the input local states ofsurrounding agents within the Moore neighborhood, while theconflict-avoidance mechanism is disabled. The second method,IRL, has almost the same settings as JL except that its inputvector only contains the self-related local state. The third

Fig. 6. Shapes formed at different iterations using stigmergic independentreinforcement learning (SIRL).

TABLE IIRECEIVED INDIVIDUAL REWARD FOR JOINT LEARNING-ORIGIN (JL-O)

AND INDEPENDENT REINFORCEMENT LEARNING-ORIGIN (IRL-O).

Transition RewardUnlabeled area to Unlabeled area: 0 → 0 0Unlabeled area to Labeled area: 0 → 1 a3Labeled area to Unlabeled area: 1 → 0 0Labeled area to Labeled area: 1 → 1 b3· max(4SI , 0)

method, joint learning-origin (JL-O), and fourth method, IRL-origin (IRL-O), are modified from the JL and IRL methods,respectively, by further disabling the stigmergy mechanism andreplacing the attractors by the exact coordinates of the agents.In this situation, each agent receives a non-zero individualreward only when it enters the labeled area, and the globalreward is also considered afterward. The received individualreward in this situation is indicated in TABLE II. Here,the transition 0 → 1 represents an agent moving from theunlabeled area to the labeled area. a3 and b3 are both positiveconstants.

The training and testing performance of the above-mentioned five methods is presented in Fig. 5, and an intuitiveillustration of the formed shapes with respect to the iterationindex in SIRL is presented in Fig. 6. In Fig. 5(a), the neuralnetwork modules are tested every 10 training rounds. It canbe observed from Fig. 5 that there is an evident performancedifference between methods with and without the stigmergymechanism, as indicated by the curves of SIRL, JL, and IRL,since stigmergy can better decompose the global objectiveand achieve a higher final similarity. In addition, althoughthe joint information is obtained in JL, it is simply treated asa noisy signal by each agent without appropriate utilization.Therefore, despite different inputs, the performance of JL andIRL is almost identical. Moreover, SIRL performs better thanJL or IRL, benefiting from the conflict-avoidance mechanism,which can reconcile the contradictions caused by the actions ofdifferent agents and further improve the cooperation efficiency.

We also present the training performance of SIRL in Fig.7 when the discount factor γ2 takes different values. We canobserve that the training performance of SIRL declines as thevalue of γ2 increases to 1. In SIRL, based on the current localobservations of the environment and the behavioral policy, theEvaluation Module of each agent is used to calculate the actionpriority among the action candidates. However, due to theexistence and influence of other agents, the transition processfrom one state to another becomes unpredictable for eachindividual agent. In this situation, the reward mapped fromthe future state by the discount factor γ2 becomes inaccurate.A smaller γ2 can generally reduce this estimation error.Therefore, in the following simulation, we set the discountfactor γ2 = 0 in the training session of the Evaluation Moduleof each agent to limit the accumulation of R(i)

t to only onestep.

We further test the well-trained neural network modulesusing the above-mentioned five methods in the tasks to formshapes “1” and “2.” The final similarity and the number ofinvolved agents in each task are listed in TABLE III, whereeach value is an average of five replications. Note that the taskcomplexity is largely related to the number of agents involved.With an increase in task complexity, the number of iterationsrequired to reach convergence also increases, whereas the finalsimilarity normally declines. In the first five rows of TABLEIII, we can observe that the neural network modules fullytrained in the task to form shape “4” can also be used toform other shapes (i.e., “1” and “2”), and the methods withstigmergy yield superior portability to those without it.

0 500 1000 1500 2000Training round

0.2

0.4

0.6

0.8

1

Sim

ilari

ty

Similarity

SIRL 2 = 0

SIRL 2 = 0.3

SIRL 2 = 0.6

SIRL 2 = 0.7

SIRL 2 = 0.8

SIRL 2 = 0.9

Fig. 7. Training performance of stigmergic independent reinforcement learn-ing (SIRL) when γ2 takes different values.

0 50 100 150 200Number of iteration

0.7

0.8

0.9

1

Sim

ilari

ty

Similarity

DC "1"DC "4"DC "2"SIRL "1"SIRL "4"

SIRL "2"CS "1"CS "4"CS "2"

Fig. 8. Performance comparison between centralized selection (CS), decen-tralized (DC), and stigmergic independent reinforcement learning (SIRL) inthree different tasks.

Next, we compare the performance with the centralizedselection (CS) method presented in [30], in which a certainnumber of agents are selected synchronously at each iterationthrough a threshold-based function [26] to perform actions,while the other agents stop. Furthermore, we convert the CSmethod into a decentralized method called DC, in which eachagent can automatically decide to perform an action or notthrough comparison with surrounding neighbors. Thus, thenumber of active agents may vary within each task or iteration.Instead of competing based on the action priority, as in SIRL,agents in DC can directly determine their moving prioritiesin terms of the received rewards. In particular, an agent canmove in the next time step t + 1 only when it receives themaximum reward in time step t within the comparison range,which has the same size as the Moore neighborhood. Asillustrated in TABLE IV, the reward received by each agent inDC is carefully tuned and closely related to its surroundingconditions. Here, each agent should consider the existences ofup to four nearest neighbors in the Up, Down, Left, and Rightdirections. The label “1” (or “0”) in the second line signifiesthat this agent is in the labeled (or unlabeled) area, whereas“*” indicates to ignore this condition. We observe that agentsin the unlabeled area appear to receive greater rewards and

TABLE IIIFINAL SIMILARITIES USING DIFFERENT METHODS.

Stigmergy Conflict-Avoidance Federal Training Input Local State Algorithm Shape (Number of involved agents)1 (65) 4 (119) 2 (161) 0 (179) 6 (128) 8 (163)

Applied Applied Applied Self-related SIRL 96.6% 97.5% 97.5% 95.1% 96.0% 96.4%Applied None Applied Joint JL 84.3% 85.7% 82.1% N/A N/A N/AApplied None Applied Self-related IRL 84.9% 83.2% 81.9% N/A N/A N/A

None None Applied Joint JL - O 35.4% 49.6% 41.0% N/A N/A N/ANone None Applied Self-related IRL - O 33.9% 46.7% 37.9% N/A N/A N/A

Applied Applied None Self-related DC 99.7% 98.3% 94.5% 84.8% 88.4% 86.9%Applied Applied None Self-related CS 94.2% 95.1% 87.8% 57.2% 74.4% 75.3%Applied Applied Applied Self-related SIRL - A N/A N/A N/A 97.0% 95.4% 95.3%Applied None Applied Self-related SIRL - WS N/A N/A N/A 85.7% 87.8% 88.7%

0 50 100 150 200 250 300Number of iteration

0

0.2

0.4

0.6

0.8

1

Sim

ilari

ty

Similarity

SIRL "0"SIRL - A "0"SIRL - WS "0"DC "0"CS "0"

(a)


0

0.2

0.4

0.6

0.8

1S

imila

rity

Similarity


(b)


0

0.2

0.4

0.6

0.8

1

Sim

ilari

ty

Similarity


(c)

Fig. 9. Performance comparison using different methods in the tasks to form shapes “0,” “6,” and “8.”

TABLE IVMANUALLY SET RECEIVED REWARD FOR THE DECENTRALIZED (DC)

METHOD.

Number of neighbors 4 3 3 2 1 1 0In the labeled area ∗ 1 0 ∗ 1 0 ∗

Reward 0 4 12 8 8 12 12

Fig. 10. Final shapes formed by centralized selection (CS), decentralized(DC), and stigmergic independent reinforcement learning (SIRL).

thus obtain more action opportunities. In addition, similar toCS, each agent in DC is also designed to approach its selectedattractor in a circular path so as to make most agents in theunlabeled area move around the labeled area to accelerate theconvergence [30].

In Fig. 8, we present a performance comparison betweenCS, DC, and SIRL in the task of forming three differentshapes (i.e., “1,” “4,” and “2”). Their final similarities are listedin the first, sixth, and seventh rows of TABLE III. We canobserve that the performance of CS and DC tends to declinesharply with an increase in task complexity and the numberof involved agents. Moreover, in CS and DC, the value ofreceived individual rewards and the moving manner of eachagent must be determined manually, which is impractical inmore complex scenarios. In contrast, the performance of SIRLreaches a level comparable to that of CS or DC, and achievesa superior convergence rate in more complex tasks, such asthe task to form the shape “2.”

Next, we increase the task complexity and add a ringstructure into the shape to be formed so as to give it a morecomplex topology. The performance comparison between CS,DC, and SIRL in the tasks to form shapes “0,” “6,” and “8”is provided in Fig. 9. The numerical results are also listed inthe first, sixth, and seventh rows of TABLE III. The neuralnetwork modules utilized in SIRL are still fully trained in theprevious task to form shape “4.” In Fig. 9, there is a largeperformance difference between CS or DC and SIRL. Thereason is that it is generally easier for agents in the unlabeledarea to win the moving opportunities; however, they move

around the labeled area repeatedly in CS or DC. Consistentwith our previous discussion, both the CS and DC methodsperform better in less complex scenarios, such as the taskto form the shape “1.” However, they reach a bottleneck forcomplex shapes that contain ring structures because agents inthe unlabeled area are blocked by other agents located at theedge of the labeled area.

We present the final shapes formed by the three methodsin Fig. 10. It can be clearly observed that many agentsin CS or DC are detained on the periphery of the labeledarea, ultimately reducing the final similarity. In contrast, theperformance of SIRL remains stable regardless of the taskcomplexity, which benefits from the learning process of thefederal training method. Compared with DC, the number ofagents that are selected at each iteration must be predeterminedin CS. However, the most appropriate number may depend onthe specific task or iteration. Thus, CS would obtain lowerfinal similarities or require a larger number of total stepsin different tasks. In addition, the performance of CS alsodeclines dramatically when the shape to be formed contains aring structure. This phenomenon can also be observed in DC,as the two methods share the same manner of motion.

We further test the effect of different ranges of the co-ordination channel of the conflict-avoidance mechanism onthe final performance based on SIRL. Instead of relying onthe Moore neighbors, in the SIRL-A method, we reduce themaximum range of the coordination channel to 1; thus, onlyfour neighbors are added in the comparison of action priority.Furthermore, in the SIRL-WS method, we disable the conflict-avoidance mechanism (i.e., well-trained Evaluation Module)and provide each agent with the action opportunity at eachtime step, which can be regarded as the maximum range ofthe coordination channel being set to 0. The performance ofthe two methods is illustrated in Fig. 9. The numerical resultsare also listed in the last two rows of TABLE III. It can be seenthat the performance of SIRL-A reaches a similar level to thatof SIRL, and achieves an even higher convergence rate, sincemore agents can obtain action opportunities and participate inthe task at an early stage. In addition, the curves of SIRL-WSin various tasks grow faster but stop at similarly lower levels. Itcan be concluded that the conflict-avoidance mechanism playsan important role in multi-agent collaboration and can reducethe behavioral localities of agents and achieve a higher finalsimilarity.

As an important metric, the number of total steps requiredby all involved agents to form the target shape should beespecially considered. In Fig. 11, we present the total numberof steps when the performance reaches convergence using fourdifferent methods. In the Oracle method, it is assumed thatthe location of each agent and the target shape are known inadvance. Therefore, for each vacancy in the labeled area, thenearest agent is moved in sequence to fill it greedily. Thisscheme represents a simple but effective control method inthe extraordinary case in which all information is known, eventhough the greedy algorithm does not necessarily produce theoptimal result. In Fig. 11, the formed shapes are arranged fromsmall to large based on the number of involved agents. There

Steps required to form the target shape

"1" "4" "6" "2" "8" "0"Shape to be formed

0

2000

4000

6000

8000

10000

Nu

mb

er o

f st

eps

OracleDCCSSIRL

Fig. 11. Number of steps required using different methods.

is a trend that as the number of involved agents increases, thenumber of total steps required also increases. We can observethat the performance of SIRL reaches a level comparable tothat of Oracle. Moreover, since the agents are controlled inparallel, SIRL may spend less time completing tasks in realscenarios.

VI. CONCLUSIONS

This study utilizes the advantages of the stigmergy mech-anism rooted in a natural MAS and contributes to integrat-ing stigmergic collaboration into the DRL-based decision-making process of each involved intelligent agent. In par-ticular, in our proposed SIRL algorithm, various agents arecoordinated through the stigmergic functionality and conflict-avoidance mechanism in fulfilling the corresponding tasks. Asan enhancement to stigmergy, the proposed conflict-avoidancemechanism can further improve the coordination performance,especially when agents do not fully cooperate. In addition, theRL process of each agent is strengthened through the proposedfederal training method. Compared with a traditional dis-tributed learning algorithm, our learning method allows agentsto optimize their internal neural networks while maintainingtheir external interaction processes required for effective inter-agent collaboration. Furthermore, due to the introduction ofthe stigmergic RL process, our proposed SIRL scheme can befurther applied to more complex cooperative tasks for whichthe agent’s behavioral policy cannot be predetermined. Theresults of numerical simulations verify the effectiveness ofour proposed scheme with significant improvements in bothefficiency and scalability.

In future work, we plan to implement the proposed SIRLscheme in real-world application scenarios. As a preliminarystep, our scheme has recently been implemented on multiplemobile robots in team formation tasks, which are similar tothose presented in this paper [56], and achieve satisfactory re-sults. We also plan to investigate the performance of the SIRLscheme in other coordination scenarios, such as the controlof autonomous vehicles in intelligent urban transportation. Inaddition, determining how to improve the network trainingefficiency while considering the communication delay and cost

among various agents is also a valuable direction that we willconsider in future research.

REFERENCES

[1] J. Schwarzrock, I. Zacarias, A. L. C. Bazzan, L. H. Moreira, and E. P. D.Freitas, “Solving task allocation problem in multi Unmanned AerialVehicles systems using Swarm intelligence,” Engineering Applicationsof Artificial Intelligence, vol. 72, pp. 10–20, 2018.

[2] H. V. D. Parunak, S. A. Brueckner, and J. Sauter, “Digital Pheromonesfor Coordination of Unmanned Vehicles,” Lecture Notes in ComputerScience, vol. 3374, pp. 246–263, 2005.

[3] M. G. C. A. Cimino, A. Lazzeri, and G. Vaglini, “Combining stigmergicand flocking behaviors to coordinate swarms of drones performingtarget search,” in 2015 6th International Conference on Information,Intelligence, Systems and Applications (IISA), Corfu, Greece, July 2015.

[4] P. Valckenaers, H. V. Brussel, M. Kollingbaum, and O. Bochmann,“Multi-Agent Coordination and Control Using Stigmergy Applied toManufacturing Control,” in Selected Tutorial Papers from the 9th ECCAIAdvanced Course ACAI 2001 and Agent Link’s 3rd European AgentSystems Summer School on Multi-Agent Systems and Applications,Berlin, Heidelberg, July 2001.

[5] Hadeli, P. Valckenaers, M. Kollingbaum, and H. V. Brussel, “Multi-agent coordination and control using stigmergy,” Computers in Industry,vol. 53, no. 1, pp. 75–96, 2004.

[6] J. Werfel and R. Nagpal, “Extended Stigmergy in Collective Construc-tion,” IEEE Intelligent Systems, vol. 21, no. 2, pp. 20–28, 2006.

[7] B. Guo, C. Chen, D. Zhang, Z. Yu, and A. Chin, “Mobile Crowd Sensingand Computing: When Participatory Sensing Meets Participatory SocialMedia,” IEEE Communications Magazine, vol. 54, no. 2, pp. 131–137,2016.

[8] A. L. Alfeo, M. G. C. A. Cimino, S. Egidi, B. Lepri, A. Pentland,and G. Vaglini, “Stigmergy-Based Modeling to Discover Urban ActivityPatterns from Positioning Data,” in Social, Cultural, and BehavioralModeling. Cham: Springer International Publishing, April 2017, pp.292–301.

[9] G. Tesauro, “Temporal difference learning and TD-gammon,” Commu-nications of the ACM, vol. 38, no. 3, pp. 58–68, 1995.

[10] N. Kohl and P. Stone, “Policy gradient reinforcement learning for fastquadrupedal locomotion,” in IEEE International Conference on Roboticsand Automation, vol. 3, New Orleans, LA, USA, May 2004, pp. 2619– 2624.

[11] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath,“Deep Reinforcement Learning: A Brief Survey,” IEEE Signal Process-ing Magazine, vol. 34, no. 6, pp. 26–38, 2017.

[12] M. Volodymyr, K. Koray, S. David, A. A. Rusu, V. Joel, M. G.Bellemare, G. Alex, R. Martin, A. K. Fidjeland, and O. Georg, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no.7540, pp. 529–533, 2015.

[13] V. Mnih, A. P. Badia, M. Mirza, A. Graves, and K. Kavukcuoglu,“Asynchronous Methods for Deep Reinforcement Learning,” February2016, arXiv. [Online]. Available: https://arxiv.org/abs/1602.01783.

[14] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller,“Deterministic Policy Gradient Algorithms,” in Proceedings of the 31stInternational Conference on Machine Learning, vol. 32, no. 1, Bejing,China, June 2014.

[15] L. Busoniu, R. Babuska, and B. D. Schutter, “A Comprehensive Surveyof Multiagent Reinforcement Learning,” IEEE Trans. on Systems, Man,and Cybernetics, Part C, vol. 38, no. 2, pp. 156–172, 2008.

[16] M. L. Littman, “Markov games as a framework for multi-agent reinforce-ment learning,” in Machine Learning Proceedings 1994, San Francisco(CA), 1994.

[17] L. Matignon, G. Laurent, and N. Fort-Piat, “Independent reinforcementlearners in cooperative Markov games: A survey regarding coordinationproblems,” The Knowledge Engineering Review, vol. 27, pp. 1–31,March 2012.

[18] J. N. Foerster, Y. M. Assael, N. D. Freitas, and S. Whiteson, “Learningto Communicate with Deep Multi-Agent Reinforcement Learning,” May2016, arXiv. [Online]. Available: https://arxiv.org/abs/1605.06676.

[19] M. Tan, “Multi-Agent Reinforcement Learning: Independent vs. Coop-erative Agents,” in Proceedings of the 10th International Conference onMachine Learning (ICML-93), May 1993.

[20] H. Mao, N. Yan, Z. Gong, W. Ke, M. Chao, X. Yang, W. Yuan,J. Wang, Q. Wang, and X. Liu, “ACCNet: Actor-Coordinator-Critic Net for ‘Learning-to-Communicate’ with Deep Multi-agentReinforcement Learning,” June 2017, arXiv. [Online]. Available:https://arxiv.org/abs/1706.03235.

[21] M. J. Mataric, “Using communication to reduce locality in distributedmultiagent learning,” Journal of Experimental and Theoretical ArtificialIntelligence, vol. 10, no. 3, p. 13, 1998.

[22] F. Heylighen, “Stigmergy as a universal coordination mechanism I:Definition and components,” Cognitive Systems Research, vol. 38, pp.4–13, 2016.

[23] ——, “Stigmergy as a universal coordination mechanism II: Varietiesand evolution,” Cognitive Systems Research, vol. 38, pp. 50–59, 2016.

[24] R. Schoonderwoerd, O. Holland, and J. Bruten, “Ant-Based LoadBalancing in Telecommunications Networks,” Adaptive Behavior, vol. 5,no. 2, pp. 169–207, 1996.

[25] G. D. Caro and M. Dorigo, “AntNet: Distributed Stigmergetic Controlfor Communications Networks,” Journal of Artificial Intelligence Re-search, vol. 9, no. 1, pp. 317–365, 1999.

[26] M. Dorigo, E. Bonabeau, and G. Theraulaz, “Ant algorithms andstigmergy,” Future Generation Computer Systems, vol. 16, no. 8, pp.851–871, 2000.

[27] C. Maru, M. Enoki, A. Nakao, Y. Shu, S. Yamaguchi, and M. Oguchi,“QoE Control of Network Using Collective Intelligence of SNS inLarge-Scale Disasters,” in 2016 IEEE International Conference onComputer and Information Technology (CIT), Nadi, Fiji, March 2017.

[28] J. Takahashi, R. Kanamori, and T. Ito, “A Preliminary Study on Antic-ipatory Stigmergy for Traffic Management,” in 2012 IEEE/WIC/ACMInternational Conferences on Web Intelligence and Intelligent AgentTechnology, vol. 3, Macau, China, December 2012.

[29] R. Kanamori, J. Takahashi, and T. Ito, “Evaluation of anticipatorystigmergy strategies for traffic management,” in 2012 IEEE VehicularNetworking Conference (VNC), Seoul, South Korea, November 2012,pp. 33–39.

[30] X. Xu, Z. Zhao, R. Li, and H. Zhang, “Brain-Inspired StigmergyLearning,” IEEE Access, vol. 7, pp. 410–424, 2019.

[31] O. R. Pagan, “The brain: a concept in flux,” Philosophical Transactionsof The Royal Society B Biological ences, vol. 374, no. 1774, 2019.

[32] O. E. Holland, “Multiagent systems: Lessons from social insects andcollective robotics,” Adaptation, Coevolution, and Learning in Multia-gent Systems: Papers from the 1996 AAAI Spring Symposium, 1996.

[33] A. Ricci, A. Omicini, M. Viroli, L. Gardelli, and E. Oliva, “CognitiveStigmergy: Towards a Framework Based on Agents and Artifacts,” inProceedings of the 3rd International Conference on Environments forMulti-Agent Systems III, Berlin, Heidelberg, May 2006.

[34] C. H. Yong and R. Miikkulainen, “Coevolution of Role-Based Coop-eration in Multiagent Systems,” IEEE Trans. on Autonomous MentalDevelopment, vol. 1, pp. 170–186, 2009.

[35] M. Dorigo, M. Birattari, and T. Stutzle, “Ant colony optimization,” IEEEComputational Intelligence Magazine, vol. 1, no. 4, pp. 28–39, 2007.

[36] V. Crespi, A. Galstyan, and K. Lerman, “Comparative analysis of top-down and bottom-up methodologies for multi-agent system design,” inProceedings of the 4th International Joint Conference on AutonomousAgents and Multiagent Systems, New York, NY, USA, July 2005.

[37] R. Aras, A. Dutech, and F. Charpillet, “Stigmergy in multi agent rein-forcement learning,” in Proceedings of the 4th International Conferenceon Hybrid Intelligent Systems, USA, December 2004.

[38] K. Verbeeck and N. Ann, “Stigmergy and Bootstrapping: Using AntAlgorithms as a Case Study for Learning in MAS,” in Proceedings of2nd International Workshop on Ant Algorithms, September 2000.

[39] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep Decen-tralized Multi-task Multi-Agent Reinforcement Learning under PartialObservability,” in Proceedings of the 34th International Conference onMachine Learning, vol. 70, International Convention Centre, Sydney,Australia, August 2017.

[40] L. Peshkin, N. Meuleau, and L. P. Kaelbling, “Learning Policies withExternal Memory,” in Proceedings of the 16th International Conferenceon Machine Learning, Williamstown, MA, USA, March 2001.

[41] M. Bourenane, A. Mellouk, and D. Benhamamouch, “ReinforcementLearning in Multi-Agent Environment and Ant Colony for PacketScheduling in Routers,” in Proceedings of the 5th ACM InternationalWorkshop on Mobility Management and Wireless Access, New York,NY, USA, October 2007, p. 137–143.

[42] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine Learning,vol. 8, no. 3-4, pp. 279–292, 1992.

[43] K. Verbeeck and A. Nowe, “Colonies of learning automata,” IEEE Trans.on Systems, Man, and Cybernetics, vol. 32, no. 6, pp. 72–80, 2002.

[44] B. Masoumi and M. R. Meybodi, “Speeding up learning automata basedmulti agent systems using the concepts of stigmergy and entropy,” ExpertSystems with Applications, vol. 38, no. 7, pp. 8105–8118, 2011.

[45] O. Shamir and T. Zhang, “Stochastic Gradient Descent for Non-smoothOptimization: Convergence Results and Optimal Averaging Schemes,” inProceedings of the 30th International Conference on Machine Learning,vol. 28, no. 1, Atlanta, Georgia, USA, June 2013, pp. 71–79.

[46] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le,M. Z. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Y. Ng,“Large Scale Distributed Deep Networks,” in Proceedings of the 25thInternational Conference on Neural Information Processing Systems,vol. 1, Red Hook, NY, USA, December 2012.

[47] H. Rui, G. Yanmin, and G. Yuanxiong, “CPFed: Communication-Efficient and Privacy-Preserving Federated Learning,” March 2020,arXiv. [Online]. Available: https://arxiv.org/abs/2003.13761v1.

[48] G. Sartoretti, Y. Wu, W. Paivine, and T. K. S. Kumar, “DistributedReinforcement Learning for Multi-robot Decentralized Collective Con-struction.” Cham: Springer International Publishing, 2019, pp. 35–49.

[49] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “PrioritizedExperience Replay,” December 2015, arXiv. [Online]. Available:https://arxiv.org/abs/1511.05952.

[50] I. Grondman, L. Bucsoniu, G. A. D. Lopes, and R. Babuska, “A Surveyof Actor-Critic Reinforcement Learning: Standard and Natural PolicyGradients,” IEEE Trans. on Systems, Man, and Cybernetics, Part C,vol. 42, no. 6, pp. 1291–1307, 2012.

[51] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The ArcadeLearning Environment: An Evaluation Platform for General Agents,”Journal of Artificial Intelligence Research, vol. 47, no. 1, pp. 253–279,2013.

[52] M. Mamei and F. Zambonelli, “Programming Stigmergic Coordinationwith the TOTA Middleware,” in Proceedings of the 4th InternationalJoint Conference on Autonomous Agents and Multiagent Systems, NewYork, NY, USA, July 2005.

[53] S. Dutta, G. Joshi, S. Ghosh, P. Dube, and P. Nagpurkar,“Slow and Stale Gradients Can Win the Race: Error-RuntimeTrade-offs in Distributed SGD,” 2018, arXiv. [Online]. Available:https://arxiv.org/abs/1803.01113.

[54] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,” Proceedings of the IEEE, vol. 86,no. 11, pp. 2278 – 2324, 1998.

[55] O. Miramontes, R. V. Sole, and B. C. Goodwin, “Collective behaviourof random-activated mobile cellular automata,” Physical Nonlinear Phe-nomena, vol. 63, no. 1-2, pp. 145–160, 1993.

[56] K. Chen, R. Li, J. Crowcroft, Z. Zhao, and H. Zhang, “Demo: TheImplementation of Stigmergy in Network-assisted Multi-agent System,”in Proc. ACM Mobicom 2020, London, UK, 2020.

Documents

Stigmergic Independent Reinforcement Learning for Multi-Agent … · 2020. 10. 13. · Pierre-Paul Grass`e in 1950s when studying the behaviors of social insects [22], [23]. Recently,