7
Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage Huy Xuan Pham, Hung Manh La, David Feil-Seifer, and Ara Nefian Abstract— This paper proposes a distributed Multi-Agent Reinforcement Learning (MARL) algorithm for a team of Unmanned Aerial Vehicles (UAVs). The proposed MARL al- gorithm allows UAVs to learn cooperatively to provide a full coverage of an unknown field of interest while minimizing the overlapping sections among their field of views. Two challenges in MARL for such a system are discussed in the paper: firstly, the complex dynamic of the joint-actions of the UAV team, that will be solved using game-theoretic correlated equilibrium, and secondly, the challenge in huge dimensional state space repre- sentation will be tackled with efficient function approximation techniques. We also provide our experimental results in detail with both simulation and physical implementation to show that the UAV team can successfully learn to accomplish the task. I. I NTRODUCTION Optimal sensing coverage is an active research branch. Solutions have been proposed in previous work, for instance, by solving general locational optimization problem [1], using Voronoi partitions [2], [3], using potential field methods [4], [5], or scalar field mapping [6], [7]. In most of those work, authors made assumption about the mathematical model of the environment, such as distribution model of the field or the predefined coverage path [8], [9]. In reality, however, it is very difficult to have an accurate model, because its data is normally limited or unavailable. Unmanned aerial vehicles (UAV), or drones, have already become popular in human society, with a wide range of application, from retailing business to environmental issues. The ability to provide visual information with low costs and high flexibility makes the drones preferable equipment in tasks relating to field coverage and monitoring, such as in wildfire monitoring [10], or search and rescue [11]. In such applications, usually a team of UAVs could be deployed to increase the coverage range and reliability of the mission. As with other multi-agent systems [12], [13], the important challenges in designing an autonomous team of UAVs for field coverage include dealing with the dynamic complexity of the interaction between the UAVs so that they can coordinate to accomplish a common team goal. Model-free learning algorithms, such as Reinforcement learning (RL), would be a natural approach to address the This material is based upon work supported by the National Aeronau- tics and Space Administration (NASA) Grant No. NNX15AK48A issued through the NV EPSCoR RID Seed, and NASA Grant No. NNX15AI02H issued through the NVSGC Faculty Award - CD. Huy Pham is a PhD student of the Advanced Robotics and Automation (ARA) Laboratory, and Dr. Hung La is the director of ARA Lab. Dr. David Feil-Seifer is an Assistant Professor of Department of Computer Science and Engineering, University of Nevada, Reno, NV 89557, USA. Dr. Ara Nefian is with NASA Ames Research Center, Moffett Field, CA 94035. Corresponding author: Hung La, email: [email protected] aforementioned challenges relating to the required accurate mathematical models for the environment, and the complex behaviors of the system. These algorithms will allow each agent in the team to learn new behavior, or reach consensus with others [14], without depending on a model of the environment [15]. Among them, RL is popular because it is relatively generic to address a wide range of problem, while it is simple to implement. Classic individual RL algorithms have already been ex- tensively researched in UAV applications. Previous papers focus on applying RL algorithm into UAV control to achieve desired trajectory tracking/following [16], or discussion of using RL to improve the performance in UAV applica- tion [17]. Multi-Agent Reinforcement Learning (MARL) is also an active field of research. In multi-agent systems (MAS), agents’ behaviors cannot be fully designed as priori due to the complicated nature, therefore, the ability to learn appropriate behaviors and interactions will provide a huge advantage for the system. This particularly benefits the sys- tem when new agents are introduced, or the environment is changed [18]. Recent publications concerning the possibility of applying MARL into a variety of applications, such as in autonomous driving [19], or traffic control [20]. In robotics, efforts have been focused on robotic system coordination and collaboration [21], transfer learning [22], or multi-target observation [23]. For robot path planning and control, most prior research focuses on classic problems, such as navigation and collision avoidance [24], object carrying by robot teams [25], or pursuing preys/avoiding predators [26], [27]. Many other papers in multi-robotic sys- tems even simplified the dynamic nature of the system to use individual agent learning such as classic RL algorithm [28], or actor-critic model [29]. To our best knowledge, not so many works available addressed the complexity of MARL in a multi-UAV system and their daily missions such as optimal sensing coverage. In this paper, we propose how a MARL algorithm can be applied to solve an optimal coverage problem. We address two challenges in MARL: (1) the complex dynamic of the joint-actions of the UAV team, that will be solved using game-theoric correlated equilibrium, and (2) the challenge in huge dimensional state space will be tackled with an efficient space-reduced representation of the value function. The remaining of the paper is organized as follows. Section II details on the optimal field coverage problem formulation. In section III, we discuss our approach to solve the problem and the design of the learning algorithm. Basics in MARL will also be covered. We present our experimental arXiv:1803.07250v2 [cs.RO] 16 Sep 2018

Huy Xuan Pham, Hung Manh La, David Feil-Seifer, and Ara Nefian · Cooperative and Distributed Reinforcement Learning of Drones for Field Coverage Huy Xuan Pham, Hung Manh La, David

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Cooperative and Distributed Reinforcement Learning of Drones forField Coverage

Huy Xuan Pham, Hung Manh La, David Feil-Seifer, and Ara Nefian

Abstract— This paper proposes a distributed Multi-AgentReinforcement Learning (MARL) algorithm for a team ofUnmanned Aerial Vehicles (UAVs). The proposed MARL al-gorithm allows UAVs to learn cooperatively to provide a fullcoverage of an unknown field of interest while minimizing theoverlapping sections among their field of views. Two challengesin MARL for such a system are discussed in the paper: firstly,the complex dynamic of the joint-actions of the UAV team, thatwill be solved using game-theoretic correlated equilibrium, andsecondly, the challenge in huge dimensional state space repre-sentation will be tackled with efficient function approximationtechniques. We also provide our experimental results in detailwith both simulation and physical implementation to show thatthe UAV team can successfully learn to accomplish the task.

I. INTRODUCTION

Optimal sensing coverage is an active research branch.Solutions have been proposed in previous work, for instance,by solving general locational optimization problem [1], usingVoronoi partitions [2], [3], using potential field methods [4],[5], or scalar field mapping [6], [7]. In most of those work,authors made assumption about the mathematical model ofthe environment, such as distribution model of the field orthe predefined coverage path [8], [9]. In reality, however, itis very difficult to have an accurate model, because its datais normally limited or unavailable.

Unmanned aerial vehicles (UAV), or drones, have alreadybecome popular in human society, with a wide range ofapplication, from retailing business to environmental issues.The ability to provide visual information with low costsand high flexibility makes the drones preferable equipmentin tasks relating to field coverage and monitoring, suchas in wildfire monitoring [10], or search and rescue [11].In such applications, usually a team of UAVs could bedeployed to increase the coverage range and reliability ofthe mission. As with other multi-agent systems [12], [13],the important challenges in designing an autonomous teamof UAVs for field coverage include dealing with the dynamiccomplexity of the interaction between the UAVs so that theycan coordinate to accomplish a common team goal.

Model-free learning algorithms, such as Reinforcementlearning (RL), would be a natural approach to address the

This material is based upon work supported by the National Aeronau-tics and Space Administration (NASA) Grant No. NNX15AK48A issuedthrough the NV EPSCoR RID Seed, and NASA Grant No. NNX15AI02Hissued through the NVSGC Faculty Award - CD.

Huy Pham is a PhD student of the Advanced Robotics and Automation(ARA) Laboratory, and Dr. Hung La is the director of ARA Lab. Dr. DavidFeil-Seifer is an Assistant Professor of Department of Computer Scienceand Engineering, University of Nevada, Reno, NV 89557, USA. Dr. AraNefian is with NASA Ames Research Center, Moffett Field, CA 94035.Corresponding author: Hung La, email: [email protected]

aforementioned challenges relating to the required accuratemathematical models for the environment, and the complexbehaviors of the system. These algorithms will allow eachagent in the team to learn new behavior, or reach consensuswith others [14], without depending on a model of theenvironment [15]. Among them, RL is popular because it isrelatively generic to address a wide range of problem, whileit is simple to implement.

Classic individual RL algorithms have already been ex-tensively researched in UAV applications. Previous papersfocus on applying RL algorithm into UAV control to achievedesired trajectory tracking/following [16], or discussion ofusing RL to improve the performance in UAV applica-tion [17]. Multi-Agent Reinforcement Learning (MARL)is also an active field of research. In multi-agent systems(MAS), agents’ behaviors cannot be fully designed as prioridue to the complicated nature, therefore, the ability to learnappropriate behaviors and interactions will provide a hugeadvantage for the system. This particularly benefits the sys-tem when new agents are introduced, or the environment ischanged [18]. Recent publications concerning the possibilityof applying MARL into a variety of applications, such as inautonomous driving [19], or traffic control [20].

In robotics, efforts have been focused on robotic systemcoordination and collaboration [21], transfer learning [22],or multi-target observation [23]. For robot path planning andcontrol, most prior research focuses on classic problems,such as navigation and collision avoidance [24], objectcarrying by robot teams [25], or pursuing preys/avoidingpredators [26], [27]. Many other papers in multi-robotic sys-tems even simplified the dynamic nature of the system to useindividual agent learning such as classic RL algorithm [28],or actor-critic model [29]. To our best knowledge, not somany works available addressed the complexity of MARLin a multi-UAV system and their daily missions such asoptimal sensing coverage. In this paper, we propose howa MARL algorithm can be applied to solve an optimalcoverage problem. We address two challenges in MARL: (1)the complex dynamic of the joint-actions of the UAV team,that will be solved using game-theoric correlated equilibrium,and (2) the challenge in huge dimensional state space willbe tackled with an efficient space-reduced representation ofthe value function.

The remaining of the paper is organized as follows.Section II details on the optimal field coverage problemformulation. In section III, we discuss our approach to solvethe problem and the design of the learning algorithm. Basicsin MARL will also be covered. We present our experimental

arX

iv:1

803.

0725

0v2

[cs

.RO

] 1

6 Se

p 20

18

result in section IV with a comprehensive simulation, fol-lowed by an implementation with physical UAVs in a labsetting. Finally, section V concludes our paper and layoutsfuture work.

II. PROBLEM FORMULATION

Fig. 1. A team of UAVs to cover a field of interest F . A UAV canenlarge the FOV by flying higher, but risk in getting overlapped with otherUAVs in the system. Minimizing overlap will increase the field coverageand resolution.

In an mission like exploring a new environment such asmonitoring an oil spilling or an wildfire area, it is growinginterest to send out a fleet of UAVs acting as a mobilesensor network, as it provides many advantages comparingto traditional static monitoring methods [6]. In such mission,the UAV team needs to surround the field of interest to getmore information, for example, visual data. Suppose that wehave a team of quadrotor-type UAVs (Figure 1). Each UAV isan independent decision maker, thus the system is distributed.They can localize itself using on-board localization system,such as using GPS. They can also exchange informationwith other UAVs through communication links. Each UAVequipped with identical downward facing cameras providesit a square field of view (FOV). The camera of each UAVand its FOV form a pyramid with half-angles θT = [θ1, θ2]T

(Figure 2). A point q is covered by the FOV of UAV i if itsatisfies the following equations:

||q − ci||zi

≤ tan θT , (1)

where ci is the lateral-projected position, and zi is thealtitude of the UAV i, respectively. The objective of theteam is not only to provide a full coverage over the shapeof the field of interest F under their UAVs’ FOV, but alsoto minimize overlapping other UAVs’ FOV to improve theefficiency of the team (e.g., minimizing overlap can increaseresolution of field coverage). A UAV covers F by trying toput a section of it under its FOV. It can enlarge the FOV tocover a larger section by increasing the altitude zi accordingto (1), however it may risk overlapping other UAVs’ FOVin doing so. Formally speaking, let us consider a field F ofarbitrarily shapes. Let p1, p2, ..., pm denote the positions of

Fig. 2. Field of view of each UAV.

m UAV 1, 2, ...,m, respectively. Each UAV i has a squareFOV projected on the environment plane, denoted as Bi.Let f(q, p1, p2, ..., pm) represents a combined areas underthe FOVs of the UAVs. The team has a cost function Hrepresented by:

H =

∫q∈F

f(q, p1, p2, ..., pm)Φ(q)dq

−∫q∈Bi∩Bj ,∀i,j∈m

f(q, p1, p2, ..., pm)Φ(q)dq,(2)

where Φ(q) measures the importance of a specific area. In aplain field of interest, Φ(q) is constant.

The problem can be solved using traditional methods,such as, using Voronoi partitions [2], [3], or using potentialfield methods [4], [5]. Most of these works proposed model-based approach, where authors made assumption about themathematical model of the environment, such as the shapeof the target [8], [9]. In reality, however, it is very difficultto obtain an accurate model, because the data of the envi-ronment is normally insufficiently or unavailable. This canbe problematic, as the systems may fail if using incorrectmodels. On the other hand, many learning algorithm, suchas RL algorithms, rely only on the data obtained directlyfrom the system, would be a natural option to address theproblem.

III. ALGORITHM

A. Reinforcement Learning and Multi-Agent ReinforcementLearning

Classic RL defines the learning process happens when adecision maker, or an agent, interacts with the environment.During the learning process, the agent will select the ap-propriate actions when presented a situation at each stateaccording to a policy π, to maximize a numerical rewardsignal, that measures the performance of the agent, feedbackfrom the environment. In MAS, the agents interact withnot only the environment, but also with other agents inthe system, making their interactions more complex. Thestate transition of the system is more complicated, resultingfrom a join action containing all the actions of all agentstaking at a time step. The agents in the system now must

also consider other agents states and actions to coordinateand/or compete with. Assuming that the environment hasMarkovian property, where the next state and reward of anagent only depends on the current state, the Multi-AgentLearning model can be generalized as a Markov game <m, {S}, {A}, T,R >, where:• m is the number of agents in the system.• {S} is the joint state space {S} = S1×S2× ...×Sm,

where Si, i = 1, ...,m is the individual state space ofan agent i. At time step k, the individual state of agenti is denoted as si,k. The joint state at time step k isdenoted as Sk = {s1,k, s2,k, ..., sm,k}.

• {A} is the joint action space, {A} = A1 × A2 ×... × Am, where Ai, i = 1, ...,m is the individualaction space of an agent i. Each joint action at timek is denoted as Ak ∈ {A} while the individual ac-tion of agent i is denoted as ai,k. We have: Ak ={a1,k, a2,k, ..., am,k}.

• T is the transition probability function, T : S×A×S→[0, 1], is the probability of agent i that takes actionai,k to move from state si,k to state si,k+1. Gener-ally, it is represented by a probability: T (si,k, ai,k) =P (si,k+1|si,k, ai,k) = Pi(ak).

• R is the individual reward function: R : S × A → Rthat specifies the immediate reward of the agent i forgetting from si,k at time step k to state si,k+1 at timestep k+ 1 after taking action ai,k. In MARL, the teamhas a global reward GR : {S}×{A} → R in achievingthe team’s objective. We have: GR(Sk, Ak) = rk+1.

The agents seek to optimize expected rewards in anepisode by determining which action to take that will havethe highest return in the long run. In single agent learning,a value function Q(sk, ak), A × S → R, helps quantifystrategically how good the agent will be if it takes an actionak at state sk, by calculating its expected return obtained overan episode. In MARL, the action-state value function of eachagent also depends on the joint state and joint action [30],represented as:

Q(si,k, ai,k, s−i,k, a−i,k) = Q(Sk, Ak) = E{∞∑k

γri,k+1},

(3)where 0 < γ ≤ 1 is the discount factor of the learning.This function is also called Q-function. It is obvious thatthe state space and action space, as well as the valuefunction in MARL is much larger than in individual RL,therefore MARL would require much larger memory space,that will be a huge challenge concerning the scalability ofthe problem.

B. Correlated Equilibrium

In order to accomplish the team’s goal, the agents mustreach consensus in selecting actions. The set of actions thatthey agreed to choose is called a joint action, Ak ∈ {A}.Such an agreement can be evaluated at equilibrium, suchas Nash equilibrium (NE) [24] or Correlated equilibrium(CE) [31]. Unlike NE, CE can be solved with the help of

linear programming (LP) [32]. Inspired by [31] and [25],in this work we use a strategy that computes the optimalpolicy by finding the CE equilibrium for the agents in thesystems. From the general problem of finding CE in gametheory [32], we formulate a LP to help find the stable actionfor each agent as follows:

π(Ak) = arg maxAk

{m∑i=1

Qi,k(Sk, Ak))Pi(ak)}.

subject to:∑ak∈Ai

Pi(ak) = 1,∀i ∈ {m}

Pi(ak) ≥ 0,∀i ∈ {m},∀ak ∈ Ai∑a′k∈Ai

[Qi,k(Sk, ak, Ak,−i)−Qi,k(Sk, a′k, Ak,−i)]Pi(ak)

≥ 0,∀i ∈ {m}.(4)

Here, Pi(ak) is the probability of UAV i selecting actiona at time k, and A−i denotes the rest of the actions ofother agents. Solving LP has long been researched by theoptimization community. In this work, we use a state-of-the-art program from the community to help us solve the aboveLP.

C. Learning Design

In this section, we design a MARL algorithm to solveour problem formulated in section II. We assume that thesystem is fully observable. We also assume the UAVs areidentical, and operated in the same environment, and haveidentical sets of states and actions: S1 = S2 = ... = Sm,and A1 = A2 = ... = Am.

The state space and action space set of each agentshould be represented as discrete finite sets approximately,to guarantee the convergence of the RL algorithm [33]. Weconsider the environment as a 3-D grid, containing a finiteset of cubes, with the center of each cube represents adiscrete location of the environment. The state of an UAVi is defined as its approximate position in the environment,si,k , [xc, yc, zc] ∈ S, where xc, yc, zc are the coordinates ofthe center of a cube c at time step k. The objective equation(2) now becomes:

maxHSk∈{S}

= arg maxSk∈{S}

{∑i

fi(Sk)−∑i

oi(Sk)}, (5)

where fi : {S} → R is the count of squares, or cells,approximating the field F under the FOV of UAV i, andoi : {S} → R is the total number of cells overlapped withother UAVs.

To navigate, each UAV i can take an action ai,k out of aset of six possible actions A: heading North, West, South orEast in lateral direction, or go Up or Down to change thealtitude. Note that if the UAV stays in a state near the borderof the environment, and selects an action that takes it out ofthe space, it should stay still in the current state. Certainly,the action ai,k belongs to an optimal joint-action strategyAk resulted from (4). Note that in case multiple equilibrium

exists, since each UAV is an independent agent, they canchoose different equilibrium, making their respective actionsdeviate from the optimal joint action to a sub-optimal jointaction. To overcome this, we employ a mechanism calledsocial conventions [34], where the UAVs take turn to carryout an action. Each UAV is assigned with a specific rankingorder. When considering the optimal joint action sets, the onewith higher order will have priority to choose its action first,and let the subsequent one know its action. The other UAVsthen can match their actions with respect to the selectedaction. To ensure collision avoidance, lower-ranking UAVscannot take an action that will lead to the newly occupiedstates of higher-ranking UAVs in the system. By this, at atime step k, only one unique joint action will be agreedamong the UAV’s.

Defining the reward in MARL is another open problemdue to the dynamic nature of the system [18]. In thispaper, the individual reward that each agent receives can beconsidered as the total number of cells it covered, minusthe cells overlapping with other agents. However, a globalteam goal would help the team to accomplish the taskquicker, and also speed up the learning process to convergefaster [25]. We define the global team’s reward is a functionGR : {S} × {A} → R that weights the entire team’s jointstate Sk and joint action Ak at time step k in achieving (5).The agent only receives reward if the team’s goal reached:

GR(Sk, Ak) =

{r, if

∑i fi(Sk) ≥ fb,

∑i oi(Sk) ≤ 0

0, otherwise.(6)

where fb ∈ R is an acceptable bound of the field beingcovered. During the course of learning, the state - actionvalue function Qi,k(si, ai) for each agent i at time k can beiteratively updated as in Multi-Agent Q - learning algorithm,similar to those proposed in [25], [30]:

Qi,k+1(Sk, Ak)← (1− α)Qi,k(Sk, Ak) + α[GR(Sk, Ak)

+ γ maxA′in{A}

Qi,k(Sk+1, A′)],

(7)where 0 < α ≤ 1 is the learning rate, and γ is the discountrate of the RL algorithm. The term max

A′in{A}Qi,k(Sk+1, A

′)

derived from (4) at joint state Sk+1.

D. Approximate Multi-Agent Q-learning

In MARL, each agent updates its value function withrespect to other agents’ state and action, therefore the stateand action variable dimensions can grow exponentially ifwe increase the number of agent in the system. This makesvalue function representation a challenge. Consider the valuefunction Qi,k+1(Sk, Ak) in (3), the space needed to store allthe possible state - action pairs is |S1| · |S2|... · |Sm| · |A1| ·|A2|...|Am| = |Si|m|Ai|m.

Works have been proposed in the literature to tacklethe problem: using graph theory [35] to decompose theglobal Q-function into a local function concerning only asubset of the agents, reducing dimension of Q-table [36], oreliminating other agents to reduce the space [37]. However,

most previous approaches require additional step to reducethe space, that may place more pressure on the already-intense calculation time. In this work, we employ simpleapproximation techniques [38]: Fixed Sparse Representation(FSR) and Radial Basis Function (RBF) to map the originalQ to a parameter vector θ by using state and action -dependent basis functions φ : {S} × {A} → R:

Qi,k(Sk, Ak) =∑l

φl(Sk, Ak)θi,l = φT (Sk, Ak)θi, (8)

The FSR scheme uses a column vector φ(S,A) of the sizeD · |{A}|, where D is the sum of dimensions of the statespace. For example, if the state space is a 3-D space: X ×Y ×Z, then D = X + Y +Z. Each element in φ is definedas follows:

φ(x, y) =

{1, if x = Sk, y = Ak;

0, otherwise.(9)

In RBF scheme, we can use a column vector φ of l · |{A}|element, each can be calculated as:

φ(l, y) =

e−Sk−cl

2µ2l , ify = Ak;

0, otherwise,(10)

where cl is the center and µl is the radius of l pre-definedbasis functions that have the shape of a Gaussian bell.

The φ(S,A) and θi in FSR and RBF schemes are columnvectors of the size D · |{A}| and l · |{A}|, respectively, whichis much less than the space required in the original Q-valuefunction. For instance, if we deploy 3 agents on a space of7 × 7 × 4, and each agent has 6 actions, the original Q-table size would have (7 · 7 · 7 · 6)3 = 1.6 · 109 numbersin it. Compare to the total space required for approximatedparameter vectors in FSR scheme is 3 · (7 + 7 + 4) · 63) =3.8 · 103, and in RBF scheme is just 3 · 8 · 6 = 144 numbers,the required space is hugely saved.

After approximation, the update rule in (7) for Q-functionbecomes the update rule for the parameter [33] set of eachUAV i:

θi,k+1 ← θi,k + α[GR(Sk, Ak)

+ γ maxA′in{A}

(φT (Sk+1, A′)θi,k)

− (φT (Sk, Ak)θi,k]φ(Sk, Ak).

(11)

E. Algorithm

We propose our learning process as Algorithm 1. Thealgorithm required learning rate α, discount factor γ, and aschedule {εk}. The learning process is divided into episodes,with arbitrarily-initialized UAVs’ states in each episode.We use a greedy policy π with a big initial ε to increasethe exploration actions in the early stages, but it will bediminished over time to focus on finding optimal joint actionaccording to (4). Each UAV will evaluate their performancebased on a global reward function in (6), and update theapproximated value function of their states and action usingthe law (11) in a distributed manner.

Algorithm 1: MULTI-AGENT APPROXIMATED EQUILIBRIUM-BASED Q-LEARNING.

Input: Learning parameters: Discount factor γ, learning rate α, schedule {εk}, number of step per episode LInput: Basis Function vector φ(S,A), ∀si,0 ∈ Si, ∀ai,0 ∈ Ai

1 Initialize θi,0 ← 0, i = 1, ...,m;2 for episode = 1, 2, ... do3 Randomly initialize state si,0, ∀i4 for k = 0, 1, 2, ... do5 for i = 0, 1, 2, ...,m do6 Exchange information with other UAVs to obtain their state sj,k and parameters θj , j 6= i, j = 1...m7

π(Ak) =

{Find an optimal joint-action (strategy) by solving (4), with probability 1− εkTake a random joint action, otherwise.

Decide unique joint action Ak, and take individual joint action according to social conventions rule8 Receive other UAVs’ new states sj,k+1|j 6= i, j − 1, ...,m9 Observe global reward rk+1 = GR(Sk, Ak)

10 Update:

θi,k+1 ← θi,k + α[GR(Sk, Ak) + γ maxA′in{A}

(φT (Sk+1, A′)θi,k)− (φT (Sk, Ak)θi,k]φ(Sk, Ak).

Output: parameter vector θi, i = 1...m and policy π

(a) k = 1 (b) k = 10 (c) k = 22 (d) k = 35

Fig. 3. 2-D result showing the FOV of 3 UAVs collaborate in the last learning episode to provide a full coverage of the unknown field F with discretepoints denoted by ∗ mark, while avoiding overlapping others.

(a) (b) (c) (d)

Fig. 4. Different optimal solutions show the configuration of the FOV of 3 UAVs with a full coverage and no discrete point (∗ mark) overlapped.

IV. EXPERIMENTAL RESULTS

A. Simulation

We set up a simulation on MATLAB environment to provethe effectiveness of our proposed algorithm. Consider ourenvironment space as a 7× 7× 5 discrete 3-D space, and afield of interest F on a grid board with an unknown shape(Figure 3). The system has m = 3 UAVs, each UAV can

take six possible actions to navigate: forward, backward, goleft, go right, go up or go down. Each UAV in the team willhave a positive reward r = 0.1 if the team covers the wholefield F with no overlapping, otherwise it receives r = 0.

We implement the proposed algorithm 1 with both approx-imation schemes: FSR and RBF, and compare their perfor-mance with a baseline algorithm. For the baseline algorithm,

Fig. 5. 3-D representation of the UAV team covering the field.

Fig. 6. Number of steps the team UAV took over episodes to derive theoptimal solution.

Fig. 7. Physical implementation with 2 ARdrones. The UAVs cooperate tocover a field consists of black markers, while avoid overlapping each other.

the agents seek to solve the problem by optimizing individualperformance, that is to maximize their own coverage of thefield F , and stay away from overlapping others to avoida penalty of −0.01 for each overlapping square. For theproposed algorithm, both schemes use learning rate α = 0.1,discount rate γ = 0.9, and ε = 0.9 for the greedy policywhich is diminished over time. To find CE for the agents in(4), we utilize an optimization package for MATLAB fromCVX [39].

Our simulation on MATLAB shows that, in both FSRand RBF schemes after some training episodes the proposedalgorithm allows UAV team to organize in several optimalconfigurations that fully cover the field while having no over-lapping, while the baseline algorithm fails in most episodes.Figure 3 shows how the UAVs coordinated to cover the fieldF in the last learning episode in 2D. Figure 4 shows a resultof different solutions of the 3 UAV’s FOV configuration withno overlapping. For a clearer view, Figure 5 shows the UAVsteam and their FOVs in 3D environment in the last episodeof the FSR scheme.

Figure 6 shows the number of steps per episode the teamtook to converge to optimal solution. The baseline algorithmfails to converge, so it took maximum number of steps(2000), while the two schemes using proposed algorithmconverged nicely. Interestingly, it took longer for the RBFscheme to converge, compare to the FSR scheme. It islikely due to the difference in accuracy of the approximationtechniques, where RBF scheme has worse accuracy.

B. Implementation

In this section, we implement a lab-setting experiment for2 UAVs to cover the field of interest F with the similarspecification as of the simulation, in an environment spaceas a 7×7×4 discrete 3-D space. We use a quadrotor ParrotAR Drone 2.0, and the Motion Capture System from MotionAnalysis [40] to provide state estimation. The UAVs arecontrolled by a simple PD position controller [41].

We carried out the experiment using the FSR scheme, withsimilar parameters to the simulation, but now for only 2UAVs. Each would have a positive reward r = 0.1 if the teamcovers the whole field F with no overlapping, and r = 0otherwise. The learning rate was α = 0.1, and discount rateγ = 0.9, ε = 0.9, which was diminished over time. Similarto the simulation result, the UAV team also accomplishedthe mission, with two UAVs coordinated to cover the wholefield without overlapping each other, as showed in (Figure7).

V. CONCLUSION

This paper proposed a MARL algorithm that can beapplied to a team of UAVs that enable them to cooperativelylearn to provide full coverage of an unknown field of interest,while minimizing the overlapping sections among their fieldof views. The complex dynamic of the joint-actions of theUAV team has been solved using game-theoretic correlatedequilibrium. The challenge in huge dimensional state spacehas been also tackled with FSR and RBF approximationtechniques that significantly reduce the space required tostore the variables. We also provide our experimental resultswith both simulation and physical implementation to showthat the UAVs can successfully learn to accomplish the taskwithout the need of a mathematical model. In the future, weare interested in using Deep Learning to reduce computationtime, especially in finding CE. We will also consider to workin more important application where the dynamic of the fieldpresents, such as in wildfire monitoring.

REFERENCES

[1] J. Cortes, S. Martinez, T. Karatas, and F. Bullo, “Coverage controlfor mobile sensing networks,” IEEE Transactions on robotics andAutomation, vol. 20, no. 2, pp. 243–255, 2004.

[2] M. Schwager, D. Rus, and J.-J. Slotine, “Decentralized, adaptivecoverage control for networked robots,” The International Journal ofRobotics Research, vol. 28, no. 3, pp. 357–375, 2009.

[3] A. Breitenmoser, M. Schwager, J.-C. Metzger, R. Siegwart, andD. Rus, “Voronoi coverage of non-convex environments with a groupof networked robots,” in Robotics and Automation (ICRA), 2010 IEEEInternational Conference on. IEEE, 2010, pp. 4982–4989.

[4] S. S. Ge and Y. J. Cui, “Dynamic motion planning for mobile robotsusing potential field method,” Autonomous robots, vol. 13, no. 3, pp.207–222, 2002.

[5] M. Schwager, B. J. Julian, M. Angermann, and D. Rus, “Eyes inthe sky: Decentralized control for the deployment of robotic cameranetworks,” Proceedings of the IEEE, vol. 99, no. 9, pp. 1541–1561,2011.

[6] H. M. La and W. Sheng, “Distributed sensor fusion for scalar fieldmapping using mobile sensor networks,” IEEE Transactions on cy-bernetics, vol. 43, no. 2, pp. 766–778, 2013.

[7] M. T. Nguyen, H. M. La, and K. A. Teague, “Collaborative andcompressed mobile sensing for data collection in distributed roboticnetworks,” IEEE Transactions on Control of Network Systems, vol. PP,no. 99, pp. 1–1, 2017.

[8] H. M. La, “Multi-robot swarm for cooperative scalar field mapping,”Handbook of Research on Design, Control, and Modeling of SwarmRobotics, p. 383, 2015.

[9] H. M. La, W. Sheng, and J. Chen, “Cooperative and active sensing inmobile sensor networks for scalar field mapping,” IEEE Transactionson Systems, Man, and Cybernetics: Systems, vol. 45, no. 1, pp. 1–12,2015.

[10] H. X. Pham, H. M. La, D. Feil-Seifer, and M. Deans, “A distributedcontrol framework for a team of unmanned aerial vehicles for dynamicwildfire tracking,” in 2017 IEEE/RSJ International Conference onIntelligent Robots and Systems (IROS), Sept 2017, pp. 6648–6653.

[11] T. Tomic, K. Schmid, P. Lutz, A. Domel, M. Kassecker, E. Mair,I. L. Grixa, F. Ruess, M. Suppa, and D. Burschka, “Toward a fullyautonomous uav: Research platform for indoor and outdoor urbansearch and rescue,” IEEE robotics & automation magazine, vol. 19,no. 3, pp. 46–56, 2012.

[12] T. Nguyen, H. M. La, T. D. Le, and M. Jafari, “Formation controland obstacle avoidance of multiple rectangular agents with limitedcommunication ranges,” IEEE Transactions on Control of NetworkSystems, vol. 4, no. 4, pp. 680–691, Dec 2017.

[13] H. M. La and W. Sheng, “Dynamic target tracking and observingin a mobile sensor network,” Robotics and Autonomous Systems,vol. 60, no. 7, pp. 996 – 1009, 2012. [Online]. Available:http://www.sciencedirect.com/science/article/pii/S0921889012000565

[14] F. Muoz, E. S. Espinoza Quesada, H. M. La, S. Salazar, S. Commuri,and L. R. Garcia Carrillo, “Adaptive consensus algorithms forreal-time operation of multi-agent systems affected by switchingnetwork events,” International Journal of Robust and NonlinearControl, vol. 27, no. 9, pp. 1566–1588, 2017, rnc.3687. [Online].Available: http://dx.doi.org/10.1002/rnc.3687

[15] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.MIT press Cambridge, 1998, vol. 1, no. 1.

[16] H. Bou-Ammar, H. Voos, and W. Ertel, “Controller design for quadro-tor uavs using reinforcement learning,” in Control Applications (CCA),2010 IEEE International Conference on. IEEE, 2010, pp. 2130–2135.

[17] A. Faust, I. Palunko, P. Cruz, R. Fierro, and L. Tapia, “Learning swing-free trajectories for uavs with a suspended load,” in Robotics andAutomation (ICRA), 2013 IEEE International Conference on. IEEE,2013, pp. 4902–4909.

[18] L. Busoniu, R. Babuska, and B. De Schutter, “Multi-agent reinforce-ment learning: An overview,” in Innovations in multi-agent systemsand applications-1. Springer, 2010, pp. 183–221.

[19] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent, reinforcement learning for autonomous driving,” arXiv preprintarXiv:1610.03295, 2016.

[20] B. Bakker, S. Whiteson, L. Kester, and F. C. Groen, “Traffic lightcontrol by multiagent reinforcement learning systems,” in InteractiveCollaborative Information Systems. Springer, 2010, pp. 475–510.

[21] J. Ma and S. Cameron, “Combining policy search with planning inmulti-agent cooperation,” in Robot Soccer World Cup. Springer, 2008,pp. 532–543.

[22] M. K. Helwa and A. P. Schoellig, “Multi-robot transfer learning:A dynamical system perspective,” arXiv preprint arXiv:1707.08689,2017.

[23] F. Fernandez and L. E. Parker, “Learning in large cooperative multi-robot domains,” 2001.

[24] J. Hu and M. P. Wellman, “Nash q-learning for general-sum stochasticgames,” Journal of machine learning research, vol. 4, no. Nov, pp.1039–1069, 2003.

[25] A. K. Sadhu and A. Konar, “Improving the speed of convergence ofmulti-agent q-learning for cooperative task-planning by a robot-team,”Robotics and Autonomous Systems, vol. 92, pp. 66–80, 2017.

[26] Y. Ishiwaka, T. Sato, and Y. Kakazu, “An approach to the pursuitproblem on a heterogeneous multiagent system using reinforcementlearning,” Robotics and Autonomous Systems, vol. 43, no. 4, pp. 245–256, 2003.

[27] H. M. La, R. Lim, and W. Sheng, “Multirobot cooperative learningfor predator avoidance,” IEEE Transactions on Control Systems Tech-nology, vol. 23, no. 1, pp. 52–63, 2015.

[28] S.-M. Hung and S. N. Givigi, “A q-learning approach to flocking withuavs in a stochastic environment,” IEEE transactions on cybernetics,vol. 47, no. 1, pp. 186–197, 2017.

[29] A. A. Adepegba, M. S. Miah, and D. Spinello, “Multi-agent area cov-erage control using reinforcement learning.” in FLAIRS Conference,2016, pp. 368–373.

[30] A. Nowe, P. Vrancx, and Y.-M. De Hauwere, “Game theory and multi-agent reinforcement learning,” in Reinforcement Learning. Springer,2012, pp. 441–470.

[31] A. Greenwald, K. Hall, and R. Serrano, “Correlated q-learning,” inICML, vol. 3, 2003, pp. 242–249.

[32] C. H. Papadimitriou and T. Roughgarden, “Computing correlatedequilibria in multi-player games,” Journal of the ACM (JACM), vol. 55,no. 3, p. 14, 2008.

[33] L. Busoniu, R. Babuska, B. De Schutter, and D. Ernst, Reinforcementlearning and dynamic programming using function approximators.CRC press, 2010, vol. 39.

[34] L. Busoniu, R. Babuska, and B. De Schutter, “A comprehensive surveyof multiagent reinforcement learning,” IEEE Trans. Systems, Man, andCybernetics, Part C, vol. 38, no. 2, pp. 156–172, 2008.

[35] C. Guestrin, M. Lagoudakis, and R. Parr, “Coordinated reinforcementlearning,” in ICML, vol. 2, 2002, pp. 227–234.

[36] Z. Zhang, D. Zhao, J. Gao, D. Wang, and Y. Dai, “Fmrqa multiagentreinforcement learning algorithm for fully cooperative tasks,” IEEEtransactions on cybernetics, vol. 47, no. 6, pp. 1367–1379, 2017.

[37] D. Borrajo, L. E. Parker, et al., “A reinforcement learning algorithm incooperative multi-robot domains,” Journal of Intelligent and RoboticSystems, vol. 43, no. 2-4, pp. 161–174, 2005.

[38] A. Geramifard, T. J. Walsh, S. Tellex, G. Chowdhary, N. Roy, J. P.How, et al., “A tutorial on linear function approximators for dynamicprogramming and reinforcement learning,” Foundations and Trends®in Machine Learning, vol. 6, no. 4, pp. 375–451, 2013.

[39] M. Grant, S. Boyd, and Y. Ye, “Cvx: Matlab software for disciplinedconvex programming,” 2008.

[40] “Motion analysis corporation.” [Online]. Available: https://www.motionanalysis.com/

[41] H. X. Pham, H. M. La, D. Feil-Seifer, and L. V. Nguyen, “Au-tonomous uav navigation using reinforcement learning,” arXiv preprintarXiv:1801.05086, 2018.