11
1 A Data-driven Multi-agent Autonomous Voltage Control Framework Using Deep Reinforcement Learning Shengyi Wang, Student Member, IEEE, Jiajun Duan, Member, IEEE, Di Shi, Senior Member, IEEE, Chunlei Xu, Haifeng Li, Ruisheng Diao, Senior Member, IEEE, Zhiwei Wang, Senior Member, IEEE, Abstract—The complexity of modern power grids keeps in- creasing due to the expansion of renewable energy resources and the requirement of fast demand responses, which results in a great challenge for conventional power grid control systems. Existing autonomous control approaches for the power grid requires an accurate system model and a powerful computational platform, which is difficult to scale up for the large-scale energy system with more control options and operating conditions. Facing these challenges, this paper proposes a data-driven multi- agent power grid control scheme using a deep reinforcement learning (DRL) method. Specifically, the classic autonomous volt- age control (AVC) problem is taken as an example and formulated as a Markov Game with a heuristic method to partition agents. Then, a multi-agent AVC (MA-AVC) algorithm based on a multi- agent deep deterministic policy gradient (MADDPG) method that features centralized training and decentralized execution is developed to solve the AVC problem. The proposed method can learn from scratch and gradually master the system operation rules by input and output data. In order to demonstrate the effec- tiveness of the proposed MA-AVC algorithm, comprehensive case studies are conducted on an Illinois 200-Bus system considering load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent System, autonomous voltage con- trol, deep reinforcement learning, centralized training and de- centralized executing control, data-driven, deep neural network NOMENCLATURE Scalar α Parameter for scaling β t i Cooperation level parameter of agent i at time step t γ Discount factor σ t i Exploration parameter of agent i at time step t τ Parameter for updating target networks N a Number of agents in a Markov Game n b i Number of buses that agent i has r t i Reward of agent i at time step t r t ik Reward of agent i at bus k at time step t T Time horizon V t k Voltage magnitude at bus k at time step t This work is funded by SGCC Science and Technology Program, Research on Real-time Autonomous Control Strategies for Power Grid based on AI Technologies, under contract no. 5700-201958523A-0-0-00. S. Wang, J. Duan, D. Shi, R. Diao and Z. Wang are with GEIRI North America, San Jose, CA 95134, USA. C. Xu and H. Li are with State Grid Jiangsu Electric Power Company, Nanjing, 210024, China. And S. Wang is also with the Department of Electrical and Computer Engineering, Temple University, Philadelphia, PA 19122, USA. Tensor θ π i Weights of actor network for agent i θ Q i Weights of critic network for agent i θ β i Weights of coordinator network for agent i θ π 0 i Weights of target actor network for agent i θ Q 0 i Weights of target critic network for agent i a t Joint action at time step t a t i Action of agent i at time step t a t -i Actions except agent i at time step t o t i Observation of agent i at time step t o t -i Observations except agent i at time step t s t State of the environment at time step t Set Λ t i Set of violated bus index of agent i has at time step t A i Set of actions for agent i B i Set of local bus index that agent i has O i Set of observations for agent i S Set of states Function π i Actor network of agent i π 0 i Target actor network of agent i f i Coordinator network of agent i g Indication function Q i Critic network of agent i Q 0 i Target critic network of agent i V i State-value function of agent i I. I NTRODUCTION With the increasing integration of renewable energy farms and various distributed energy resources, fast demand response and voltage regulation of modern power grids are facing great challenges such as the voltage quality degradation, cascading tripping faults, and voltage stability issues [1]. Various au- tonomous voltage control (AVC) methods have been developed and attempted to better tackle such challenges . The objective of AVC is to maintain bus voltage magnitudes within the desirable range by properly regulating control settings such as generator bus voltage magnitudes, capacitor bank switching, transformer tap setting, and etc. But most of them are model- based methods, which cannot handle the ever-changing power system [2]. Based on the implementation mechanism, the existing work of AVC can be categorized into three categories [3]: cen- tralized control, distributed control, and decentralized control.

A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

1

A Data-driven Multi-agentAutonomous Voltage Control Framework

Using Deep Reinforcement LearningShengyi Wang, Student Member, IEEE, Jiajun Duan, Member, IEEE, Di Shi, Senior Member, IEEE, Chunlei Xu,

Haifeng Li, Ruisheng Diao, Senior Member, IEEE, Zhiwei Wang, Senior Member, IEEE,

Abstract—The complexity of modern power grids keeps in-creasing due to the expansion of renewable energy resourcesand the requirement of fast demand responses, which results ina great challenge for conventional power grid control systems.Existing autonomous control approaches for the power gridrequires an accurate system model and a powerful computationalplatform, which is difficult to scale up for the large-scale energysystem with more control options and operating conditions.Facing these challenges, this paper proposes a data-driven multi-agent power grid control scheme using a deep reinforcementlearning (DRL) method. Specifically, the classic autonomous volt-age control (AVC) problem is taken as an example and formulatedas a Markov Game with a heuristic method to partition agents.Then, a multi-agent AVC (MA-AVC) algorithm based on a multi-agent deep deterministic policy gradient (MADDPG) methodthat features centralized training and decentralized execution isdeveloped to solve the AVC problem. The proposed method canlearn from scratch and gradually master the system operationrules by input and output data. In order to demonstrate the effec-tiveness of the proposed MA-AVC algorithm, comprehensive casestudies are conducted on an Illinois 200-Bus system consideringload/generation changes, N-1 contingencies, and weak centralizedcommunication environment.

Index Terms—Multi-agent System, autonomous voltage con-trol, deep reinforcement learning, centralized training and de-centralized executing control, data-driven, deep neural network

NOMENCLATURE

Scalarα Parameter for scalingβti Cooperation level parameter of agent i at time step tγ Discount factorσti Exploration parameter of agent i at time step tτ Parameter for updating target networksNa Number of agents in a Markov Gamenbi Number of buses that agent i hasrti Reward of agent i at time step trtik Reward of agent i at bus k at time step tT Time horizonV tk Voltage magnitude at bus k at time step t

This work is funded by SGCC Science and Technology Program, Researchon Real-time Autonomous Control Strategies for Power Grid based on AITechnologies, under contract no. 5700-201958523A-0-0-00. S. Wang, J. Duan,D. Shi, R. Diao and Z. Wang are with GEIRI North America, San Jose, CA95134, USA. C. Xu and H. Li are with State Grid Jiangsu Electric PowerCompany, Nanjing, 210024, China. And S. Wang is also with the Departmentof Electrical and Computer Engineering, Temple University, Philadelphia, PA19122, USA.

Tensorθπi Weights of actor network for agent iθQi Weights of critic network for agent iθβi Weights of coordinator network for agent iθπ

i Weights of target actor network for agent iθQ

i Weights of target critic network for agent iat Joint action at time step tati Action of agent i at time step tat−i Actions except agent i at time step toti Observation of agent i at time step tot−i Observations except agent i at time step tst State of the environment at time step tSetΛti Set of violated bus index of agent i has at time step tAi Set of actions for agent iBi Set of local bus index that agent i hasOi Set of observations for agent iS Set of statesFunctionπi Actor network of agent iπ′i Target actor network of agent ifi Coordinator network of agent ig Indication functionQi Critic network of agent iQ′i Target critic network of agent iVi State-value function of agent i

I. INTRODUCTION

With the increasing integration of renewable energy farmsand various distributed energy resources, fast demand responseand voltage regulation of modern power grids are facing greatchallenges such as the voltage quality degradation, cascadingtripping faults, and voltage stability issues [1]. Various au-tonomous voltage control (AVC) methods have been developedand attempted to better tackle such challenges . The objectiveof AVC is to maintain bus voltage magnitudes within thedesirable range by properly regulating control settings such asgenerator bus voltage magnitudes, capacitor bank switching,transformer tap setting, and etc. But most of them are model-based methods, which cannot handle the ever-changing powersystem [2].

Based on the implementation mechanism, the existing workof AVC can be categorized into three categories [3]: cen-tralized control, distributed control, and decentralized control.

Page 2: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

2

Due to space limitations, several most representative advance-ments of AVC technology for each category are enumeratedbelow. The centralized control strategy requires sophisticatedcommunication networks to collect global operating conditionsand requires a powerful central controller to process a hugeamount of information. As one of the centralized solutions, theoptimal power flow (OPF) based method has been extensivelyimplemented to support the system-wide voltage profile suchas [4] in PJM, U.S, [5] in Denmark, and etc using convex relaxtechnique [6] to handle nonlinear and non-convex problems.However, such OPF-based methods are susceptible to single-point failure, communication burden, and scalability issues[7]. As an alternative solution, the distributed or decentralizedcontrol strategy [7], [8] has attracted more and more attentionto mitigating disadvantages in the centralized control strategy.Both of them do not require a central controller, but theformer method asks neighboring agents to share a certainamount of information, while the latter one only uses thelocal measurements without neighboring communication atall in a multi-agent system. For example, the alternatingdirection method of multipliers (ADMM) algorithm is used todevelop a distributed voltage control scheme in [9] to achievethe globally optimal settings of reactive power. The paper[10] presents a gradient-projection based local reactive power(VAR) control framework with a guarantee of convergence toa surrogate centralized problem.

Although the majority of existing model-based work suchas [4], [5], [9], [10] have been claimed to achieve promisingperformance in AVC, they heavily rely on accurate knowledgeof power grids and parameters, which is not practical fornowadays’ large interconnected power systems with increasingcomplexity. On one hand, it is quite challenging for themodel-based methods to handle some uncertainties from thestochastic load changes and contingencies. On the other hand,it is considerably difficult to acquire an accurate model ofcertain non-linear system components, e.g., power electronicdevices and renewable energies. With inaccurate grid knowl-edge and parameters, the data-driven methods have attractedmore and more attention to reduce the impact of modelaccuracy on control performance. A few researchers havedeveloped reinforcement learning (RL) based AVC methodsthat allow controllers to learn a goal-oriented control schemefrom interactions with a system-like simulation model [11]driven by a large amount of operating data. The model-freeQ-learning algorithm is used in [12] to provide the optimalcontrol setting, which is the solution of the constrained loadflow problem. The authors in [13] propose a fully distributedmethod for optimal reactive power dispatch using a consensus-based Q-learning algorithm. Recently, the deep RL (DRL) hasbeen largely recognized by the research community becauseof its superior ability to represent continuous high-dimensionalspace. A novel AVC paradigm, called Grid Mind, is proposedto correct the abnormal voltage profiles in [1], [2] usingDRL. The policy for optimal tap setting of voltage regulationtransformers is found by a batch RL algorithm in [14]. Thepaper [15] proposes a novel two-timescale solution, where thedeep Q network method is applied to the optimal configurationof capacitors on the slow time scale.

The above-discussed work motivates us to design a controlstrategy implemented in a centralized training and decen-tralized executing and data-driven fashion for a large-scalesystem reconstructed as a multi-agent system. In this paper,a novel multi-agent AVC (MA-AVC) scheme is proposed tomaintain voltage magnitudes within their operation limits,which is an extension of the latest work [1], [2] from asingle-agent centralized executing control system to a multi-agent decentralized executing control system. First, a heuristicmethod is developed to partition agents with the two stepsincluding geographic partition and post-partition adjustmentin a way of trial and error. Then, the whole system canbe divided into several small regions. Second, the MA-AVCproblem is formulated as a Markov Game [16] with a bi-layer reward design considering the cooperation level. Third,the MADDPG algorithm from [17], which is a multi-agent,off-policy and actor-critic DRL algorithm, is modified andreformulated for AVC problem. During the training process,a centralized communication network is required to provideglobal information for critic network updating. One notablething is this process can be achieved offline in a safe labenvironment without interaction with a real system. Duringonline execution, the well-learned DRL agent only takes thelocal measurements, and the output control commands canbe verified by the grid operator before executing. Finally, acoordinator approximator is developed to adaptively learn thecooperation level among different agents defined in the rewardfunction. In addition, an independent replay buffer is assignedto each agent to stabilize the MADDPG system. Compared topast work, our contributions can be summarized as follows.

• The DRL-based agent in the proposed MA-AVC schemecan learn its control policy through massive offlinetraining without needs to model complicated physicalsystems and adapt its behavior to new changes includingload/generation variations and topological changes etc.

• The proposed multi-agent DRL system solves the di-mension cursing problem in existing DRL methods andcan be scaled up to control large-scale power systemsaccordingly. The proposed control scheme can also beeasily extended and applied to other control problemsexcept for AVC.

• The decentralized execution mechanism in the proposedMA-AVC scheme can be applied to large-scale intricateenergy networks with low computational complexity foreach agent. Meanwhile, it addresses the communicationdelay and the single-point failure issue of the centralizedcontrol scheme.

• The proposed MA-AVC scheme realizes a regional con-trol with an operation rule based policy design, andrefines the original MADDPG algorithm integrated withindependent replay buffers to stabilize the learning pro-cess and coordinators to model the cooperation behavior,and tests the robustness of the algorithm to a weakcentralized communication environment.

Regarding the remainder of the paper, section II introducesthe definition of Markov Game, and formulates the AVCproblem as a Markov Game. Section III presents MADDPG,

Page 3: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

3

and proposes a data-driven MA-AVC scheme including offlinetraining and online execution. Numerical simulation usingIllinois 200-Bus system is presented in Section IV, withconcluding remarks drawn in Section V.

II. PROBLEM FORMULATION

In this section, the preliminaries for Markov Games areintroduced first, and then the AVC problem is formulated asa Markov Game.

A. Preliminaries of Markov Games

A multi-agent extension of Markov decision processes(MDPs) can be described by Markov Games. It can also beviewed as a collection of coupled strategic games, one perstate. At each time step t, a Markov Game for Na agents isdefined by a discrete set of states st ∈ S, a discrete set ofactions ati ∈ Ai and a discrete set of observations oti ∈ Oifor each agent. If the current observation oti of each agentcompletely reveals the current state of the environment, thatis, st = oti, the game is a fully observable Markov Game,otherwise it is a partially observable Markov Game. This workis focused on the latter. To select actions, each agent has itsindividual policy πi : Oi × Ai 7→ [0, 1], which is a mappingπi(o

ti) from the observation to an action. When each agent

takes its individual action, the environment changes as a resultof the joint action at ∈ A(= ×Nai=1Ai) according to the statetransition model p(st+1|st, at). Each agent obtains rewards asa function of the state and the joint action rti : S × A 7→ R,and receives a private observation ot+1

i conditioned on theobservation model p(ot+1

i |st). The goal of each agent is tofind a policy, which maximizes its expected discounted return

maxπi

Eati∼πi

st+1∼p(st+1|st,at)

[

T∑t=0

γtrti ] (1)

where γ ∈ [0, 1] is a discount factor and T is the time horizon.Finally, two important value functions (2) and (3) of each

agent i (state-value function Vi(s) and action-value functionQi(s, a)) are defined as follows

Vi(s).= E

ati∼πist+1∼p(st+1|st,at)

[

T∑t=0

γtrti |s0 = s] (2)

Qi(s, a).= E

ati∼πist+1∼p(st+1|st,at)

[

T∑t=0

γtrti |s0 = s, a0 = a] (3)

Vi(s) represents the expected discounted return when startingin s and following πi thereafter, while Qi(s, a) represents theexpected discounted return when starting from taking actiona in state s under a policy πi thereafter.

B. Formulating AVC Problem as a Markov Game

For AVC, the control goal is to bring the system voltageprofiles back to normal after unexpected disturbances, and

the control variables include generator bus voltage magnitude,capacitor bank switching, transformer tap setting, and etc. Inthis work, without loss of generality, generator bus voltagemagnitude is chosen to maintain acceptable voltage profiles.

1) Definition of Agent: In order to reconstruct a large-scalepower grid as a multi-agent system, a heuristic method topartition multiple control agents is proposed here. First, thepower grid is divided into several regional zones accordingto the default geographic location information. Then, eachagent is assigned a certain number of inter-connected zones(geographic partition). Because the geographic partition can-not guarantee that each bus voltage is controllable throughregulating the local generator bus voltage magnitudes. Next,the uncontrollable sparse buses are recorded and re-assignedto other effective agents (post-partition adjustment), which isimplemented in a way of trial and error. Specifically speaking,after geographic partition, an offline evaluating program willbe set up, and the uncontrollable buses will be recorded duringthis process. Then the uncontrollable buses in the recordswill be re-assigned to other agents that have the electricalconnections. The above post-partition adjustment process willbe repeatedly implemented until all of the buses are undercontrol by the local resources.

As shown in Fig. 1 of Illinois 200-bus system with sixdefault zones denoted by A-F. Initially, the zone A and Fare assigned to agent #1, the zone B and C are assigned toagent #2, and the zone D and E are assigned to agent #3,respectively. It should be noted that the way of partition maynot be unique. According to offline simulated records, thenoted uncontrollable buses are re-assigned among agent #1to #3. After partition, zone D is separated into three differentsub-zones, namely D1 D2 and D3, in which 14 out of 15uncontrollable buses (bus #41, #80, #111, #163, #164, #165,#166, #168, #169, #173, #174, #175, #179, #184, i.e., sub-zone D1) are re-assigned from agent #3 to agent #1, and theremaining one uncontrollable bus (bus #100, i.e., sub-zone D2)is re-assigned from agent #3 to agent #2. In the end, agent #1is responsible for zone A, F and D1, agent #2 is responsiblefor B, C and D2, and agent #3 is responsible for zone E andD3.

2) Definition of Action, State and Observation: The controlactions are defined as a vector of generator bus voltage mag-nitudes, each element of which can be continuously adjustedwithin a range from 0.95 pu to 1.05 pu. The states aredefined as a vector of meter measurements that are used torepresent system operation status, e.g. system-wide bus voltagemagnitudes, phase angles, loads, generations and power flows.Similar to some existing voltage control strategies [18], [19],the proposed RL method in this paper only takes the system-wide bus voltage magnitudes as the state in the Markov Game.On one hand, other system operation status can be somehowreflected on the voltage profile. On the other hand, it alsoreflects how powerful DRL in extracting the useful informationfrom the limited states. In this way, many resources formeasurement and communication can be saved. Three voltageoperation zones are defined to differentiate voltage profilesincluding normal zone (V tk ∈ [0.95, 1.05] pu), violationzone (V tk ∈ [0.8, 0.95) ∪ (1.05, 1.25] pu), and diverged zone

Page 4: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

4

Zone A Zone B Zone C

Zone DZone EZone F

Geographic Partition

Agent 2

Agent 3Agent 1

Post-partition Adjustment

Agent 2

Agent 3Agent 1

𝐷" 𝐷#

𝐷$

Figure 1. The demonstration of a heuristic method to partition agent

(V tk ∈ [0, 0.8)∪(1.25,∞) pu). The observation for each agentis defined as a local measurement of bus voltage magnitudes.It is assumed that each agent can only observe and manageits own zones.

3) Definition of Reward: To implement DRL, the rewardfunction is designed to evaluate the effectiveness of the actions,which is defined through a hierarchical consideration. First,for each bus, the reward rtik is designed to motivate the agentto reduce the deviation of bus voltage magnitude from thegiven reference value Vref = 1.0 pu. A complete definition ofrtik is illustrated in Table I. It can be seen that buses withsmaller deviations will be awarded larger rewards. Then, foreach agent, the total reward of each transition is calculatedaccording to three different occasions: i). If all of the voltagesare located in the normal zone, each agent is rewarded withthe value as calculated in Eqn. (4); ii). If the violation exists inany agent without the divergence, each agent is penalized withvalue shown as Eqn. (5); iii). If the divergence exists in anyagents, each agent is penalized with a relatively large constantin Eqn. (6).

rti =

∑k∈Bi

rtik +∑j 6=i

∑k∈Bj

rtjk

nbi +∑j 6=i

nbj∈ [0, 1] (4)

rti = α[∑k∈Λti

rtik + βti∑j 6=i

∑k∈Λtj

rtjk] (5)

rti = −5 (6)

where Bi is the set of local bus index that the agent i has,and nbi is the number of buses that the agent i has. α is theparameter for scaling, Λti is the set of violated bus index thatthe agent i has, and βti ∈ [0, 1] is the parameter to reflect thelevel of cooperation to fix the system voltage violation issues.When Λti = ∅, rtik = 0 (k ∈ Λti).

It should be noted that in the occasion i) and the occasioniii), each agent has the same reward. While in the occasion ii),if βti = 1, all of agents share the same reward and collaborateto solve the bus voltage violations of the whole system, andwhen βti approaches 0, each agent considers more about itsown regional buses and cares less for other zones.

Table ITHE DEFINITION OF REWARD OF EACH BUS

operationzone V t

k (pu) rtikrtik’s monotone when

V tk → 1.0 pu

Normal [Vref,1.05] 1.05− V tk

1.05− Vref0→ 1

Normal [0.95,Vref) V tk − 0.95

Vref − 0.950→ 1

Violation (1.05,1.25] V tk − Vref

Vref − 1.25−1→ −0.2

Violation [0.8,0.95) Vref − V tk

0.8− Vref−1→ −0.25

Diverged [1.25,∞) −5 no change

Diverged [0,0.8) −5 no change

III. DATA-DRIVEN MULTI-AGENT AVC SCHEME

In the previous section, the MA-AVC problem has beenformulated as a Markov Game. Thus, one critical problem ofsolving Eqn. (1) is to design an agent to learn an effectivepolicy (control law) through interaction with the environment.One of the desired features for a suitable DRL algorithm isthat it may utilize extra information to accelerate the trainingprocess, while only the local measurements are required (i.e.observations) during execution. In this section, a multi-agent,off-policy and actor-critic DRL algorithm, i.e. MADDPG [17],is first briefly introduced. Then, a novel MA-AVC schemeis developed based on the extension and modification ofMADDPG. The proposed method occupies the attributes suchas data-driven, centralized-training (even if in some weakcommunication environment during training), decentralized-executing, and operation-rule-integrated, which can meet thedesired criteria of modern power grid operation.

A. MADDPG

Considering a deterministic parametric policy called actordenoted by πi(·|θπi ) : Oi 7→ Ai approximated by a neuralnetwork for agent i, the control law for each agent with aGaussian noise N (0, σti) can be expressed as

ati = πi(oti|θπi ) +N (0, σti) (7)

where θπi is the weights of actor for agent i, and σti is a pa-rameter for exploration. For the episodic case, the performance

Page 5: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

5

measure of policy J(θπi ) for agent i can be defined as the valuefunction of the start state of the episode

J(θπi ) = Vi(s0) (8)

According to policy improvement, the actor can be updatedby implementing gradient ascent to move the policy in thedirection of gradient of Eqn. (8), which can be viewed asmaximizing action-value function, and an analytic expressionof gradient can be written as follows

∇θπi J(θπi ) ≈ Est∼D

[∇θπi Qi(st, ati = πi(o

ti|θπi ), at−i)] (9)

where D is the replay buffer [20] which stores historicalexperience, and at−i is the other agents’ actions. At eachtime step, the actor and critic for each agent can be updatedby sampling a minibatch uniformly from the buffer, whichallows the algorithm to benefit from learning across a setof uncorrelated experiences to stabilize the learning process.Without a replay buffer, the gradient ∇θπi J(θπi ) in Eqn. (8)will be calculated using sequential samples, which may alwayshave the same direction in the gradient and lead to thedivergence of learning.

Applying the chain rule to Eqn. (9), the gradient of Eqn. (8)can be decomposed into the gradient of the action-value withrespect to actions, and the gradient of the policy with respectto the policy parameters

∇θπi J(θπi ) = Est∼D

[∇atiQi(st, ati, a

t−i)∇θπi a

ti|ati=πi(oti|θπi )]

(10)It should be noted that the action-value Qi(s

t, ati, at−i) is a

centralized policy evaluation function considering not onlyagent i’s own actions, but also other agents’ actions, whichhelps to make a stationary environment for each agent, evenas the policies change. In addition, we have st = (oti, o

t−i),

but actually there is no restrictions to its setting.The process to learn an action-value function is called policy

evaluation. Considering a parametric action-value functioncalled critic denoted by Qi(·|θQi ) approximated by a neuralnetwork for agent i, the action-value function can be updatedby minimizing the following loss

L(θQi ) = Est∼D

[(Qi(st, ati, a

t−i|θ

Qi )− yti)2] (11)

whereyti = rti + γQi(s

t+1, at+1i , at+1

−i |θQi ) (12)

θQi is the weights of critic for agent i. In order to improvethe stability of learning, target networks for actor π′i(·|θπ

i )

and critic Q′i(·|θQ′

i ) are introduced in [21], which is a copy ofthe original actor and critic network with an earlier snapshotof weights, respectively. θπ

i and θQ′

i are the weights of thetarget actor and target critic, respectively. The target value ytiis a reference value that the critic network of Qi(·|θQi ) wantsto track during the training. This value is estimated by thetarget networks of actor π′i(·|θπ

i ) and critic Q′i(·|θQ′

i ) [21].Then the yti is stabilized and replaced by target networks

yti = rti + γQ′i(st+1, at+1′

i , at+1′−i |θ

Q′

i )|at+1′i =π′

i(ot+1i ) (13)

The weights of these target networks for agent i are updatedby having them slowly track the learned networks (actor andcritic)

θQ′

i ← τθQi + (1− τ)θQ′

i(14)

θπ′

i ← τθπi + (1− τ)θπ′

i (15)

where τ � 1 is a parameter for updating the target networks.

B. MA-AVC Scheme

From Eqn. (5), it can be seen that the proposed reward inthe second situation requires to set the parameter βti to reflectthe level of cooperation. It can be set manually as a constant,but in this work a coordinator denoted by fi(·|θβi ) : S 7→ Rapproximated by a neural network for agent i is proposed toadaptively regulate it, and the parameter βti can be calculatedas

βti = fi(st|θβi ) (16)

where θβi is the weights of coordinator for agent i. It canbe seen that the parameter βti is determined by the systemstates. In this work, the coordinator is updated by minimizingthe critic loss with respect to the coordinator weights, and itsgradient can be expressed as

∇θβi L(θβi ) = 2 Est∼D

[∑j 6=i

∑k∈Λtj

rtjk∇θβi βti ] (17)

It is expected that the critic can evaluate how good theparameter βti is during training, and the learned parameter βtican be a good predictor of the cooperation level for the nexttime step.

Conventionally, it is desired to regulate the generators in theabnormal voltage areas, while maintaining the original settingof the generators in the other normal areas. Considering theabove operation rules, the proposed control law for each agentcan be expressed as

ati =

{πi(o

ti|θπi ) +N(0, σti) if |Λti| > 0;

at−1i if |Λti| = 0.

(18)

where |Λti| is the number of violated bus that the agent i has.In order to make the learning more stable, each agent has itsown replay buffer denoted by Di which can store the followingtransitions

Di ← (st, oti, at, rti , s

t+1, ot+1i , at+1′) (19)

where at = (ati, at−i) and at+1′ = (at+1′

i , at+1′−i ). This is done

to make the samples more identically distributed.Incorporating Eqn. (10)-(11) and Eqn. (13)-(19), our pro-

posed MA-AVC scheme is summarized in algorithm 1 fortraining and algorithm 2 for execution.

Page 6: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

6

C. Training and ExecutionIn order to mimic the real power system in the lab, a

power flow solver environment in algorithm 1 is used. Eachagent has its individual actor, critic, coordinator, and replaybuffer. But they can share a certain amount of informationduring the training process. In Algorithm 1, the values of Mand N are the size of the training dataset and the maximumnumber of iterations, respectively. The size of the trainingdataset should be enough large so that the training datasetcan contain more system operation statuses. The maximumnumber of iterations should be not too large to reduce thenegative impact on training due to consequential transitionswith ineffective actions. The information flow of the trainingprocess in the lab is given in Fig. 2. The detailed training andimplementation process of the proposed MA-AVC method issummarized as follows.

Step 1. For each power flow file (with or without contingen-cies) as an episode, the environment (grid simulator) will solvethe power flow and obtain the initial grid states. Based on thestates, if agents detect any voltage violations, the observationof each agent will be extracted. Otherwise, move to the nextepisode (i.e., redo step 1).

Step 2. The non-violated DRL agents will maintain the orig-inal action setting, while the violated DRL agents will executenew actions based on Eqn. (18). Then, new grid states will beobtained from the environment using the modified power flowfile. According to the obtained new states, the reward and thenew observation of each agent will be calculated and extracted,respectively.

Step 3. Each violated agent will store the transitions intheir individual replay buffer. Periodically, the actor, critic andcoordinator network will be updated in turn with a randomlysampled minibatch.

Step 4. Along with the training, each DRL agent keepsreducing the noise to decrease the exploration probability. Ifone of the episode termination conditions is satisfied, store theinformation and go to the next episode (i.e., redo Step 1).

The above closed-loop process will continue until all ofthe episodes in the training dataset run out. For each episode,the training process terminates when one of three conditionsis satisfied: i). violation cleared; ii). divergent power flowsolution; iii). the maximum number of iterations reached. Thisclosed-loop process will continue until one of the episodetermination conditions is satisfied. It does not matter whethervoltage violation still exists if the episode is terminated underthe condition i) and ii). Through the penalization mechanismdesigned in the reward and penalty, the agent can learn fromthe experience to avoid the bad termination conditions.

During online execution, the actor of controllers will onlyutilize the local measurement from the power grids. At thebeginning stage of online implementation, the decisions fromthe DRL agent will be firstly confirmed by the system operatorto avoid the risks. In the meanwhile, the real-time actions fromexisting AVC can also be used to quickly retrain the onlineDRL agent. It can be noted that the proposed control schemeis fully decentralized during execution, which can realize theregional AVC without any communication. In the experimentalenvironment, an example of decentralized execution under

Obtain grid statesand calculate rewards

Power flow solver

DRL Agent 1

DRL Agent 2

DRL Agent …

Episodetermination

Power flow file

observationsrewards

Contingency

actions

Figure 2. Information flow of the DRL Agent training process

heavy load condition is presented in the Fig.3. It can beobserved that agent #1 has several bus voltages (black points)dropping below the lower bound (red dash line) initialed,while agent #2 and #3 are fine. Once agent #1 detectsviolations, its actor will output the control action to resetthe PV bus voltages (blue crosses) given its own observations(black points and blue crosses) while the actors of other agentsremain same. After control, the original violated voltages areregulated within the normal zone.

Algorithm 1 The MA-AVC Algorithm for Training1: for episode = 1 to M do2: Initialize power flow and send oti, s

t to each agent3: Count |Λti|4: while voltages violate and step < N do5: Calculate ati based on (18)6: Execute ati in power flow solver environment and

send at, st+1, rti to each agent7: Based on at, st+1, rti , selects at+1′

i using target actor8: Congregate all at+1′

i and share at+1′ to each agent9: Store transitions in Di for each violated agent i

10: Update actor (10), critic (11), and coordinator of vio-lated agents (17) with a randomly sampled minibatch

11: Update target critic and actor (14) and (15)12: reduce noise σti13: step += 114: end while15: end for

Algorithm 2 The MA-AVC Algorithm for Execution1: repeat2: Detect Voltage violations of each agent, and count |Λti|3: Select ati (18) with extremely small σti4: Execute ati in the environment5: until voltage violations are cleared

IV. NUMERICAL SIMULATION

The proposed MA-AVC scheme is numerically simulatedon the Illinois 200-Bus system [22]. The whole system ispartitioned into three agents and formulated as a Markov Gamewith some specifications as shown in Table II. To mimic a realpower system environment, an in-house developed power gridsimulator is adapted to implement the AC power flow. The

Page 7: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

7

0 25 50 75 1000.930.95

1.051.05Vo

ltage

(pu)

agent 1 (before control)

0 25 50 75 1000.930.95

1.051.05

Volta

ge (p

u)

agent 1 (after control)

0 20 40 600.95

1.04

Volta

ge (p

u)

agent 2 (before control)

0 20 40 600.95

1.04

Volta

ge (p

u)

agent 2 (after control)

0 10 20 30Bus Index

0.96

1.04

Volta

ge (p

u)

agent 3 (before control)

0 10 20 30Bus Index

0.96

1.04Vo

ltage

(pu)

agent 3 (after control)buspv bus

Figure 3. An example for decentralized execution under heavy load condition

in-house developed power gird simulator can run in Linuxoperating system with BPA format data as input [23]. Themajor functions include AC/DC power flow analysis, ACoptimal power flow analysis and voltage stability analysis, etc.The training data are synthetically generated from a feasiblepower flow solution, and the power flow based constraints arenot neglected. The specific generation process for the trainingdata is summarized as follows. First, the load is perturbedwith a random range from 80% to 120%. Then, the activegeneration output of each generator is adjusted based on itscapacity to match the total active power consumption. Next,one transmission line can be randomly tripped to cause thecontingency situation. After that, the power flow solver willbe run to check whether this case is solvable. Finally, thefeasible power flow cases will be stored as the training data.During training, the generator bus voltage magnitudes will bechanged by the control actions. Each generator also has itsown generation limits for active and reactive power. If onegenerator reaches its generation limits, it will be automaticallytransferred from a P/V bus to a P/Q bus in the environment.Therefore, the AC power flow based constraints are alwaysguaranteed.

Table IITHE SPECIFICATION OF MARKOV GAME CONSTRUCTED IN ILLINOIS

200-BUS SYSTEM

Agent i nbi dim(at

i) dim(oti) dim(Di)

#1 (Zone A F D1) 106 15 106 690

#2 (Zone B C D2) 65 15 65 608

#3 (Zone E D3) 29 8 29 536

*dim(·) means the operation to get the dimension of the vector

The neural network architecture of (target) actor, (target)critic, and coordinator for each agent are presented in Fig.4.The fully connected (FC) layer is used to build the networks.The batch normalization (BN) can be applied to the input [24].The Rectified Linear Unit (ReLU) and Sigmoid functions areselected as the activation functions. The number of neurons islabeled below each layer. During training, the Adam optimizerwith a learning rate of 10−6, 10−6 and 10−5 for actor, critic,

𝑜!"

FCBN+ReLU

FCSigmoid

𝑎!"

30neurons

dim(𝑎!")neurons

𝑠"

FCBN+ReLU

𝑎"

30neurons

FCBN+ReLU

30neurons

+

FC

1neuron

𝑄!(𝑠", 𝑎")

𝑠"

FCBN+ReLU

FCSigmoid

𝛽!"

30neurons

1neuron

(Target) Actor

Coordinator (Target) Critic

Figure 4. The neural network architecture of (target) actor, (target) critic, andcoordinator for each agent

and coordinator, respectively, and the parameter 10−4 for up-dating the target networks are used. The discount factor γ, thesize of the replay buffer, the batch size, and the maximum timesteps are set to be 0.99, 200, 126, and 50, respectively. Theexploration parameter σti is decayed by 0.09% per time step.After all replay buffers are filled up, the network parametersare updated once every two time steps if needed.

A. Case I: Without Contingencies

In case I, all lines and transformers are in normal workingconditions and a strong centralized communication environ-ment is utilized during training. The operation data have70% ∼ 130% load change from its original base value, andthe power generation is re-dispatched based on a participationfactor. Three DRL based agents are trained on those first2000 data with 80% ∼ 120% load change, and tested on theremaining 3000 data with 70% ∼ 130% load change. Sinceminimizing the actor loss is equivalent to maximizing theaction-value function, the actor loss is defined as the negativeaction-value function in this paper. As shown in Fig.5, as thetraining process continues, the actor loss and the critic loss ofeach agent has a downward tendency, and finally converges toan equilibrium solution. It can be observed in the Fig.6 thatthe total reward increases while the action time decreases, thatis, each agent is trained to take at least as possible steps tofix voltage violations. The blue lines in Fig. 6 are the actualtotal reward behavior, and the black lines are the smoothedtotal reward behavior. During testing, all agents only take oneor two actions to fix the voltage problem. The Fig.7 showsthat the level of cooperation βi of each agent. They remainto be 0.5 at the beginning of training because the replaybuffers have not been filled up and no network parameters areupdated. Once network parameters start to update, the level ofcooperation of each agent keeps adjusting based on the inputstate until three agents converge to an equilibrium solution.The CPU Time in Fig.8 shows an obvious tendency to decreasealong the training process.

B. Case II: With Contingencies

In case II, the same episodes and settings in the Case Iare used during training, but random N-1 contingencies areconsidered as emergency conditions in real grid operation.One transmission line is randomly tripped during training, e.g.108-75, 19-17, 26-25, 142-86. As shown in the Fig.9, bothactor loss and critic loss of each agent perform a downward

Page 8: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

8

0 200 400 600 800Iteration

1.0

0.5

0.0

0.5

1.0

1.52.0

Acto

r Los

sagent 1agent 2agent 3

0 200 400 600 800Iteration

0123456

Criti

c Lo

ss

agent 1agent 2agent 3

Figure 5. The actor and critic loss for case 1

0 1000 2000 3000 4000 50002

1

01

Tota

l Rew

ard

agent 10 1000 2000 3000 4000 5000

0

20

Actio

n Ti

me agent 1

0 1000 2000 3000 4000 5000

1

0

1

Tota

l Rew

ard

agent 20 1000 2000 3000 4000 5000

0

20

Actio

n Ti

me agent 2

0 1000 2000 3000 4000 5000Episode

1

0

1

Tota

l Rew

ard

agent 3

0 1000 2000 3000 4000 5000Episode

0

20

Actio

n Ti

me agent 3

Figure 6. The reward and action time for case 1

0 500 1000 1500 2000 2500 3000 3500Time Step

0.45

0.50 agent 1agent 2agent 3

Figure 7. The level of cooperation for case 1

0 1000 2000 3000 4000 5000Episode

0

2

CPU

Tim

e (s

)

Figure 8. The cpu time for case 1

tendency, and finally converge to the equilibrium solution. Itcan be observed in Fig.10 that the total reward increases andthe action execution time decrease. During testing, all agentsonly take one or two actions to fix the voltage problem as well.The Fig.11 shows the update of cooperation level. Similarly,the CPU Time in the Fig.12 shows an decreasing tendency.

Both case I and case II demonstrate that the effectivenessof the proposed MA-AVC scheme for voltage regulation underthe situation with/without contingencies.

C. Case III: With Weak Centralized Communication

The setting of Case III is same as Case II where N-1contingencies are considered. But the communication graphamong agents is not fully connected, namely weak centralizedcommunication. We assume that agent #1 can communicate

0 200 400 600 800Iteration

0.5

0.0

0.5

1.0

1.5

Acto

r Los

s

agent 1agent 2agent 3

0 200 400 600 800Iteration

0

1

2

3

4

5

Criti

c Lo

ss

agent 1agent 2agent 3

Figure 9. The actor and critic loss for case 2

0 1000 2000 3000 4000 50002

1

0

Tota

l Rew

ard

agent 10 1000 2000 3000 4000 5000

0

10

20

30

Actio

n Ti

me agent 1

0 1000 2000 3000 4000 50002

1

0

Tota

l Rew

ard

agent 20 1000 2000 3000 4000 5000

0

10

20

30

Actio

n Ti

me agent 2

0 1000 2000 3000 4000 5000Episode

1

0

Tota

l Rew

ard

agent 30 1000 2000 3000 4000 5000

Episode

0

10

20

30

Actio

n Ti

me agent 3

Figure 10. The reward and action time for case 2

0 500 1000 1500 2000 2500 3000 3500Time Step

0.40.50.6 agent 1

agent 2agent 3

Figure 11. The level of cooperation for case 2

0 1000 2000 3000 4000 5000Episode

0

2

CPU

Tim

e (s

)

Figure 12. The cpu time for case 2

with agent #2 and #3, but agent #2 and #3 cannot commu-nicate with each other. As shown in the Fig.13, during thetraining process, the actor loss and the critic loss of eachagent have a downward tendency, and finally converge to theequilibrium solution. It can be observed in the Fig.14 that thetotal reward keeping increasing while the action time keepsdecreasing along the training process. It should be noted thateach agent takes a bit more action steps than that of caseII, which means the limited communication does reduce theperformance of system. Then, Fig.15 and Fig.16 show similarresults as the previous cases.

From case III, it can be shown that the proposed MA-AVC scheme can perform well to fix the voltage violationsin a weak centralized communication environment with a bitmore action times. It is a solid proof to extend the proposedalgorithm to distributed training later. In addition, the level

Page 9: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

9

0 200 400 600 800Iteration

0.750.500.250.000.250.500.751.00

Acto

r Los

s

0 200 400 600 800Iteration

0

5

10

15

20

25

Criti

c Lo

ss

agent 1agent 2agent 3

agent 1agent 2agent 3

Figure 13. The actor and critic loss for case 3

0 1000 2000 3000 4000 5000

210

Tota

l Rew

ard

agent 1

0 1000 2000 3000 4000 50000

10

2030

Actio

n Ti

me agent 1

0 1000 2000 3000 4000 50002

1

0

Tota

l Rew

ard

agent 20 1000 2000 3000 4000 5000

0

10

20

30

Actio

n Ti

me agent 2

0 1000 2000 3000 4000 5000Episode

0.5

0.0

0.5

Tota

l Rew

ard

agent 30 1000 2000 3000 4000 5000

Episode

0

10

20

30

Actio

n Ti

me agent 3

Figure 14. The reward and action time for case 3

0 500 1000 2500 3000 35001500

0.4

0.5 agent 1agent 2agent 3

Time Step2000

Figure 15. The level of cooperation for case 3

0 1000 2000 3000 4000 5000Episode

0123

CPU

Tim

e (s

)

Figure 16. The cpu time for case 3

of cooperation in case I, II, and III have a similar tendency,that is, the cooperation level of agent #1 goes up while thecooperation level of agent #2 and #3 goes down. It indicatesthat the agent #1 have more potential to fix voltage violations,and thus can contribute more in solving voltage issues.

D. Case IV: The Effect of Reward on Learning

In case IV, the effect of reward on motivating learning isstudied. In the proposed reward design principle, a rewardis assigned to each bus in terms of the deviation levelof its magnitude from the given reference value. Althoughthe major objective in this paper is to maintain acceptablevoltage profiles, there is a concern whether the DRL basedagent can autonomously learn to reduce the deviation of busvoltage magnitudes given a reference value. Case studies areperformed with two different reference values: 1.0 pu and 0.96

pu. As shown in Fig. 17, the red dash lines are the lower andupper bound of the normal voltage region zones. The points arethe average voltage magnitude over samples on testing datasetat each bus. The dash point lines are the average voltagemagnitude over each bus and samples on testing dataset. Itcan be seen that the average voltage is different before andafter control. It can be further observed that the overall trendis toward the given reference, which demonstrates the abilityof DRL based agent to reduce deviations and optimize thevoltage profile.

E. Comparison Analysis

As mentioned in Section I, the proposed method is anextension of the latest work [1], [2] from a single-agent DRLbased AVC system to a multi-agent AVC system. In orderto emphasize the superiority of the proposed method over theprevious work with the same test system, i.e., Illinois 200-Bussystem, and control variables, an relatively fair comparisonanalysis is conducted in Table III. It can be shown that thesignificance of the proposed DRL based methods have animproved scalability that is the premise of deployment ofDRL based AVC to a large-scale power grid. On one hand, theactor network of each agent has less neural network parameterswith lower model complexity. Considering a one hidden layerfully connected neural network as an actor network, theprevious work has at least 200× 38 = 7600 parameters whilethe proposed method has 106× 15 + 65× 15 + 29× 8 = 3797parameters. On the hand, as the size of power grids increases,the previous work doesn’t have potentials to handle the high-dimensional input-output space for the actor network, whilethe proposed method can solve the dimension cursing problembecause once the system is reconstructed as a multi-agent sys-tem, each agent just controls regional devices given the localmeasurement. Moreover, with operation rule based policy, theproposed work can realize regional control, i.e, when thevoltage violations occur in some agent’s zone, the only oneproblematic agent needs to make a decision to fix the voltageviolations, which is illustrated in Fig. 3, while the previouswork cannot. This regional controllability better suits for thepractical AVC.

V. CONCLUSIONS

This paper has proposed a DRL based data-driven MA-AVC scheme to mitigate voltage issues in the large-scaleenergy system. The MA-AVC problem is formulated as aMarkov Game with a heuristic method to partition agents.The MADDPG algorithm is adapted and modified to learn aneffective policy in a centralized way from the operating data.The well learned DRL based agents can achieve satisfactoryperformance to control voltage profiles in a decentralizedmanner. The proposed coordinators can adaptively regulate thelevel of cooperation based on the system states in learning.Finally, numerical simulations on the Illinois 200-Bus systemverify the effectiveness of the MA-AVC algorithm. Moreover,the MA-AVC algorithm can also cope with a weak centralizedcommunication environment, which is a good base to extendour training algorithm to distributed learning in future work.

Page 10: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

10

Figure 17. The illustration of the effect of reward on learning

Table IIITHE COMPARISON ANALYSIS OF THE PROPOSED METHOD WITH THE BASE WORK IN [2]

Method Implementation Mechanism Solution Methodology Control Performance

[2] Centralized training &Centralized execution

1. Single-agent control; 2. Formu-lation: Markov Decision Process(MDP); 3. State: active & reactiveline power flows, bus voltage mag-nitudes & phase angles; 4. DDPGbased agent: actor + critic + replaybuffer; 5. Deterministic policy

With random load fluctuations andcontingencies applied to the op-eration data, the DRL agent caneffectively fix the voltage violationissues.

Proposed Method Centralized training &Decentralized execution

1. Multi-agent control; 2. Formu-lation: Markov Game; 3. State:bus voltage magnitudes; 4. DDPGbased agent: actor + critic + coordi-nator + independent replay buffer;5. Operation rule based determin-istic policy

With random load fluctuations andcontingencies applied to the oper-ation data while considering com-munication limits, the DRL agentcan effectively solve the voltageviolation issues with the improvedscalability and regional controlla-bility.

Also, more controllable devices such as transformer and shuntcan be incorporated into the MA-AVC system in the future.

REFERENCES

[1] R. Diao, Z. Wang, D. Shi et al., “Autonomous voltage control forgrid operation using deep reinforcement learning,” IEEE PES GeneralMeeting, Atlanta, GA, USA, 2019., 2019.

[2] J. Duan, D. Shi, R. Diao, H. Li, Z. Wang, B. Zhang, D. Bian, andZ. Yi, “Deep-reinforcement-learning-based autonomous voltage controlfor power grid operations,” IEEE Transactions on Power Systems, 2019.

[3] H. Sun, Q. Guo et al., “Review of challenges and research opportunitiesfor voltage control in smart grids,” IEEE Transactions on Power Systems,2019.

[4] Q. Guo, H. Sun, M. Zhang et al., “Optimal voltage control of pjmsmart transmission grid: Study, implementation, and evaluation,” IEEETransactions on Smart Grid, vol. 4, no. 3, pp. 1665–1674, Sep. 2013.

[5] N. Qin, C. L. Bak et al., “Multi-stage optimization-based automaticvoltage control systems considering wind power forecasting errors,”IEEE Transactions on Power Systems, vol. 32, no. 2, pp. 1073–1088,2016.

[6] S. H. Low, “Convex relaxation of optimal power flow—part i: For-mulations and equivalence,” IEEE Transactions on Control of NetworkSystems, vol. 1, no. 1, pp. 15–27, 2014.

[7] D. K. Molzahn, F. Dörfler et al., “A survey of distributed optimizationand control algorithms for electric power systems,” IEEE Transactionson Smart Grid, vol. 8, no. 6, pp. 2941–2962, 2017.

[8] K. E. Antoniadou-Plytaria, I. N. Kouveliotis-Lysikatos et al., “Dis-tributed and decentralized voltage control of smart distribution networks:Models, methods, and future research,” IEEE Transactions on smartgrid, vol. 8, no. 6, pp. 2999–3008, 2017.

[9] H. J. Liu, W. Shi, and H. Zhu, “Distributed voltage control in distributionnetworks: Online and robust implementations,” IEEE Transactions onSmart Grid, vol. 9, no. 6, pp. 6106–6117, Nov 2018.

[10] H. Zhu and H. J. Liu, “Fast local voltage control under limited reactivepower: Optimality and stability analysis,” IEEE Transactions on PowerSystems, vol. 31, no. 5, pp. 3794–3803, Sep. 2016.

[11] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforcement learning forelectric power system decision and control: Past considerations andperspectives,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 6918–6927, 2017.

[12] J. G. Vlachogiannis and N. D. Hatziargyriou, “Reinforcement learningfor reactive power control,” IEEE transactions on power systems, vol. 19,no. 3, pp. 1317–1325, 2004.

[13] Y. Xu, W. Zhang et al., “Multiagent-based reinforcement learning foroptimal reactive power dispatch,” IEEE Transactions on Systems, Man,and Cybernetics, Part C (Applications and Reviews), vol. 42, no. 6, pp.1742–1751, 2012.

[14] H. Xu, A. Dominguez-Garcia, and P. W. Sauer, “Optimal tap settingof voltage regulation transformers using batch reinforcement learning,”IEEE Transactions on Power Systems, 2019.

[15] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, “Two-timescale voltage control in distribution grids using deep reinforcementlearning,” IEEE Transactions on Smart Grid, 2019.

[16] M. L. Littman, “Markov games as a framework for multi-agent rein-forcement learning,” in Machine learning proceedings 1994. Elsevier,1994, pp. 157–163.

[17] R. Lowe, Y. Wu, A. Tamar et al., “Multi-agent actor-critic for mixedcooperative-competitive environments,” in Advances in Neural Informa-tion Processing Systems, 2017, pp. 6379–6390.

[18] A. Singhal, V. Ajjarapu, J. Fuller, and J. Hansen, “Real-time localvolt/var control under external disturbances with high pv penetration,”IEEE Transactions on Smart Grid, vol. 10, no. 4, pp. 3849–3859, 2018.

[19] G. Qu and N. Li, “Optimal distributed feedback voltage control under

Page 11: A Data-driven Multi-agent Autonomous Voltage Control ... Data-driven... · load/generation changes, N-1 contingencies, and weak centralized communication environment. Index Terms—Multi-agent

11

limited reactive power,” IEEE Transactions on Power Systems, vol. 35,no. 1, pp. 315–331, 2019.

[20] D. Silver, G. Lever et al., “Deterministic policy gradient algorithms,” inProceedings of the 31st International Conference on Machine Learning- Volume 32, ser. ICML’14, 2014, pp. I–387–I–395.

[21] T. P. Lillicrap, J. J. Hunt et al., “Continuous control with deep rein-forcement learning,” arXiv preprint arXiv:1509.02971, 2015.

[22] A. B. Birchfield, T. Xu et al., “Grid structural characteristics as val-idation criteria for synthetic networks,” IEEE Transactions on powersystems, vol. 32, no. 4, pp. 3258–3265, 2016.

[23] C. EPRI, “Bpa psd-st stability program user manual (version 5.2.0),”Beijing, China, 2018.

[24] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167, 2015.

Shengyi Wang (S’17) received the B.S. and M.S. degrees in electricalengineering from Shanghai University of Electric Power, China, and ClarksonUniversity, Potsdam, NY, in 2016 and 2017, respectively. He is currentlypursuing his Ph.D. degree at Temple University, Philadelphia. He receivedOutstanding Undergraduate Thesis Award in 2016. His research interests in-clude game-theoretic control for multi-agent systems, data-driven optimizationfor power systems, and non-intrusive load monitoring.

Jiajun Duan (S’13-M’18) received his B.S. degree in Power system andits automation from Sichuan University, Chengdu, China, and M.S. degreein in Electrical Engineering at Lehigh University, Bethlehem, PA in 2013and 2015, respectively, and the Ph.D. degree in Electrical Engineering fromLehigh University in 2018. Currently, he is a research scientist in GEIRINorth America, San Jose, CA, USA. His research interests include artificialintellegence, power system, power electronics, control systems, and machinelearning.

Di Shi (M’12-SM’17) received the B.S. degree in electrical engineering fromXian Jiaotong University, Xian, China, in 2007, and M.S. and Ph.D. degreesin electrical engineering from Arizona State University, Tempe, AZ, USA, in2009 and 2012, respectively. He currently leads the AI & System AnalyticsGroup at GEIRI North America, San Jose, CA, USA. His research interestsinclude WAMS, Energy storage systems, and renewable integration. He is anEditor of IEEE Transactions on Smart Grid.

Chunlei Xu received the B.S. degree in electrical engineering from ShanghaiJiao Tong University, Shanghai, China, in 1999. He currently leads theDispatching Automation department at Jiangsu Electrical Power Companyin China. His research interests include power system operation and controland WAMS.

Haifeng Li (M’19) received his Ph.D. degree from Southeast University,China, in 2001. He is with SGCC Jiangsu Electric Power Company asdeputy director of System Operations. His research interests are power systemoperation, stability analysis and control, and AI application in power grids.He is the recipient of the 2018 Chinese National Science and TechnologyProgress Award, 1st prize.

Ruisheng Diao (M’09-CSM’15) obtained his Ph.D. degree in EE fromArizona State University, Tempe, AZ, in 2009. Dr. Diao has been managingand supporting a portfolio of research projects in the area of power systemmodeling, dynamic simulation, online security assessment and control, andHPC implementation in power grid studies. He is now with GEIRINA asdeputy department head, AI & System Analytics, in charge of several R&Dprojects on power grid high-fidelity simulation tools and developing new AImethods for grid operations.

Zhiwei Wang (M’16-SM’18) received the B.S. and M.S. degrees in electricalengineering from Southeast University, Nanjing, China, in 1988 and 1991,respectively. He is President of GEIRI North America, San Jose, CA, USA.Prior to this assignment, he served as President of State Grid US Represen-tative Office, New York City, from 2013 to 2015, and President of State GridWuxi Electric Power Supply Company from 2012-2013. His research interestsinclude power system operation and control, relay protection, power systemplanning, and WAMS.