On games and simulators as a platform Simulation for

On games and simulators as a platformfor development of artificial intelligencefor command and control

The Journal of Defense Modeling andSimulationXX(X):1–11©The Author(s) 2021Reprints and permission:sagepub.co.uk/journalsPermissions.navDOI: 10.1177/ToBeAssignedwww.sagepub.com/

SAGE

Vinicius G. Goecks1, Nicholas Waytowich1, Derrik E. Asher1, Song Jun Park1, Mark Mittrick1,John Richardson1, Manuel Vindiola1, Anne Logie1, Mark Dennison2, Theron Trout1, PriyaNarayanan1, and Alexander Kott1

AbstractGames and simulators can be a valuable platform to execute complex multi-agent, multiplayer, imperfect informationscenarios with significant parallels to military applications: multiple participants manage resources and make decisionsthat command assets to secure specific areas of a map or neutralize opposing forces. These characteristics haveattracted the artificial intelligence (AI) community by supporting development of algorithms with complex benchmarksand the capability to rapidly iterate over new ideas. The success of artificial intelligence algorithms in real-time strategygames such as StarCraft II have also attracted the attention of the military research community aiming to explore similartechniques in military counterpart scenarios. Aiming to bridge the connection between games and military applications,this work discusses past and current efforts on how games and simulators, together with the artificial intelligencealgorithms, have been adapted to simulate certain aspects of military missions and how they might impact the futurebattlefield. This paper also investigates how advances in virtual reality and visual augmentation systems open newpossibilities in human interfaces with gaming platforms and their military parallels.

KeywordsArtificial intelligence, Reinforcement learning, War gaming, Command and control, Human-computer interface, Futurebattlefield

Introduction

In warfare, the ability to anticipate the consequences ofthe opponent’s possible actions and counteractions is anessential skill1. This becomes even more challenging in thefuture battlefield and multi-domain operations where thespeed and complexity of operations are likely to exceedthe cognitive abilities of a human command staff in chargeof conventional, largely manual, command and controlprocesses. Strategy games such as chess, poker, and Go haveabstracted some of command and control (C2) concepts.While wargames are often manual games played withphysical maps, markers, and spreadsheets to compute battleoutcomes2, computer games and modern game engines suchas Unreal Engine1 and Unity2 are capable of automatingsimulation of complex battle and related physics simulations.

Computer games and game engines are not only suitableto simulate military scenarios but also provide a useful test-bed for developing AI algorithms. Games such as StarCraftII 3–9, Dota 210, Atari11–13, Go14–16, chess17,18, heads-up no-limit poker19 have all been used as platforms for trainingartificially intelligent agents. Another capability developedby the gaming industry that is relevant to the militarydomain is virtual reality and visual augmentation systems,which may potentially provide commanders and staff a moreintuitive and information-rich display of the battlefield.

Recently, there has been a major effort towards leveragingthe capabilities of computer games and simulation enginesdue to their suitability for integrating AI algorithms, some

examples of which we discuss shortly. The work describedin this paper focuses on adapting imperfect informationreal-time strategy games (RTS) and their correspondingsimulators, to emulate military assets in battlefield scenarios.Further, this work aims to advance the research anddevelopment efforts associated with AI algorithms forcommand and control. Flexibility in these games and gameengines allow for simulations of a rather broad range ofterrains or assets. In addition, most games and simulatorsallow for battlefield scenarios to be played out at a “fasterthen real-time” speed which is a highly desirable feature forthe rapid development and testing of data-driven approaches.

The paper is organized as follows. In the Background andRelated Work section, we offer motivation for research onAI-based tools for Command and Control (C2) decision-making support, and describe a number of prior relatedefforts to develop approaches to such tools. In particular, weoffer arguments in support of using machine learning thatcan leverage data produced using simulation / wargamingruns, to achieve affordable AI solutions. In the following

1DEVCOM Army Research Laboratory, USA2DEVCOM Army Research Laboratory West, USA

Corresponding author:Vinicius G. Goecks, DEVCOM Army Research Laboratory, HumanResearch and Engineering Directorate, Aberdeen Proving Ground, MD.USA.Email: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

arX

iv:2

110.

1130

5v1

[cs

.LG

] 2

1 O

ct 2

021

2 The Journal of Defense Modeling and Simulation XX(X)

section, we outline briefly the project that yielded the resultsdescribed in the paper, and explain how critical for sucha research is to find or adapt a game or a simulationwith an appropriate set of features. Then, we describe indetail two case studies. In the first, we describe how weadapted a fast and popular gaming system to approximatepartly-realistic military operation and how we used it inconjunction with a reinforcement learning algorithm to trainan “artificial commander” (a trained agent) that commandsa blue force to fight an opposing enemy force (red force).In the second case study, we describe a similar explorationusing a realistic military simulation system. Then we discussapproaches to overcoming common challenges that are likelyto arise in transitioning such future technologies to real-world battlefield, and also describe our key findings.

The key contributions of the research described in thepaper are two-fold. First, we explore and describe twocase studies on modeling a partly realistic mission scenarioand training an artificial commander leveraging existing(with some adaptations) off-the-shelf deep reinforcementlearning algorithms. We show that such trained algorithmscan significantly outperform both human and doctrine-basedbaselines without using any pre-coded expert knowledge,and learn effective behaviors entirely from experience.Second, we formulate and empirically confirm a set offeatures and requirements that a game or simulation engineshould meet in order to serve as a viable platform for researchand development of reinforcement learning tools for C2. Ineffect, these are initial recommendations for the researcherswho build related experimental and developmental systems.

Background and Related WorkWe start by defining important terms used throughout thisresearch work. Wargame is defined in this work as a largelymanual, strategy game that uses rules, data, and proceduresto simulate an armed conflict between two or more opposingsides1,20,21, which is used to train military officers andplan military courses of action. This is different fromgames, which is used here as a fully automated, recreationalcomputer application with well-defined rules and scoresystems that uses a simulation of an armed conflict as a formof entertainment. Similarly, simulators are used here as ahybrid form of wargames and games. Simulators are fullyautomated computer applications that aim to realisticallysimulate the outcome of military battles and courses ofaction. They are not designed as an entertainment platformbut as a tool to aid military planning.

Concerning command and control (C2), we use thesame definition as command and control warfightingfunction defined in the Army Doctrine Publication No.3 -Operations22 as the “related tasks and a system that enablecommanders to synchronize and converge all elementsof combat power”. Similarly, AI-support to C2 are AIsystems developed to aid human commanders by providingadditional information or recommendations in the commandand control process.

The past few decades have seen a number of ideas andcorresponding research towards developing automated orsemi-automated tools that could support decision-makingin planning and executing military operations. DARPA’s

JFACC program took place in late 1990’s23 and developeda number concepts and prototypes for agile managementof a joint air battle. Most of the approaches consideredat that time involved continuous real-time optimizationand re-optimization (as situation continually changes) ofroutes and activities of various air assets. Also in mid-to-late 1990’s, the Army funded the CADET project24 whichexplored potential utility of classical hierarchical planning,adapted for adversarial environments, for transforming high-level battle sketch into a detailed synchronization matrix– a key product of the doctrinal Military Decision-MakingProcess (MDMP). In early 2000’s, DARPA initiated theRAID project25 which explored a number of technologiesfor anticipating enemy battle plans, as well as dynamicallyproposing friendly tactical actions. At the time, game-solving algorithms emerged as the most successful amongthe technological approaches explored.

The role of multiple domains and their extremely complexinteractions – beyond the traditional kinetic fights to includepolitical, economic and social effects – were exploredin late 2000’s in DARPA’s COMPOEX program26. Thisprogram investigated the use of interconnected simulationsub-models, mostly system-dynamic models, in order toassist senior military and civilian leaders in planning andexecuting large-scale campaigns in complex operationalenvironments. The importance of non-traditional warfightingdomains such as the cyber domain has been recognizedand studied in mid-2010’s by a NATO research group27

that looked into simulation approaches to assessing missionimpacts of cyberattacks and highlighted strong non-lineareffects of interactions between cyber, human and traditionalphysical domains.

All approaches taken in the research efforts mentionedabove – and many other similar ones – have major andsomewhat common weaknesses. They tend to require arigid, precise formulation of the problem domain. Oncesuch a formulation is constructed, they tend to produceeffective results. However, as soon as a new elementneeds to be incorporated into the formulation (e.g., anew type of a military asset or a new tactic), a difficult,expensive, manual and long effort is required to “rewire”the problem formulation and to fine-tune the solutionmechanism. And the real world endlessly presents a streamof new elements and entities that must be taken intoaccount. In rule-based systems of the 1980’s, a systemwould become un-maintainable as more and more rules(with often unpredictable interactions) had to be added torepresent the real-world intricacies of a domain. Similarly,in optimization-based approaches, an endless number ofrelations between significant variables, and variety ofconstraints had to be continually and manually be added (amaintenance nightmare) to represent real-world intricacies ofa domain. In game-based approaches, the rules governinglegal moves and effects of moves for each piece wouldgradually become hopelessly convoluted as more and morerealities of a domain had to be manually contrived and addedto game formulation.

In short, such approaches are costly in their representation-building and maintenance. Ideally, we would like to seea system that learns its problem formulation and solutionalgorithm directly from its experiences in a real or simulated

Prepared using sagej.cls

DEVCOM Army Research Laboratory 3

world, without any (or with little) manual programming.Machine learning, particularly reinforcement learning, offersthat promise.

In contrast, applicability of machine learning to problemsof real-world C2 remains a matter of debate and exploration.Walsh et al.28,29 investigated how the military can usethe capacity of dealing with large volumes of data andgreater decision speed of machine learning algorithms formilitary planning and command and control. Walsh etal.28 analyzed and rated the characteristics of ten games,such as Go, Bridge, and StarCraft II, and ten commandand control processes, such as intelligence preparation ofthe battlefield, operational assessment, and troop leadingprocedures. Example of characteristics are the operationaltempo, rate of environment change, problem complexity,data availability, stochasticity of action outcomes, and others.Their main conclusion related to using games as a platformfor military applications was that real-world tasks are verydifferent from many of the games and environments usedto develop and demonstrate artificial intelligence systems,which is mostly due to them having fixed and well-definedrules that are regularly exploited by AI agents.

Related to these games, StarCraft II (SC2) is a real-timestrategy game where players compete for map dominationand resources. The players need to manage resources toexpand their bases, build more units, upgrade them, andcoordinate tens of units at the same time to defeat theiropponent. Vinyals et al.5 presented AlphaStar, the first AIagent to learn how to play the full game of SC2 at thehighest competitive level. AlphaStar played under the samekind of constraints that humans play under professionallyapproved conditions. The AI agent was able to achieve thissuccess using a combination of self-play via reinforcementlearning, multi-agent learning, and imitation learning usinghundreds of thousands of expert replays. Sun et al.4 proposedthe TStarBot AI agent, a combination of reinforcementlearning and encoded knowledge of the game, becoming thefirst learning agent to defeat all levels of the built-in rule-based AI in the StarCraft II game. Han et al.9 extendedthat to TStarBot-X with a complete analysis of the learnedagent trained under a limited scale of computation resources.Also using limited computational resources, Wang et al.8

presented StarCraft Commander, an AI system that alsoachieves the highest competitive level but using a model witha third of the parameters used by AlphaStar and about one-sixth of the training time.

OpenAI et al.10 was the first AI system to defeat theworld champions at the Dota 2 esport game, a partially-observable five-against-five real-time strategy game withhigh dimensionality of observation and action spaces. In theDota 2 game, both teams compete for control of strategiclocations of the map and to defend their bases at the cornersof the map. Each agent controls a special hero unit withunique abilities that are used when teams engage in conflictagainst each other and non-player-controlled units. OpenAIwas able to solve this complex task by scaling existingreinforcement learning algorithms, in this case, ProximalPolicy Optimization (PPO)30 with Generalized AdvantageEstimation (GAE)31, to train using thousand of GPUs overten months with a specialized neural-network architecture tohandle the long time horizons of the game.

Boron and Darken32 investigated the use of deep rein-forcement algorithms to solve simulated, perfect informa-tion, turn-based, multi-agent, small tactical engagement sce-narios based in military doctrine. The grid-world environ-ment emulated an offensive scenario where the homogeneousunits controlled by the reinforcement learning agent couldmove and attack and the defender was static and combat wasresolved using both deterministic and stochastic Lanchestercombat models33–35. One of the main results was how thelearned behavior can be controlled to follow different prin-ciples of war such as mass or force36 based on the discountfactor for the rewards.

Asher et al.37 proposed the adoption of militarily relevanttactics to a simulated predator-prey pursuit task that allowedfor teams of deep reinforcement learning agents to engagein various tactics through capability modifications such asdecoys, traps, and camouflage. Further, this research grouphas introduced methods for measuring coordination38–40

towards a framework for integrating AI agents into mixedSoldier-agent teams41,42.

With respect to simulators built using existing gameengines, Sun et al.43 applied deep reinforcement learningto the command and control of air defense operations.Digital battlefield environment based on Unreal Enginewas developed for reinforcement learning training process.The training setup consisted of static and random opponentstrategies where attack route and formation of units waseither fixed or random. For evaluation, win rate, battledamage statistics, and combat details were compared againsthuman experts. Their experimental results showed that deepreinforcement learning agent achieved higher winning ratethan the human experts in fixed and random opponentscenarios. Conversely, in this work, the authors are proposinga real-time decision-making and planning system that isdesigned using reinforcement learning formalism, which canoperate in tandem with the commander. Unity game enginewas used as a base for a multi-agent robot and sensorsimulation for military applications44 and as a quadrotorsimulator45, while other simulators, such as MicrosoftAirSim46, use Unreal Engine to simulate interaction betweenmultiple robotic agents and the environment, being flexibleenough to allow the implementation of military-relevanttasks, as for example, an autonomous landing task where adrone landed on top of a combat vehicle47.

With respect to military simulators, the AlphaDogfightTrial was a DARPA sponsored competition48 where an AIcontroled a simulated F-16 fighter jet in an aerial combatagainst an Air Force pilot flying in a virtual reality simulator.The goal of this program was to automate air-to-air combat.First place winner, Heron Systems, designed an F-16 AI,which outperformed seven other participating companiesand defeated an experienced Air Force pilot with a scoreof 5-049 in a simulated dogfight. The outcome of thiscompetition demonstrated that an AI can provide preciseaircraft maneuvering that may surpass human abilities.In addition, this F-16 AI opened possibilities of human-machine teaming such that human pilots can address high-level strategies while offloading low level, tedious tacticaltasks to an AI.

Schwartz et al.50 described a wargaming support tool thatused a genetic algorithm to recommend modifications to



an existing friendly course of action (COA), a sequenceof decisions to be taken in a military scenario, framedas task scheduling problem with multiple tasks and theirrespective start times. The system integrated user input tothe optimization process to constrain which tasks can bemodified, minimum and maximum start times, all the AIsettings, and monitor all the recommendations that werebeing simulated by the AI. The authors showed that theirsystem was able to generate expert-level recommendation tofriendly courses of action and support the military decision-making process (MDMP)51.

Given these positive results on wargaming, real-timecomplex strategy and multi-agent recreational applications,and military simulators, extending these algorithms tomilitary command and control applications as a whole is apromising research avenue that we explore in this researchwork.

Games and Simulations for DeepReinforcement Learning of C2In this work, we investigate whether deep reinforcementlearning algorithms might support future agile and adaptivecommand and control (C2) of multi-domain forces thatwould enable the commander and staff to exploit rapidly andeffectively fleeting windows of superiority.

To train reinforcement learning agents for a commandand control scenario, a fast running simulator withthe appropriate interfaces is needed to enable learningalgorithms to run for millions of simulation steps, as is oftenrequired by state-of-the-art deep reinforcement learningalgorithms52,53. Much of what motivated us on adapting agame to the study of AI in C2 applications was the difficultyof finding a simulator that had all the required features todevelop machine learning algorithms for C2. These featuresinclude, but are not limited to, being able to:

• interact with the simulator, also known as the environ-ment, via an application programming interface (API),which includes the ability to query environment statesand send actions computed by the agent;

• simulate interactions between agent and environmentfaster than real-time;

• parallelize simulations for distributed training of themachine learning agent, ideally over multiple nodes;

• randomize environment conditions during training tolearn more robust and generalizable reinforcementlearning policies; and

• emulate realistic combat characteristics such asterrain, mobility, visibility, communications, weaponrange, damage, rate of fire, and other factors, in orderto mimic real world military scenarios.

While we were unable to find a simulator that met all theserequirements out of the box, game-based simulators, such asthe StarCraft II Learning Environment (SC2LE)3, were ableto be adapted with the purpose of exploring use of applyingartificial intelligence algorithms in potential military C2applications. The next sections describe how we adaptedSC2LE and the military simulator OpSim, respectively, to beused as our main platform to develop an artificial commanderfor command and control tasks using reinforcement learning.

Case Study: StarCraft II for Military ApplicationsAs part of this project, we developed a research prototypecommand and control (C2) simulation and experimentationcapability that included simulated battlespaces using theStarCraft II Learning Environment (SC2LE)3 with interfacesto deep reinforcement learning algorithms via RLlib54,a library that provides scalable software primitives forreinforcement learning on a high performance computingsystem.

StarCraft II is a complex real-time strategy game in whichplayers balance high-level economic decisions with low-level individual control of potentially hundreds of units inorder to overpower and defeat an opponent force. StarCraft IIhas a number of difficult challenges for artificial intelligencealgorithms that make it a suitable simulation environmentfor wargaming, command and control, and other militaryapplications. For example, the game has complex state andaction spaces, can last tens of thousands of time-steps, canhave thousands of actions selected in real time, and cancapture uncertainty due to the partial observability or “fog-of-war”. Further, the game has heterogeneous assets, aninherent control architecture that in some elements resemblesmilitary command and control, embedded objectives thathave adversarial nature, and a shallow learning curvefor implementation/modification compared to more robustsimulations.

DeepMind’s SC2LE framework3 exposes Blizzard Enter-tainment’s StarCraft II Machine Learning API as a rein-forcement learning environment. This tool provides accessto StarCraft II, its associated map editor, and an interface forRL agents to interact with StarCraft II, getting observationsand sending actions.

Using the StarCraft II Editor we implemented TigerClaw,a Brigade-scale offensive operation scenario55 to generate atactical combat scenario within the StarCraft II simulationenvironment, as seen in Fig. 1. The game was militarized byre-skining the icons to incorporate MIL-STD-2525C militarysymbology56 and unit parameters (weapons, range, scaling)associated with the TigerClaw scenario, as seen in Fig. 2.

Figure 1. Illustration of areas of operations in the TigerClawscenario.

In TigerClaw, the Blue Force’s goal is to cross thewadi terrain, neutralize the Red Force, and control certaingeographic locations. These objectives are encoded in thegame score for use by reinforcement learning agents and



Figure 2. StarCraft II using MILSTD2525 symbols.

serving as a benchmarking baseline for comparison acrossdifferent neural network architectures and reward drivingattributes. The following sections describe the process weused to adapt the map, units, and rewards to this militaryscenario.

Map Development We created a new Melee Map forTigerClaw scenario using the StarCraft II Editor. The mapsize was the largest available, 256 by 256 tiles, using theStarCraft II coordinate system. A wasteland tile set was usedas the default surface of the map since it visually resembled adesert region in the area of operations in TigerClaw, as seenin Fig. 3. After the initial setup, we used the Terrain tools tomodify the map to loosely approximate the area of operation.The key terrain feature was the impassable wadi with limitedcrossing points.

Figure 3. Modified StarCraft II, left panel, map to resemble realworld area of operation, right panel.

Distance scaling was an important factor for the scenariocreation. In the initial map, we used the known distancebetween landmarks to translate StarCraft II distance, usingits internal coordinate system, into kilometers and latitudeand longitude. This translation is important for adjustingweapons range during unit modification and to ensurecompatibility with other internal visualization tools thatexpect geographic coordinates as input.

Playable Units Modification To simulate the TigerClawscenario, we selected StarCraft II “fantastic” units that couldbe made to approximate the capabilities of realistic militaryunits, even if crudely. The StarCraft II units were firstduplicated and their attributes modified in the Editor tosupport the scenario. First, we modified the appearance of theunits and replaced it with an appropriate MIL-STD-2525Csymbol, as seen in Table 1.

Other attributes modified for the scenario included,weapon range, weapon damage, unit speed, and unit health(how much damage it can sustain). Weapon ranges werediscerned from open source materials and scaled to the map

Table 1. Mapping of TigerClaw units to StarCraft II units.

TigerClaw Unit StarCraft II Unit

Armor Siege Tank (tank mode)Mechanized Infantry Hellion

Mortar MarauderAviation BansheeArtillery Siege Tank (siege mode)

Anti-Armor ReaperInfantry Marine

dimensions. Unit speed was established in the TigerClawoperations order and fixed at that value. The attributesfor damage and health were estimated, with the guidingprinciple of maintaining a conflict that challenges bothsides. Each StarCraft II unit usually had only one weaponmaking it challenging to simulate the variety of armamentsavailable to a company size unit. Additional effort to increasethe accuracy of unit modifications will require wargamingsubject matter experts.

Additionally, the Blue Force units were modified so thatthey would not engage offensively or defensively unlessspecifically commanded by the player or learning agent incontrol. To control the Red Forces, we used two differentstrategies. The first strategy was to include a scripted courseof action, a set of high-level actions to take, for Red Forcemovements that is executed in every simulation. The unitsdefault aggressiveness attributes controlled how it engagedBlue. The second strategy was to let a StarCraft II botAI control the Red Force to execute an all-out attack, orsuicide as it is termed in the Editor. The built-in StarCraftII bot has several difficulty levels (1–10) which dictatethe proficiency of the bot. The bot levels indicate theirproficiency where a level 1 is a fairly rudimentary bot thatcan be easily defeated and level 10 is a very sophisticated botthat uses information not available to players (i.e., a cheatingbot). Finally, environmental factors such as fog-of-war weretoggled across experiments to investigate their impact. Thiswas done in the following manner:

Game Score and Reward Implementation Reward func-tion is an important component of reinforcement learningand it controls how the agent reacts to environmental changesby giving them positive or negative reward for each situation.We incorporated the reward function for the TigerClawscenario in StarCraft II and our implementation overrode theinternal game scoring system. The default scoring systemin StarCraft II rewarded players for the resource value oftheir units and structures. Our new scoring system focusedon gaining and occupying new territory as well as destroyingthe enemy.

Our reward function awarded +10 points for the BlueForce crossing the wadi (river) and −10 points for retreatingback. In addition, we awarded +10 points for destroying aRed Force unit and −10 points if a Blue Force unit wasdestroyed. In order to implement the reward function, it wasnecessary to first use the StarCraft II map editor to definethe various regions and objectives of the map. Regions areareas, defined by the user, which are utilized by internal maptriggers that compute the game score, as seen in Fig. 4.



Figure 4. Regions and objectives used as part of the rewardfunction in the custom TigerClaw scenario.

Additional reward components can be also integratedbased on the the Commander’s Intent in the TigerClaw, orother scenario, warning orders. Ideally, the reward functionwill attempt to train the agent to create optimal behavior thatis perceived as reasonable by a military subject matter expert.

Deep Reinforcement Learning Results Using the customTigerClaw map, units, and reward function, we trained amulti-input and multi-output deep reinforcement learningagent adapted from Waytowich et al 20197. The RL agentwas trained using the Asynchronous Advantage Actor Critic(A3C) algorithm57. In this tactical version of the StarCraft IImini-game, as shown in Fig. 5, the state-space consists of 7mini-map feature layers of size 64x64 and 13 screen featurelayer maps of size 64x64 for a total of 20 64x64 2D images.Additionally, it also consists of 13 non-spatial featurescontaining information such as player resources and buildqueues. The mini-map and screen features were processedby identical 2-layer convolutional neural networks (top tworows) in order to extract visual feature representations of theglobal and local states of the map, respectively. The non-spatial features were processed through a fully-connectedlayer with a non-linear activation. These three outputs werethen concatenated to form the full state-space representationfor the agent.

Figure 5. Input processing for the custom TigerClaw scenario.

The actions in StarCraft II are compound actions in theform of functions that require arguments and specificationsabout where that action is intended to take place on thescreen. For example, an action such as “attack” is representedas a function that would require the x− y attack locations onthe screen. The action space consists of the action identifier(i.e., which action to run), and two spatial actions (x and y)that are represented as two vectors of length 64 real-valuedentries between 0 and 1.

The architecture of the A3C agent we use is similar to theAtari-net agent12, which is an A3C agent adapted from Atarito operate on the StarCraft II state and actions space. Wemake one slight modification to this agent and add a longshort-term memory (LSTM) layer58, which adds memoryto the model and improves performance57. The completearchitecture of our A3C agent is shown in Fig. 6.

Figure 6. Schematic diagram of the full A3C reinforcementlearning agent and its connection to the StarCraft IIenvironment representing the TigerClaw scenario.

The A3C models were trained with 20 parallel actor-learners using 8,000 simulated battles against a built-inStarCraft II bot operating on hand crafted rules. Each trainedmodel was tested on 100 rollouts of the agent on theTigerClaw scenario. The models were compared against arandom baseline with randomized actions as well as a humanplayer playing 10 simulated battles against the StarCraft IIbot. Fig.7 show plots of total episode reward and number ofBlue Force casualties during the evaluation rollouts. We seethat the AI commander has not only achieved comparableperformance compared to a human player, but has alsoperformed slightly better at the task, while also reducingBlue Force casualties.

Case Study: Reinforcement Learning usingOpSim: a Military SimulatorThe TigerClaw scenario, as described in the previous casestudy, was also implemented in the OpSim59 simulator.OpSim is a decision support tool developed by ColeEngineering Services Inc. (CESI) that provides planningsupport, mission execution monitoring, mission rehearsal,embedded training, and mission execution monitoring and



Figure 7. Total reward and Blue Force casualties of the trainedAI commander (A3C agent) compared to human and randomagent baselines. The AI commander is able to achieve a rewardthat is comparable (and slightly better) than the human baselinewhile taking a reduced number of Blue Dorce casualties.

re-planning. OpSim integrates with SitaWare C4I commandand control, a critical component of Command PostComputing Environment3 (CPCE) fielded by PEO C3Tallowing all levels of command to have shared situationalawareness and coordinate operational actions, thus making itan embedded simulation that connects directly to operationalmission command.

OpSim is fundamentally constructed as an extensibleService Oriented Architecture (SOA) based simulation andis capable of running faster than current state of theart simulation environments such as One Semi-AutomatedForces4 (OneSAF)60,61, MAGTF Tactical Warfare Simula-tion5 (MTWS)62. OpSim is designed to run much faster thatwall clock time, and can run 30 replications of the TigerClawmission, which would take two hundred and forty hoursif run serially in real time. Output of a simulation planin OpSim include an overall ranking of Blue Force plansbased on criteria such as ammunition expenditure, casualties,equipment loss, fuel usage, and others. The OpSim toolhowever was not originally designed for AI applicationsand had to be adapted by incorporating interfaces to runreinforcement learning algorithms.

An OpenAI Gym63 interface was developed to exposesimulation state and offer simulation control to externalagents with the ability to supply actions for select entitieswithin the simulation, as well as the amount of timeto simulate before responding back to the interface. Theobservation space consists of 17 features vector wherethe observation space is partially observable based oneach entities’ equipment sensors. Unlike the StarCraft IIenvironment, our OpSim-based prototype of an artificial C2agent currently does not use image inputs or spatial featuresfrom the screen images. The action space primarily consistsof simple movements and engagement attacks, as shownbelow:

• Observation space: damage state, x location, ylocation, equipment loss, weapon range, sensor range,fuel consumed, ammunition consumed, ammunitiontotal, equipment category, maximum speed, perceivedopposition entities, goal distance, goal direction, firesupport, taking fire, engaging targets.

• Action space: no operation, move forward, movebackward, move right, move left, speed up, slow down,orient to goal, halt, fire weapon, call for fire, react tocontact.

• Reward function: friendly damaged (−0.5), friendlydestroyed (−1.0), enemy damaged (0.5), enemydestroyed (1.0), −0.01 ∗ km from goal destination atevery step.

Experimental Results Two types of “artificial comman-ders” were developed for the OpSim environment. The firstis based on an expert designed rule engine provided aspart of OpSim and developed by Military Subject MatterExperts (SMEs) using military doctrinal rules. The secondis a reinforcement learning-based long short-term memory(LSTM)58 deep neural network with a multi-input and multi-output trained with Advantage Actor Critic (A2C) algo-rithm57. OpSim’s custom Gym interface supports multi-agenttraining where each force can use an either rule or learning-based commander.

The policy network was trained on a high performancecomputing facility with 39 parallel workers collecting212,000 simulated battles in 44 hours. The trained modelswere evaluated with 100 rollout simulation results usingthe frozen policy at a checkpoint with the highest meanreward for the Blue Force, in this case, a rolling averageof 195 for the Blue Force policy mean reward and −317for the Red Force policy mean reward. Analysis of 100rollout, as seen in Figure 8, show that the reinforcementlearning-based commander minimizes Blue Force casualtiesfrom about 4 to 0.4, on average, when compared to therule-based commander and increases Red Force casualtiesfrom 5.4 to 8.4, on average, in the same comparison. Thisoutcome is reached by employing a strategy to engageusing only combat armor companies and fighting infantrycompanies. The reinforcement learning-based commanderhas learned a strategy to utilize Blue Force’s (BLUFOR)most lethal units with Abrams and Bradleys vehicles whileprotecting vulnerable assets from engaging with the RedForce (OPFOR), as seen in a snapshot of the beginning andend of one rollout shown in Figure 9.

Figure 8. Unit casualties comparison between rule-based(expert) and reinforcement learning-based (RL) commanders inthe TigerClaw scenario.

Challenges and Future Applications withVirtual RealityTransition to practice is a common and major challenge.An approach that might reduce challenges of transitioningAI-based C2 tools – as we are beginning to explore inthe cases studies described above – is the adaptation ofgames and simulators with built-in AI capabilities to enablecourse of action (COA) development and wargaming. This



Figure 9. Beginning (left) and end (right) of a reinforcementlearning-based commander rollout in OpSim illustrating alearned strategy by the Blue Force to engage only with the mostlethal units while protecting vulnerable assets.

path provides the commander and staff with opportunitiesto rapidly explore in great detail multiple friendly coursesof action (COAs) against multiple enemy COAs, capabilityor asset allocation and employment, terrain or environmentalchanges, and various visualizations to better understand howan adversary might respond to selected COAs in a simulatedbattle. Therefore, the inclusion of AI enabled games intocurrent military practices might accelerate, supplement, orotherwise enhance current practices in COA and wargamingdevelopment and decisions.

Another challenge is operator interfaces that oftenfail to gain end-user acceptance. To this end, anothercapability potentially adapted from the gaming industryare head-mounted displays (HMD) which afford users theability to interact with content in virtual reality (VR),augmented reality (AR), and mixed reality (MR) settings- collectively XR. Traditionally, these technologies havebeen used in industry to provide immersive experiencesfor storytelling, to enable remote real-time collaboration inshared virtual spaces, and to augment natural perception withholographic overlays. For real-world military applications,XR technologies can be used to provide a more intuitivedisplay of multi-domain information, enable distributedcollaboration, and to enhance situational awareness andunderstanding. While the Army has already investedin XR technologies for training, through the SyntheticTraining Environment6, and to enhance lethality, throughthe Integrated Visual Augmentation System (IVAS)7, it hasrelied heavily on advances made in the gaming industry todo so. In fact, many military XR projects are developedusing either Unity or Unreal - the two most popular gamedevelopment engines.

For this project, initial integration has begun to pull livedata from the StarCraft II simulation into a networked XRenvironment64. The goal is to allow multiple, decentralizedusers with the ability to not only examine AI-generatedcourses of action in an immersive platform, but to interactwith the AI and other human decision-makers collaborativelyin real time. Prior research65 has shown that understanding ofspatial information, like that depicted in a three-dimensionalwar-game, can be enhanced by visualization in an HMD.Furthermore, rendering of the tactical information in agame engine like Unity enables researchers with the abilityto precisely control all aspects of how the battlefieldis visualized and related user interfaces. Interoperabilitybetween this immersive XR interface and other militaryprogram of record command and control informationsystems, such as CPCE discussed in the OpSim case study, is

also being investigated to allow explicit comparison betweenimmersive and non-immersive tools.

Findings and DiscussionOur research highlights several important aspects ofdeveloping a training system for AI algorithms in military-relevant games and battlefield simulators for command andcontrol.

First, with respect to both the adapted game and battlefieldsimulator, we found that an important aspect is simulationspeed. Reinforcement learning algorithms are notorious forrequiring millions, or even billions52,53, of data samplesfrom the environment to train the best-performing policies,so fast simulation cycles are essential to achieve resultsin a reasonable time. In our StarCraft II experiments wewere able to simulate 8 million training steps, or 12,800battles, per day with 35 CPU workers, and that was still notfast enough given that our agents required over 40 milliontraining steps to achieve satisfactory results. Given that thesimulations run at least for days, another important aspect fora simulator to be considered for this type of research work isscalability. Advances in computing capacity has propelledthe field of reinforcement learning in the past decade andthus a simulator for AI research needs to support distributedcomputing at scale. Yet another aspect is adaptability. Thesimulator should give the experiment designer the flexibilityto model the desired military task, including terrain, assets,and fog-of-war.

Second, with respect to training reinforcement learningagents in simulators, due to the nature of the exploratorybehavior of these reward-driven algorithms, sometimeslearning agents are able to exploit loopholes in simulatordynamics and learn unrealistic behaviors, usually unintendedby the simulator designer, in order to maximize the receivedreward66. This type of exploitation is often not detectedduring training, and are discovered only when carefullyinvestigating the trained policy behavior after the trainingsession. For example, in our early experiments, we found thatthe agent was able to exploit a simulator deficiency allowingarmored assets to cross impassable terrain. In this sense,reinforcement learning can be a tool to detect and strengthensimulators.

Third, with respect to the training procedure, it shouldcomprise of diverse training data, for instance terrain, assets,and location, so the learning artificial commander is ableto develop a more robust and general policy and be ableto address variability that exists in reality. How diverse andhow much variability these scenarios should contain is stillan open research question. During training, it also helpsto evaluate the learning agent periodically, either againsta rule or doctrine-based commander or against anotherlearning artificial commander. Once training is completeand the agent is evaluated in test cases, we found thatvisualization tools are essential to uncover if the artificialcommander learned any unintended behavior although morevisualization tools are needed to understand each detail ofthe commander’s actions and plans.

Finally, we found that reinforcement learning has beenable, at least in a set of cases we explored, to outperformboth human and doctrine-based baselines without access to



prior human knowledge of the task, which becomes evenmore relevant when task complexity increases and human-dictated rules are more difficult to specify. However, there arestill open questions on how human-like the strategies learnedare, or how human-like they need to be if that is the case.Another concern is that the reinforcement learned policiesare willing to sacrifice assets in the battle if that leads to ahigher reward value at the end of the scenario, which maynot be an acceptable decision for a human commander. Weaddress this issue when handcrafting the reward function bypenalizing the agent for assets lost, however, given that thereward function is also composed of additional goals, thissolution is non-trivial to balance in practice66.

ConclusionsAs artificial intelligence (AI) algorithms have beensuccessfully learning how to solve complex games, theirimpact on military applications is probably imminent. Weenvision the need to develop agile and adaptive AI supporttools for the future battlefield and multi-domain operationscenarios, under the main assumption that the future flowof information and speed of operation will likely exceedthe capabilities of the current human staff if the commandand control processes remain largely manual. This includesleveraging AI algorithms to analyze battlefield informationfrom multiple sources, from both red and blue forces,and correctly identify and exploit emerging windows ofsuperiority.

In this research work, we explore two case studieson modeling a partially realistic mission scenario, andtraining artificial commanders that leverage existing (withsome adaptations) off-the-shelf deep reinforcement learningalgorithms. We show that such trained algorithms canoutperform significantly both human and doctrine-basedbaselines without using any pre-coded expert knowledge,and learning effective behaviors entirely from experience.Furthermore, we formulate and empirically confirm a set offeatures and requirements that a game or simulation enginemust meet in order to serve as a viable platform for researchand development of reinforcement learning tools for C2.

Acknowledgements

Portions of the Related Work, StarCraft II and OpSim casestudies also appear in the DEVCOM Army Research Laboratoryreport ARL-TR-9192 55.

FundingThe author(s) disclosed receipt of the following financialsupport for the research, authorship, and/or publicationof this article: Research was sponsored by the ArmyResearch Laboratory and was accomplished partly underCooperative Agreement [W911NF-20-2-0114]. The viewsand conclusions contained in this document are those ofthe authors and should not be interpreted as representingthe official policies, either expressed or implied, of theArmy Research Laboratory or the U.S. Government. TheU.S. Government is authorized to reproduce and distributereprints for Government purposes notwithstanding anycopyright notation herein.

Notes

1. Unreal Engine game engine: https://www.unrealengine.com/.2. Unity game engine: https://unity.com/3. Command Post Computing Environment (CPCE) webpage:

https://peoc3t.army.mil/mc/cpce.php4. One Semi-Automated Forces (OneSAF) webpage:

https://www.peostri.army.mil/OneSAF5. MAGTF Tactical Warfare Simulation (MTWS) webpage:

https://coleengineering.com/capabilities/mtws6. Synthetic Training Environment (STE) webpage:

https://asc.army.mil/web/portfolio-item/synthetic-training-environment-ste/

7. Integrated Visual Augmentation System (IVAS) webpage:https://www.peosoldier.army.mil/Program-Offices/Project-Manager-Integrated-Visual-Augmentation-System/

References

1. Caffrey Jr MB. On wargaming. US Naval War College Press.2. Perla PP. The art of wargaming: A guide for professionals and

hobbyists. Naval Institute Press, 1990.3. Vinyals O, Ewalds T, Bartunov S et al. Starcraft ii: A

new challenge for reinforcement learning. arXiv preprintarXiv:170804782 2017; .

4. Sun P, Sun X, Han L et al. Tstarbots: Defeating the cheatinglevel builtin ai in starcraft ii in the full game. arXiv preprintarXiv:180907193 2018; .

5. Vinyals O, Babuschkin I, Czarnecki WM et al. Grandmasterlevel in starcraft ii using multi-agent reinforcement learning.Nature 2019; 575(7782): 350–354.

6. Waytowich N, Barton SL, Lawhern V et al. Grounding naturallanguage commands to starcraft ii game states for narration-guided reinforcement learning. In Artificial Intelligence andMachine Learning for Multi-Domain Operations Applications,volume 11006. International Society for Optics and Photonics,p. 110060S.

7. Waytowich N, Barton SL, Stump E et al. A narration-basedreward shaping approach using grounded natural languagecommands. International Conference on Machine Learning(ICML), Workshop on Imitation, Intent and Interaction 2019;.

8. Wang X, Song J, Qi P et al. Scc: an efficient deep reinforcementlearning agent mastering the game of starcraft ii. arXiv preprintarXiv:201213169 2020; .

9. Han L, Xiong J, Sun P et al. Tstarbot-x: An open-sourced andcomprehensive study for efficient league training in starcraft iifull game. arXiv preprint arXiv:201113729 2020; .

10. OpenAI, Berner C, Brockman G et al. Dota 2 with large scaledeep reinforcement learning. arXiv preprint arXiv:1912066802019; .

11. Mnih V, Kavukcuoglu K, Silver D et al. Playing atari with deepreinforcement learning. arXiv preprint arXiv:13125602 2013;.

12. Mnih V, Kavukcuoglu K, Silver D et al. Human-level controlthrough deep reinforcement learning. nature 2015; 518(7540):529–533.

13. Hessel M, Modayil J, Van Hasselt H et al. Rainbow:Combining improvements in deep reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence,volume 32.



14. Silver D, Huang A, Maddison CJ et al. Mastering the gameof go with deep neural networks and tree search. nature 2016;529(7587): 484–489.

15. Silver D, Schrittwieser J, Simonyan K et al. Masteringthe game of go without human knowledge. nature 2017;550(7676): 354–359.

16. Schrittwieser J, Antonoglou I, Hubert T et al. Mastering atari,go, chess and shogi by planning with a learned model. Nature2020; 588(7839): 604–609.

17. Campbell M, Hoane Jr AJ and Hsu Fh. Deep blue. Artificialintelligence 2002; 134(1-2): 57–83.

18. Hsu FH. Behind Deep Blue: Building the computer thatdefeated the world chess champion. Princeton University Press,2002.

19. Brown N and Sandholm T. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science 2018;359(6374): 418–424.

20. Gortney WE. Department of defense dictionary of militaryand associated terms. Technical report, Joint Chiefs Of StaffWashington, 2010.

21. Burns S, Della Volpe D, Babb R et al. War gamers handbook:A guide for professional war gamers. Technical report, NavalWar College Newport United States, 2015.

22. Army U. Army doctrine publication no. 3-0: Operations.Washington, DC: Headquarters, Department of the Army 2019;.

23. Kott A et al. Advanced technology concepts for command andcontrol. Xlibris Corporation, 2004.

24. Kott A, Ground L, Budd R et al. Toward practical knowledge-based tools for battle planning and scheduling. In AAAI/IAAI.pp. 894–899.

25. Ownby M and Kott A. Reading the mind of the enemy:predictive analysis and command effectiveness. Technicalreport, SOLERS INC ARLINGTON VA, 2006.

26. Kott A and Corpac PS. Compoex technology to assist leadersin planning and executing campaigns in complex operationalenvironments. Technical report, DEFENSE ADVANCEDRESEARCH PROJECTS AGENCY ARLINGTON VA, 2007.

27. Kott A, Ludwig J and Lange M. Assessing mission impact ofcyberattacks: toward a model-driven paradigm. IEEE Security& Privacy 2017; 15(5): 65–74.

28. Walsh M, Menthe L, Geist E et al. Exploring the Feasibility andUtility of Machine Learning-Assisted Command and Control.RAND ; 1.

29. Walsh M, Menthe L, Geist E et al. Exploring the Feasibility andUtility of Machine Learning-Assisted Command and Control.RAND ; 2.

30. Schulman J, Wolski F, Dhariwal P et al. Proximal policyoptimization algorithms. arXiv preprint arXiv:1707063472017; .

31. Schulman J, Moritz P, Levine S et al. High-dimensionalcontinuous control using generalized advantage estimation.arXiv preprint arXiv:150602438 2015; .

32. Boron J and Darken C. Developing combat behavior throughreinforcement learning in wargames and simulations. In 2020IEEE Conference on Games (CoG). IEEE, pp. 728–731.

33. Lanchester FW. Aircraft in warfare: The dawn of the fourtharm. Constable limited, 1916.

34. Washburn A. Lanchester systems. Technical report, NAVALPOSTGRADUATE SCHOOL MONTEREY CA, 2000.

35. MacKay NJ. Lanchester combat models. arXiv preprintmath/0606300 2006; .

36. Corps UM. Marine corps doctrinal publication 1-0. MarineCorps Operations 2011; .

37. Asher DE, Zaroukian E and Barton SL. Adapting thepredator-prey game theoretic environment to army tacticaledge scenarios with computational multiagent systems. arXivpreprint arXiv:180705806 2018; .

38. Asher D, Garber-Barron M, Rodriguez S et al. Multi-agentcoordination profiles through state space perturbations. In2019 International Conference on Computational Science andComputational Intelligence (CSCI). IEEE, pp. 249–252.

39. Zaroukian E, Rodriguez SS, Barton SL et al. Algorithmicallyidentifying strategies in multi-agent game-theoretic environ-ments. In Artificial Intelligence and Machine Learning forMulti-Domain Operations Applications, volume 11006. Inter-national Society for Optics and Photonics, p. 1100614.

40. Asher DE, Zaroukian E, Perelman B et al. Multi-agent collaboration with ergodic spatial distributions. InArtificial Intelligence and Machine Learning for Multi-DomainOperations Applications II, volume 11413. InternationalSociety for Optics and Photonics, p. 114131N.

41. Barton SL, Waytowich NR and Asher DE. Coordination-driven learning in multi-agent problem spaces. arXiv preprintarXiv:180904918 2018; .

42. Barton SL and Asher D. Reinforcement learning frameworkfor collaborative agents interacting with soldiers in dynamicmilitary contexts. In Next-Generation Analyst VI, volume10653. International Society for Optics and Photonics, p.1065303.

43. Fu Q, Fan CL, Song Y et al. Alpha C2–an intelligent airdefense commander independent of human decision-making.IEEE Access 2020; 8: 87504–87516. DOI:10.1109/ACCESS.2020.2993459.

44. King-Doonan EK. ARO year in review 2020. Technical report,Army Research Office, 2020.

45. Song Y, Naji S, Kaufmann E et al. Flightmare: A flexiblequadrotor simulator. In Conference on Robot Learning.

46. Shah S, Dey D, Lovett C et al. Airsim: High-fidelity visualand physical simulation for autonomous vehicles. In Field andService Robotics.

47. Goecks VG, Gremillion GM, Lawhern VJ et al. Efficientlycombining human demonstrations and interventions for safetraining of autonomous systems in real-time. In Proceedingsof the AAAI Conference on Artificial Intelligence, volume 33.pp. 2462–2470.

48. Alphadogfight trials go virtual for final event. https:

//www.darpa.mil/news-events/2020-08-07.Accessed: 2021-05-17.

49. Alphadogfight trials foreshadow future of human-machinesymbiosis. https://www.darpa.mil/news-events/2020-08-26. Accessed: 2021-05-17.

50. Schwartz PJ, O’Neill DV, Bentz ME et al. Ai-enabledwargaming in the military decision making process. InArtificial Intelligence and Machine Learning for Multi-DomainOperations Applications II, volume 11413. InternationalSociety for Optics and Photonics, p. 114130H.

51. Reese PP. Military decisionmaking process: Lessons and bestpractices. Technical report, Center for Army Lessons LearnedFort Leavenworth United States, 2015.


https://www.darpa.mil/news-events/2020-08-07





52. Espeholt L, Soyer H, Munos R et al. Impala: Scalabledistributed deep-rl with importance weighted actor-learnerarchitectures. In International Conference on MachineLearning. PMLR, pp. 1407–1416.

53. Kapturowski S, Ostrovski G, Quan J et al. Recurrentexperience replay in distributed reinforcement learning. InInternational conference on learning representations.

54. Liang E, Liaw R, Nishihara R et al. RLlib: Abstractionsfor distributed reinforcement learning. In Dy J and KrauseA (eds.) Proceedings of the 35th International Conferenceon Machine Learning, Proceedings of Machine LearningResearch, volume 80. PMLR, pp. 3053–3062.

55. Narayanan P, Vindiola M, Park S et al. First-YearReport of ARL Director’s Strategic Initiative (FY20-23):Artificial Intelligence (AI) for Command and Control (C2) ofMulti Domain Operations. Technical Report ARL-TR-9192,Adelphi Laboratory Center (MD): DEVCOM Army ResearchLaboratory (US), 2021.

56. STD M. 2525c. Common Warfighting Symbology 2011; 1.57. Mnih V, Badia AP, Mirza M et al. Asynchronous methods for

deep reinforcement learning. In International conference onmachine learning. PMLR, pp. 1928–1937.

58. Hochreiter S and Schmidhuber J. Long short-term memory.Neural computation 1997; 9(8): 1735–1780.

59. Surdu JR, Haines GD and Pooch UW. Opsim: A purpose-builtdistributed simulation for the mission operational environment.SIMULATION SERIES 1999; 31: 69–74.

60. Wittman Jr RL and Harrison CT. Onesaf: A product lineapproach to simulation development. Technical report, MITRECORP ORLANDO FL, 2001.

61. Parsons D, Surdu J and Jordan B. Onesaf: a next generationsimulation modeling the contemporary operating environment.In Proceedings of Euro-simulation interoperability workshop.Citeseer.

62. Blais CL. Marine air ground task force (magtf) tactical warfaresimulation (mtws). In Proceedings of Winter SimulationConference. IEEE, pp. 839–844.

63. Brockman G, Cheung V, Pettersson L et al. Openai gym. arXivpreprint arXiv:160601540 2016; .

64. Dennison MS and Trout TT. The accelerated user reasoningfor operations, research, and analysis (aurora) cross-realitycommon operating picture. Technical report, CombatCapabilities Development Command Adelphi, 2020.

65. Skarbez R, Polys NF, Ogle JT et al. Immersive analytics:Theory and research agenda. Frontiers in Robotics and AI2019; 6: 82. DOI:10.3389/frobt.2019.00082.

66. Amodei D, Olah C, Steinhardt J et al. Concrete problems in aisafety. arXiv preprint arXiv:160606565 2016; .


Documents

On games and simulators as a platform Simulation for