15
AUTONOMOUS I NDUSTRIAL MANAGEMENT VIA R EINFORCEMENT L EARNING :S ELF -L EARNING AGENTS FOR D ECISION -MAKING –AR EVIEW APREPRINT Leonardo A. Espinosa Leal * Department of Business Management and Analytics Arcada University of Applied Sciences Helsinki, Finland Magnus Westerlund Department of Business Management and Analytics Arcada University of Applied Sciences Helsinki, Finland Anthony Chapman Computing Department University of Aberdeen Aberdeen, UK. October 22, 2019 ABSTRACT Industry has always been in the pursuit of becoming more economically efficient and the current focus has been to reduce human labour using modern technologies. Even with cutting edge technologies, which range from packaging robots to AI for fault detection, there is still some ambiguity on the aims of some new systems, namely, whether they are automated or autonomous. In this paper we indicate the distinctions between automated and autonomous system as well as review the current literature and identify the core challenges for creating learning mechanisms of autonomous agents. We discuss using different types of extended realities, such as digital twins, to train reinforcement learning agents to learn specific tasks through generalization. Once generalization is achieved, we discuss how these can be used to develop self-learning agents. We then introduce self-play scenarios and how they can be used to teach self-learning agents through a supportive environment which focuses on how the agents can adapt to different real-world environments. Keywords Autonomous systems · reinforcement learning · self-play · digital twin · industry4.0 1 Introduction Recent developments in industry, such as Industry4.0, have emphasized the need for a comprehensive automation of operational processes [1]. Similar developments in both the automobile [2] and maritime industry [3] has highlighted the need for vehicle/vessel driving automation. In recent media, we often see the term autonomous used as a synonym for automated. Hence, automated driving often become autonomous driving, without a deeper reflection on what an autonomous system would entail. Within information systems research we also have a long history of researching autonomous entities, sometimes as an abstract futuristic entity that is often ill-defined in terms of true autonomy, although some efforts have been made [4]. We consider an autonomous entity as an evolutionary leap compared to an automated entity. Meanwhile, the automation component is far from solved for many physical environments. On the other hand, for digital environments such as stock exchanges and advertisement auctions, the process has, to a large extent, already been automated, but autonomous behaviour is rarely needed or desired in such environments. To address this gap we delimit our focus in this paper, to consider one part of an autonomous system, the learning mechanisms * Corresponding author: leonardo.espinosaleal@arcada.fi arXiv:1910.08942v1 [cs.MA] 20 Oct 2019

INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

AUTONOMOUS INDUSTRIAL MANAGEMENT VIAREINFORCEMENT LEARNING: SELF-LEARNING AGENTS FOR

DECISION-MAKING – A REVIEW

A PREPRINT

Leonardo A. Espinosa Leal∗Department of Business Management and Analytics

Arcada University of Applied SciencesHelsinki, Finland

Magnus WesterlundDepartment of Business Management and Analytics

Arcada University of Applied SciencesHelsinki, Finland

Anthony ChapmanComputing DepartmentUniversity of Aberdeen

Aberdeen, UK.

October 22, 2019

ABSTRACT

Industry has always been in the pursuit of becoming more economically efficient and the current focushas been to reduce human labour using modern technologies. Even with cutting edge technologies,which range from packaging robots to AI for fault detection, there is still some ambiguity on the aimsof some new systems, namely, whether they are automated or autonomous. In this paper we indicatethe distinctions between automated and autonomous system as well as review the current literatureand identify the core challenges for creating learning mechanisms of autonomous agents. We discussusing different types of extended realities, such as digital twins, to train reinforcement learning agentsto learn specific tasks through generalization. Once generalization is achieved, we discuss how thesecan be used to develop self-learning agents. We then introduce self-play scenarios and how they canbe used to teach self-learning agents through a supportive environment which focuses on how theagents can adapt to different real-world environments.

Keywords Autonomous systems · reinforcement learning · self-play · digital twin · industry4.0

1 Introduction

Recent developments in industry, such as Industry4.0, have emphasized the need for a comprehensive automation ofoperational processes [1]. Similar developments in both the automobile [2] and maritime industry [3] has highlightedthe need for vehicle/vessel driving automation. In recent media, we often see the term autonomous used as a synonymfor automated. Hence, automated driving often become autonomous driving, without a deeper reflection on what anautonomous system would entail. Within information systems research we also have a long history of researchingautonomous entities, sometimes as an abstract futuristic entity that is often ill-defined in terms of true autonomy,although some efforts have been made [4]. We consider an autonomous entity as an evolutionary leap compared to anautomated entity. Meanwhile, the automation component is far from solved for many physical environments. On theother hand, for digital environments such as stock exchanges and advertisement auctions, the process has, to a largeextent, already been automated, but autonomous behaviour is rarely needed or desired in such environments. To addressthis gap we delimit our focus in this paper, to consider one part of an autonomous system, the learning mechanisms

∗Corresponding author: [email protected]

arX

iv:1

910.

0894

2v1

[cs

.MA

] 2

0 O

ct 2

019

Page 2: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

of an autonomous entity. To achieve this, we first expand on the subject of what an autonomous system entails andrelevant areas that may assist in achieving autonomy.

The need for automation has led to a realization that we also need to digitize our industrial environments and processes.These processes include manufacturing, warehousing, building management, or hospital management. Digitizationmay happen by introducing IoT sensors, communication networks, and cameras into an environment and then digitallycapture the physical reality. Advanced types of automation will often form a model using past events and test themodel using more recent events in a controlled environment before testing the model in real scenarios. A commonapproach has been to use fuzzy logic to describe expert behaviour for an entity or an agent that operates in or overseesthe environment [5]. Prediction is sometimes required in automation and neural networks are a popular way to compareon the outcomes of the predictions [6].

In order to verify a number of algoritmically derived scenarios in the physical factory may often be expensive, dangerous,and disruptive. Instead a digital representation of the process/environment, a so called extended reality, can be used forverification and optimization, before ideas/changes are moved into the physical realm. An example of an extendedreality is the digital twin technology [7], which can be described in many formats, e.g. process graphs, a 2D flat model,a 3D space, or a 4D space-time environment (can be seen as stacked 3D models with temporal dependencies). Although,the more realistic the environment and simulation is, the better the final result can be expected. Complexity tend toincrease for each added dimension and different strategies must be devised to conquer the different environments.

A state-of-the-art form of digital twin technology is physics-based (follows physical laws) and allows the agent tointeract with the environment, i.e. bi-directional data-exchange [8]. Furthermore, the environment may be also bemodified within certain parameters to allow for an improved process. Hence, both the agent and the extended realityenvironment are part of the modelling process and may be enhanced in order to improve an outcome. In this article,we focus on a method and experimental study for self-playing agents which can use deep reinforcement learning, inwhat can best be generalized as an extended reality environment [9]. Our aim is to address the topic of designingself-play scenarios for self-learning agents that combine various sources for discovering and generating training data.By reviewing the literature we noticed a gap which could improve existing architectures used for creating self-learningagents that can be used for enabling autonomous industrial management. [8] also highlight the need for more concretedigital twin case-studies. Combining AI research and digital twin technology offer an interesting avenue for furtherresearch.

The structure of the paper is the following. Section 2 reflect on three core terms used throughout the paper. The firstsubsection provides a discussion on autonomous vs. automated systems, and notes that although media and informationsystems literature provide a considerable history of autonomous systems, real-world examples of such systems donot yet exist in a strict language or legal sense. Following this, eXtended Reality (XR) is introduced as a self-playenvironment for teaching models through interactivity. The last sub-section discusses the need for data augmentation tobridge the gap between training in extended reality and inference in the physical reality. The third section introducesreinforcement learning and recent advancements. The fourth section presents how algorithms can learn by self-play, i.e.learning from the interaction in XR, and a conceptual architecture. The fifth section contributes a design and proposesapplications in some relevant areas. The final section concludes our findings and presents future directions for research.

2 Research methods and concepts

In the first part of this paper, a literature review of the relevant concepts is presented. Through the literature reviewwe provide a foundational understanding of root definitions for the second part, where we provide an analysis of ourfindings. In this latter part, we follow a soft systems methodology (SSM) by framing a problem formulation (modellearning) and an action plan (conceptual model for learning) aimed at future research [10]. [11] state that SSM consistof an analysis of the current status of the system. This includes inherent problems and activities, and a definition of thesystem that derives the actual goal of the targeted system (“root definition”) in order to propose a conceptual systemmodel.

2.1 Requirements for achieving autonomous systems

Autonomous systems have long been a promise of both academia and industry alike. One of the early academic papersdiscussing autonomous systems was presented in the 1950s, where the author discusses an autonomous nonlinearoscillation system [12]. Much later, in the 1990’s with the introduction of the personal computer and precursors to theInternet, autonomous agents became a topic of interest. The agent implementations tended to cover relatively narrowand software localized areas. [13] defines the agent as a persistent software entity dedicated to a specific purpose. [4]reviews the early definitions of autonomous agents, and then summarizes that the autonomous property can be defined

2

Page 3: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

Agent

autonomous

learning

goal-oriented

temporally continuous

communi-cativereactive

mobile

flexible

character

Figure 1: Agent properties adapted from [4].

as something which exercises control over its own actions. Given the progress since, we would argue these definitionsdo not qualify an entity for being autonomous, but mere automated.

The definition of the word autonomous, in a language sense, can be describe as an autonomous entity is independentand has the freedom to govern itself (adapted from the Cambridge Online Dictionary). Using a strict interpretation, itsuggests that truly autonomous systems have yet to come into existence, and that traditional methods have not beenable to deliver beyond the automation component. Automation can here be defined as an entity made to operate bythe use of machines or computers in order to reduce the work done by humans (adapted from the Cambridge OnlineDictionary). This latter definition of automated systems corresponds better to the original definition of autonomousagents, as presented above. An autonomous system should have the freedom to govern itself, and governance impliessome form of understanding of real-world repercussions of ones actions, for example the limits the legal frameworkprovide us with. Additionally, an autonomous entity should show independence, suggests that a mere exploitation ofthe historical world is not sufficient, but rather that the entity must learn to explore unknown worlds through self-play.Hence, an entity that employs an intelligence derived from a rule-set or a neural network trained on historical data, willlikely fail the test of true autonomy.

A similar standpoint has been taken by organizations regulating and standardizing the automotive industry [14]. TheSociety of Automotive Engineers (SAE) through the On-Road Automated Vehicle Standards Committee is responsiblefor defining levels of driving automation for cars and provides a classification of levels from 0-5. In the SAEclassification, the highest level, "Level 5 (steering wheel optional)" is defined as a fully automated sysem, and not as anautonomous system [15].

There have been several methods proposed for creating agents, e.g. fuzzy logic or neural networks. In [5] the authorsuse the concept of intelligent agents for describing an important property of the entity. Intelligence or rather reasoning,allows an agent to react to different defined circumstances or even unforeseen circumstances. Some early considerationsrealized the challenge of defining a globally validated view of a contained entity, with so called artificial intelligence,and instead highlighted that at the essence of an intelligent entity is an ability to perform subjective probability analysisover both logical and natural (physical) conditions, [16] and [5] approach this through an expert system governed byfuzzy logic and highlight certain abilities an agent must achieve to be considered intelligent. They highlight certainabilities an agent must achieve to be considered intelligent. For example, an agent must be able to plan a set of actionsand needs to adapt its plan to changing conditions. [4] provide a classification of properties an agent may have (seeFig. 1).

On a more philosophical note, for an autonomous entity to achieve independence and freedom to govern itself, one canargue only an AI with consciousness can achieve such a feat. However, [17] highlights that consciousness, although notproperly understood, may indeed be a representation rather than an agency. Hence, the consciousness created in ourminds are a result of the brain attempting to make sense of sensory inputs, and consciousness is but an illusion, a modelof the contents of our attention. Therefore, the state of dreaming and the state of awakeness is a change in sensorydetails and not a termination/commencement of conscious experience. Assuming that the cortical conductor theoryinterpretation is correct [18], we can reduce an autonomous system into four main concepts that feeds an attention layer,

3

Page 4: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

learning, reasoning, control, and selection. In this paper we delimit our focus to the learning mechanism that occurfrom an interaction and describe some industrial environments were learning can occur. Our position is that without aninteractive learning mechanism and an ability to self-improve, such that reinforcement learning provide us with, thesystem cannot be considered autonomous. Still, as discussed above there are other properties that must also be fulfilledfor a system to be considered truly autonomous.

2.2 Extended reality

The term eXtended Reality (XR) has become known as a common term for fields were digitally enhanced environmentsand human-interaction are combined for various purposes. This includes Virtual Reality (VR), Mixed Reality (MR), andAugmented Reality (AR). Compared to, for example, autonomous driving, these environments offer AI researchers animportant and relatively low-cost setting for implementing human-interacting algorithms, such algorithms may containreasoning that can be considered “true” AI. Perhaps most importantly, these digital environments allow us to study anagent’s decision making, and thereby, offering a feed-back loop between the agent-environment-user.

Still, the need for training new abilities often require that agents are presented with big data. The traditional approachwas often to gather the data from users, e.g. playing a game, and then train on this data. Today, the deep networks’need for massive data and relatively complex training scenarios for reinforcement learning presents researchers witha problem that is often better solved by augmenting additional data for training networks, than using real data. Dataaugmentation methods depend on the problem at hand, an agent may for example have to learn how to deal with objectrecognition, spatial actions to take in relation to detected objects, or temporal differences in scenarios, to name a few.The following sub-section reviews some of the relevant literature regarding the performance of data augmentation fortraining agents.

2.3 Data augmentation

Data augmentation has gained prominence as a studied method for extending available datasets [19]. The ability totrain deep networks often depends on the availability of big data. The fields of both image recognition and voicerecognition has been strongly influenced by deep learning methods [20, 21], and this has motivated the emergence ofdata augmentation. For RL many of the same concepts can be utilized, but there is also a need for methods that workparticularly in the temporal dimension.

Traditional, naïve approaches tend to manipulate the investigated environment or dataset in various ways. For visualtasks, these have included scaling of objects, translating i.e. moving objects spatially to various positions, rotationof objects at various angles, flipping objects as to remove bias from any direction, adding noise, changing lightningconditions, and transforming perspective of a known object by changing the angle of view [19, 22].

For audio tasks, data augmentation often includes deformations into a temporal dimension. Approaches include timestretching by changing audio speed, pitch shifting by raising or lowering the tone frequency by various degrees, dynamicrange compression, and introducing background noise using gaussian or natural noise methods [23].

The naïve data augmentation approaches tend to produce limited alternative data for RL agents to learn from in anextended reality setting. For an RL agent to learn new abilities, data augmentation must support the agent’s scenariolearning process. As suggested by [24], we shift from learning to generalize on spatial data to reacting to continuous-timedynamical systems without a-priori discretization of time, state, and action.

Several approaches exist for the creation of these scenarios. An important method is adversarial learning, as it canproduce new and complex augmented datasets by pitting a generative model against an adversary [25]. A generativemodel in combination with XR, can also address the exploration problem, as exploring some states in the physicalreality could be very costly and dangerous. This combination also allows the system developer to understand whichstate spaces in the virtual environment has been visited and trained upon, and the model’s ability to generalize in theextended reality environment. These generative techniques for extending the learning environment are further exploredin the following section.

3 Reinforcement learning

Although XR is slowly receiving more recognition, AI and machine learning’s use of XR to enhance learning is laggingbehind. XR could help improve an AI’s behaviour by providing information from either pure virtual or semi-realenvironments. Reinforcement learning is one machine learning technique which, when combined with XR, couldproduce interesting and beneficial results for many applications, such as driverless cars, autonomous factories, smartcities, gaming and more.

4

Page 5: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

RL’s primary purpose is to calculate the best action an agent should take when an environment is provided. With RL,we could be able to calculate what best action to take by maximizing the cumulative reward from previous actions, thuslearning a policy. RL has long roots from areas such as dynamical programming [26], and for a historical review see[27]. However, recently it has received a lot of attention for its potential to advance and improve applications in gaming,robotics, natural language processing (including dialogue systems, machine translation, and text generation), computervision, neural architecture design, business management, finance, healthcare, intelligent transportation systems, andother computer systems [28].

Attention and memory are two parts from RL which, if done impetuously, could negatively affect performance. Attentionis the mechanism which focuses on the salient parts. Whereas, memory provides long term data storage, and attention isan approach for memory addressing [28]. Using XR and self-play, agents may be able to learn desired behaviour beforean action an agent makes become crucial to their performance. As an example, autonomous helicopter software couldlearn fundamental mechanisms for flight using virtual data in simulations in order to achieve high level of attentionusing the memory required, without the risks posed by real world applications. Once the attention has reached a desiredlevel, it can be applied to real agents in the physical world.

General value functions can be used to represent knowledge. RL, arguably, mimics knowledge in the sense that it(generally) learns from the results of actions taken. Thus, one may be able to represent knowledge with general valuefunctions using policies, termination functions, reward functions, and terminal reward functions as parameters [29].Doing so, an agent may be able to predict the values of sensors, and policies to maximize those sensor values, andanswer predictive or goal-oriented questions.

Generative Adversarial Networks (GANs) estimate generative models via an adversarial process by training two modelssimultaneously [25], a generative model G to capture the data distribution, and a discriminative model D to estimatethe probability that a sample comes from the training data but not the generative model G. Such an approach could beextended to XR by training a generative model G on virtual / simulated test data and then a discriminative model D toestimate the probability that a sample comes from the real world. This could help tackle some of the issues with RLwithin virtual environments and extended to the real world. RL and XR could be used before the agent is applied to areal environment, this could save on resources and make autonomous systems a more viable option for general use.

GANs together with transfer learning could advance self-play using virtual environments for real world agents [30]. Bycombining virtual data generative models and transferring the learning model to a discriminative model, we may beable to accurately express what was learned from the virtual learning environment to the real agent. Again, unforeseenproblem will inevitably arise due to the nature of modelling. By using RL both in the virtual learning phase andembedded into the real agent, we may drastically improve a real agent’s learning time.

[31] proposed a strategic attentive writer (STRAW), a deep recurrent neural network architecture, for learning high-leveltemporally abstracted macro-actions in an end-to-end manner based on observations from the environment. Macro-actions are sequences of actions commonly occurring. STRAW builds a multi-step action plan, updated periodicallybased on observing rewards, and learns for how long to commit to the plan by following it without re-planning. Similarto GANs, STRAW could be used after the simulation learning stage so the agent copes with any discrepancies betweenthe simulation and the real world.

Adaptive learning is a core characteristic to achieving strong AI [32]. Several adaptive learning methods have beenproposed which utilize prior knowledge [33, 34, 28]. [33] proposed to represent a particular optimization algorithm as apolicy, and convergence rate as reward. [34, 28] proposed to learn a flexible recurrent neural network (RNN) model tohandle a family of RL tasks, to improve sample efficiency, learn new tasks in a few samples, and benefit from priorknowledge.

The notion of self-play is one of the biggest advancements of modern AI. AlphaGo AI is Deepmind’s newest Go playingAI [35], that learns, tabula rasa, superhuman proficiency in challenging domains. Starting with the basic rules, theyused self-play for the AI to learn strategies by playing against itself and storing efficient / rewarding moves. FictitiousSelf-Play is a machine learning framework that implements fictitious play in a sample-based fashion [36]. The threestrategies that are compared are: Learning by self-play, learning from playing against a fixed opponent, and learningfrom playing against a fixed opponent while learning from the opponent’s moves as well [37].

4 Self-play Scenarios and Architectures

Generalizing from training into real scenarios is not easy for self learning agents. This problem is known as the realitygap [38]. In the initial stages of AI research, the training of self-learning agents included rules or limited scenarioswhere it can learn and improve upon competition against other introduced players. Interestingly, video games haveemerged as one of the main source of benchmark environments for the training and testing of such agents, mostly due

5

Page 6: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

Figure 2: Proposed general architecture for a self-learning agent interacting with its environment.

to its realistic, yet controlled approach to the real world, and the easy access to large amounts of data. For example, anAI agent is trained by playing with a perfect copy of itself without any supervision [39]. In this scenario, a set of basicrules of the game have been introduced at the beginning and the agent improves much faster using a vector of rewardsinstead of the classical scalar quantity [40].

Initially, self-play agents were trained to play board games (such as chess and go, among others) [35] but it has nowbeen successfully extended from the classic and simpler Atari 2600 video games [41] to more complex first-personshooters: Doom [42], Battlefield [43], Quake [44]; Role Playing games2: Warcraft [45]; Real-Time Strategic Games:Starcraft [46, 47, 48] and more recently Multiplayer Online Battle Arena (Dota 2, [49]). For a more comprehensivereview see the work by [50].

Motivation for studying self-play scenarios has increased in recent years mainly due to the advances in neural networkarchitectures suitable to the reinforcement learning paradigm: DQN [41], AC3 [51], DSR [52], Dueling networks[53]among others as well as the development of powerful and accessible graphics processing unit (GPU) computingresources. This area of research is broadly known as Deep Reinforcement Learning [28, 54, 55].

The challenge of training self-playing agents in order to develop more complex policies inside realistic and highlyspecific or general environments remain as an open problem. Most of the recent developments tend to focus on veryparticular properties of the learning agent or the way that they interact with their surroundings. To address this issue, weidentify two general mechanism that can be improved in order to design a better self-learning agent: self-play scenariosand self-learning architectures.

4.1 Improving self-play scenarios and self-learning agents: closing the reality gap

Constructing realistic self-play scenarios plays a fundamental role in training self-learning agents. Once an agent isimmersed in a specific environment, we expect (independently of the self-learning architecture) that it will learn a set ofpolicies accordingly to the received experiences3. A problem which is widely understood, is that when agents learnfrom strict simulated scenarios, they may not be prepared for unexpected situations when the environment changes,such as a pigeon flying towards the sensor of a driverless car. Here, we propose a general scheme that uses the versatilityof the video games or simulators as a source of synthetic data and the wide array of capabilities of modern extendedreality technologies, to enrich the properties of the real environment during the training of self-learning agents. Anagent may learn independently, but the environment can be controlled to persuade the agent to learn a set of additionalpolicies for unexpected scenarios. In addition to the enriched data, the self-learning agent may be trained using purelysynthetic data. But the limitations of this method rely in the accuracy of the representation of the real scenarios.

For the self-learning mechanism, we have identified three key steps which could improve the design of architecturesfor self-learning agents, which may improve policies both in terms of effectiveness and robustness. The three areas

2Including the Massively Multiplayer Online type of games.3A learning agent can learn a set of policies, or will optimize the parameters of a given set of policy. New policies can emerge

even without previous knowledge. The sum of the whole policies is called general policy.

6

Page 7: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

Figure 3: Illustration of how different components from Figure 2 can be used on complete self-learning systems. a.Adapted from [56], b. Adapted from [57], c. Adapted from [58], d. Adapted from [59], e. Adapted from [60] and f.Adapted from [61].

are: Data collection, task generalization and emergent behaviour (see Fig. 2). For a given agent, in the first stage theagent will need to interact with the environment, possibly by accessing a data collection, then the agent should beable to generalize a set of given tasks and, simultaneously, new skills should emerge (independently or due to the taskgeneralization). In the final stage, the emergent and the generalization skills interact with the environment to create acontinuous self-learning agent.

To illustrate the steps, we present a set of representative developments in the area of (deep) reinforcement learning.Note that this list is not comprehensive but the examples are presented to highlight their own specific propertiesthen to introduce them in our general model (see Figure 3). Each can be used as a building block inside a completeself-play scenario, for example, Figure 3a shows a low-fidelity rendered images with random camera positions, lightingconditions, object positions, and non-realistic textures use to train a self-learning agent [56]; Figure 3b shows an agentwhich uses a compressed representation of a real scenario to learn a set of policies which are successfully use back tothe real environment [57]. Figure 3c shows an image of a robot used as a one-demonstration example during the trainingstage [58]. Figure 3d shows another image of a robot used during a training stage to teach a robot to place an object ina target position [59], Figure 3e depicts an agent stacking two blocks, behaviour learn from sparse rewards [60] andFigure 3f illustrates one competitive environment where one the agents develops a new set skills [61].

1. Data collection:Domain randomization (DR): DeepMind have recently shown how an agent can be trained on artificiallygenerated scenarios. In their paper, the authors successfully transferred the knowledge from a neural networkpurely trained on low resolution rendered RGB images: domain randomization [56]. This method can beextensively used for training agents in the case that the amount of data available is low or when the separationbetween the real and the train environment is immense.

World models (WM): Self-learning agents can be trained in a compressed spatial and temporal representationof real environments by [57]. This method is highly powerful because an agent can learn in a more compact orhallucinated universe and then go back to the original environment exporting the set of learned abilities. Oneof the main advantages of this method is the possibility to perform a much faster and accurate in situ trainingof the agents by using less demanding computational resources.

2. Task generalization:

One-shot imitation learning (OSIL): the authors present an improved method that uses a meta-learningframework built upon the soft attention model [62] named one-shot imitation learning [58]. Here, the agentsare able to learn a new policy and solve a task after being exposed to a few demonstrations during the trainingstage.

7

Page 8: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

One-shot visual imitation learning (OSVIL): Meta-imitation learning algorithm that teaches an agent tolearn how to learn efficiently [59]. In this work, a robot reuses past experiences and then upon a singledemonstrations, it develops new skills.

3. Emergent behaviour:Scheduled auxiliary control (SAC): another research team from DeepMind introduced a new framework thatallows agents to learn new and complex behaviours in presence of a stream of sparse rewards [60].Multi-agent competition (MAC): a paper by one research team from OpenAI, the authors showed that multi-agents self-play on complex environments can produce behaviours which can be more complex than theenvironment itself [61]. The emergent skills can improve the capabilities of the agents upon unexpectedchanges in the real environment.

To summarize, Figure 2 illustrates how an agent can retrieves data from an environment and then generalizes to aspecific task and simultaneously develops new abilities. The new skills can emerge independently or due to the taskgeneralization process. In the final stage, the environment gets modified by the agent itself. A combination of suchmethods could be used to create more effective architectures for teaching self-learning agents. The novelty of this ideais that it builds upon the concept of modularity to create more complex architectures which develop new untrainedbehaviours. This notion has been explored before to improve learning agents, for instance by synergistically combiningtwo modelling methods such as type-based reasoning and policy reconstruction [63]. Interestingly, it has been statedrecently that this is still a promising area under research [64]. In addition, the proposed architecture establishes ageneral framework for the design of intelligent agents or rational agents searching to maximize their reward uponinteraction with an external environment [65] via specific goals (task generalization and emergent behaviour). Once anarchitecture is defined, targeting the most adequate general policies are dependent on the external information gatheredby the agent. In the data collection module, the main goal is to optimize the data used by the agent for learning. Inthe general case however, a proper design should include a non-static complete representation of the environment forexploring extensively all possible scenarios.

In the next section, we present a complete general design for a self-learning agent including an extra module forextending the description of the world experienced by the agent and then, upon an ad-hoc division and by using theconcept of multi-agent [66]. We discuss the application in the design of a central autonomous agent able provide viarational decision-making, strategies to tackle the possible consequences derived from changes on any stage of a specificindustrial infrastructure.

5 Proposed Design

As already discussed, our goal of designing general self-play scenarios for teaching self-learning agents can be tackledby separating the data retrieved from the environment and the agent’s self-learning architecture. In the spirit of theSMM (see Section 2) we present a general scheme in which we divide the general architecture into two modules, inModule 1, the agent retrieves the data from its surroundings as a combination of information from the real world andsynthetic data (or pure synthetic data), and in Module 2 (equivalent to the structure of the Fig 3), the agent creates itsfinal policies. The general scheme is depicted in the Fig 4.

For a self-learning agent inside a specific self-learning scenario, there is no difference between synthetic or realdata. Here we call real data the information extracted from physical world without any previous or further digitalmodifications. The agent uses exclusively the information, in terms of raw bytes, independently of the sensors thatconnect it with the environment. The use of synthetic data arises as a need to expose the self-learning agent tounexpected situations or conditions that allow it to create a set of optimal related policies. The representation of thereal world can be done also via pure synthetic data, however the best case scenario is such where the synthetic datais used to extend the real world. The agent also can modify its own environment during the learning process, and ifthe whole process is fully done in a simulation environment, the use of engines that mimic the physical rules of thereal world become necessary. The proposed general design can be encapsulated and used as a basic element in a morecomplex multi-agent based learning architecture [66]. The communication among its moieties can be performed via thelocal information modules that represents their individual real worlds. Posteriorly, a central agent retrieves and updatesits state providing a final decision. Several mathematical strategies can be applied to interchange the information, forinstance, by defining scalar, vectorial or probabilistic representations.

In the next part we aim to present an example or how we can build a minimal supra-modular architecture for the specificcase of industrial management. We discuss the advantages and the characteristics for the modular parts of the design, aswell as review (not exhaustively), the state-of-the-art developments for learning agents in decision making.

8

Page 9: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

Figure 4: Two module design of a general architecture for a self-learner agent interacting with its enriched or alteredenvironment.

Figure 5: Proposed architecture for a reinforcement learning agent in Autonomous Industrial Management (AIM) foradaptive industrial infrastructure.

5.1 Application in Autonomous Industrial Management

The aforementioned improvements can be employed in specific applications using specific designs. In particular, byusing the new and open simulation frameworks such as OpenAI Gym4 or Dopamine5 among others. Here we discuss aparticular application for industrial environments solving a managerial problem, however the developed ideas can beused for other purposes upon a clear definition of their parts.

In the industrial regime, decision-making is one of the key elements in the adaptive business intelligence discipline [67].Despite the advances in the computational tools, still many relevant decisions are taken by real persons. Here, wepropose a general model where self-learning agents can be trained to make decisions on industrial scale upon self-

4https://github.com/openai/gym5https://github.com/google/dopamine

9

Page 10: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

playing using extended reality training scenarios. We devised an strategy where an industrial infrastructure is dividedad-hoc in three independent sections managed by three independent self-learning agents (see Figure 5). Each agentreplicates the physical process but it has the property to explore (and improve) in its own learning space based in theperturbation via self-learning mechanisms. Upon external perturbations in their environments, the final decision isdone by a central agent which learns thanks to the information gathered for the independent agents. This Multi-AgentReinforcement Learning (MARL) structure will bring a modularity structure that can be exploited in the industrial realmto improve and optimize the decision-making processes. Our example focuses mainly in the type of manufacturingindustry, but the methodology can be extended easily to other industrial fields. Similar conceptual frameworks have beenpresented and described in the literature [68, 69] and in some cases including full computational frameworks or softwareapplied on specific industrial environments [70, 71] were described. The niche of industrial management is a fruitfulsource of academic research with the application of the cutting-edge methods for the design of intelligent systems. Ourdivision in three different entities: Supply Chain Management (SCM), Warehouse or Inventory Management and DigitalTwin responds to the fact that a lot of work on modelling and creation of autonomous agents has been done in the lastyears independently in each area, therefore a low level description of an autonomous agent for industrial management ispossible by following our proposed architecture (see Fig. 4). In the next subsection we present a non exhaustive reviewof methods by highlighting some of the relevant strategies employed in the creation of autonomous systems.

5.1.1 Supply Chain Management (SCM)

The use of RL for optimization in the SCM has been discussed largely in the last decades [72, 73], in particular, in theseminal works by [74] where the authors propose a multi-agent systems for the optimization of tasks in the SCM, [75]that solves the multi-agent problem as a semi-Markov decision problem, [76] that reduces the complexity by dividingthe agents into three components (in a similar concept presented in this work), among others. Here, the self-learningagent must be able to decide about the different suppliers (external and internal) in coordination with the warehousemanager agent.

5.1.2 Warehouse or Inventory Management

Models for warehouse management have been developed in parallel to the state-of-the-art technological developmentsto achieve short and optimized responses in delivering goods [77] meanwhile an optimal stock level is maintained. Inparticular, it is worth to highlight two types of warehousing systems: automatic and automated warehousing systems.Several methods for self-learning agents in warehouses have shown to be effective in solving real problems, for instance,Stochastic Learning [78], temporal difference Actor-Critic algorithm [79] or more recently Deep ReinforcementLearning [80]. In general, the information can be controlled and retrieved via an optimized wireless network of sensorsinside the warehouse [81]. In this case, the self-learning agent must be able not only to gather the internal informationabout the location of the goods inside the warehouse but also optimize the architecture of the network via self-playmechanisms.

5.1.3 Digital Twin

The use of digital tools in the industry has outgrown the level in which only individual applications were able tobe modelled accurately for very specific topics, in the sixties, towards a complete simulation of the systems duringthe whole entire life cycle in modern times [82]. The concept that includes a detailed model and description of aproduct, system or component in all its possible phases (product design, production system engineering, productionplanning, production execution, production intelligence and closed-loop optimization) [83] is known as digital twin6

In general terms, the digital twin is an extension of the model-based systems engineering (MBSE) concept [85]. Thethree important aspects of the digital twin have been discussed previously by other authors [82, 83, 8]: modularity,connectivity and autonomy. These aspects can be linked with our proposed design in the Fig 4. Our design can beextended into the MARL context.

6 Conclusion & Future Work

The design of self-learning and self-play scenarios is still an area of fruitful development and research. Many criticshave pointed out that AI research is limited by its own ideas [86] as well as it’s usefulness for industrial purposes.The creation and discussion of general architectures can open the door to new proposals, in particular, for the finalemergence of the expected autonomous systems (see Section 2.1). Despite a boom in the field during the last years, thereare many open questions about how to enable self-learning agents to achieve specific tasks without any supervision. We

6A term coined by NASA during the Apollo missions, but brought recently to the general public [84].

10

Page 11: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

propose to tackle this question by using our general architecture, which can be, in principle, limited by the modulardevelopments and the availability of specific data sets or sensors.

In this work we grounded our position in autonomous systems by providing a definition of requirements and presented ageneral review of the relevant literature. We provided a proposal for a design of a general architecture for self-learningagents. Our design included two separate modules, one for the creation of the data and the second for the independentself-learning of an agent. We conclude by stating that the second module is, in general, divided into three stages,where each stage is in charge of accomplishing an independent task: data collection, task generalization and emergentbehaviour. In very particular designs, generalization can influence emergent behaviours, but only in one direction.We expanded our findings by presenting a concrete example of our ideas in the industrial manufacturing sector. Herewe suggest a division in three modular elements that feed a central agent for decision-making. Each module is ableto explore and learn from all possible scenarios and available data, taking into account changes or disruptions in itsown infrastructure and the information provided by the other modules. In the near future, we plan to present the firstimplementation of our ideas and publish the conclusions about the feasibility of our design.

References

[1] Klaus Schwab. The Fourth Industrial Revolution. Crown Publishing Group, New York, NY, USA, 2017.[2] Tesla. Future of driving. https://www.tesla.com/autopilot, 2019. Accessed: 2019-07-04.[3] Rolls-Royce. Rolls-royce and finferries demonstrate world’s first fully autonomous ferry. https://www.rolls-

royce.com/media/press-releases/2018/03-12-2018-rr-and-finferries-demonstrate-worlds-first-fully-autonomous-ferry.aspx, 2018. Accessed: 2019-07-04.

[4] Stan Franklin and Art Graesser. Is it an agent, or just a program?: A taxonomy for autonomous agents. InInternational Workshop on Agent Theories, Architectures, and Languages, pages 21–35. Springer, 1996.

[5] Christer Carlsson, Mario Fedrizzi, and Robert Fullér. Fuzzy logic in management, volume 66. Springer Science &Business Media, 2012.

[6] Tong Wang, Huijun Gao, and Jianbin Qiu. A combined adaptive neural network and nonlinear model predictivecontrol for multirate networked industrial process control. IEEE Transactions on Neural Networks and LearningSystems, 27(2):416–425, 2016.

[7] Stefan Boschert, R Rosen, and C Heinrich. Next generation digital twin. In Proceedings of the 12th InternationalSymposium on Tools and Methods of Competitive Engineering TMCE, pages 209–218, 2018.

[8] Werner Kritzinger, Matthias Karner, Georg Traar, Jan Henjes, and Wilfried Sihn. Digital twin in manufacturing:A categorical literature review and classification. IFAC-PapersOnLine, 51(11):1016–1022, 2018.

[9] Leonardo Espinosa Leal, Anthony Chapman, and Magnus Westerlund. Reinforcement learning for extendedreality: designing self-play scenarios. In Proceedings of the 52nd Hawaii International Conference on SystemSciences, pages 156–163, 2019.

[10] Peter Checkland and John Poulter. Soft systems methodology. In Martin Reynolds and Sue Holwell, editors,Systems Approaches to Managing Change: A Practical Guide, pages 1430–1436. Springer, 2010.

[11] CG Sørensen, S Fountas, E Nash, Liisa Pesonen, Dionysis Bochtis, Søren Marcus Pedersen, B Basso, andSB Blackmore. Conceptual model of a future farm management information system. Computers and electronicsin agriculture, 72(1):37–47, 2010.

[12] Lawrence Lee Rauch. Oscillation of a third order nonlinear autonomous system. In S. Lefschetz, editor,Contributions to the theory of nonlinear oscillations, volume 1, page 39. Princeton University Press, 1950.

[13] David Canfield Smith, Allen Cypher, and Jim Spohrer. Kidsim: programming agents without a programminglanguage. Communications of the ACM, 37(7):54–67, 1994.

[14] Miltos Kyriakidis, Riender Happee, and Joost CF de Winter. Public opinion on automated driving: Results ofan international questionnaire among 5000 respondents. Transportation research part F: traffic psychology andbehaviour, 32:127–140, 2015.

[15] SAE On-Road Automated Vehicle Standards Committee. Taxonomy and definitions for terms related to on-roadmotor vehicle automated driving systems. SAE International, 2014.

[16] Francis J Anscombe, Robert J Aumann, et al. A definition of subjective probability. Annals of mathematicalstatistics, 34(1):199–205, 1963.

[17] Joscha Bach. Phenomenal experience and the perceptual binding state. In Papers of the 2019 Towards ConsciousAI Systems Symposium. CEUR-WS, 2019.

11

Page 12: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

[18] Joscha Bach. The cortical conductor theory: Towards addressing consciousness in ai models. In BiologicallyInspired Cognitive Architectures Meeting, pages 16–26. Springer, 2018.

[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neuralnetworks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[20] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.

[21] Leonardo Espinosa Leal, Kaj-Mikael Björk, Amaury Lendasse, and Anton Akusok. A web page classifier librarybased on random image content analysis using deep learning. In Proceedings of the 11th PErvasive TechnologiesRelated to Assistive Environments Conference, PETRA ’18, pages 13–16, New York, NY, USA, 2018. ACM.

[22] Prasad Pai. Data augmentation techniques in CNN using Tensorflow. https://bit.ly/2KLm8K6, 2017. Accessed:2018-06-12.

[23] Justin Salamon and Juan Pablo Bello. Deep convolutional neural networks and data augmentation for environmentalsound classification. IEEE Signal Processing Letters, 24(3):279–283, 2017.

[24] Kenji Doya. Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245, 2000.

[25] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville,and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages2672–2680, 2014.

[26] R Bellman. Dynamic Programming. Princeton University Press, 1957.

[27] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal ofartificial intelligence research, 4:237–285, 1996.

[28] Yuxi Li. Deep reinforcement learning: An overview. arXiv preprint arXiv:1701.07274, 2017.

[29] Richard S Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M Pilarski, Adam White, and DoinaPrecup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotorinteraction. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pages761–768. International Foundation for Autonomous Agents and Multiagent Systems, 2011.

[30] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and dataengineering, 22(10):1345–1359, 2010.

[31] Alexander Vezhnevets, Volodymyr Mnih, Simon Osindero, Alex Graves, Oriol Vinyals, John Agapiou, et al.Strategic attentive writer for learning macro-actions. In Advances in neural information processing systems, pages3486–3494, 2016.

[32] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learnand think like people. Behavioral and Brain Sciences, 40, 2017.

[33] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies.The Journal of Machine Learning Research, 17(1):1334–1373, 2016.

[34] Yutian Chen, Matthew W Hoffman, Sergio Gómez Colmenarejo, Misha Denil, Timothy P Lillicrap, MattBotvinick, and Nando de Freitas. Learning to learn without gradient descent by gradient descent. arXiv preprintarXiv:1611.03824, 2016.

[35] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot,Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a generalreinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

[36] Johannes Heinrich, Marc Lanctot, and David Silver. Fictitious self-play in extensive-form games. In InternationalConference on Machine Learning, pages 805–813, 2015.

[37] Michiel Van Der Ree and Marco Wiering. Reinforcement learning in the game of othello: learning against a fixedopponent and learning from self-play. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL),2013 IEEE Symposium on, pages 108–115. IEEE, 2013.

[38] Nick Jakobi, Phil Husbands, and Inman Harvey. Noise and the reality gap: The use of simulation in evolutionaryrobotics. In European Conference on Artificial Life, pages 704–720. Springer, 1995.

[39] Alexey Dosovitskiy and Vladlen Koltun. Learning to act by predicting the future. arXiv preprint arXiv:1611.01779,2016.

[40] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. Cambridge, MA: MIT Press,2011.

12

Page 13: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

[41] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, AlexGraves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deepreinforcement learning. Nature, 518(7540):529, 2015.

[42] Michał Kempka, Marek Wydmuch, Grzegorz Runc, Jakub Toczek, and Wojciech Jaskowski. Vizdoom: Adoom-based ai research platform for visual reinforcement learning. In Computational Intelligence and Games(CIG), 2016 IEEE Conference on, pages 1–8. IEEE, 2016.

[43] Jack Harmer, Linus Gisslén, Henrik Holst, Joakim Bergdahl, Tom Olsson, Kristoffer Sjöö, and Magnus Nordin.Imitation learning with concurrent actions in 3d games. arXiv preprint arXiv:1803.05402, 2018.

[44] Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, AndrewLefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. Deepmind lab. arXiv preprint arXiv:1612.03801, 2016.

[45] Per-Arne Andersen, Morten Goodwin, and Ole-Christoffer Granmo. Towards a deep reinforcement learningapproach for tower line wars. In International Conference on Innovative Techniques and Applications of ArtificialIntelligence, pages 101–114. Springer, 2017.

[46] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo,Alireza Makhzani, Heinrich Küttler, John Agapiou, Julian Schrittwieser, et al. Starcraft ii: a new challenge forreinforcement learning. arXiv preprint arXiv:1708.04782, 2017.

[47] Gabriel Synnaeve, Nantas Nardelli, Alex Auvolat, Soumith Chintala, Timothée Lacroix, Zeming Lin, FlorianRichoux, and Nicolas Usunier. Torchcraft: a library for machine learning research on real-time strategy games.arXiv preprint arXiv:1611.00625, 2016.

[48] Yuandong Tian, Qucheng Gong, Wenling Shang, Yuxin Wu, and C Lawrence Zitnick. Elf: An extensive,lightweight and flexible research platform for real-time strategy games. In Advances in Neural InformationProcessing Systems, pages 2656–2666, 2017.

[49] OpenAI. Dota 2. https://blog.openai.com/dota-2/, 2018. Accessed: 2018-06-12.

[50] Niels Justesen, Philip Bontrager, Julian Togelius, and Sebastian Risi. Deep learning for video game playing. arXivpreprint arXiv:1708.07902, 2017.

[51] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, DavidSilver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In InternationalConference on Machine Learning, pages 1928–1937, 2016.

[52] Tejas D Kulkarni, Ardavan Saeedi, Simanta Gautam, and Samuel J Gershman. Deep successor reinforcementlearning. arXiv preprint arXiv:1606.02396, 2016.

[53] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas. Dueling networkarchitectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

[54] Seyed Sajad Mousavi, Michael Schukat, and Enda Howley. Deep reinforcement learning: an overview. InProceedings of SAI Intelligent Systems Conference, pages 426–440. Springer, 2016.

[55] Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. A brief survey of deepreinforcement learning. arXiv preprint arXiv:1708.05866, 2017.

[56] Joshua Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain ran-domization for transferring deep neural networks from simulation to the real world. CoRR, abs/1703.06907,2017.

[57] David Ha and Jürgen Schmidhuber. World models. CoRR, abs/1803.10122, 2018.

[58] Yan Duan, Marcin Andrychowicz, Bradly C. Stadie, Jonathan Ho, Jonas Schneider, Ilya Sutskever, Pieter Abbeel,and Wojciech Zaremba. One-shot imitation learning. CoRR, abs/1703.07326, 2017.

[59] Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and Sergey Levine. One-shot visual imitation learningvia meta-learning. CoRR, abs/1709.04905, 2017.

[60] Martin A. Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Van de Wiele,Volodymyr Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing - solving sparse reward tasksfrom scratch. CoRR, abs/1802.10567, 2018.

[61] Trapit Bansal, Jakub Pachocki, Szymon Sidor, Ilya Sutskever, and Igor Mordatch. Emergent complexity viamulti-agent competition. CoRR, abs/1710.03748, 2017.

[62] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to alignand translate. arXiv preprint arXiv:1409.0473, 2014.

13

Page 14: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

[63] Stefano V Albrecht and Subramanian Ramamoorthy. A game-theoretic model and best-response learningmethod for ad hoc coordination in multiagent systems. In Proceedings of the 2013 international conference onAutonomous agents and multi-agent systems, pages 1155–1156. International Foundation for Autonomous Agentsand Multiagent Systems, 2013.

[64] Stefano V Albrecht and Peter Stone. Autonomous agents modelling other agents: A comprehensive survey andopen problems. Artificial Intelligence, 258:66–95, 2018.

[65] Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Malaysia; Pearson EducationLimited„ 2016.

[66] Michael Wooldridge. An introduction to multiagent systems. John Wiley & Sons, 2009.

[67] Zbigniew Michalewicz, Martin Schmidt, Matthew Michalewicz, and Constantin Chiriac. Adaptive businessintelligence. Springer, 2006.

[68] Michael Wooldridge, Nicholas R. Jennings, and David Kinny. The gaia methodology for agent-oriented analysisand design. Autonomous Agents and Multi-Agent Systems, 3(3):285–312, Sep 2000.

[69] Jacques Ferber, Olivier Gutknecht, and Fabien Michel. From agents to organizations: An organizational viewof multi-agent systems. In Paolo Giorgini, Jörg P. Müller, and James Odell, editors, Agent-Oriented SoftwareEngineering IV, pages 214–230, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.

[70] Jaelson Castro, Manuel Kolp, and John Mylopoulos. Towards requirements-driven information systems engineer-ing: the tropos project. Information systems, 27(6):365–389, 2002.

[71] Rafael H Bordini, Jomi Fred Hübner, and Michael Wooldridge. Programming multi-agent systems in AgentSpeakusing Jason, volume 8. John Wiley & Sons, 2007.

[72] Annapurna Valluri, Michael J North, and Charles M Macal. Reinforcement learning in supply chains. Internationaljournal of neural systems, 19(05):331–344, 2009.

[73] Xianyu Zhang, Xinguo Ming, Zhiwen Liu, Dao Yin, Zhihua Chen, and Yuan Chang. A reference framework andoverall planning of industrial artificial intelligence (i-ai) for new application scenarios. The International Journalof Advanced Manufacturing Technology, pages 1–23, 2018.

[74] Mihai Barbuceanu and Mark S Fox. Coordinating multiple agents in the supply chain. In Proceedings ofWET ICE’96. IEEE 5th Workshop on Enabling Technologies; Infrastucture for Collaborative Enterprises, pages134–141. IEEE, 1996.

[75] Pierpaolo Pontrandolfo, Abhijit Gosavi, O Geoffrey Okogbaa, and Tapas K Das. Global supply chain management:a reinforcement learning approach. International Journal of Production Research, 40(6):1299–1317, 2002.

[76] Tim Stockheim, Michael Schwind, and Wolfgang Koenig. A reinforcement learning approach for supply chainmanagement. In 1st European Workshop on Multi-Agent Systems, Oxford, UK, 2003.

[77] Jeroen P Van den Berg and Willem HM Zijm. Models for warehouse management: Classification and examples.International journal of production economics, 59(1-3):519–528, 1999.

[78] Reza Moazzez Estanjini, Yingwei Lin, Keyong Li, Dong Guo, and Ioannis Ch Paschalidis. Optimizing warehouseforklift dispatching using a sensor network and stochastic learning. IEEE Transactions on Industrial Informatics,7(3):476–486, 2011.

[79] Reza Moazzez Estanjini, Keyong Li, and Ioannis Ch Paschalidis. A least squares temporal difference actor–criticalgorithm with applications to warehouse management. Naval Research Logistics (NRL), 59(3-4):197–211, 2012.

[80] Joren Gijsbrechts, Robert N Boute, Jan A Van Mieghem, and Dennis Zhang. Can deep reinforcement learningimprove inventory management? performance and implementation of dual sourcing-mode problems. Performanceand Implementation of Dual Sourcing-Mode Problems (December 17, 2018), 2018.

[81] Lukasz Golab and Theodore Johnson. Data stream warehousing. In 2014 IEEE 30th International Conference onData Engineering, pages 1290–1293. IEEE, 2014.

[82] Stefan Boschert and Roland Rosen. Digital twin—the simulation aspect. In Mechatronic Futures, pages 59–74.Springer, 2016.

[83] Roland Rosen, Georg Von Wichert, George Lo, and Kurt D Bettenhausen. About the importance of autonomy anddigital twins for the future of manufacturing. IFAC-PapersOnLine, 48(3):567–572, 2015.

[84] Mike Shafto, Mike Conroy, Rich Doyle, Ed Glaessgen, Chris Kemp, Jacqueline LeMoigne, and Lui Wang. Model-ing, simulation, information technology & processing roadmap. National Aeronautics and Space Administration,2012.

14

Page 15: INDUSTRIAL MANAGEMENT VIA EINFORCEMENT LEARNING: …

A PREPRINT - OCTOBER 22, 2019

[85] A Wayne Wymore. Model-based systems engineering. CRC press, 1993.[86] Judea Pearl and Dana Mackenzie. The Book of Why: The New Science of Cause and Effect. Basic Books, 2018.

15